Saturday 13 September 2014

The myth about AWS and High Availability

Amazon loves talking about "high availability"; failover, health checks, disaster recovery, blah, blah blah. Well, there's actually an annoying problem they don't tell you about, and that is democracy. Hint, democracy involves the majority.

The problem is how do you maintain quorum-consensus across a cluster that spans only two availability zones when one of the zones goes out. The simple answer is, you can't.

Most clustering and fail over technologies rely on having the majority vote from the servers in the cluster before it will continue to function, otherwise known as quorum-consensus. This is to prevent a scenario often refereed to as split-brain. Clustered data stores that rely on such mechanism include Sql Server Always On Availability Group, MongoDB cluster, Elasticsearch Master Nodes and I'm sure there are many more.

If three nodes were to be partitioned into two availability zones, we actually see two outage scenarios. Either the majority nodes goes offline, or they don't.



It doesn't matter how many nodes you place on which availability zone either. If you have even number of nodes, they will all stop functioning when one AZ goes down. Or you have odd number of nodes, and you have 50% chance to stay up should one of the AZ goes down. What annoys me a bit is the fact that even AWS white paper on deploying Sql Server AOAG doesn't mention of this issue and recommend setting up file share witness on one of the DC that reside in the same AZ as one of the Sql Server instances.

Obviously not all regions are the same and some have three availability zones which are US East, US West (Oregon), Ireland, and Tokyo. The regions that have three AZ can comfortably distribute 1 nodes in each AZ. However, that's only 4 out of the 8 regions in total that can truly host HA data store with automatic fail over.

While this makes deployment of these data stores with high availability not as attractive, it's still very important to do so. Server restart on AWS is very real, and let's not forget these HA cluster will allow sysadmins to service the servers more easily without having to put up the "site down for maintenance" page.

One way to get around this issue is to establish a VPN connection to another data center or on premise servers and host a lightweight node outside of the AWS infrastructure. Sql Server AOAG has file share witness and MongoDB has arbiter nodes. These are nodes in the cluster that does nothing other than to cast vote and maintain quorum. It has also been mentioned that AWS VPC will in the future allow VPC peering across different regions, and that could also potentially solve this problem.

Or get a piece of wood and knock on it, hard.

Thursday 4 September 2014

Deduping entries with logstash and elasticsearch

There's one annoying little problem with logstash and file / block based injestion. The whole file gets reprocessed in certain scenarios, causing lines that's already been processed before to be reprocessed. These scenarios include:
  1. S3 files been reuploaded with additional data appended.
    Logstash process all files after the datetime marked by the last run of processing.
  2. Local file system files being copy pasted over with updates from other locations
    Files are handled with descriptors, so they can be renamed or moved without affecting logstash tailing the file. However, when they are overwritten, they are considered to be a completely different file and all the data in the file will be reprocessed.
  3. Losing / deleting the since db file used to track progress.
  4. And I'm sure there can be more.
The solution is actually surprisingly simple. Calculate a hash of the event message and use that as the document id for elasticsearch. Here's a sample config:

input {
#something. anything.
}

filter {
mutate {
add_field => ["logstash_checksum", "%{message}"]
}
anonymize {
fields => ["logstash_checksum"]
algorithm => "MD5"
key => "a"
}
}

output {
elasticsearch {
host => "127.0.0.1"
document_id => '%{logstash_checksum}'
}
}

Note that this works best with events that already contains the timestamp such as web server access log from IIS, apache, etc, load balancer logs, etc. It would be a bad idea to apply this technique to stream based log entries that rely on timestamp at the time of injestion by logstash.

Tuesday 2 September 2014

Something subtle about logstash s3 input

There are two settings for the s3 input plugin for logstash that went under my radar in terms of their criticalilty; delete and backup_to_bucket. In some situations, it actually make or break the feasibility of this plugin entirely. So much so that it resulted in me having to significantly rework the architecture I had in mind for log processing.

People who has worked with s3 input for logstash extensively would have experienced a few major issues for bucket/key prefix combination that contains large number of objects:
  1. Extremely long start up time due to the registration of the s3 plugin within the logstash runtime.
  2. The long latency at which new files in s3 are picked up by logstash for processing.
Both of these problem stems from a quirk with the AWS s3 api. The list object api only returns 1000 items at once. This problem is further amplified by the fact that logstash s3 input does not deal with this matter intelligently and queries everything upfront then process it sequentially, rather than processing the 1000 items queried first before moving off to the next batch.

The delete and backup_to_bucket settings allow a configuration where a target location is used for logstash to collect the files to be processed and then pointing the backup_to_bucket to another location for archival. That as opposed to dumping all the data into a s3 location and pointing logstash at it and telling it to poll for updates.

This approach also frees up the reliance on the since db file logstash s3 uses to keep track of where it last injested up to. Logstash can just crunch everything it finds in the target location. This means now the logstash server can be put on an auto scaling group and throw away the server at will without having to worry about backing up / restore / losing the since db file.

Taking my setup as an example. We ship all our logs to a S3 location and then fan out from there into EMR, Redshift, Logstash, etc etc. With our IIS logs, having only 10 web servers shipping 1 log per site per hour, assuming there's an average of 2 iis sites per server, there would be a total of 10 servers x 2 sites x 24 hours x 30 days = 14400 s3 object per month. In just 6 months, there'd be around 86k objects in s3. That's 86 consecutive HTTP request to just scan the entire drop zone for changes which makes it completely unfeasible. Just imagine what's going to happen 3 years down the line.

Quite frankly, I would rather see the logstash s3 intelligently iterate and priorities yyyy/mm/dd key partitioning that most people uses for archiving large number of objects in s3 and injest and scan for the latest entry first. The combination of delete and backup_to_bucket while solves the problem, introduces new ones.
  1. It means that data would not be able to reach EMR and Redshift until Logstash has finished processing it.
  2. The logstash s3 input plugin code, as of the time of this post, uploads the local copy of the file it downloaded to the backup location as opposed to using the copy operation. This step can potentially introduce corruption to the file due to any unforeseen circumstances.
Tho, I doubt my wish would come true anytime soon, so I guess delete and backup_to_bucket is the best we have to work with for the time being.