Wednesday 8 July 2015

ElasticSearch on AWS part 2

This is a continuation of my previous blog post on ElasticSearch on AWS. 8 months later running ElasticSearch on AWS with AutoScaling Groups and Spot Instances, I'd like to share the result of how things went.

The primary use case of our ElasticSearch cluster were basically to provide Operational Intelligence. We sent all of our server metrics and logs into it and it served relatively well as a tool to increase transparency on everything that was going on across all of our environments.

Here are the list of logs we sent through it:

  1. AWS CloudTrail
  2. AWS Billing
  3. AWS ELB access log
  4. IIS log
  5. Windows Performance Counters
  6. Windows Event Log
  7. Wowza / Shoutcast / Icecast access log (A normalised dataset between the three)
  8. NLog .Net Application Log
Here are some numbers around the cluster:
  1. 1 billion-ish records
  2. 600gb index in total
  3. ~2000 shards and ~350 indexes (we had 6 shards per index)
  4. 3x t2.small master nodes, 2x r3.large data nodes and 4x r3.large spot instances.
And so far, this is some of what we have "observed and learned".
  1. Monitor the load of instances on the cluster continuously. Initially all the servers are loaded in about the same way, but as servers come and go from auto scaling groups, the replication doesn't really balance out the shards too well. This creates "hot spots" and cause one server to be more heavily loaded than the rest.
  2. Not all log types have the same amount of data rate. It's best to load test and find the optimal index pattern (per week or per day) and number of shards in a "staging cluster" for the ElasticSearch every time a new log type enters the system. Sending queries to a shard too big can lock the cluster up.
  3. Better be safe than sorry. Err on the side of shards and index being too small than too large. Querying many shards that are allocated too small is going to make the cluster slow. Querying shards that are too big (either the index size or document count) will lock the cluster.
  4. Scale up before scale out. Big fewer servers work better than smaller many servers.
  5. Design log buffering mechanism into log delivery. Kinesis and Kafka very good candidates for real time data ingestion. When the cluster was out of control and I had to stop the log ingestion, it's very reassuring to know the logs are still shipped and held some where.
Oh, and check out the template I used to deploy ElasticSearch (and the AWS lego project too while at it):
https://github.com/SleeperSmith/Aws-Lego/blob/master/src/special-blocks/elasticsearch.template
It's not perfect, but it's a good start (imo). Pull Requests are welcome.

Also, check out Bosun from the team at StackExchange which allows you to create alerts from ElasticSearch (ABOUT TIME!):
https://bosun.org/

Saturday 8 November 2014

The Necessity of Unit Testing

Enough is enough. I'm tired of arguing with people about the necessity of unit testing. Next time some one want to argue with me, I'll point them here. If you are one of those people, brace yourself.

Let's get some things straight first. People are not even qualified for stating their argument if they
  • Have not maintained over 40% useful and valid test coverage on software that's went live and been maintained for months if not years. These people do not even really know what unit testing is. They never even done it themselves.
  • Have not overseen a team to ensure proper test coverage across all the code base. It's an uphill battle that some people never faced. Slipping is easy while rising is hard.
  • Do not know how to calculate test coverage for the language they use. Yes to point 1. Yes to point 2. 40% coverage by eye balling the code? Hilarious.
  • Do not know the basics of a "good" unit test; AAA, A-TRIP. They never even experienced the benefit of true proper unit tests. They don't even know how to identify a pile of spaghetti tests that is make their life hell on earth.
I can quite safely say that above alone rules out 95% of people I mostly argue with. Yes I'm been overly generous. I'm a generous person. They argue with me blue to the face when they never did it themselves and they can't actually do it themselves. Yes they talk and act like they know all about it.

With that out the way, we can now have some proper intelligent discussion.
The code is simple and straight forward and it doesn't need tests
And we have a "code simplicity o' meter" that gives the code a score between 0 and 100 with 25 been the "complex code threshold" that requires tests? Simple code will have simple tests. If it's so straight forward that it doesn't need test, writing one wouldn't take more than a few minutes. If it does take more than a few minutes, then either:
  a) the code is not as straight forward and simple as it was perceived or
  b) the programmer don't know how to write unit test in which case it's a good practice anyway.
(Yes I'm aware of CCM and I use it myself. People who argue this "simple code" thing with me doesn't know about it usually though, so shhhhh.......)
The code is too complex and writing test takes too much time
Really. If you don't need test for complicated code, I don't know what type of code you will classify as needing test. Simple code? Not so simple but not so complicated code? Shall we bring in the "code simplicity o' meter" to your sea of grey?
Ensuring correctness to subtle mistakes in complex code is one of the main points of unit testing.



Unit Test cost extra time / money
I'd like to say I never have problem burning through same number of points in a sprint as people who doesn't write tests, but I can accept the cost factor to tests thing. At this point, it becomes a complex debate.
  • Why can't the business afford the time / money to write these tests? Writing the actual code is only a small part of the development cost. If the margin is so low that an extra 20% (arbitrary number) increase of 50% of the project cost can't be paid up front that has long and far reaching benefit, maybe there isn't even a business model to begin with.
  • Or does it. It is a fact that the earlier an issue is fixed, the less costly it is. Unit test is the second quickest feedback mechanism, behind compilers. If a regression can be stopped at unit tests, it cost far less than one even in staging/UAT environment where resetting environment can cost a lot of time and sometimes money.
Then, from the hot shot developers.
I don't make mistakes and I haven't made one in X years.
But others do, and they will come change your code. Deal.
And lastly,
Code base is still changing. Time we spend writing test could be a waste.
Do or do not, there's no try. If the code you are writing is not going to be run or used, why you even writing it? If yes, even a simple client proof of concept application crash can potentially lose you the deal. Any production code needs tests. Period. As for those so called prototype / throw away code. How cute, when's the last time that happened; throwing away code.


ElasticSearch on AWS with AutoScaling Groups and Spot Instances

One of the most powerful feature of ElasticSearch is its ability to scale horizontally, in many different ways; routing, sharding, and time / pattern based index creation and query. It is a robust storage solution that is capable of starting out small and cheap and then grow and scale as the load and volume rises.

Having implemented the ElasticSearch cluster at where I'm working myself single-handedly, I'll go over a few points of interest. (The design is pretty straight forward for people who have worked with ElasticSearch. If you want to know how to set it up, RTFM.)


As the diagram is clearly marked:
  1. First of all, you want to use a single security group across all your data nodes and master nodes for the purpose of discovery. This "shields" your nodes from other nodes or malicious attempt to join the nodes and is part of your security. Open port 9300 to and from this security group itself.
  2. A split of "core" data nodes and "spot" data nodes. Basically you have a small set of "core" data nodes that guarantee the safety of the data. A set of spot instances then are added to the cluster to boost the performance.
    • Set rack_id and cluster.routing.allocation.awareness.attributes!!!
      I don't like stating the obvious, but THIS IS CRITICAL. Set the "core" nodes to use the same rack_id while the spot instances to use another. This will force the replication to store at least 1 complete set of data on "core" nodes. Also, install kopf plugin and MAKE SURE THAT IS THE CASE. Spot instances are just that, spot instances. They can all disappear in a blink of an eye.
    • Your shard and replication count directly affects the number of maximum number of machines you can "Auto Scale" to. Self explanatory.
    • You can update the instance size of the servers specified in the launch configuration and terminate the instances one by one to "vertically" scale the servers should you run into horizontal scaling limit due to shard and replication count limit.
    • This is an incredibly economical set up. Taking r3.2xlarge instances for example, even 3 year heavy reserve cost $191 / month while spot instances cost you $60 / month. It is the ability to leverage spot instances that makes all managed hosting of ElasticSearch look like day light robbery.
      For $4k / month, you can easily scale all the way up to 2TB+ memory and 266+ cores and 10TB of SSD for 30k iops (assuming 50% of $4k monthly fee spent on spot instances and 25% on SSD.). You get 60gb ram and 780gb storage from Bonsai btw.
  3. Setup your master nodes on dedicated separate set of Auto Scaling Groups and have at least 2 servers in it. This is to prevent the cluster from falling apart should any of those master nodes in the Auto Scaling Group gets recycled.
    Another point of interest with the necessity of these dedicated master nodes is that because of the volatility of the core data nodes and spot instances since they are on Auto Scaling Groups and the frequency at which the shard and cluster states are reshuffled (again, due to ASG).
    They don't need to be particularly beefy.
  4. Front up the master nodes with an Elastic Load Balancer
    • Open port 9200 for HTTP. User queries will then be evenly distributed across the master nodes. You can use a separate ASG for a set of none data nodes specifically used for query purpose if there is a high volume of query traffic. Optionally set up SSL here.
    • Open port 9300 for TCP. Logstash instances can then join the cluster using this endpoint as the "discovery point". Otherwise you will not have a specific IP address you can set in logstash ElasticSearch output set in node protocol.
  5. Configure a S3 bucket to store snapshot and restore from using the master nodes as the gateway.
The cluster I'm currently running has 2x t2.small for masters, 2x r3.large for "core" nodes and 4x r3.large for spot instances. It managed to hold 200gb of index and 260 million records and growing without breaking a sweat. It will have a long long way to go before it hits its scaling limit of 12x r3.8xlarge. Should be good for upwards of many terabytes of index and billions of rows. Fingers crossed.

Saturday 13 September 2014

The myth about AWS and High Availability

Amazon loves talking about "high availability"; failover, health checks, disaster recovery, blah, blah blah. Well, there's actually an annoying problem they don't tell you about, and that is democracy. Hint, democracy involves the majority.

The problem is how do you maintain quorum-consensus across a cluster that spans only two availability zones when one of the zones goes out. The simple answer is, you can't.

Most clustering and fail over technologies rely on having the majority vote from the servers in the cluster before it will continue to function, otherwise known as quorum-consensus. This is to prevent a scenario often refereed to as split-brain. Clustered data stores that rely on such mechanism include Sql Server Always On Availability Group, MongoDB cluster, Elasticsearch Master Nodes and I'm sure there are many more.

If three nodes were to be partitioned into two availability zones, we actually see two outage scenarios. Either the majority nodes goes offline, or they don't.



It doesn't matter how many nodes you place on which availability zone either. If you have even number of nodes, they will all stop functioning when one AZ goes down. Or you have odd number of nodes, and you have 50% chance to stay up should one of the AZ goes down. What annoys me a bit is the fact that even AWS white paper on deploying Sql Server AOAG doesn't mention of this issue and recommend setting up file share witness on one of the DC that reside in the same AZ as one of the Sql Server instances.

Obviously not all regions are the same and some have three availability zones which are US East, US West (Oregon), Ireland, and Tokyo. The regions that have three AZ can comfortably distribute 1 nodes in each AZ. However, that's only 4 out of the 8 regions in total that can truly host HA data store with automatic fail over.

While this makes deployment of these data stores with high availability not as attractive, it's still very important to do so. Server restart on AWS is very real, and let's not forget these HA cluster will allow sysadmins to service the servers more easily without having to put up the "site down for maintenance" page.

One way to get around this issue is to establish a VPN connection to another data center or on premise servers and host a lightweight node outside of the AWS infrastructure. Sql Server AOAG has file share witness and MongoDB has arbiter nodes. These are nodes in the cluster that does nothing other than to cast vote and maintain quorum. It has also been mentioned that AWS VPC will in the future allow VPC peering across different regions, and that could also potentially solve this problem.

Or get a piece of wood and knock on it, hard.

Thursday 4 September 2014

Deduping entries with logstash and elasticsearch

There's one annoying little problem with logstash and file / block based injestion. The whole file gets reprocessed in certain scenarios, causing lines that's already been processed before to be reprocessed. These scenarios include:
  1. S3 files been reuploaded with additional data appended.
    Logstash process all files after the datetime marked by the last run of processing.
  2. Local file system files being copy pasted over with updates from other locations
    Files are handled with descriptors, so they can be renamed or moved without affecting logstash tailing the file. However, when they are overwritten, they are considered to be a completely different file and all the data in the file will be reprocessed.
  3. Losing / deleting the since db file used to track progress.
  4. And I'm sure there can be more.
The solution is actually surprisingly simple. Calculate a hash of the event message and use that as the document id for elasticsearch. Here's a sample config:

input {
#something. anything.
}

filter {
mutate {
add_field => ["logstash_checksum", "%{message}"]
}
anonymize {
fields => ["logstash_checksum"]
algorithm => "MD5"
key => "a"
}
}

output {
elasticsearch {
host => "127.0.0.1"
document_id => '%{logstash_checksum}'
}
}

Note that this works best with events that already contains the timestamp such as web server access log from IIS, apache, etc, load balancer logs, etc. It would be a bad idea to apply this technique to stream based log entries that rely on timestamp at the time of injestion by logstash.

Tuesday 2 September 2014

Something subtle about logstash s3 input

There are two settings for the s3 input plugin for logstash that went under my radar in terms of their criticalilty; delete and backup_to_bucket. In some situations, it actually make or break the feasibility of this plugin entirely. So much so that it resulted in me having to significantly rework the architecture I had in mind for log processing.

People who has worked with s3 input for logstash extensively would have experienced a few major issues for bucket/key prefix combination that contains large number of objects:
  1. Extremely long start up time due to the registration of the s3 plugin within the logstash runtime.
  2. The long latency at which new files in s3 are picked up by logstash for processing.
Both of these problem stems from a quirk with the AWS s3 api. The list object api only returns 1000 items at once. This problem is further amplified by the fact that logstash s3 input does not deal with this matter intelligently and queries everything upfront then process it sequentially, rather than processing the 1000 items queried first before moving off to the next batch.

The delete and backup_to_bucket settings allow a configuration where a target location is used for logstash to collect the files to be processed and then pointing the backup_to_bucket to another location for archival. That as opposed to dumping all the data into a s3 location and pointing logstash at it and telling it to poll for updates.

This approach also frees up the reliance on the since db file logstash s3 uses to keep track of where it last injested up to. Logstash can just crunch everything it finds in the target location. This means now the logstash server can be put on an auto scaling group and throw away the server at will without having to worry about backing up / restore / losing the since db file.

Taking my setup as an example. We ship all our logs to a S3 location and then fan out from there into EMR, Redshift, Logstash, etc etc. With our IIS logs, having only 10 web servers shipping 1 log per site per hour, assuming there's an average of 2 iis sites per server, there would be a total of 10 servers x 2 sites x 24 hours x 30 days = 14400 s3 object per month. In just 6 months, there'd be around 86k objects in s3. That's 86 consecutive HTTP request to just scan the entire drop zone for changes which makes it completely unfeasible. Just imagine what's going to happen 3 years down the line.

Quite frankly, I would rather see the logstash s3 intelligently iterate and priorities yyyy/mm/dd key partitioning that most people uses for archiving large number of objects in s3 and injest and scan for the latest entry first. The combination of delete and backup_to_bucket while solves the problem, introduces new ones.
  1. It means that data would not be able to reach EMR and Redshift until Logstash has finished processing it.
  2. The logstash s3 input plugin code, as of the time of this post, uploads the local copy of the file it downloaded to the backup location as opposed to using the copy operation. This step can potentially introduce corruption to the file due to any unforeseen circumstances.
Tho, I doubt my wish would come true anytime soon, so I guess delete and backup_to_bucket is the best we have to work with for the time being.