Wednesday 8 July 2015

ElasticSearch on AWS part 2

This is a continuation of my previous blog post on ElasticSearch on AWS. 8 months later running ElasticSearch on AWS with AutoScaling Groups and Spot Instances, I'd like to share the result of how things went.

The primary use case of our ElasticSearch cluster were basically to provide Operational Intelligence. We sent all of our server metrics and logs into it and it served relatively well as a tool to increase transparency on everything that was going on across all of our environments.

Here are the list of logs we sent through it:

  1. AWS CloudTrail
  2. AWS Billing
  3. AWS ELB access log
  4. IIS log
  5. Windows Performance Counters
  6. Windows Event Log
  7. Wowza / Shoutcast / Icecast access log (A normalised dataset between the three)
  8. NLog .Net Application Log
Here are some numbers around the cluster:
  1. 1 billion-ish records
  2. 600gb index in total
  3. ~2000 shards and ~350 indexes (we had 6 shards per index)
  4. 3x t2.small master nodes, 2x r3.large data nodes and 4x r3.large spot instances.
And so far, this is some of what we have "observed and learned".
  1. Monitor the load of instances on the cluster continuously. Initially all the servers are loaded in about the same way, but as servers come and go from auto scaling groups, the replication doesn't really balance out the shards too well. This creates "hot spots" and cause one server to be more heavily loaded than the rest.
  2. Not all log types have the same amount of data rate. It's best to load test and find the optimal index pattern (per week or per day) and number of shards in a "staging cluster" for the ElasticSearch every time a new log type enters the system. Sending queries to a shard too big can lock the cluster up.
  3. Better be safe than sorry. Err on the side of shards and index being too small than too large. Querying many shards that are allocated too small is going to make the cluster slow. Querying shards that are too big (either the index size or document count) will lock the cluster.
  4. Scale up before scale out. Big fewer servers work better than smaller many servers.
  5. Design log buffering mechanism into log delivery. Kinesis and Kafka very good candidates for real time data ingestion. When the cluster was out of control and I had to stop the log ingestion, it's very reassuring to know the logs are still shipped and held some where.
Oh, and check out the template I used to deploy ElasticSearch (and the AWS lego project too while at it):
https://github.com/SleeperSmith/Aws-Lego/blob/master/src/special-blocks/elasticsearch.template
It's not perfect, but it's a good start (imo). Pull Requests are welcome.

Also, check out Bosun from the team at StackExchange which allows you to create alerts from ElasticSearch (ABOUT TIME!):
https://bosun.org/