Saturday 13 September 2014

The myth about AWS and High Availability

Amazon loves talking about "high availability"; failover, health checks, disaster recovery, blah, blah blah. Well, there's actually an annoying problem they don't tell you about, and that is democracy. Hint, democracy involves the majority.

The problem is how do you maintain quorum-consensus across a cluster that spans only two availability zones when one of the zones goes out. The simple answer is, you can't.

Most clustering and fail over technologies rely on having the majority vote from the servers in the cluster before it will continue to function, otherwise known as quorum-consensus. This is to prevent a scenario often refereed to as split-brain. Clustered data stores that rely on such mechanism include Sql Server Always On Availability Group, MongoDB cluster, Elasticsearch Master Nodes and I'm sure there are many more.

If three nodes were to be partitioned into two availability zones, we actually see two outage scenarios. Either the majority nodes goes offline, or they don't.



It doesn't matter how many nodes you place on which availability zone either. If you have even number of nodes, they will all stop functioning when one AZ goes down. Or you have odd number of nodes, and you have 50% chance to stay up should one of the AZ goes down. What annoys me a bit is the fact that even AWS white paper on deploying Sql Server AOAG doesn't mention of this issue and recommend setting up file share witness on one of the DC that reside in the same AZ as one of the Sql Server instances.

Obviously not all regions are the same and some have three availability zones which are US East, US West (Oregon), Ireland, and Tokyo. The regions that have three AZ can comfortably distribute 1 nodes in each AZ. However, that's only 4 out of the 8 regions in total that can truly host HA data store with automatic fail over.

While this makes deployment of these data stores with high availability not as attractive, it's still very important to do so. Server restart on AWS is very real, and let's not forget these HA cluster will allow sysadmins to service the servers more easily without having to put up the "site down for maintenance" page.

One way to get around this issue is to establish a VPN connection to another data center or on premise servers and host a lightweight node outside of the AWS infrastructure. Sql Server AOAG has file share witness and MongoDB has arbiter nodes. These are nodes in the cluster that does nothing other than to cast vote and maintain quorum. It has also been mentioned that AWS VPC will in the future allow VPC peering across different regions, and that could also potentially solve this problem.

Or get a piece of wood and knock on it, hard.

12 comments:

  1. 100% right, democracy is over-rated

    ReplyDelete
  2. Would a solution be to have 4 nodes instead of an odd number like 3? Democracy won't work then if half vote A and half vote B

    ReplyDelete
    Replies
    1. Do they have to be A and B? What if they wanted to change their names to, say, C and D? How would that work?

      Delete
    2. Interesting, I never considered that!

      Delete
    3. Jon Snow, what are your thoughts on this matter?

      Delete
    4. Sql Server and MongoDB just shuts down completely.

      Less immature technology would just split brain.

      Delete
  3. This comment has been removed by the author.

    ReplyDelete
  4. If you're trying hard to lose pounds then you certainly need to jump on this totally brand new personalized keto plan.

    To create this service, certified nutritionists, personal trainers, and top chefs joined together to develop keto meal plans that are effective, suitable, economically-efficient, and satisfying.

    From their grand opening in early 2019, 100's of individuals have already remodeled their body and well-being with the benefits a great keto plan can provide.

    Speaking of benefits: in this link, you'll discover 8 scientifically-proven ones offered by the keto plan.

    ReplyDelete