How to automate failover on EC2

amazon ec2amazon-elastic-ipfailoverheartbeatscalr

Of the folks managing their own clusters (i.e. not using/paying for Amazon Autoscale, Rightscale, Scalr, etc.), how are you managing your instances on EC2 and handling (e.g.) failover? I'm wondering if most folks just end up writing their own boatloads of scripts against the EC2 API, as I suspect.

That's certainly our approach: whip up our own Python Boto-based monitoring/restarting daemon that runs off-site, listening for UDP keep-alives from our instances. On failure, we snapshot volumes, register images, start new instances, delete old volumes, and so on.

Every so often, when hacking on our scripts, I think there must be some open-source tools out there that deal with these issues already, and which don't have the constraints of (say) Scalr, but I always come back from Google empty-handed. (Things like Scalr have are pretty limited in the supported set/versions/configurations of software, and have specialized and IMO cumbersome ways of manipulating these setups.)

Also, the Linux-HA/Pacemaker ecosystem (Heartbeat, ldirectord, etc.) sounds like it isn't really suited for EC2. (But then I found this – though I'm not sure this is really a high-quality solution).

Best Answer

Well, I don't mean to just state the obvious, but the general idea is to push this complexity into the services managed by Amazon.

So on the frontend, you would use Amazon Elastic Load Balancing (ELB) to provide highly available load balancing. On the rear end, you use Amazon Relational Database Service (hosted MySQL), SimpleDB, and S3 for storage. All of these are managed by Amazon, and contain some sort of high availability / failover handling.

This typically leaves your web application servers, and any lesser common server types you might be using (rendering servers, self installed NoSQL data stores, etc).

Webapp servers are usually handled well enough with the health checks built into ELB. You can accept a small performance degradation when one webapp server is down, or consistently provision +1 server more than you need. Or if your config is simple, then when a webapp server fails, ELB together with Cloudwatch can automatically spawn a new webapp server for you.

Your own custom servers are another matter. For these it's true, you're on your own, and will need to make do with application built-in methods, or duct tape together something with custom scripts / open source HA tools.

Buying Rightscale's solution might be too expensive. But lesser expensive Amazon tools such as ELB, basic CloudWatch alerting (now free for 5 minutes resolution), or AutoScale are well worth it if you need high availability.

Related Topic