EC2 – Do Pacemaker and Heartbeat Make Sense?

amazon ec2heartbeatpacemaker

Does the Pacemaker ecosystem (Corosync etc.) make sense in the context of EC2? Up till some point, Corosync required IP multicast (not available on EC2), but I think it has broadcast now. Still, are Pacemaker et. al. the right tool for a cluster to manage itself on EC2, e.g. monitor each other for failure and thus trigger bringing up new instances to replace failed ones?

I guess part of the problem is that I've been spending quite a bit of time just straightening out all the players here (Heartbeat, Corosync, OpenAIS, etc.), and I'm still trying to figure out what these things actually are (beyond nebulous terms, e.g. that Pacemaker is a "cluster resource manager" and that Corosync provides "reliable messaging and membership infrastructure").

Hence, apologies if my question itself is a bit bumbling or doesn't completely make sense. Any insights would be greatly appreciated. Thanks.

Best Answer

Does EC2 monitor the health of services inside the guests?

If not, and that is something you want, then Pacemaker would be relevant here. Corosync probably isn't an option yet as it only does mcast and bcast, so it would be a pacemaker+heartbeat scenario.

Here's a guide to how people do it with linode instances, much of it is likely to also be relevant on EC2: http://library.linode.com/linux-ha/

To answer the question of what the pieces are, Pacemaker is the thing that starts and stops services and contains logic for ensuring both that they're running, and that they're running in only one location (to avoid data corruption).

But it can't do that without the ability to talk to itself on the other node(s), which is where heartbeat and/or corosync come in.

Think of heartbeat and corosync as a bus that any node can throw messages on and know that they'll be received by all its peers. The bus also ensures that everyone agrees who is (and is not) connected to the bus and tells you when that list changes.

For two nodes Pacemaker could just as easily use sockets, but beyond that the complexity grows quite rapidly and is very hard to get right - so it really makes sense to use existing components that have proven to be reliable.