DRBD / Heartbeat on Virtual Machines

clusterdrbdheartbeatvirtualizationvmware-vcenter

Does anyone have experience configuring drbd with heartbeat between 2 virtual linux machines (VMWare Infrastructure)?

The problem I am running into is that heartbeat likes multiple data paths to see its peer node. For instance, it likes to have a network connection to the peer, maybe one to it's gateway, and a serial cable to its peer – improving the likelihood that when it detects a peer outage, it's actually down, and not due to network congestion or something.

On a virtual machine however, the serial port and ethernet port (and all other ports) are virtual – so really, there's only one data path (correct?)

(I know that VMWare supports physical serial cables between devices, but our vm's are hosted remotely, and physical cables would prevent host migrations, which is not acceptable.)

In our case we are seeing timeouts between the heartbeat peers, even though they are running on the same host machine.

How can I configure drbd / heartbeat to be more robust when running on virtual machines

Best Answer

Have you looked whether the VMs complain about dropped interrupts or similar things - maybe the host hardware is just overloaded or not enough ressources are allocated to your VMs?

If it's a flaky or overloaded network, the right thing to do would of course be fixing that; but if your hosting provider is not keen on that, can you use multiple physical paths by attaching multiple bridged networks to different host devices (hopefully on different switches)?

Just using redundant network paths via 802.3ad couldn't hurt in that case, either.

A commenter on another question mentioned split-brain - that's one thing you want to avoid at all cost: Normally a STONITH script would e.g. turn off a networked PDU strip on the other host so that the other host is down for sure; in a VM you might try a script that switches the other VM off via the VMware API.

Finally - maybe DRBD is just not right for your scenario. If you have a SAN, you may want to open the same device on the fabric on both VMs as a raw disk and then run OCFS2 or a similar cluster FS on it. Friends have seen OCFS2 run rock-solid on up to four nodes simultaneously, which would free you up to do multi-node clusters with heartbeat2 instead of being locked in with two-node fail-over like on heartbeat 1 by drbd.

Caveat emptor: heartbeat 2 uses XML config files. Not everyone (e.g., me) likes that.