Linux – Heartbeat/DRBD failover didn’t work as expected. How to make the failover more robust

drbdheartbeatlinuxredhat

I had a scenario where a DRBD-heartbeat set up had a failed node but did not failover. What happened was the primary node had locked up, but didn't go down directly (it was inaccessible via ssh or with the nfs mount, but it could be pinged). The desired behavior would have been to detect this and failover to the secondary node, but it appears that since the primary didn't go full down (there is a dedicated network connection from server to server), heartbeat's detection mechanism didn't pick up on that and therefore didn't failover.

Has anyone seen this? Is there something that I need to configure to have more robust cluster failover? DRBD seems to otherwise work fine (had to resync when I rebooted the old primary), but without good failover, it's use is limited.

  • heartbeat 3.0.4
  • drbd84
  • RHEL 6.1
  • We are not using Pacemaker

nfs03 is the primary server in this setup, and nfs01 is the secondary.

ha.cf

  # Hearbeat Logging
logfacility daemon
udpport 694


ucast eth0 192.168.10.47
ucast eth0 192.168.10.42

# Cluster members
node nfs01.openair.com
node nfs03.openair.com

# Hearbeat communication timing.
# Sets the triggers and pulse time for swapping over.
keepalive 1
warntime 10
deadtime 30
initdead 120


#fail back automatically
auto_failback on

and here is the haresources file:

nfs03.openair.com   IPaddr::192.168.10.50/255.255.255.0/eth0      drbddisk::data  Filesystem::/dev/drbd0::/data::ext4 nfs nfslock

Best Answer

I guess you will have to implement some monitoring to check if your primary system behaves as expected. If any check fails, you should switch off the server (through IPMI/ILO or a switched PDU) and let heartbeat do its job.

I think you will always find a situation in which it doesn't work as you would expect it to do.

Related Topic