Arp failing after heartbeat failover occurs

arpheartbeatlvsroute

I have an LVS based load balancer which has been working just fine. It runs on two servers using heartbeat to provide failover.

I've added support for a second IP range to the system, but when the failover occurs, the server which takes over cannot ARP any IPs in this second range until I remove and re-add the route for that range.

Here's some more detail on what I see on the active load balancer right after failover:

# arp 

foo1.example.com  ether   00:20:ED:1A:0C:82   C                     eth0
foo2.example.com  ether   00:1E:C9:B0:F6:FE   C                     eth0
bar1.example.com          (incomplete)                              eth0

# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
2.2.2.128  *               255.255.255.192 U     0      0        0 eth0
1.1.1.0   *               255.255.255.0   U     0      0        0 eth0
default         1.1.1.1   0.0.0.0         UG    100    0        0 eth0

so I can't ARP that bar1.example.com which will be on the 2.2.2.* netblock

What I found is that removing and adding the route for the netblock fixes the issue

ip route del 2.2.2.128/26 dev eth0
ip route add 2.2.2.128/26 dev eth0

If I trigger an ARP lookup by pinging bar1.example.com, the ARP cache will now show

bar1.example.com  ether   00:22:19:51:71:E4   C                     eth0

Does anyone know what's going on here, or know of a way I could get the heartbeat daemon to perform this route deletion and re-adding when it performs the takeover?

Best Answer

Sometimes the switch keeps the old ARP mapping around for too long; I've had to use "arping -U" under linux to tell the upstream switch to flush cache.