Linux – How to fix a bad arp entry

arpclusterlinuxswitch

I'm just guessing that arp is my problem…

I have a linux drbd server cluster set up, and due to some power issues had to unplug the switch that connects the two servers. As a result, both servers became primary and took the same IP address for several seconds. (this caused a split-brain condition , but that's another issue)

My problem is that now some servers seem to be able to see the shared IP address of the cluster, and some cannot. I am wondering if this could be a situation where some switches/ports are sending the traffic to one server, and some to the other?

And if this IS the problem, how can I resolve it?

  • and… is this done at the switch, or on the server?

Best Answer

If it's really an arp issue, the problem will be confined to the network device doing the routing (since that what ARP is for - mapping L3 addresses (IP) to L2 addresses (MAC)) or possibly in the ARP cache of a server sitting in the same IP subnet. It won't involve a switch unless it's an L3 switch.

To address the problem on a cisco router, you can run the following command to clear the arp cache and allow it to rebuild:

clear arp

To remove the bad arp entry from a server which may be caching bad information (so, not the server that can't be reached, but the server that can't do the reaching) you can manually delete the bogus entry out of the ARP cache, where IP address is the IP of the server which can't be reached. Note this same syntax appears to be valid on both Linux and Windows:

arp -d <ip-address>

You can also send a gratuitous ARP from the server which can't be reached to cause other hosts on the same IP subnet to update their ARP caches (I have this in my notes, but I admit I haven't used it in a long time. I can't remember if this allows you to skip the steps above, or just shortens the process of the other hosts adding an arp entry after running the commands above):

arping -q -A -c 1 -I eth0 <ip-address>
arping -q -U -c 1 -I eth0 <ip-address>

All of the above is for an ARP issue, but you specifically mention a switch in your question. If it's a switch that only uses L3 for management, then the data flow problems would have to be problems with the MAC cache, not the ARP cache. In that case, you could run the following on the switch to purge the dynamic cache contents:

clear mac-address-table dynamic