Linux – Arp requests with odd source IP go unanswered

arpiplinuxnetworkingvmware-esxi

I'm having a server with network connectivity issues that I presume come from issues with arp protocol handling.

Let's say the network topology is as follows:

  • network 192.168.106.0, netmask 255.255.255.0
  • router at 192.168.106.1
  • "problem server" at 192.168.106.2
  • another server at 192.168.106.3

Now, assume that the "problem server" may be silent on the network for periods long enough for its arp entry on the router to expire.

When someone from outside this network attempts to connect to the "problem server", all attempts time out. Connections from within the network to the "problem server" succeed.

If the "problem server" itself attempts to connect to some other address outside the network, the connection succeeds — and after this, also connections from outside the network to the "problem server" succeed for a while. Also, connections from the "problem server" to "another server" are ok.

Looking at arp traffic in the case where the "problem server" has been silent for a long time, I can see arp requests on the network for the "problem server" address, but the "tell" address on these is the network address (192.168.106.0) instead of the router address (192.168.106.1) — and this is what I assume to be the reason for this problem: for some reason the router has wrong reply address in its arp requests.

The "another server" remains reachable, but there I assume the reason to be that it frequently makes connections to outside the local network, and thus keeps its arp entry at the router from expiring.

Any comments / suggestions?

The servers in question are running Linux (CentOS 5.x?), and are running as VMs within VMWare ESXi (5.0?) (I'll check/fill in version details once I get back to work on Monday). The router make/model is unknown for me.

Responses to questions, further findings

Apologies for being slow to return this.

Unfortunately my visibility to the network side (anything beyond the VMWare platform itself) is severely limited.

Based on the arp request packets from the router, it is a Juniper product (guessing by requestor MAC address).

This is a small network, so consider topology as a router, switch, and a single VMWare server hosting several virtual machines.

As for the originator of the odd arp requests, it pretty much has to be the network gateway: they only appear when I try to connect to the "problem" machine from outside the network – and cease when the attempt times out or is cancelled. A minor oddity is that the MAC address in these requests is not the same that is seen for the router in the server arp table after establishing an outbound connection. However, both the MAC address present in these "odd" requests as well as the MAC address shown in the server arp table have a Juniper-assigner OUI.

Then one possibly related finding; it seems that Linux won't respond to arp requests where "tell" address is the network address, whereas Windows (Vista at least) does. This I wasn't able to test in the actual problem environment, but with my own toys at home.

Also, it looks like I'm not completely alone with this issue; a similar experience can be found here: alpacapowered.wordpress.com

Best Answer

Today brought an interesting change of situation.

Eventually, things boiled down to two things:

The Juniper router, or actually a clustered firewall system had somehow lost its configuration syncronisation between the cluster parties. As a result, not all parts of the FW cluster had up-to-date configuration, and this resulted in the arp requests being wrong (yes, the bad arp requests did originate from the router/firewall).

The management application for the firewall also did misbehave, trying to push some other than current, correct, configuration to at least part of the firewall cluster.

I don't have the details on what was done for the firewall itself, nor for the management application, but the end result is that now the "tell" address on the arp requests is the router IP address (.1 from the original description), instead of the network address (.0).

And to these ("who-has ... tell ... .1") arp requests the Linux server responds just as it should, and the inbound connections work just dandy, even long after any trace of the server address has been lost from the routers arp cache.