Cluster failover and strange gratuitous arp behavior

arpclusterfailoverwindows-server-2008-r2

I am experiencing a strange Windows 2008R2 cluster related issue that is bothering me. I feel that I have come close as to what the issue is, but still don't fully understand what is happening.

I have a two node exchange 2007 cluster running on two 2008R2 servers. The exchange cluster application works fine when running on the "primary" cluster node.
The problem occurs when failing over the cluster ressource to the secondary node.

When failing over the cluster to the "secondary" node, which for instance is on the same subnet as the "primary", the failover initially works ok and the cluster ressource continues to work for a couple of minutes on the new node. Which means that the recieving node does send out a gratuitous arp reply packet that updated the arp tables on the network. But after x amount of time (typically within 5 minutes time) something updates the arp-tables again because all of a sudden the cluster service does not answer to pings.

So basically I start a ping to the exchange cluster address when its running on the "primary node". It works just great. I failover the cluster ressource group to the "secondary node" and I only have loss of one ping which is acceptable. The cluster ressource still answers for some time after being failed over and all of a sudden the ping starts timing out.

This is telling me that the arp table initially is updated by the secondary node, but then something (which I haven't found out yet) wrongfully updates it again, probably with the primary node's MAC.

Why does this happen – has anyone experienced the same problem?

The cluster is NOT running NLB and the problem stops immidiately after failing over back to the primary node where there are no problems.

Each node is using NIC teaming (intel) with ALB. Each node is on the same subnet and has gateway and so on entered correctly as far as I am concerned.

Edit:
I was wondering if it could be related to network binding order maybe? Because I have noticed that the only difference I can see from node to node is when showing the local arp table. On the "primary" node the arp table is generated on the cluster address as the source. While on the "secondary" its generated from the nodes own network card.

Any input on this?

Edit:
Ok here is the connection layout.

Cluster address: A.B.6.208/25
Exchange application address: A.B.6.212/25

Node A:
3 physical nics.
Two teamed using intels teaming with the address A.B.6.210/25 called public
The last one used for cluster traffic called private with 10.0.0.138/24

Node B:
3 physical nics.
Two teamed using intels teaming with the address A.B.6.211/25 called public
The last one used for cluster traffic called private with 10.0.0.139/24

Each node sits in a seperate datacenter connected together. End switches being cisco in DC1 and NEXUS 5000/2000 in DC2.

Edit:
I have been testing a little more.
I have now created an empty application on the same cluster, and given it another ip address on the same subnet as the exchange application. After failing this empty application over, I see the exact same problem occuring. After one or two minutes clients on other subnets cannot ping the virtual ip of the application. But while clients on other subnets cannot, another server from another cluster on the same subnet has no trouble pinging. But if i then make another failover to the original state, then the situation is the opposite. So now clients on same subnet cannot, and on other they can.
We have another cluster set up the same way and on the same subnet, with the same intel network cards, the same drivers and same teaming settings. Here we are not seeing this. So its somewhat confusing.

Edit:
OK done some more research. Removed the NIC teaming of the secondary node, since it didnt work anyway. After some standard problems following that, I finally managed to get it up and running again with the old NIC teaming settings on one single physical network card. Now I am not able to reproduce the problem described above. So it is somehow related to the teaming – maybe some kind of bug?

Edit:
Did some more failing over without being able to make it fail. So removing the NIC team looks like it was a workaround. Now I tried to reestablish the intel NIC teaming with ALB (as it was before) and i still cannot make it fail. This is annoying due to the fact that now i actually cannot pinpoint the root of the problem. Now it just seems to be some kind of MS/intel hick-up – which is hard to accept because what if the problem reoccurs in 14 days? There is a strange thing that happened though. After recreating the NIC team I was not able to rename the team to "PUBLIC" which the old team was called. So something has not been cleaned up in windows – although the server HAS been restarted!

Edit:
OK after restablishing the ALB teaming the error came back. So I am now going to do some thorough testing and i will get back with my observations. One thing is for sure. It is related to Intel 82575EB NICS, ALB and Gratuitous Arp.

I am somehow happy to hear that 🙂 I am now going to find out what causes this by doing intensive testing. Hope to get back with some results. I have not seen these problems with Broadcom.

@Kyle Brandt: What driver versions do you have on the system you saw this happen on? Please provide both NIC driver version and Teaming Driver version.

I am running 11.7.32.0 and 9.8.17.

I know for a fact that these drivers are VERY old indeed – but as this problem is only ocurring periodically it is very hard to troubleshoot if updating drivers is solving the issue. As of now i have fx tried to use this action plan: 1. Remove ALB teaming – Could not provoke the error to happen 2. Reestablish ALB teaming – The issue appeared again 3. Try AFT (Adapter Fault Tolerance) – Issue gone again 4. Install newest drivers and run ALB teaming again (tried with 11.17.27.0) – Issue gone 5. Roll drivers back – This action is now pending – but until now the system works fine.

yet again i find it frustratingly hard to troubleshoot this periodic problem, as i now have no idea which of the above steps solved the issue. Most propably it was after installing new drivers – but i dont know for a fact right now.

I hope that some of you who are experiencing the same issue can add some notes/ideas/observations so that we can the to the root of this.

Best Answer

I've started to see machines getting incorrect ARP table entries for several SQL Server instances in a failover cluster.

Client servers are alternatively populating their ARP tables with MAC addresses from the correct NIC team and the MAC address from one of the physical NICs (not the necessarily the corresponding NIC team MAC on that server) on a different cluster node.

This is causing intermittent connection failures for clients on the same LAN as the SQL Cluster.

This behavior has been noted by both VM clients as well as physical boxes.

This occurs after a failover and lasts for days.

In order to mitigate this, I've had to set static arp entries on the more troublesome clients.

ENVIRONMENT:

Windows 2008 R2 SP1 Servers in a failover cluster
SQL Server 2008 R2 Instances
Teamed Intel Gigabit NICS
HP 28XX switches
Virtual Machines hosted on Windows Server 2008 R2 SP1 Hyper-V

The Intel NIC team creates a virtual adapter with the MAC address of one of the physical NICs.

I have a suspicion that the Intel NIC teaming software is the culprit, but any other troubleshooting thoughts or solutions would be appreciated.

I'm likely going to rebuild the cluster hosts with Server 2012 and use the in-box NIC teaming there (as I have not seen that issue with my testing with that platform).

Related Solutions

Linux – How to fix a bad arp entry

If it's really an arp issue, the problem will be confined to the network device doing the routing (since that what ARP is for - mapping L3 addresses (IP) to L2 addresses (MAC)) or possibly in the ARP cache of a server sitting in the same IP subnet. It won't involve a switch unless it's an L3 switch.

To address the problem on a cisco router, you can run the following command to clear the arp cache and allow it to rebuild:

clear arp

To remove the bad arp entry from a server which may be caching bad information (so, not the server that can't be reached, but the server that can't do the reaching) you can manually delete the bogus entry out of the ARP cache, where IP address is the IP of the server which can't be reached. Note this same syntax appears to be valid on both Linux and Windows:

arp -d <ip-address>

You can also send a gratuitous ARP from the server which can't be reached to cause other hosts on the same IP subnet to update their ARP caches (I have this in my notes, but I admit I haven't used it in a long time. I can't remember if this allows you to skip the steps above, or just shortens the process of the other hosts adding an arp entry after running the commands above):

arping -q -A -c 1 -I eth0 <ip-address>
arping -q -U -c 1 -I eth0 <ip-address>

All of the above is for an ARP issue, but you specifically mention a switch in your question. If it's a switch that only uses L3 for management, then the data flow problems would have to be problems with the MAC cache, not the ARP cache. In that case, you could run the following on the switch to purge the dynamic cache contents:

clear mac-address-table dynamic

Windows 2008 ignores Gratuitous ARP requests

After testing it does seem that Hotfix 2582281 fixes the issue. You can get the hotfix without having to pay support by using their hotfix request page.

I ran a test of this using arping and unpatched windows 2008 R2. I added a secondary IP, 64.34.119.80, to a machine with in the same network L2 segment. I then issued the following command from a different machine the network (sudo arping -U 64.34.119.80 -I bond0 -c1). Right after that, I pinged 64.34.119.80 from the windows box after seeing it recieve the arp in wireshark. I then applied the hotfix and repeated the test.

Also, it seems that the arping command needs to not use unicast MAC address but rather the broadcast MAC because this is the only type of GARP ignored from my tests.

Before the patch:

enter image description here

In this wireshark capture, the ping after the GARP request is not sent to the MAC Destination that the GARP came from, so you can see that GARP is being ignored.

After the patch:

enter image description here

In this test, after the patch, the GARP request seems to be honored as the ping is sent to the MAC address that GARP came from.

So from these tests it seems hotfix 2582281 fixes the issue of GARP broadcasts being ignored.

Best Answer

Related Solutions

Linux – How to fix a bad arp entry

Windows 2008 ignores Gratuitous ARP requests

Related Topic