Cluster failover and strange gratuitous arp behavior

arpclusterfailoverwindows-server-2008-r2

I am experiencing a strange Windows 2008R2 cluster related issue that is bothering me. I feel that I have come close as to what the issue is, but still don't fully understand what is happening.

I have a two node exchange 2007 cluster running on two 2008R2 servers. The exchange cluster application works fine when running on the "primary" cluster node.
The problem occurs when failing over the cluster ressource to the secondary node.

When failing over the cluster to the "secondary" node, which for instance is on the same subnet as the "primary", the failover initially works ok and the cluster ressource continues to work for a couple of minutes on the new node. Which means that the recieving node does send out a gratuitous arp reply packet that updated the arp tables on the network. But after x amount of time (typically within 5 minutes time) something updates the arp-tables again because all of a sudden the cluster service does not answer to pings.

So basically I start a ping to the exchange cluster address when its running on the "primary node". It works just great. I failover the cluster ressource group to the "secondary node" and I only have loss of one ping which is acceptable. The cluster ressource still answers for some time after being failed over and all of a sudden the ping starts timing out.

This is telling me that the arp table initially is updated by the secondary node, but then something (which I haven't found out yet) wrongfully updates it again, probably with the primary node's MAC.

Why does this happen – has anyone experienced the same problem?

The cluster is NOT running NLB and the problem stops immidiately after failing over back to the primary node where there are no problems.

Each node is using NIC teaming (intel) with ALB. Each node is on the same subnet and has gateway and so on entered correctly as far as I am concerned.

Edit:
I was wondering if it could be related to network binding order maybe? Because I have noticed that the only difference I can see from node to node is when showing the local arp table. On the "primary" node the arp table is generated on the cluster address as the source. While on the "secondary" its generated from the nodes own network card.

Any input on this?

Edit:
Ok here is the connection layout.

Cluster address: A.B.6.208/25
Exchange application address: A.B.6.212/25

Node A:
3 physical nics.
Two teamed using intels teaming with the address A.B.6.210/25 called public
The last one used for cluster traffic called private with 10.0.0.138/24

Node B:
3 physical nics.
Two teamed using intels teaming with the address A.B.6.211/25 called public
The last one used for cluster traffic called private with 10.0.0.139/24

Each node sits in a seperate datacenter connected together. End switches being cisco in DC1 and NEXUS 5000/2000 in DC2.

Edit:
I have been testing a little more.
I have now created an empty application on the same cluster, and given it another ip address on the same subnet as the exchange application. After failing this empty application over, I see the exact same problem occuring. After one or two minutes clients on other subnets cannot ping the virtual ip of the application. But while clients on other subnets cannot, another server from another cluster on the same subnet has no trouble pinging. But if i then make another failover to the original state, then the situation is the opposite. So now clients on same subnet cannot, and on other they can.
We have another cluster set up the same way and on the same subnet, with the same intel network cards, the same drivers and same teaming settings. Here we are not seeing this. So its somewhat confusing.

Edit:
OK done some more research. Removed the NIC teaming of the secondary node, since it didnt work anyway. After some standard problems following that, I finally managed to get it up and running again with the old NIC teaming settings on one single physical network card. Now I am not able to reproduce the problem described above. So it is somehow related to the teaming – maybe some kind of bug?

Edit:
Did some more failing over without being able to make it fail. So removing the NIC team looks like it was a workaround. Now I tried to reestablish the intel NIC teaming with ALB (as it was before) and i still cannot make it fail. This is annoying due to the fact that now i actually cannot pinpoint the root of the problem. Now it just seems to be some kind of MS/intel hick-up – which is hard to accept because what if the problem reoccurs in 14 days? There is a strange thing that happened though. After recreating the NIC team I was not able to rename the team to "PUBLIC" which the old team was called. So something has not been cleaned up in windows – although the server HAS been restarted!

Edit:
OK after restablishing the ALB teaming the error came back. So I am now going to do some thorough testing and i will get back with my observations. One thing is for sure. It is related to Intel 82575EB NICS, ALB and Gratuitous Arp.


I am somehow happy to hear that 🙂 I am now going to find out what causes this by doing intensive testing. Hope to get back with some results. I have not seen these problems with Broadcom.

@Kyle Brandt: What driver versions do you have on the system you saw this happen on? Please provide both NIC driver version and Teaming Driver version.

I am running 11.7.32.0 and 9.8.17.

I know for a fact that these drivers are VERY old indeed – but as this problem is only ocurring periodically it is very hard to troubleshoot if updating drivers is solving the issue. As of now i have fx tried to use this action plan: 1. Remove ALB teaming – Could not provoke the error to happen 2. Reestablish ALB teaming – The issue appeared again 3. Try AFT (Adapter Fault Tolerance) – Issue gone again 4. Install newest drivers and run ALB teaming again (tried with 11.17.27.0) – Issue gone 5. Roll drivers back – This action is now pending – but until now the system works fine.

yet again i find it frustratingly hard to troubleshoot this periodic problem, as i now have no idea which of the above steps solved the issue. Most propably it was after installing new drivers – but i dont know for a fact right now.

I hope that some of you who are experiencing the same issue can add some notes/ideas/observations so that we can the to the root of this.

Best Answer

I've started to see machines getting incorrect ARP table entries for several SQL Server instances in a failover cluster.

Client servers are alternatively populating their ARP tables with MAC addresses from the correct NIC team and the MAC address from one of the physical NICs (not the necessarily the corresponding NIC team MAC on that server) on a different cluster node.

This is causing intermittent connection failures for clients on the same LAN as the SQL Cluster.

This behavior has been noted by both VM clients as well as physical boxes.

This occurs after a failover and lasts for days.

In order to mitigate this, I've had to set static arp entries on the more troublesome clients.

ENVIRONMENT:

  • Windows 2008 R2 SP1 Servers in a failover cluster
  • SQL Server 2008 R2 Instances
  • Teamed Intel Gigabit NICS
  • HP 28XX switches
  • Virtual Machines hosted on Windows Server 2008 R2 SP1 Hyper-V

The Intel NIC team creates a virtual adapter with the MAC address of one of the physical NICs.

I have a suspicion that the Intel NIC teaming software is the culprit, but any other troubleshooting thoughts or solutions would be appreciated.

I'm likely going to rebuild the cluster hosts with Server 2012 and use the in-box NIC teaming there (as I have not seen that issue with my testing with that platform).

Related Topic