Windows Server 2008 R2 network adapter stops working, requires hard reboot

broadcomnetworkingwindows-server-2008-r2

TL;DR version: Turns out this was a deep Broadcom networking bug in Windows Server 2008 R2. Replacing with Intel hardware fixed it. We don't use Broadcom hardware any more. Ever.

We have been using HAProxy along with heartbeat from the Linux-HA project. We are using two linux instances to provide a failover. Each server has with their own public IP and a single IP which is shared between the two using a virtual interface (eth1:1) at IP: 69.59.196.211

The virtual interface (eth1:1) IP 69.59.196.211 is configured as the gateway for the windows servers behind them and we use ip_forwarding to route traffic.

We are experiencing an occasional network outage on one of our windows servers behind our linux gateways. HAProxy will detect the server is offline which we can verify by remoting to the failed server and attempting to ping the gateway:

Pinging 69.59.196.211 with 32 bytes of data:
Reply from 69.59.196.220: Destination host unreachable.

Running arp -a on this failed server shows that there is no entry for the gateway address (69.59.196.211):

Interface: 69.59.196.220 --- 0xa
Internet Address      Physical Address      Type
69.59.196.161         00-26-88-63-c7-80     dynamic
69.59.196.210         00-15-5d-0a-3e-0e     dynamic
69.59.196.212         00-21-5e-4d-45-c9     dynamic
69.59.196.213         00-15-5d-00-b2-0d     dynamic
69.59.196.215         00-21-5e-4d-61-1a     dynamic
69.59.196.217         00-21-5e-4d-2c-e8     dynamic
69.59.196.219         00-21-5e-4d-38-e5     dynamic
69.59.196.221         00-15-5d-00-b2-0d     dynamic
69.59.196.222         00-15-5d-0a-3e-09     dynamic
69.59.196.223         ff-ff-ff-ff-ff-ff     static
224.0.0.22            01-00-5e-00-00-16     static
224.0.0.252           01-00-5e-00-00-fc     static
225.0.0.1             01-00-5e-00-00-01     static

On our linux gateway instances arp -a shows:

peak-colo-196-220.peak.org (69.59.196.220) at <incomplete> on eth1
stackoverflow.com (69.59.196.212) at 00:21:5e:4d:45:c9 [ether] on eth1
peak-colo-196-215.peak.org (69.59.196.215) at 00:21:5e:4d:61:1a [ether] on eth1
peak-colo-196-219.peak.org (69.59.196.219) at 00:21:5e:4d:38:e5 [ether] on eth1
peak-colo-196-222.peak.org (69.59.196.222) at 00:15:5d:0a:3e:09 [ether] on eth1
peak-colo-196-209.peak.org (69.59.196.209) at 00:26:88:63:c7:80 [ether] on eth1
peak-colo-196-217.peak.org (69.59.196.217) at 00:21:5e:4d:2c:e8 [ether] on eth1

Why would arp occasionally set the entry for this failed server as <incomplete>? Should we be defining our arp entries statically? I've always left arp alone since it works 99% of the time, but in this one instance it appears to be failing. Are there any additional troubleshooting steps we can take help resolve this issue?

THINGS WE HAVE TRIED

I added a static arp entry for testing on one of the linux gateways which still didn't help.

root@haproxy2:~# arp -a
peak-colo-196-215.peak.org (69.59.196.215) at 00:21:5e:4d:61:1a [ether] on eth1
peak-colo-196-221.peak.org (69.59.196.221) at 00:15:5d:00:b2:0d [ether] on eth1
stackoverflow.com (69.59.196.212) at 00:21:5e:4d:45:c9 [ether] on eth1
peak-colo-196-219.peak.org (69.59.196.219) at 00:21:5e:4d:38:e5 [ether] on eth1
peak-colo-196-209.peak.org (69.59.196.209) at 00:26:88:63:c7:80 [ether] on eth1
peak-colo-196-217.peak.org (69.59.196.217) at 00:21:5e:4d:2c:e8 [ether] on eth1
peak-colo-196-220.peak.org (69.59.196.220) at 00:21:5e:4d:30:8d [ether] PERM on eth1

root@haproxy2:~# arp -i eth1 -s 69.59.196.220 00:21:5e:4d:30:8d
root@haproxy2:~# ping 69.59.196.220
PING 69.59.196.220 (69.59.196.220) 56(84) bytes of data.
--- 69.59.196.220 ping statistics ---
7 packets transmitted, 0 received, 100% packet loss, time 6006ms

Rebooting the windows web server solves this issue temporarily with no other changes to the network but our experience shows this issue will come back.

Swapping network cards and switches

I noticed the link light on the port of the switch for the failed windows server was running at 100Mb instead of 1Gb on the failed interface. I moved the cable to several other open ports and the link indicated 100Mb for each port that I tried. I also swapped the cable with the same result. I tried changing the properties of the network card in windows and the server locked up and required a hard reset after clicking apply. This windows server has two physical network interfaces so I have swapped the cables and network settings on the two interfaces to see if the problem follows the interface. If the public interface goes down again we will know that it is not an issue with the network card.

(We also tried another switch we have on hand, no change)

Changing network hardware driver versions

We've had the same problem with the latest Broadcom driver, as well as the built-in driver that ships in Windows Server 2008 R2.

Replacing network cables

As a last ditch effort we remembered another change that occurred was the replacement of all of the patch cords between our servers / switch. We had purchased two sets, one green of lengths 1ft – 3ft for the private interfaces and another set of red cables for the public interfaces. We swapped out all of the public interface patch cables with a different brand and ran our servers without issue for a full week … aaaaaand then the problem recurred.

Disable checksum offload, remove TProxy

We also tried disabling TCP/IP checksum offload in the driver, no change. We're now pulling out TProxy and moving to a more traditional x-forwarded-for network arrangement without any fancy IP address rewriting. We'll see if that helps.

Switch Virtualization providers

On the off chance this was related to Hyper-V in some way (we do host Linux VMs on it), we switched to VMWare Server. No change.

Switch host model

We've reached the end of our troubleshooting rope and are now formally involving Microsoft support. They recommended changing the host model:

We did that, and we also got some unpublished kernel hotfixes which were presumably rolled into 2008 R2 SP1. No fix.

Replacing network card hardware

Ultimately, replacing the Broadcom network hardware with Intel network hardware fixed this issue for us. So I am inclined to think that the Broadcom Windows Server 2008 R2 drivers are at fault!

http://blog.serverfault.com/post/broadcom-die-mutha/

Best Answer

From http://linux-ip.net/html/ether-arp.html:

If no ARP cache entry exists for a requested destination IP, the kernel will generate mcast_solicit ARP requests until receiving an answer. During this discovery period, the ARP cache entry will be listed in an incomplete state. If the lookup does not succeed after the specified number of ARP requests, the ARP cache entry will be listed in a failed state. If the lookup does succeed, the kernel enters the response into the ARP cache and resets the confirmation and update timers.

It looks like your gateway box is not responding (or responding too slowly) to ARP requests from your gateway box. Does that <incomplete> eventually switch to <failed>? What network hardware do you have between the the server and the gateway? Is it possible broadcast ARP requests are being filtered or blocked somewhere between the two hosts?

Related Solutions

High Instances of Zero Window Messages

The question is somewhat aged already. I am not sure if it is still unresolved, but will try some troubleshooting advice nonetheless.

First of all, it is important to check where zero-window-announcements occur. At certain points in the protocol exchange it might be perfectly valid for them to be there if the web server simply does not expect any data to come back as a response at a given moment and maybe has set the receive buffer to 0 for a given socket or has the receive buffer filled up by simply not fetching anything from there for a while. Debugging this would require knowledge of the protocol (better yet the implementations) used.

You should not need to tune any value of the TCP parameters for any common LAN setup, TCP is mainly self-tuning except for extreme cases like networks with variable latencies or unpredictable packet loss.

What network loads require NIC polling vs interrupts

Great question that had me doing some reading to try and figure it out. Wish I could say I have an answer... but maybe some hints.

I can at least answer your question, "should it be able to live on single interrupt per packet". I think the answer is yes, based on a very busy firewall that I have access to:

Sar output:

03:04:53 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
03:04:54 PM        lo     93.00     93.00      6.12      6.12      0.00      0.00      0.00
03:04:54 PM      eth0 115263.00 134750.00  13280.63  41633.46      0.00      0.00      5.00
03:04:54 PM      eth8  70329.00  55480.00  20132.62   6314.51      0.00      0.00      0.00
03:04:54 PM      eth9  53907.00  66669.00   5820.42  21123.55      0.00      0.00      0.00
03:04:54 PM     eth10      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:04:54 PM     eth11      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:04:54 PM      eth1      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:04:54 PM      eth2 146520.00 111904.00  45228.32  12251.48      0.00      0.00     10.00
03:04:54 PM      eth3    252.00  23446.00     21.34   4667.20      0.00      0.00      0.00
03:04:54 PM      eth4      8.00     10.00      0.68      0.76      0.00      0.00      0.00
03:04:54 PM      eth5      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:04:54 PM      eth6   3929.00   2088.00   1368.01    183.79      0.00      0.00      1.00
03:04:54 PM      eth7     13.00     17.00      1.42      1.19      0.00      0.00      0.00
03:04:54 PM     bond0 169170.00 201419.00  19101.04  62757.00      0.00      0.00      5.00
03:04:54 PM     bond1 216849.00 167384.00  65360.94  18565.99      0.00      0.00     10.00

As you can see, some very high packet per second counts, and no special ethtool tweaking was done on this machine. Oh... Intel chipset, though. :\

The only thing that was done was some manual irq balancing with /proc/irq/XXX/smp_affinity, on a per-interface basis. I'm not sure why they chose to go that way instead of with irqbalance, but it seems to work.

I also thought about the math required to answer your question, but I think there are way too many variables. So... to summarise, in my opinion, the answer is no, I don't think you can predict the outcomes here, but with enough data capture you should be able to tweak it to a better level.

Having said all that, my gut feel is that you're somehow hardware-bound here... as in a firmware or interop bug of some kind.

Best Answer

Related Solutions

High Instances of Zero Window Messages

What network loads require NIC polling vs interrupts

Related Topic