We rented 10 Windows servers from a large provider in Europe. They all work as expected. One server suddenly crashed and customer care found that both disks were broken. They replaced disks and also memory (they run a test and found a faulty DIMM).
After this, server was reinstalled from scratch with Windows 2008 server R2.
Now we're experiencing random TCP sockets disconnection in our software (it's a server software that handles ~200 realtime TCP connections). We made a lot of tests but we cannot reproduce the problem which is totally random.
Sometimes, also VNC, SSH, RDP connections drop, so it's not related to our software.
I reinstalled again Windows 2008 and as the first thing I downloaded Firefox… downloaded stopped due to disconnection.
It seems definitely an hardware issue.
All other servers are running same SO on same hardware. We never had problems.
How can I reproduce this issue to show the provider that there is an hardware problem? Is there some specific Windows based test suite for network problems?
I'm open to suggestions.
Update 1
Wireshark capture shows that server is suddenly sending a RST,ACK packet.
There are a lot of retransmissions and some packets before RST there is such a packet:
[TCP ACKed unseen segment] https > 60226 [ACK] Seq=42906 Ack=79 Win=253 Len=0 SLE=27 SRE=53 443 60226
Update 2
NIC adapter is a Realtek PCIe GBE Family Controller. Driver is rt64win7.sys from Realtek, version 7.065.1025.2012. It's the one supplied with the preinstalled server for every customer. It works correctly with the very same drivers on other servers.
Update 3
I installed latest version of Wireshark. I run wget -m --limit-rate 1000 somesite
to generate some requests with TCP traffic.
There are a lot of warnings in Wireshark capture. Window Full, ZeroWindow. I tried different sites with wget and warnings always appears. Could this be our problem?
11 0.569100000 xx.xx.xxx.216 xx.xxx.xxx.80 TCP 554 [TCP segment of a reassembled PDU]
12 0.569356000 xx.xx.xxx.216 xx.xxx.xxx.80 TCP 554 [TCP Window Full] [TCP segment of a reassembled PDU]
13 0.569376000 xx.xxx.xxx.80 xx.xx.xxx.216 TCP 54 58572 > http [ACK] Seq=266 Ack=1828 Win=1000 Len=0
14 0.655205000 xx.xx.xxx.216 xx.xxx.xxx.80 TCP 554 [TCP segment of a reassembled PDU]
15 0.655443000 xx.xx.xxx.216 xx.xxx.xxx.80 TCP 554 [TCP Window Full] [TCP segment of a reassembled PDU]
16 0.655457000 xx.xxx.xxx.80 xx.xx.xxx.216 TCP 54 58572 > http [ACK] Seq=266 Ack=2828 Win=1000 Len=0
17 0.741237000 xx.xx.xxx.216 xx.xxx.xxx.80 TCP 554 [TCP segment of a reassembled PDU]
18 0.741498000 xx.xx.xxx.216 xx.xxx.xxx.80 TCP 554 [TCP Window Full] [TCP segment of a reassembled PDU]
19 0.741516000 xx.xxx.xxx.80 xx.xx.xxx.216 TCP 54 [TCP ZeroWindow] 58572 > http [ACK] Seq=266 Ack=3828 Win=0 Len=0
20 1.060906000 xx.xxx.xxx.80 xx.xx.xxx.216 TCP 54 [TCP Window Update] 58572 > http [ACK] Seq=266 Ack=3828 Win=1000 Len=0
21 1.146737000 xx.xx.xxx.216 xx.xxx.xxx.80 TCP 554 [TCP segment of a reassembled PDU]
22 1.146993000 xx.xx.xxx.216 xx.xxx.xxx.80 TCP 554 [TCP Window Full] [TCP segment of a reassembled PDU]
23 1.147007000 xx.xxx.xxx.80 xx.xx.xxx.216 TCP 54 [TCP ZeroWindow] 58572 > http [ACK] Seq=266 Ack=4828 Win=0 Len=0
24 1.634966000 xx.xx.xxx.216 xx.xxx.xxx.80 TCP 60 [TCP Keep-Alive] http > 58572 [ACK] Seq=4827 Ack=266 Win=15616 Len=0
25 1.634981000 xx.xxx.xxx.80 xx.xx.xxx.216 TCP 54 [TCP ZeroWindow] 58572 > http [ACK] Seq=266 Ack=4828 Win=0 Len=0
Update 4
I spotted some weird behaviour in Wireshark capture.
44176 1183.719018000 xx.xxx.xxx.80 xx.x.xxx.88 TCP 1506 [TCP segment of a reassembled PDU]
44177 1183.724259000 xx.x.xxx.88 xx.xxx.xxx.80 TCP 78 [TCP Dup ACK 44174#1] 57852 > http [ACK] Seq=94 Ack=868148 Win=175872 Len=0 TSval=190130437 TSecr=8617727 SLE=869588 SRE=871028
44178 1183.724297000 xx.xxx.xxx.80 xx.x.xxx.88 TCP 1506 [TCP segment of a reassembled PDU]
44179 1183.725337000 xx.x.xxx.88 xx.xxx.xxx.80 TCP 78 [TCP Dup ACK 44174#2] 57852 > http [ACK] Seq=94 Ack=868148 Win=175872 Len=0 TSval=190130437 TSecr=8617727 SLE=869588 SRE=872468
44180 1183.725353000 xx.xxx.xxx.80 xx.x.xxx.88 TCP 1506 [TCP segment of a reassembled PDU]
44181 1183.753811000 xx.x.xxx.88 xx.xxx.xxx.80 TCP 86 [TCP Dup ACK 44174#3] 57852 > http [ACK] Seq=94 Ack=868148 Win=175872 Len=0 TSval=190130445 TSecr=8617727 SLE=873908 SRE=875348 SLE=869588 SRE=872468
44182 1183.753838000 xx.xxx.xxx.80 xx.x.xxx.88 TCP 1506 [TCP Fast Retransmission] [TCP segment of a reassembled PDU]
44183 1183.758173000 xx.x.xxx.88 xx.xxx.xxx.80 TCP 86 [TCP Dup ACK 44174#4] 57852 > http [ACK] Seq=94 Ack=868148 Win=175872 Len=0 TSval=190130447 TSecr=8617727 SLE=873908 SRE=876788 SLE=869588 SRE=872468
44184 1183.768334000 xx.x.xxx.88 xx.xxx.xxx.80 TCP 86 [TCP Dup ACK 44174#5] 57852 > http [ACK] Seq=94 Ack=868148 Win=175872 Len=0 TSval=190130449 TSecr=8617727 SLE=873908 SRE=878228 SLE=869588 SRE=872468
44185 1183.770232000 xx.x.xxx.88 xx.xxx.xxx.80 TCP 86 [TCP Dup ACK 44174#6] 57852 > http [ACK] Seq=94 Ack=868148 Win=175872 Len=0 TSval=190130449 TSecr=8617727 SLE=873908 SRE=879668 SLE=869588 SRE=872468
44186 1183.773544000 xx.x.xxx.88 xx.xxx.xxx.80 TCP 86 [TCP Dup ACK 44174#7] 57852 > http [ACK] Seq=94 Ack=868148 Win=175872 Len=0 TSval=190130450 TSecr=8617727 SLE=873908 SRE=881108 SLE=869588 SRE=872468
44187 1183.784085000 xx.x.xxx.88 xx.xxx.xxx.80 TCP 86 [TCP Dup ACK 44174#8] 57852 > http [ACK] Seq=94 Ack=868148 Win=175872 Len=0 TSval=190130452 TSecr=8617727 SLE=873908 SRE=882548 SLE=869588 SRE=872468
44188 1183.784097000 xx.xxx.xxx.80 xx.x.xxx.88 TCP 101 [TCP Retransmission] [TCP segment of a reassembled PDU]
44189 1183.789043000 xx.x.xxx.88 xx.xxx.xxx.80 TCP 86 [TCP Dup ACK 44174#9] 57852 > http [ACK] Seq=94 Ack=868148 Win=175872 Len=0 TSval=190130453 TSecr=8617727 SLE=873908 SRE=883988 SLE=869588 SRE=872468
44190 1183.789056000 xx.xxx.xxx.80 xx.x.xxx.88 TCP 1471 [TCP Retransmission] [TCP segment of a reassembled PDU]
44191 1183.793926000 xx.x.xxx.88 xx.xxx.xxx.80 TCP 94 [TCP Dup ACK 44174#10] 57852 > http [ACK] Seq=94 Ack=868148 Win=175872 Len=0 TSval=190130454 TSecr=8617727 SLE=895508 SRE=896948 SLE=873908 SRE=883988 SLE=869588 SRE=872468
44192 1183.793939000 xx.xxx.xxx.80 xx.x.xxx.88 TCP 1506 [TCP segment of a reassembled PDU]
44193 1183.800204000 xx.x.xxx.88 xx.xxx.xxx.80 TCP 94 [TCP Dup ACK 44174#11] 57852 > http [ACK] Seq=94 Ack=868148 Win=175872 Len=0 TSval=190130457 TSecr=8617727 SLE=895508 SRE=898388 SLE=873908 SRE=883988 SLE=869588 SRE=872468
44194 1183.800217000 xx.xxx.xxx.80 xx.x.xxx.88 TCP 1506 [TCP segment of a reassembled PDU]
44195 1183.803615000 xx.x.xxx.88 xx.xxx.xxx.80 TCP 94 [TCP Dup ACK 44174#12] 57852 > http [ACK] Seq=94 Ack=868148 Win=175872 Len=0 TSval=190130457 TSecr=8617727 SLE=895508 SRE=899828 SLE=873908 SRE=883988 SLE=869588 SRE=872468
44196 1183.803640000 xx.xxx.xxx.80 xx.x.xxx.88 TCP 1506 [TCP Retransmission] [TCP segment of a reassembled PDU]
44197 1183.803654000 xx.xxx.xxx.80 xx.x.xxx.88 TCP 1506 [TCP Retransmission] [TCP segment of a reassembled PDU]
44198 1183.803660000 xx.xxx.xxx.80 xx.x.xxx.88 TCP 1506 [TCP Retransmission] [TCP segment of a reassembled PDU]
44199 1183.803665000 xx.xxx.xxx.80 xx.x.xxx.88 TCP 1506 [TCP Retransmission] [TCP segment of a reassembled PDU]
.. a lot of these
There is 1 second of errors. I had 2 connections and none fall down. Retransmissions saved the connections.
I will go to the provider showing all these errors. I hope to convince them.
Best Answer
Two things I would do are: use the Realtek diagnostics to check for a hardware issue.
And: setup performance counter logging on the interface - specifically look at the following counters:
Network Interface\Packets Outbound Errors
Network Interface\Packets Received Errors
TCPv4\Connection Failures
TCPv4\Connections Reset
This should help record the issue you're seeing, and can be provided to your host to show the issues you're having. Also - setup the same performance counter logging on some healthy servers for comparison.