Windows – Tons of TCP connections in TIME_WAIT state on windows 2008 – running on amazon AWS

amazon ec2tomcat6windowswindows-server-2008

OS: Windows Server 2008, SP2 (running on EC2 Amazon).

Running web app using Apache httpd & tomcat server 6.02 and Web server has keep-alive settings.

There are around 69,250 (http port 80) + 15000 (other than port 80) TCP connections in TIME_WAIT state (used netstat & tcpview). These connections don't seem to close even after stopping web server (waited 24 hours)

Performance monitor counters:

  • TCPv4 Active Connections: 145K
  • TCPv4 Passive Connections: 475K
  • TCPv4 Failure Connections: 16K
  • TCPv4 Connections Reset: 23K

HKEY_LOCAL_MACHINE\System \CurrentControlSet\Services\Tcpip\Parameters does not have TcpTimedWaitDelay key, so value should be the default (2*MSL, 4 mins)

Even if there are thousands of connection requests are coming at the same time, why windows OS is not able to clean them eventually?
What could be the reasons behind this situation?
Is there any way to forcefully close all these TIME_WAIT connections without restarting windows OS?

After few days we app stops taking any new connections.

Best Answer

We've been dealing with this issue too. It looks like Amazon found the root cause and corrected it. Here is the info they gave me.

Hi, I am pasting below an explanation of what was causing this issue. Good news is that this has been fixed very recently by our engineering team. To get fix, all you'll have to do is STOP/START the Windows Server 2008 instances where you are seeing this issue. Again, I am not talking about REBOOT which is different. STOP/START causes the instance to move to a different (healthy) host. When these instances launch again, they will be running on hosts that have the fix in place so they won't have this issue again. Now below is the engineering explanation of this issue. After an in depth investigation, we've found that when running Windows 2008 x64 on most available instance types, we've identified an issue which may result in TCP connections that remain in TIME_WAIT/CLOSE_WAIT for excessively long periods of time (in some cases, remaining in this state indefinitely). While in these states, the particular socket pairs remain unusable and if enough accumulate, will result in port exhaustion for the ports in question. If this particular circumstance occurs, the only solution to clear the socket pairs in question is to reboot the instance in question. We have determined the cause to be the values produced by a timer function in Windows 2008 kernel API which, on many of our 64-bit platforms, will occasionally retrieve a value that is extremely far in the future. This affects the TCP stack by causing the timestamps on the TCP socket pairs to be stamped significantly far in the future. According to Microsoft, there is a stored cumulative counter which will not be updated unless the value produced by this API call is larger than the cumulative value. The ultimate result is that sockets created after this point will all be stamped too far in the future until that future time is reached. In some cases, we have seen this value several hundred days into the future, thus the socket pairs appear to be stuck forever.