Tcp – Troubleshooting “TCP Zero Window” Issues

packet-analysistcptroubleshootingwireshark

I'm currently having a problem troubleshooting a trading application. Let me give a simple diagram of the current network setup

(Gov't Stock Exchange Network Router)X–>(256kbs Leased Line)<–X(Telco Router)<-(100mbps fast E link)—>Our Network Devices(5 Switches, 1 Firewall)<–>Trading Server.

Our users reports that they are experiencing slowness at around 9:30 to 9:45am. I checked the CPU, Memory, Response Time and Link Utilization of all our Network Devices and Interfaces and all of them reports normal levels.

Part of the trading process is the communication between the Stock Exchange Network and our Trading Servers so if there is any slowness on that 256kbps leased line link, surely it would contribute to the slowness. Unfortunately, the telco router is not being monitored by the Telco and we're still asking for permission if we can add their device to our Solarwinds.

So the closest link I could look at is the 100mbps link from our switch going to the leased line router on our side.

When the traders are experiencing 3ms to 5ms latency in trading, it shows this:

Transmit: 1500bps – 1900bps

Receive: 2000bps – 2400bps

Bytes Transferred per Minute: 44KB-60KB

Wireshark Reports no problem at this time

Special note though on every 9:34 – 9:37 because they experience 10ms – 15ms latency in trading:

Transmit: 1900bps – 2400bps

Receive: 2400bps – 3200bps

Bytes Transferred per Minute: 90KB – 170KB

Wireshark Reports that I'm getting TCP Zero Window(trade server sending the zero window alert to the to stock exchange server) errors but it only lasts for a few milliseconds and only happens at twice or thrice a day.

And there was even one incident when our traders where experiencing crazy latencies of 1min – 3mins delay in trading!:

Transmit: 4000bps

Receive: 5600bps

Wireshark Reports that we were getting TCP Zero Window(trade server sending the zero window alert to the to stock exchange server) errors for the whole trading period of that day. This only happened once and until now, I'm still not available to resolve this issue

The Trading Server team reports that their CPU, Memory and NIC utilization is normal and of course, everyone is blaming the network guys.

So here are my questions:

0.) Is there something with the way i troubleshoot this problem?? I figured I should write this as question no. 0, haha.

1.) When TCP Zero Window happens, what things and devices should I check? Because server team reports that the Memory and NIC utilization of their trading server is normal.

2.) Is there a way to graph in wireshark the transmit/receive bps an bytes received? What I currently do is to go to Statistics -> Conversation -> IPV4 -> Check the "Limit Display to this Filter" and the filter I'm using is ip.addr eq X.X.X.X and ip.addr eq Y.Y.Y.Y and (frame.time ge "DATE HH:MM:SS.000000000" and frame.time le "DATE HH:MM:SS.999999999") and go look at the bps and bytes received

3.) Are there other things I could look at or check(network devices, etc.)?

Thanks a lot for all your help guys! 🙂

Best Answer

Well. Since you've captured packets that show that your Trading Server is sending the TCP ACK's with a window size of 0, you at least know the problem is definitely on your side. Which is actually a good thing, because you are in a position to fix it. (There is one thing that might be the issue which would be a problem on their end, I'll talk about that later)

You've also traced the issue to happening during times of increased throughput, also a good thing.

You said the CPU/RAM usage on your Trading Server reported normal. The application you are using, is it by chance configured to use a limited amount of RAM on the host OS? Maybe a limited percent? Because it would stand to reason that if so, as you had more connections and more throughput, there was less RAM available to the application, and therefore less resources available for TCP.

Either way, what OS is your Trading Server using? If you haven't already, you should look into tuning the OS to dedicate more RAM to TCP. In Windows, there are Registry values you can modify. In Linux, there are config files you can edit.

It would also be wise to make sure your Firewall (and nothing else in between) is trying to proxy your TCP sessions. That way you know you are dealing with the full "client to server" TCP connection, and not something in between.

The last thing I can offer is to study the TCP packets being sent from the Stock Exchange to your server just before your server sends a Window Size of 0. In particular, look for the incoming packets to have the value 11 in the IP Header's ECN field (Explicit Congestion Notification -- the last two bits in what used to be DSCP, bits 14 and 15 if you're looking at an IP Header). There is a chance that if both the Client and Server in the communication supported ECN, and a router in transit detected congestion, that it turned these bits on to tell the client and server to slow down their transfers. (This is that thing I said that might be a problem on their end)

I think that (tries to) answer questions 0,1,3. I'll have to dig around a bit more to give you a reliable answer for 2. But I'm pretty confident there is a way.