Linux clients can’t connect, server and TCP Windows Size / Timestamps issue

linux-networkingtcptimestamp

We have a problem that a number of clients (all linux Ubuntu) are sometimes not able to connect to a remote server over SSH. If the problem occurs, Windows clients don't have that problem and can connect just fine.

I found this other question with a similar problem:
Why would a server not send a SYN/ACK packet in response to a SYN packet

Disabling TCP Timestamping on the server does indeed solve the problem, but I would like to know what the real problem is. I don't really see why this should cause any problems, definitely not when establishing the connection.

When using Wireshark, I see that the Windows clients use a Window size of 8192 whereas the Linux clients use a Window size of 29200. The Windows clients receive a SYN_ACK, the Linux clients don't. Is it possible that this higher initial window size is responsible for not sending the SYN_ACK by the server? I can't come up with a sensible explanation as to why it could cause the given problem, but since it's the only (visible to me) difference, it does appear to look like that. Am I missing something?

*** EDIT
After more searching, thinking and some voodoo magic, I think I might have come up with a plausible explanation. It does take some assumptions and specific conditions to be in place, but I do believe that these might just be possible in this particular situation.

Both users are behind the same NAT device (in our case, a Fortigate firewall). This firewall will assign local ports on it's external interface/IP to each NAT'ed connection. If the port is already in use to another user, it is skipped. If the connection is closed, the port is released and returned to the NAT pool. If that port is then assigned to the other user, but the server still has some record of the connection (TIME_WAIT, final FIN/ACK not received) and the timestamp of the packet is lower of that of the previous connection, the packet will be silently disgarded.

Ok, there are a lot of if in there, but…
– the two users are developing on the same website so they will be making a lot of connections to the same remote server
– the firewall (Fortigate) appearantly keeps a sequential counter of the NAT port per source IP/destinationIP/destinationPort. If the counters of both users are close to eachother, chances of such "collision" happening with two connections to that server are not that unlikely, given that both destination IP as port are the same. That would explain why the problem only occurs sporadically.

The only problem with this theory is that I can't find any evidence of this happening on the server side. There are no connection stuck in TIME_WAIT or something like that, and I do assume that once they disappear from the netstat output, the server has forgotten about them.

I do believe that the initial Window Size does not play a role in this, so I am striking that one of of the list of suspects.

Best Answer

So if the Windows clients don't have the problem my guess is they are not requesting TCP timestamps while the Linux ones are. You can verify this by looking again at the Wireshark captures from both examples.

To start troubleshooting the underlying cause of the timestamp issue, the first order of business would be to make sure client and server are synchronized to NTP servers. If they just have a free running clock, it could very well be the cause of the issue. For example:

 # ntpq -p
 remote           refid      st t when poll reach   delay   offset  jitter
========================================================================
*utcnist2.colora .ACTS.           1 u   92 1024  377   50.242    2.041   1.847
+time-c.timefreq .ACTS.           1 u  623 1024  377   55.413   -1.781   0.418

Make sure at least one has the asterisk in front. That means its in sync. It is anyhow strange to see the TCP session stall at the very beginning. One would expect it to stall after a few packets with timestamp values have been exchanged. More precisely when the timestamp value from one packet appears to be backwards in time from the previous packet.