Linux – Clarification about Linux TCP window size and delays

linuxtcptcp-slow-starttcpdump

I have delays when sending data through a TCP channel I am not able to understand. The link is a 1Gb link with a end to end latency of roughly 40ms. In my current setup, latency (the time from one message to go from the sender user space to the receiver user space) can reach 100ms.

The sender socket is configured with the TCP_NODELAY option. Sender buffer (SO_SNDBUF) is configured to be 8MB. Receive buffer (SO_RCVBUF) is also configured to be 8MB. Tcp window scaling is activated.

update-1: I am using the zeromq 3.1.1 middleware to carry data. Socket configuration, including the TCP_NODELAY flag is performed by the middleware. Some options are accessible like rx and tx emit buffer sizes but not TCP_NODELAY. As far as I have understood, the TCP_NODELAY is activated to ensure that the data is sent as possible. In the meantime, actual socket sends and decision to send a message are performed in two separate threads. A proper batching is done if several messages are available at the time the first message in the batch is to be sent.

I ran a capture with tcpdump from which the frames below have been extracted. After the initial TCP handshake, the sender (172.17.152.124) starts sending data. The initial window size is 5840 bytes for the receiver & 5792 bytes for the sender.

My problem is that the sender sends two frames (#6 and #7) then stops, waiting for an ack to come back from the receiver. As far as I can see, the window size of the receiver is not reached and the transfer should not stop (384 bytes outstanding with an initial receive window size of 5840 bytes). I am starting to think that I have no understood correctly what TCP is. Can someone help clarifying ?

update-2: My data payload consists of a magic number followed by a timestamp. I have isolated the delayed packets by comparing the timestamps of the payloads with the timestamps put by tcpdump. The payload ts of frame #9 is very close to the one of frame #6 and #7 and clearly less than the timestamp of the received ack in frame #8.

update-1: The fact that frame #9 is not sent immediately can be explained by the slow-start of the TCP channel. In fact, the problem also appears once the connection is running for several minutes so the slow-start does not seem to be the general explanation.

  1. 20:53:26.017415 IP 172.17.60.9.39943 > 172.17.152.124.56001: Flags [S], seq 2473022771, win 5840, options [mss 1460,sackOK,TS val 4219180820 ecr 0,nop,wscale 8], length 0

  2. 20:53:26.017423 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [S.], seq 2948065596, ack 2473022772, win 5792, options [mss 1460,sackOK,TS val 186598852 ecr 219180820,nop,wscale 9], length 0

  3. 20:53:26.091940 IP 172.17.60.9.39943 > 172.17.152.124.56001: Flags [.], ack 1, win 23, options [nop,nop,TS val 4219180894 ecr 186598852], length 0

  4. 20:53:26.091958 IP 172.17.60.9.39943 > 172.17.152.124.56001: Flags [P.], seq 1:15, ack 1, w in 23, options [nop,nop,TS val 4219180895 ecr 186598852], length 14

  5. 20:53:26.091964 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [.], ack 15, win 12, options [nop,nop,TS val 186598927 ecr 4219180895], length 0

  6. 20:53:26.128298 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [P.], seq 1:257, ack 15, win 12, options [nop,nop,TS val 186598963 ecr 4219180895], length 256

  7. 20:53:26.128519 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [P.], seq 257:385, ack 15, win 12, options [nop,nop,TS val 186598963 ecr 4219180895], length 128

  8. 20:53:26.202465 IP 172.17.60.9.39943 > 172.17.152.124.56001: Flags [.], ack 257, win 27, options [nop,nop,TS val 4219181005 ecr 186598963], length 0

  9. 20:53:26.202475 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [.], seq 385:1833, ack 15, win 12, options [nop,nop,TS val 186599037 ecr 4219181005], length 1448

  10. 20:53:26.202480 IP 172.17.152.124.56001 > 172.17.60.9.39943: Flags [P.], seq 1833:2305, ack 15, win 12, options [nop,nop,TS val 186599037 ecr 4219181005], length 472

If this matters, both ends are Linux RHEL5 boxes, with 2.6.18 kernels and network cards are using e1000e drivers.

update-3
Content of /etc/sysctl.conf

[jlafaye@localhost ~]$ cat /etc/sysctl.conf | grep -v "^#" | grep -v "^$" 
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 1048576
net.core.wmem_default = 1048576
net.ipv4.tcp_rmem = 65536 4194304 16777216
net.ipv4.tcp_wmem = 65536 4194304 16777216 
net.core.netdev_max_backlog = 10000 
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_mem = 262144 4194304 16777216
kernel.shmmax = 68719476736

Best Answer

After doing a little more digging into my traffic, I was able to see that my data was nothing but a sequence of small bursts with small idle periods between them.

With the useful tool ss, I was able to retrieve the current congestion window size of my connection (see the cwnd value in the output):

[user@localhost ~]$ /usr/sbin/ss -i -t -e | grep -A 1 56001

ESTAB 0 0 192.168.1.1:56001
192.168.2.1:45614 uid:1001 ino:6873875 sk:17cd4200ffff8804 ts sackscalable wscale:8,9 rto:277 rtt:74/1 ato:40 cwnd:36 send 5.6Mbps rcv_space:5792

I ran the tool several times and discovered that the congestion window size was regularly reset to the initial value (10ms, on my Linux box). The connection was constantly looping back to the slow start phase. During the slow start period, bursts with a number of messages exceeding the window size were delayed, waiting for the acks related to the first packets of the burst.

The fact that the traffic consists of a sequence of bursts likely explains the reset of the congestion window size.

By deactivating the slow start mode after idle period, I was able to get rid of the delays.

[user@host ~]$ cat /proc/sys/net/ipv4/tcp_slow_start_after_idle 0

Related Topic