Linux – TCP congestion control for low-latency 10GbE -> 1GbE network

linuxnetworkingtcp

I have a server with a 10GbE connection to a switch, and 10 clients each with a 1GbE connection to the same switch.

Running nuttcp in parallel on each of the clients, I can push 10 TCP streams of data to the server simultaneously at close to wire speed (i.e. just shy of 100 megabytes per second from all 10 clients simultaneously).

However, when I reverse the direction and send data from the server to the clients — i.e., 10 TCP streams, one going to each client — the TCP retransmissions skyrocket and the performance drops to 30, 20, or even 10 megabytes per second per client. I want to get these numbers up, because this traffic pattern is representative of certain applications I care about.

I have verified that my server is capable of saturating a 10GbE link by performing the same experiment over a 10GbE connection to a similar server. I have verified that there are no errors on any of my ports.

Finally, when I forcibly clamp (limit) the receiver's TCP window size, I can get the bandwidth somewhat higher (30-40 megabytes/sec); and if I clamp it extremely low, I can get the retransmissions to zero (with the bandwidth ludicrously low).

Thus I am reasonably confident I am overrunning the buffers in my switch, resulting in packet loss due to congestion. However, I thought TCP's congestion control was supposed to deal with this nicely, eventually stabilizing at something above 50% of wire speed.

So my first question is very simple: Which TCP congestion control algorithm would be best for my situation? There are a ton of them available, but they mostly seem to be targeted at lossy networks or high-bandwidth high-latency networks or wireless networks… None of which apply to my situation.

Second question: Is there anything else I can try?

Best Answer

  1. You would want an algorithm where the window size is not drastically reduced when there is a packet drop. It's the drastic drop in window size that results in the sudden drop in throughput with TCP traffic.

  2. If your switch and your server support flow control, try enabling flow control. How well this works depends almost entirely on the Switch's silicon and firmware. Basically, the switch will detect egress congestion on the port that is connected to a client, determine where the packets came from, and send flow control frames out the ingress port (i.e. back to the server). If the server understands flow control frames, it will reduce the transmission speed. If it all works well you will get optimal throughput with virtually zero packet drops occurring on the switch's egress buffer.