Python – Socket TCP server w/ high RTT and retransmissions

pythonsockettcp

I have a TCP server built with sockets in Python. The application I'm building is time-sensitive, so the integrity of the data is important, therefore we need TCP. The bandwidth is very low.

And there's a client which requests data from the server every 50 ms. The client gets as response an OK message in case the server doesn't have the data or the actual required data.

Whenever the client makes a request to the server, it sends a frame of 5 bytes (not including the 40 extra bytes that come from IP and TCP).
On the other side, the server either responds with a frame of 5 bytes (in most cases) or a frame of > 70 bytes (generally every second)

On both sides the sockets are set like this:

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) # this line is excluded in client's case
sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 8192)
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
sock.settimeout(0.5)

Everything runs fine on the local network (no lag at all), but whenever I connect to the server from the public IP (I'm port-forwarding) it lags a lot. The lag can go up to 15 seconds (at that moment it times out), which is incredibly much. Most of the time the RTT stays at 200-210 ms. On WireShark I can see that there are lots of (spurious) retransmissions and dup ACK.

What can I do? I've already disabled the Nagle's algorithm, but with no success yet.

Best Answer

I've had a good look over the capture files provided and here is my analysis. In summary, I believe this is an issue with your Router, which appears to be a Technicolor device of some sort.

Client Side Capture

  • Your client is having major issues trying to connect to a variety of websites. HTTPS websites (www.bing.com, wdcp.microsoft.com etc) are getting no response after the Client Hello stage resulting in retransmissions and eventual timeout from your device. Another set of HTTP requests to an Akamai hosted website (104.90.152.18) is resulting in a 408 Request Time-out.
  • Looking specifically at the traffic from the client to the server the vast majority of the sessions start reasonably OK but then encounter packet loss resulting in retransmissions from the client and timeouts. For example, examine packet number 161 - 207. At packet 161 the client sends a data packet to the server but gets no response back, causing the client to retransmit for around 15 seconds before the connection is torn down.

    The majority of the TCP streams demonstrate this behaviour so it we can conclude that either the data packets from the client are not reaching the server OR the response from the server is not reaching the client.

  • Looking at the latency, there is a significant (and volatile) delay between the SYN and SYN/ACK response from the server, ranging from 168ms to 770ms.

Server Side Capture

  • Unfortunately, the server side capture does not capture the same events as the client side capture. I am also unsure where exactly in the network this has been captured as it includes client and server traffic. ICMP redirects are also being sent which indicates sub-optimal routing. I do not believe this to be causing the issue however.
  • If you apply a wireshark display filter for tcp.stream eq 1 || tcp.stream eq 2 you can see both sides of the communication. Specifically, Client > Firewall and then Firewall > Server (and vice-versa). Again, everything starts OK and then around packet 407 things get interesting.

    Packet #407 marks the point when the client sends a chunk of new data to the server. The router receives this and forwards it to the server. The server sends an Acknowledgement packet back (packet #410) as well as another small data packet (#411). What we don't see however is the router passing these packets back to the client - this is the best evidence I have found of this being a router issue.

Compare this to one of the many successful exchanges slightly further up in the trace - packet 394 to 406 for example:

  1. (#394) Client sends a data packet to the public IP of the server
  2. (#396) Router receives this and forwards it to the local IP of the server
  3. (#397) Server sends an acknowledgement back to the NAT'd IP of the client
  4. (#398) Server sends a small data packet back to the NAT'd IP of the client
  5. (#401) Router sends the acknowledgement back to the client's local IP
  6. (#402) Router sends the small data packet back to the client's local IP
  7. (#403) Client sends an acknowledgement back to the public IP of the server to confirm it received the data the server sent
  8. (#406) The router forwards the acknowledgement to the local IP of the server.

When things fail, everything stops after stage 4 - the two packets sent from the server appear to be dropped at the router.

Final Thoughts

  • Most of your TCP connections, not just your Python application, seem to be being suffering from performance issues as demonstrated by the many connection issues in your client side capture.
  • There is reasonably proof in your server side capture that packets are being blackholed when they have to be forwarded through your router.
  • Your testing has concluded that there is no issues when testing this application locally, when traffic does not need to traverse the router for port forwarding.
  • Unfortunately, I am not familiar with Technicolor routers at all and the only thing I could suggest would be to check whether there are any Firewall or Quality of Service rules enabled on the router which could be impacting performance. Perhaps if you can test with an alternative router or host your application in another network to see if the issues persist.