Troubleshoot large number of TCP retransmits / dup ack / segment lost

rdptcpwindows-server-2008windows-xpwireshark

I have a problem with RDC slowing to a crawl or disconnecting entirely. Client is XP SP3 w/ RDC 6, server is Win 2k8 R2. Both have been scanned thoroughly and found to be virus free.

I downloaded and installed Wireshark on the client computer and ran a packet capture during an RDC session. The log showed at least 10-20 retransmit / dup ack /segment losses per minute during normal usage. Then, when I had a disconnect, it shot up to dozens of these per second.

FYI, I know very little about the Wireshark tool or how to do a full analysis of this problem.

** EDIT **

A little about my network architecture:

Client –

12 Mb down, 1Mb up
1 laptop directly connected to the modem -or- (I've tried it this way with no change) plugged in through a Linksys DSL phone box
Located in Israel. Telecom services are split into infrastructure and ISP there, infrastructure is provided by HOT, ISP is provided by Netvision.

Server –

5 Mb down, 5 Mb up
Medium web/data/app hosting network, routed with Allied Telesyn AR410
Located in CA (US). ISP is Call America.

Other remote clients have no issues connecting to the servers (either speed or disconnects). Several other laptops have been used at the client location to verify it is not a hardware issue. The cable modem has also been replaced.

Best Answer

Probably not enough information, but here is some general guidance:

If other remote clients are ok and do not experience the symptom, the problem probably is not with the server. It may be the connection for that client.

A retransmit typically means that a packet was not acknowledged, so there usually will not be actual "errors" in a packet capture. It means that one end was sending the packets, and the other was not acknowledged. You may want to perform the capture from both ends, to determine if the retransmit is one-way only, or both ways.

If you ping your host from the client, what is the response time? If it is over 150 ms, you may have a suboptimal connection.

The server network adapter setting for Large Send Offload should be disabled. Windows should be smart enough to know it cannot send large packets to machines on different subnets, but sadly this is not always the case. If your server is a hyper-v guest, this is almost certainly the problem.

MTU. Generally speaking, accessing a server remotely when you are not on the same subnet, the MTU should always be whatever the smallest MTU is between the two endpoints. And that usually means the client. For remote clients that connect over VPN, it is not uncommon to have an MTU of 1400 or even lower. It can be beneficial to set the server MTU to match what the lowest MTU would be, to avoid issues where MTU cannot be discovered properly (sometimes referred to as black hole routers). To find the MTU for your connection you can enter the following command from your client:

ping -f -l xxxx <server ip>

Where xxxx is the MTU size. Start with 1400. If it works, increase it until it displays the message "Packet needs to be fragmented but DF set". If 1400 does not work, decrease it until it does. The highest number that works is your payload size. Add 28 to the payload size and that is your MTU.

You can set the server MTU at the following registry location:

HKLM\SYSTEM\CurrentControlSet\services\Tcpip\Parameters\Interfaces\{guid of the network adapter}

FYI - RDP packets are always sent with the "Do Not Fragment" bit set.

TCP Sequence Numbering

There are a couple of things to remember when decoding TCP traces...

TCP sequence numbers are directional (i.e. if someone sends me a payload, I don't increase my sequence number based on bytes that I received)
TCP sequence numbers point to the head of the payload of a packet

However, those points alone don't account for the missing ACKs for seq numbers 5841-14600 between packets 6 and 7. My best guess (and that's all I can really do at this point) is that you might have dropped ACK packets somewhere in between the NIC and wireshark. You can tell when this happens if you see messages like this (from a linux xterm or ssh session)...

19431 packets captured
38863 packets received by filter
572 packets dropped by kernel  <----------------
7 packets dropped by interface <----------------

Solutions to wireshark packet drops

Reduce the size of packets that wireshark looks at (100 bytes per packet is normally more than enough)
Disable DNS lookups and live capture scrolling (disk buffer capture is most efficient)
In linux you can fix these drops by adjusting buffers on the NIC and between the kernel and libpcap^{Note 1}...

ethtool -G eth0 rx 768

sysctl -w net.core.netdev_max_backlog=30000
If you're in windows it helps to give wireshark more buffer space (the -B CLI option) when you call it...

Note 1. YMMV, buffer settings are specific to your system... play with them until you don't see packets dropped messages

Excessive ‘TCP Dup ACK’ & ‘TCP Fast Retransmission’ causing issues on network. What’s causing this

I realize that this answer is simplified, and not as explicit as I'd like it to be, so if you have questions about a step, please ask!

Scrolling down a bit after opening this file in Wireshark we see some frames in different color. Looks really bad, right? Well, it's not that bad. Hold on, we'll get there.

Checking the SYN packet (frame 37) we see SACK and Window Scaling in the TCP Options. Good. Same thing in in the SYN/ACK (frame 38), SACK and Windows scaling. Awesome. Don't see anything weird regarding SACK.

An estimate of the unloaded RTT is the time between the SYN packet and the first ACK (frame 39). It's about 9.3 ms, which matches your findings. Note that the time between SYN/ACK and ACK (frames 38 and 39) is much lower than between SYN and SYN/ACK (37 and 38). This means that this capture file is taken at the receiver, and to see why that's not ideal, we'll have to go back to school.

Between the sender and the receiver there is one part of the network path that is the smallest, which limits the bandwidth. The RTT estimate we just got from the handshake gives us an estimate of the length of this network path. A measurement of how many packets we can fit in this pipe is the Pipe Capacity or the Bandwidth Delay Product - PC [bits] = R [bits/s] * RTT [s], where R is the smallest bandwidth. Pipe Capacity is then a measurement of volume.

Imagine a garden hose. Its volume is measured is defined by its length and its width in the same way right? To get the most water out of it, it needs to be completely filled with water, otherwise there will be air gaps limiting the water flow. In case we manage to fill it completely, it might overflow. We can use a bucket so that we won't get the floor wet, and if the bucket overflows that doesn't affect the water flow.

Turns out that it's exactly the same in the network path. We need to fill the pipe... In other words, Pipe Capacity is the smallest bytes in flight (how much water we have in the pipe + bucket) between the sender and the receiver that fully utilizes the smallest bandwidth (doesn't cause air gaps). So if the bytes in flight > PC then we're good!

Looking at the TCP trace Statistics -> TCP StreamGraph -> Time Sequence Graph (tcptrace) we can see bytes on the Y-axis, and time on the X-axis. The derivative of this curve is bytes/second, or throughput. Note how the black "line" is flat, meaning throughput is stable! It's interrupted by blue lines a couple of times though (those are the SACK ranges in the duplicate ACKs), but as can be seen it does not affect throughput.

See how the lower right gray solid line (zoom in a bit, that's the ACKs) is really close to the black TCP segments? The time between the TCP segment and the ACK is the RTT, here's it's almost 0! It means that there are not many segments in flight passed this capture point. This in turns means that we can't use that to estimate the bytes in flight, and this is why a sender side packet capture is way better.

Packets here are naturally lost before the capture point. Each data segment that was in flight at the time of the loss triggers a duplicate ACK. Therefore we can use the number of duplicate ACKs to estimate the bytes in flight at the time of the packet loss. Here we see about 9, 16 and 23 segments. Each segment has 1448 bytes of data, so that's gives us a bytes in flight between 13k and 33k. The throughput here was about than 3 Mbit/s (from IO graph) and with the RTT we measured before we get a Pipe Capacity less than 3e6 [bits/s] * 10e-3 [s] / 8 bytes = 3750 bytes, or less than 3 segments.

Because the bytes in flight at the time of these losses are not really the same (hard to tell here with so few samples) I can't really say if these are random losses (that are bad bad bad) or losses occurring because a queue/bucket overflows, but they are occurring when bytes in flight > PC so throughput is not affected.

Your answer seems to indicate that they were indeed random, but not so many to cause low throughput.

Best Answer

Related Solutions

Syn/ack sequence number confusion

TCP Sequence Numbering

Solutions to wireshark packet drops

Excessive ‘TCP Dup ACK’ & ‘TCP Fast Retransmission’ causing issues on network. What’s causing this

Related Topic