Why Packet Loss AFTER tcpdump has logged the packet

packetlosstcptcpdump

We encounter some strange packet loss and want to know the reason for this.

We have an imageserver and a server for stressing the imageserver.
Both are located in the same datacenter

First we run a load test like this (command shortened for readability):

ab -n 50 -c 5 http://testserver/img/de.png

The image has only about 300 Bytes. Results are very fast responses:

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       0
Processing:     1    3   0.7      3       4
Waiting:        1    3   0.7      3       3
Total:          1    3   0.7      3       4

When we increase concurrency we see some lags (command shortened for readability):

sudo ab -n 500 -c 50 http://testserver/img/de.png

Results with concurrency 50:

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.2      0       1
Processing:     2   35 101.6     12     614
Waiting:        2   35 101.6     12     614
Total:          3   36 101.7     12     615

so we see most requests are pretty fast, a few of them are pretty slow.

We dumped the whole network traffic with tcpdump and saw some strange retransmissions.

alt text http://vygen.de/screenshot1.png

this dump was taken on the imageserver!

so you see that the initial package (No. 306) containing the GET request is arriving on the imageserver, but it seems the package get lost after tcpdump has logged it. It seems to me that this package does not arrive at the tomcat image server.

the retransmission is triggered by the requesting server 200ms later and everything runs fine afterwards.

Do you know any reason why a package can get lost after it was received?

Our machines are both:

  • Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
  • 8 GB RAM
  • Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)
  • Debian version 5.0.5

So we do not have any problems concerning memory or cpu load.

We had some problems with our nic controller a while ago. We handled it by using a different driver, we are using now r8168 instead of r8169

But we had the same problems of lost packets with an Intel NIC
– Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet Controller (rev 05)

So we see the same problem with equal machines but different ethernet cards.

Till now i thought packet loss would only happen between servers on the line, when the packet gets corrupted or things like that.

We really want to know what reasons there might be for those packet loss after tcpdump has logged them.

Your help is very appreciated.

Best Answer

we found the root cause of this. We had an acceptCount of 25 in our tomcat server.xml.

acceptCount is documented like this:

acceptCount

The maximum queue length for incoming connection requests when all possible request processing threads are in use. Any requests received when the queue is full will be refused. The default value is 100.

But this is not the whole story about acceptCount. Short: acceptCount is the backlog Parameter when opening the socket. So this value is important for the listen backlog, even if not all threads are busy. It is important if request are faster coming in then tomcat can accept and delegate them to waiting threads. The default acceptCount is 100. This is still a small value to feed a sudden peak in requests.

We checked the same thing with apache and nginx and had the same strange packet loss but with higher concurrency values. The corresponding value in apache is ListenBacklog which defaults to 511.

BUT, with debian (and other linux based os) the default max value for the backlog paramter is 128.

$ sysctl -a | grep somaxc
net.core.somaxconn = 128

So whatever you type in acceptCount or ListenBacklog it will not be over 128 until you change net.core.somaxconn

For a very busy webserver 128 is not enough. You should change it to something like 500, 1000 or 3000, depending on your needs.

After setting acceptCount to 1000 and net.core.somaxconn to 1000 we no longer had those dropped packets. (Now we have a bottleneck somewhere else, but this is another story..)