Debian – Very high packet loss in burst

debianpacketloss

We sometimes have 90%+ more packet loss on our server, but it does not always appends. Right now it works perfectly, but just half an hour ago, it had just that problem.

Our service provider is telling us to go in a recovery system to test if this is really a hardware problem and not software on our side. However, I don't see anything that can cause packet loss on our side, especially if it is not consistent.

Is there anything we could check before doing an other test on the recovery system?

We have a dedicated server at Hetzner.de. It is connected to 100MBit ethernet. We did not try to change anything on the hardware side, because our server provider want that we check our software before to continue to check the hardware.

Here is the mtr reports I have made. During that the report, we had 3 burst of packet loss and the rest of the time the server was reachable :

Client to server

HOST: mbp                         Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.0.1.1                   0.0%  1000    0.4   0.2   0.2   3.4   0.2
  2.|-- 10.0.1.1                   0.3%  1000   27.5  29.7   5.9 237.3  34.6
  3.|-- 10.170.172.121             0.4%  1000   17.2  41.9   7.2 334.1  44.2
  4.|-- 216.113.123.158            1.4%  1000   44.4  58.6  10.6 299.6  49.2
  5.|-- 216.113.123.194            1.1%  1000   36.6  72.9  19.4 330.7  48.1
  6.|-- paix-nyc.init7.net         0.7%  1000   57.1  75.8  18.4 313.8  49.1
  7.|-- r1lon1.core.init7.net      1.4%  1000  199.8 150.9  87.1 373.7  56.4
  8.|-- r1fra1.core.init7.net      0.6%  1000  244.2 150.1  98.6 438.6  53.6
  9.|-- gw-hetzner.init7.net       1.4%  1000  175.3 140.6 100.5 397.2  49.7
 10.|-- hos-bb2.juniper2.rz16.het 39.0%  1000  120.0 136.7 103.5 362.6  44.3
 11.|-- hos-tr4.ex3k13.rz16.hetzn  0.8%  1000  145.4 132.2 106.8 393.3  36.9
 12.|-- static.98.43.9.5.clients. 39.8%  1000  116.0 131.5 106.1 371.8  34.4

Server to client

HOST: thetransitapp               Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. static.97.43.9.5.clients.you 29.0%  1000    7.2   7.4   0.9  24.9   1.9
  2. hos-tr1.juniper1.rz16.hetzne 38.7%  1000    6.1   9.6   0.2  78.8   7.6
  3. hos-bb2.juniper4.ffm.hetzner 36.2%  1000   11.8  11.4   5.8  29.0   1.5
  4. r1fra1.core.init7.net        38.1%  1000   12.4  13.9   5.5  22.9   3.9
  5. r1lon1.core.init7.net        36.3%  1000   23.5  26.5  17.6  37.6   4.4
  6. r1nyc1.core.init7.net        35.5%  1000   92.3  93.8  86.1 103.0   3.7
  7. paix-ny.ia-unyc-bb05.vtl.net 35.5%  1000   95.5  96.4  87.6 134.7   5.3
  8. 216.113.123.169              36.3%  1000  101.5 102.0  94.4 124.9   3.6
  9. 216.113.124.42               34.7%  1000  113.1 107.7  96.7 117.6   3.6
 10. 216.113.123.157              37.5%   999  106.5 107.4 101.5 115.0   1.5
 11. ???                          100.0   999    0.0   0.0   0.0   0.0   0.0
 12. modemcable004.103-176-173.mc 36.7%   999  111.2 147.9 107.2 342.0  48.3

Here is the ethernet configuration

Settings for eth0:
    Supported ports: [ TP MII ]
    Supported link modes:   10baseT/Half 10baseT/Full 
                            100baseT/Half 100baseT/Full 
                            1000baseT/Half 1000baseT/Full 
    Supports auto-negotiation: Yes
    Advertised link modes:  10baseT/Half 10baseT/Full 
                            100baseT/Half 100baseT/Full 
                            1000baseT/Half 1000baseT/Full 
    Advertised pause frame use: No
    Advertised auto-negotiation: Yes
    Link partner advertised link modes:  10baseT/Half 10baseT/Full 
                                         100baseT/Half 100baseT/Full 
                                         1000baseT/Full 
    Link partner advertised pause frame use: No
    Link partner advertised auto-negotiation: Yes
    Speed: 1000Mb/s
    Duplex: Full
    Port: MII
    PHYAD: 0
    Transceiver: internal
    Auto-negotiation: on
    Supports Wake-on: pumbg
    Wake-on: g
    Current message level: 0x00000033 (51)
    Link detected: yes

ifconfig of eth0:

eth0      Link encap:Ethernet  HWaddr c8:60:00:bd:2f:9d  
          inet addr:5.9.43.98  Bcast:5.9.43.127  Mask:255.255.255.224
          inet6 addr: fe80::ca60:ff:febd:2f9d/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3521 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2117 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2882770 (2.7 MiB)  TX bytes:910907 (889.5 KiB)
          Interrupt:30 Base address:0x8000

Best Answer

In my opinion it's hetzner fault. I've been arguing with them for a very long time about similar case.

We had those problems and were reporting it to the hosting company. The answer was always the same - "Please attach mtr in both directions" - they would answer like that even during the fault. So we did write a daemon that will launch mtr each time we have any packet loss between servers :

if [ -z $1 ] ; then
                echo "Give target host"
else
                host=$1
                while true ; do
                                loss=`ping -c 10 $host | grep packet | awk {'print $6'} | sed s/%//g`
                                if [ $loss -ge 1 ]; then
                                                echo `date` >> /root/scripts/loss_measure_mtr.log
                                                mtr -s 1500 -r -c 1000 -i 0.1 $host >> /root/scripts/loss_measure_mtr.log
                                fi

                done


fi

Then with this information they answered :

At this time there was an incoming attack in the subnet. In this case it is
possible that packet-loss occurs at servers in the same subnet.

Best Regards

Michael Straetz

Hetzner Online AG
Support
90431 Nürnberg / Germany
Tel: +49 (911) 234 226 54
Fax: +49 (911) 234 226 8 977
http://www.hetzner.de

What is exactly happening ? I dont' know but it looks almost the same :

Sun Aug 12 01:13:20 CEST 2012
HOST: app                         Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.                      94.1%  1000    0.2   0.2   0.1   0.4   0.1
  2. static.1.24.24.46.clients.you  0.0%  1000    3.0   1.9   0.7  19.4   1.5
  3. hos-tr4.juniper2.rz13.hetzne  9.4%  1000    0.6   1.9   0.4 133.2   8.0
  4. hos-bb2.juniper1.rz1.hetzner  5.4%  1000   38.6   7.1   3.0 112.9  11.5
  5. hos-tr1.ex3k3.rz1.hetzner.de 10.9%  1000    4.4   5.1   3.6  23.6   1.8**
  6. static.88-128-24-108.clients 15.5%  1000    3.6   3.5   3.4   4.6   0.1
HOST: app                         Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.                  94.5%  1000    0.2   0.2   0.1   0.6   0.1
  2. static.1.24.24.46.clients.you  0.0%  1000    1.2   1.9   0.7  19.3   1.6
  3. hos-tr4.juniper2.rz13.hetzne  9.3%  1000    0.6   1.8   0.4 136.8   7.9
  4. hos-bb2.juniper1.rz1.hetzner  2.7%  1000    3.3   7.0   3.0 113.1  11.5
  5. hos-tr1.ex3k3.rz1.hetzner.de  8.5%  1000    7.0   5.1   3.6  26.8   2.0
  6. static.88-128-24-108.clients 12.8%  1000    3.6   3.5   3.3   4.5   0.1

I have tens of mtr's like this.

In my opinion it's their infrastructure problems. Notice that loss is occuring on the nodes : hos-tr1.ex3k3.rz1.hetzner.de, hos-tr4.juniper2.rz13.hetzner.de and so on.

If they don't fix that I'll probably migrate to linode or amazon.

Related Solutions

How to diagnose packet loss

I am a network engineer, so I'll describe this from my perspective.

For me, diagnosing packet loss usually starts with "it's not working very well". From there, I usually try to find kit as close to both ends of the communication (typically, a workstation in an office and a server somewhere) and ping as close to the other end as possible (ideally the "remote end-point", but sometimes there are firewalls I can't send pings through, so will have to settle for a LAN interface on a router) and see if I can see any loss.

If I can see loss, it's usually a case of "not enough bandwidth" or "link with issues" somewhere in-between, so find the route through the network and start from the middle, that usually gives you one end or the other.

If I cannot see loss, the next two steps tend to be "send more pings" or "send larger pings". If that doesn't sort give an indication of what the problem is, it's time to start looking at QoS policies and interface statistics through the whole path between the end-points.

If that doesn't find anything, it's time to start question your assumptions, are you actually suffering from packet loss. The only sure way of finding that is to do simultaneous captures on both ends, either by using WireShark (or equivalent) on the hosts or by hooking up sniffer machines (probably using WireShark or similar) via network taps. Then comes the fun of comparing the two packet captures...

Sometimes, what is attributed as "packet loss" is simply something on the server side being noticeably slower (like, say, moving the database from "on the same LAN" to "20 ms away" and using queries that requires an awful lot of back-and-forth between the front-end and the database).

Linux – How passively monitor for tcp packet loss? (Linux)

For a general sense of the scale of your problem netstat -s will track your total number of retransmissions.

# netstat -s | grep retransmitted
     368644 segments retransmitted

You can aso grep for segments to get a more detailed view:

# netstat -s | grep segments
         149840 segments received
         150373 segments sent out
         161 segments retransmitted
         13 bad segments received

For a deeper dive, you'll probably want to fire up Wireshark.

In Wireshark set your filter to tcp.analysis.retransmission to see retransmissions by flow.

That's the best option I can come up with.

Other dead ends explored:

netfilter/conntrack tools don't seem to keep retransmits
stracing netstat -s showed that it is just printing /proc/net/netstat
column 9 in /proc/net/tcp looked promising, but it unfortunately appears to be unused.

Best Answer

Related Solutions

How to diagnose packet loss

Linux – How passively monitor for tcp packet loss? (Linux)

Related Topic