Linux – Debugging perceived network saturation

debugginglinuxnetworking

I'm having a network issue where one machine is saying it is sending data at a rate of 150 Mbit/s but the other machine is receiving at a mere 100 Mbit/s. The sending application eventually crashes, complaining of being out of memory (it's actually a custom dsp, so no use getting into error messages).

So I'm suspicious about the receiving machine not handling the load properly but I'm not very experienced in debugging this kind of situation.

I've been running some vmstat but I have no idea if any of the numbers are alarming or not:

r  b swpd     free   buff   cache  si  so  bi  bo    in     cs us sy id wa st
0  0    0 13317200 262808 2311648   0   0   0   0  1131   3403  4  1 96  0  0
0  0    0 13309140 262808 2311652   0   0   0   9  2092   9235 10  2 89  0  0
0  0    0 13295748 262808 2311652   0   0   0   0  4521  22710 14  4 82  0  0
5  0    0 13279620 262808 2311652   0   0   0   0 13835  66325 30 10 60  0  0
6  0    0 13257432 262808 2311656   0   0   0   0 20092  92365 43 14 43  0  0
3  0    0 13232756 262808 2311660   0   0   0   0 22522 117367 49 17 34  0  0
3  0    0 13207832 262808 2311664   0   0   0  10 23419 149649 54 20 26  0  0
7  0    0 13159720 262808 2311668   0   0   0   8 23816 168436 56 21 23  0  0
8  0    0 13122148 262808 2311668   0   0   0   0 26267 168578 54 20 26  0  0
8  0    0 13119544 262808 2311668   0   0   0   0 30498 164004 53 24 24  0  0
7  0    0 13117312 262808 2311668   0   0   0   0 29853 163340 55 23 23  0  0
8  0    0 13116832 262808 2311664   0   0   0   3 29942 162609 55 22 24  0  0
8  0    0 13118824 262808 2311668   0   0   0   0 30098 162232 55 21 24  0  0
8  0    0 13118212 262808 2311668   0   0   0   0 29213 159902 45 18 37  0  0
8  0    0 13116352 262808 2311668   0   0   0   3 29552 161978 55 21 24  0  0
7  0    0 13117468 262808 2311664   0   0   0   9 30218 162704 56 22 22  0  0
5  0    0 13116972 262808 2311672   0   0   0   0 30172 164399 57 19 24  0  0
8  0    0 13115608 262808 2311672   0   0   0   8 30068 163894 56 18 26  0  0
0  0    0 13181080 262808 2311676   0   0   0   0 19062 151066 46 20 34  0  0
6  0    0 13186536 262808 2311676   0   0   0   0  6812  85690 15 19 66  0  0
1  0    0 13186784 262808 2311676   0   0   0   0  6733  82150 19 22 59  0  0
0  0    0 13203400 262808 2311716   0   0   0   9  2659  33015  5  5 90  0  0

I also checked sockstat, but I don't know if those numbers are worrisome either:

> cat /proc/net/sockstat
sockets: used 920
TCP: inuse 82 orphan 0 tw 0 alloc 91 mem 8228
UDP: inuse 271 mem 20
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

Here is the TCP buffer memory (I tried setting them to values I saw on another machine in the lab where they were roughly tripled with no change):

> sysctl -a | grep tcp.*mem
net.ipv4.tcp_mem = 69888        93184   139776
net.ipv4.tcp_wmem = 4096        16384   16777216
net.ipv4.tcp_rmem = 4096        87380   16777216

As for the hardware, I have 8 cores of this:

> cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           E5472  @ 3.00GHz
stepping        : 10
cpu MHz         : 2403.000
cache size      : 6144 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
bogomips        : 5999.77
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual

What other utilities can I use to debug this? And if anyone has any good resources for this kind of debugging I would be grateful!

Best Answer

What does this show? sudo sysctl -a | grep tcp.*mem

Also: sudo ethtool ethWhateverYoursIs

It is my guess somehow the sending application isn't respecting the TCP flow control and eventually runs out of TCP send buffer memory and crashes.

Related Topic