Ubuntu server: weird latency jumps in lan

We replaced our aging firewall with this server, running Ubuntu 16.04.

It does (almost) nothing other than running iptables with about 900 rules (filter & nat combined).

The aging server it replaced worked fine and there were no issues whatsoever.

Every once in a while (it can be once an hour or every 30 seconds) the latency between the new firewall and any other host on the LAN jumps from 0.1-0.2ms to 10, 40, 100 and even 3000ms for a few seconds (sometimes it even lasts minutes). I noticed it with a simple lag on an ssh connection to a host in the DMZ (shouldn't be any lag) and then tested it with simple continues, high-rate (-i 0.1) ping tests to various hosts.

I tested this on both the 10gbps interface and one of the 1gbps interfaces. The server is nowhere near it's network limits (~10Kpps, 100-400mbps up&down combined). CPU is idling at 99%

In one of the longer "outages" I connected to the firewall from the internet to debug it, I noticed that there are no issues with any other interface and all of them are okay with no latency issues.

In order to remove the switch out of the equation I moved the 1gbps interface to a different switch, outside our stack and added another server to the new switch to test against. The problem still persists, I run a constant ping to multiple machines and they all go up to 2-3 seconds every once in a while including the one in the "immediate" switch.

dmesg shows nothing, ifconfig shows no errors, /proc/interrupts shows that all cores participate in handling the nic(s) (although I am pretty sure that for such a low throughput even 1 core would suffice…)

Any suggestions or ideas how to debug such a scenario would be appreciated.

Thanks!

EDIT: Adding ethtool output:

Settings for eno1:

Supported ports: [ TP ]
Supported link modes:   10baseT/Half 10baseT/Full
                        100baseT/Half 100baseT/Full
                        1000baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Advertised link modes:  10baseT/Half 10baseT/Full
                        100baseT/Half 100baseT/Full
                        1000baseT/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: on (auto)
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x00000007 (7)
               drv probe link
Link detected: yes

EDIT 2:
Maybe it's irrelevant but I did see this in one of the (really long) outages:

%Cpu(s):  0.1 us,  3.3 sy,  0.0 ni, 95.7 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
KiB Mem : 16326972 total, 14633008 free,   296636 used,  1397328 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 15540780 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
29163 root      20   0       0      0      0 S   8.0  0.0  14:08.45 kworker/4:0
31722 root      20   0       0      0      0 S   7.3  0.0   9:39.76 kworker/6:0
11677 root      20   0       0      0      0 S   5.6  0.0   0:04.65 kworker/3:1
149 root      20   0       0      0      0 S   4.0  0.0  27:21.36 kworker/2:1
46 root      20   0       0      0      0 S   0.3  0.0   0:06.93 ksoftirqd/6

Unusually high kworker cpu usage (normally it's around 1%).
Any idea?

Best Answer

I have had a similar situation and this link helped us to solve our issues!

Essentially, you probably need to configure the TCP socket receive max buffer size to between 2-4mb, maybe even smaller if it doesn't affect your service as you are having so many large spikes.

To compare the issues:

Lots of healthy traffic with seemingly random, massive lag spikes that could last for a long period of time.
You've confirmed the issue is with your new firewall.
All of your data from testing is telling you there is no problem.
It is a very occasional, seemingly random delay between the packet being received by the OS and processed.

Hope this is helpful!

Best Answer

Related Solutions

Ubuntu Server 12.04 can access LAN but not internet

Linux – Can’t set NIC speed to 10Gbps using ethtool

Related Topic