We replaced our aging firewall with this server, running Ubuntu 16.04.
It does (almost) nothing other than running iptables with about 900 rules (filter & nat combined).
The aging server it replaced worked fine and there were no issues whatsoever.
Every once in a while (it can be once an hour or every 30 seconds) the latency between the new firewall and any other host on the LAN jumps from 0.1-0.2ms to 10, 40, 100 and even 3000ms for a few seconds (sometimes it even lasts minutes). I noticed it with a simple lag on an ssh connection to a host in the DMZ (shouldn't be any lag) and then tested it with simple continues, high-rate (-i 0.1) ping tests to various hosts.
I tested this on both the 10gbps interface and one of the 1gbps interfaces. The server is nowhere near it's network limits (~10Kpps, 100-400mbps up&down combined). CPU is idling at 99%
In one of the longer "outages" I connected to the firewall from the internet to debug it, I noticed that there are no issues with any other interface and all of them are okay with no latency issues.
In order to remove the switch out of the equation I moved the 1gbps interface to a different switch, outside our stack and added another server to the new switch to test against. The problem still persists, I run a constant ping to multiple machines and they all go up to 2-3 seconds every once in a while including the one in the "immediate" switch.
dmesg shows nothing, ifconfig shows no errors, /proc/interrupts shows that all cores participate in handling the nic(s) (although I am pretty sure that for such a low throughput even 1 core would suffice…)
Any suggestions or ideas how to debug such a scenario would be appreciated.
Thanks!
EDIT: Adding ethtool output:
Settings for eno1:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: on (auto)
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x00000007 (7)
drv probe link
Link detected: yes
EDIT 2:
Maybe it's irrelevant but I did see this in one of the (really long) outages:
%Cpu(s): 0.1 us, 3.3 sy, 0.0 ni, 95.7 id, 0.0 wa, 0.0 hi, 1.0 si, 0.0 st
KiB Mem : 16326972 total, 14633008 free, 296636 used, 1397328 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 15540780 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29163 root 20 0 0 0 0 S 8.0 0.0 14:08.45 kworker/4:0
31722 root 20 0 0 0 0 S 7.3 0.0 9:39.76 kworker/6:0
11677 root 20 0 0 0 0 S 5.6 0.0 0:04.65 kworker/3:1
149 root 20 0 0 0 0 S 4.0 0.0 27:21.36 kworker/2:1
46 root 20 0 0 0 0 S 0.3 0.0 0:06.93 ksoftirqd/6
Unusually high kworker cpu usage (normally it's around 1%).
Any idea?
Best Answer
I have had a similar situation and this link helped us to solve our issues!
Essentially, you probably need to configure the TCP socket receive max buffer size to between 2-4mb, maybe even smaller if it doesn't affect your service as you are having so many large spikes.
To compare the issues:
Hope this is helpful!