Linux – How to optimize throughput on a linux NAT/router

linuxnetworkingperformance-tuningrouter

I am trying to use an old Fujitsi RX300S2, with a quad core Intel Xeon CPU @2.80GHz as a Gitabit NAT router, it has a dual gigabit NIC on board over PCI-X.

The router will also forward multicast traffic from the external interface to the internal network. Multicast routing is handled by the upstream Cisco router so the NAT router only has to "leak" multicast traffic between eth1 (upstream) and eth0 (internal).

This has been properly setup using igmpproxy which basically makes the L3 router act as a L2 bridge according to multicast traffic.

When testing the throughput, I have no problem receiving ~850-900Mbit multicast traffic on 200 groups/streams (approx 80'000 p/s) to a local process in userspace, which also analyses the 200 streams in realtime without packet loss. The local process maxes one core at 100%.

The streams consists of IPTV mpeg transport streams encapsulated in IP UDP packets. 7×188=1316 bytes of payload.

But when testing the throughput in forwarding mode, e.g multicast traffic enters eth1 and is routed at kernel level to eth0 and sent into the local network, the NAT router cannot forward all traffic received.

The external interface eth1 receives all multicast traffic ~900Mbit but the outgoing interface only transmits ~600Mbit and all streams suffer from packet loss according the the receiving test machine attached to eth0.

When analysing the load ksoftirqd/3 maxes out at 100% CPU but the other 3 cores are below 10% so it seems that not all 4 cores participate in the load.

The /proc/interrupts also shows that eth0 and eth1 share irq16:

    CPU0 CPU1  CPU2       CPU3
16:    0    0 92155  208280892   IO-APIC   16-fasteoi uhci_hcd:usb2, uhci_hcd:usb5, eth1, eth0

As can be seen, CPU3 handles a disproportionate amount of interrupts.

I have read through various texts regarding cpu_affinity and trying to pin CPU cores to network queues. Unfortunately this NIC tg3 from Broadcom does not support multiple queues but still, it should be possible to share the load between more cores on this quad core system.

Or is it the PCI-X bus that is the bottleneck, but if so, then the throughput should be reduced on both incoming eth1 and outgoing eth0 and packets should be dropped by eth1 but it seems that packets are lost between eth1 and eth0. Not true, since when packets are lost in the router, /sys/class/net/eth1/statistics/rx_missed_errors is incremented alot (about 1000 p/s).

When only 100 channels and approx 500 Mbit is forwarded, packet loss does not happen and ksoftirqd/3 only consumes about 5-6% CPU. But when 600Mbit is forwarded, ksoftirqd/3 consumes 100% so it seems that some bottleneck outside the CPU is hit.

Is it out of the question that an old server like this is able to forward 1Gbit of UDP traffic in one direction only between two built in NIC's? Even though the packets are large, 1316 bytes payload, which gives a moderate 80..90kp/s in 1Gbit?

Best Answer

We abandoned the server, by spec the two on board network interfaces were not supposed to drive full gigabit traffic. The second interface was indented to be used for management.

A standard desktop core i5 with PCIe and two Intel i210 gigabit adapters was able to forward 1Gbit multicast UDP traffic with no problem.

Although, it required tweaking the RX and TX buffers (ethtool -G) due to burstiness in traffic. A 2x or 4x PCIe would probably help to reduce the risk of missed packets due to PCIe bus congestion.

Related Topic