Redhat – How to fine tune TCP performance on Linux with a 10Gb fiber connection

kernelnetworkingredhattcp

We have 2 Red Hat servers that are dedicated for customer speedtest. They both use 10Gb fiber connections and sit on 10Gb links. All network gear in between these servers fully support 10Gb/s. Using Iperf or Iperf3 the best I can get is around 6.67Gb/s. That being said, one server is in production (customers are hitting it) and the other server is online but not being used. (we are using it for testing atm) The 6.67Gb/s is also one way, I should mention. We'll call these server A and server B.

When server A acts as the iperf server, we get the 6.67Gb/s speeds. When server A acts as the client to server B it can only push about 20Mb/s.

What I have done:

So far the only thing I have done is increase the TX/RX buffers on both server to their max. One was set to 512 the other 453. (RX only, TX was already maxed out) so here is that that looks like on both after the update:

Server A:
Ring parameters for em1:
Pre-set maximums:
RX:     4096
RX Mini:    0
RX Jumbo:   0
TX:     4096
Current hardware settings:
RX:     4096
RX Mini:    0
RX Jumbo:   0
TX:     4096

Server B:
Ring parameters for p1p1:
Pre-set maximums:
RX:     4078
RX Mini:    0
RX Jumbo:   0
TX:     4078
Current hardware settings:
RX:     4078
RX Mini:    0
RX Jumbo:   0
TX:     4078

NICS look like this:

Server A: 
ixgbe 0000:01:00.0: em1: NIC Link is Up 10 Gbps, Flow Control: RX/TX

Serer B:
bnx2x 0000:05:00.0: p1p1: NIC Link is Up, 10000 Mbps full duplex,     Flow control: ON - receive & transmit

Server A ethtool stats:
 rx_errors: 0
 tx_errors: 0
 rx_over_errors: 0
 rx_crc_errors: 0
 rx_frame_errors: 0
 rx_fifo_errors: 0
 rx_missed_errors: 0
 tx_aborted_errors: 0
 tx_carrier_errors: 0
 tx_fifo_errors: 0
 tx_heartbeat_errors: 0
 rx_long_length_errors: 0
 rx_short_length_errors: 0
 rx_csum_offload_errors: 123049

 Server B ethtool stats:
 [0]: rx_phy_ip_err_discards: 0
 [0]: rx_csum_offload_errors: 0
 [1]: rx_phy_ip_err_discards: 0
 [1]: rx_csum_offload_errors: 0
 [2]: rx_phy_ip_err_discards: 0
 [2]: rx_csum_offload_errors: 0
 [3]: rx_phy_ip_err_discards: 0
 [3]: rx_csum_offload_errors: 0
 [4]: rx_phy_ip_err_discards: 0
 [4]: rx_csum_offload_errors: 0
 [5]: rx_phy_ip_err_discards: 0
 [5]: rx_csum_offload_errors: 0
 [6]: rx_phy_ip_err_discards: 0
 [6]: rx_csum_offload_errors: 0
 [7]: rx_phy_ip_err_discards: 0
 [7]: rx_csum_offload_errors: 0
 rx_error_bytes: 0
 rx_crc_errors: 0
 rx_align_errors: 0
 rx_phy_ip_err_discards: 0
 rx_csum_offload_errors: 0
 tx_error_bytes: 0
 tx_mac_errors: 0
 tx_carrier_errors: 0
 tx_deferred: 0
 recoverable_errors: 0
 unrecoverable_errors: 0

Potential issue: Server A has tons of rx_csum_offload_errors. Server A is the one on production and I can't help but think that CPU interrupts may be an underlying factor here and whats causing the errors I see.

cat /proc/interrupts from Server A:

122:   54938283          0          0          0          0            0          0          0          0          0          0          0            0          0          0          0          0          0          0           0          0          0          0          0  IR-PCI-MSI-edge      em1-  TxRx-0
123:   51653771          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-1
124:   52277181          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-2
125:   51823314          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-3
126:   57975011          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-4
127:   52333500          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-5
128:   51899210          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-6
129:   61106425          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-7
130:   51774758          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-8
131:   52476407          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-9
132:   53331215          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-TxRx-10
133:   52135886          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0

Would disabling rx-checksumming help if this is what the issue may be? Also I see no CPU interrupts on the server that's not in production, which makes sense, since its NIC is not needing CPU time.

Server A:
 ethtool -k em1
Features for em1:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-unneeded: off
tx-checksum-ip-generic: off
tx-checksum-ipv6: on
tx-checksum-fcoe-crc: on [fixed]
tx-checksum-sctp: on [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: off
tx-tcp6-segmentation: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: on [fixed]
tx-gre-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
fcoe-mtu: off [fixed]
loopback: off [fixed]

Other than using jumbo frames, which is not possible because our network gear does not support them, what else can I do or check to provide me with the most optimal TCP performance for my 10Gb network? The 6.67Gb/s is not that bad I guess taking into consideration that one of the servers is in production and my hypothesis about the CPU interrupts the NIC is generating. But the 20Mb/s speed in the other direction on a 10Gb link is simply not acceptable. Any help would be greatly appreciated.

Server A specs:
x64 24v CPU
32GB RAM
RHEL 6.7

Server B Specs:
x64 16v CPU
16GB ram
RHEL 6.7

Best Answer

In Linux/Intel I would use following methodology for performance analysis:

Hardware:

  • turbostat
    Look for C/P states for cores, frequencies, number of SMIs. [1]
  • cpufreq-info
    Look for current driver, frequencies, and governor.
  • atop
    Look for interrupt distribution across cores
    Look for context switches, interrupts.
  • ethtool
    -S for stats, look for errors, drops, overruns, missed interrupts, etc
    -k for offloads, enable GRO/GSO, rss(/rps/rfs)/xps
    -g for ring sizes, increase
    -c for interrupt coalescing

Kernel:

  • /proc/net/softirq[2] and /proc/interrupts[3]
    Again, distribution, missed, delayed interrupts, (optional) NUMA-affinity
  • perf top
    Look where kernel/benchmark spends its time.
  • iptables
    Look if there are rules (if any) that may affect performance.
  • netstat -s, netstat -m, /proc/net/*
    Look for error counters and buffer counts
  • sysctl / grub
    So much to tweak here. Try increasing hashtable sizes, playing with memory buffers, congestion control, and other knobs.

In your case your main problem is interrupt distribution across the cores, so fixing it will be your best corse of action.

PS. Do not forget that in those kinds of benchmarks kernel and driver/firmware versions play a significant role.

PPS. You probably want to install the newest ixgbe driver from Intel[4]. Do not forget to read README there and examine scripts directory. It has lots of performance-related tips.

[0] Intel also has nice docs about scaling network performance
https://www.kernel.org/doc/Documentation/networking/scaling.txt
[1] You can pin your processor to a specific C-state:
https://gist.github.com/SaveTheRbtz/f5e8d1ca7b55b6a7897b
[2] You can analyze that data with:
https://gist.github.com/SaveTheRbtz/172b2e2eb3cbd96b598d
[3] You can set affinity with:
https://gist.github.com/SaveTheRbtz/8875474
[4] https://sourceforge.net/projects/e1000/files/ixgbe%20stable/

Related Topic