Debian – Linux optical 10Gbe networking, how to diagnose performance problems

bondingdebianlacplinux-networkingsfp

I have a small cluster consisting of 3 servers. Each has two 10Gbe SFP+ optical network cards. There are two separate 10Gbe switches. On all servers one NIC is connected to switch 1, second NIC is connected to switch 2 to provide fault tolerance.

Physical interfaces are bonded on server level using LACP.

All servers can ping each other, but on one there is small (4%) packet loss (over bonded interface, which looks suspicious to me)

When I check with iperf3 transfer rates between two good servers, they show about 9.8Gbit/s transfer rates in both directions.

Those two good servers can also download from problematic one also about 9.8 Gbit/s

Iperf3 show strange thing when run as client on problematic server. It starts with a few hundred megabit in first turn. Later speed drops to 0 bit/s (while still running ICMP ping with ~96% success rate). Only in one direction.
When other servers download from this, they get full speed.

It's all running on a same hardware even firmware version is the same (Dell R620 servers, Mellanox ConnextX-3-EN NIC's, Opton SPF+ modules, Mikrotik CRS309-1G-8S switches). Also OS is the same latest stable Debian with all updates and exact installed packages.

There is no firewall, all iptables rules are cleared on all servers

On problematic server i check interfaces, both NIC's show UP and running at 10Gbit full duplex

Also cat /proc/net/bonding/bond0 show both interfaces UP, active, no physical link errors

I checked/replaced SFP+ modules, used different fiber patch cords, tried different switch ports and nothing changes, still this one problematic server get poor download speed from others and small packet loss (over bonded interface!).

I also tried patch cord combinations with: (both on, first on second off, first off second on). Also no change

Any ideas how can I diagnose it better?

Best Answer

Unless the switches support stacking and support LACP across chassis, LACP cannot work that way. In fact, static LAG trunking won't work either.

Generally, link aggregation only works with a single opposite switch (or a stack acting like it).

With simple L2 redundancy, you can only run the NICs in active/passive pairs with failover. Using multiple L3 links with appropriate load balancing and IP migration on failover or monitoring by an external load balancer will also work in your scenario.

Related Solutions

Linux – Is bonding mode=5 a solution against MAC flapping

In mode 5, or balance-tlb mode, outgoing traffic uses the MAC address of the slave interface that it's leaving, instead of using the address of the bond interface.

Typically, the bond's MAC is used for all traffic, which can cause a MAC flapping condition between two ports on a given switch - each of your switches will see ingressing traffic with the bond's MAC as the source, both from the direct connection to the device, and from the cross-connect to the other switch.

The transmit load-balancing mode skirts this issue by balancing traffic outbound between interfaces, but by using the interface's MAC address as the source for outbound traffic. If your other nodes in the subnet (particularly the router) don't mind this behavior, then it works just fine - typically there will be no issue, but I can imagine some restrictive router security settings taking offense.

The bond interface will take the MAC address of one of its slave interfaces:

root@test1:~# ifconfig
bond1     Link encap:Ethernet  HWaddr 00:0c:29:3d:f7:35
          inet addr:192.168.100.25  Bcast:192.168.100.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:fe3d:f735/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

eth1      Link encap:Ethernet  HWaddr 00:0c:29:3d:f7:35
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1

eth2      Link encap:Ethernet  HWaddr 00:0c:29:3d:f7:3f
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1

eth1's MAC matches the bond interface, it's the "primary", so it's getting the inbound traffic.

And, just to confirm:

root@test1:~# cat /sys/class/net/bond1/bonding/mode
balance-tlb 5

root@test1:~# cat /sys/class/net/bond1/bonding/active_slave
eth1

Ok, so.. is it load balancing? Here's how it looks from another node, sending constant pings:

root@test2:~# tcpdump -e -n -i eth0 proto 1
20:33:08.094078 00:0c:29:46:4f:c6 > 00:0c:29:3d:f7:35, ethertype IPv4 (0x0800), length 98: 192.168.100.40 > 192.168.100.25: ICMP echo request, id 5810, seq 38, length 64
20:33:08.094549 00:0c:29:3d:f7:35 > 00:0c:29:46:4f:c6, ethertype IPv4 (0x0800), length 98: 192.168.100.25 > 192.168.100.40: ICMP echo reply, id 5810, seq 38, length 64
20:33:09.094052 00:0c:29:46:4f:c6 > 00:0c:29:3d:f7:35, ethertype IPv4 (0x0800), length 98: 192.168.100.40 > 192.168.100.25: ICMP echo request, id 5810, seq 39, length 64
20:33:09.094520 00:0c:29:3d:f7:35 > 00:0c:29:46:4f:c6, ethertype IPv4 (0x0800), length 98: 192.168.100.25 > 192.168.100.40: ICMP echo reply, id 5810, seq 39, length 64
20:33:10.094078 00:0c:29:46:4f:c6 > 00:0c:29:3d:f7:35, ethertype IPv4 (0x0800), length 98: 192.168.100.40 > 192.168.100.25: ICMP echo request, id 5810, seq 40, length 64
20:33:10.094540 00:0c:29:3d:f7:35 > 00:0c:29:46:4f:c6, ethertype IPv4 (0x0800), length 98: 192.168.100.25 > 192.168.100.40: ICMP echo reply, id 5810, seq 40, length 64

That all looks normal - eth1 is responding. Then, unprompted, there's a switch - notice that the request's destination MAC and the response's source MAC no longer match.

20:33:11.094084 00:0c:29:46:4f:c6 > 00:0c:29:3d:f7:35, ethertype IPv4 (0x0800), length 98: 192.168.100.40 > 192.168.100.25: ICMP echo request, id 5810, seq 41, length 64
20:33:11.094614 00:0c:29:3d:f7:3f > 00:0c:29:46:4f:c6, ethertype IPv4 (0x0800), length 98: 192.168.100.25 > 192.168.100.40: ICMP echo reply, id 5810, seq 41, length 64
20:33:12.094059 00:0c:29:46:4f:c6 > 00:0c:29:3d:f7:35, ethertype IPv4 (0x0800), length 98: 192.168.100.40 > 192.168.100.25: ICMP echo request, id 5810, seq 42, length 64
20:33:12.094531 00:0c:29:3d:f7:3f > 00:0c:29:46:4f:c6, ethertype IPv4 (0x0800), length 98: 192.168.100.25 > 192.168.100.40: ICMP echo reply, id 5810, seq 42, length 64
20:33:13.094086 00:0c:29:46:4f:c6 > 00:0c:29:3d:f7:35, ethertype IPv4 (0x0800), length 98: 192.168.100.40 > 192.168.100.25: ICMP echo request, id 5810, seq 43, length 64
20:33:13.094581 00:0c:29:3d:f7:3f > 00:0c:29:46:4f:c6, ethertype IPv4 (0x0800), length 98: 192.168.100.25 > 192.168.100.40: ICMP echo reply, id 5810, seq 43, length 64

Watching a constant ping, the switches between source continue arbitrarily based on the bond interface's evaluation of the load - it seems to re-evaluate every 10 seconds.

Failover for inbound traffic in mode 5 is much more basic; when the interface is detected as down, the bond interface's MAC is simply moved over to the live interface. This'll often fire a MAC flapping warning in your switch logs, but that's to be expected; nothing to worry about.

The interface MACs change from this:

eth1      Link encap:Ethernet  HWaddr 00:0c:29:3d:f7:35
eth2      Link encap:Ethernet  HWaddr 00:0c:29:3d:f7:3f

..to, after taking eth1 down, this:

eth1      Link encap:Ethernet  HWaddr 00:0c:29:3d:f7:3f
eth2      Link encap:Ethernet  HWaddr 00:0c:29:3d:f7:35

And, all traffic sources from eth2, with a MAC of :35.

So, yeah - assuming that you don't care about load balancing of inbound traffic, the balance-tlb mode seems to do an excellent job of switch-safe load balancing of outbound traffic and failover of inbound traffic.

If your router doesn't care about multiple source MACs sending traffic for a single IP, and doesn't get offended by gratuitous ARP failovers, then you should be good to go!

Link Aggregation (LACP/802.3ad) Maximum Throughput Explained

A quick and dirty explanation is that a single line of communication using LACP will not split packets over multiple interfaces. For example, if you have a single TCP connection streaming packets from HostA to HostB it will not span interfaces to send those packets. I've been looking at LACP a lot here lately for a solution we are working on and this is a common misconception that 'bonding' or 'trunking' multiple network interfaces with LACP gives you a "throughput" of the combined interfaces. Some vendors have made proprietary drivers that will route over multiple interfaces but the LACP standard does not from what I've read. Here's a link to a decent diagram and explanation I found from HP while searching on similar issues: http://www.hp.com/rnd/library/pdf/59692372.pdf

Best Answer

Related Solutions

Linux – Is bonding mode=5 a solution against MAC flapping

Link Aggregation (LACP/802.3ad) Maximum Throughput Explained

Related Topic