FreeBSD link aggregation no faster than single link

bandwidthbondingfreebsdlacp

We put a 4 port Intel I340-T4 NIC in a FreeBSD 9.3 server1 and configured it for link aggregation in LACP mode in an attempt to decrease the time it takes to mirror 8 to 16 TiB of data from a master file server to 2-4 clones in parallel. We were expecting to get up to 4 Gbit/sec aggregate bandwidth, but no matter what we've tried, it never comes out faster than 1 Gbit/sec aggregate.2

We're using iperf3 to test this on a quiescent LAN.3 The first instance nearly hits a gigabit, as expected, but when we start a second one in parallel, the two clients drop in speed to roughly ½ Gbit/sec. Adding a third client drops all three clients' speeds to ~⅓ Gbit/sec, and so on.

We've taken care in setting up the iperf3 tests that traffic from all four test clients comes into the central switch on different ports:

LACP test setup

We've verified that each test machine has an independent path back to the rack switch and that the file server, its NIC, and the switch all have the bandwidth to pull this off by breaking up the lagg0 group and assigning a separate IP address to each of the four interfaces on this Intel network card. In that configuration, we did achieve ~4 Gbit/sec aggregate bandwidth.

When we started down this path, we were doing this with an old SMC8024L2 managed switch. (PDF datasheet, 1.3 MB.) It wasn't the highest-end switch of its day, but it's supposed to be able to do this. We thought the switch might be at fault, due to its age, but upgrading to a much more capable HP 2530-24G did not change the symptom.

The HP 2530-24G switch claims the four ports in question are indeed configured as a dynamic LACP trunk:

# show trunks
Load Balancing Method:  L3-based (default)

  Port | Name                             Type      | Group Type    
  ---- + -------------------------------- --------- + ----- --------
  1    | Bart trunk 1                     100/1000T | Dyn1  LACP    
  3    | Bart trunk 2                     100/1000T | Dyn1  LACP    
  5    | Bart trunk 3                     100/1000T | Dyn1  LACP    
  7    | Bart trunk 4                     100/1000T | Dyn1  LACP    

We've tried both passive and active LACP.

We've verified that all four NIC ports are getting traffic on the FreeBSD side with:

$ sudo tshark -n -i igb$n

Oddly, tshark shows that in the case of just one client, the switch splits the 1 Gbit/sec stream over two ports, apparently ping-ponging between them. (Both the SMC and HP switches showed this behavior.)

Since the clients' aggregate bandwidth only comes together in a single place — at the switch in the server's rack — only that switch is configured for LACP.

It doesn't matter which client we start first, or which order we start them in.

ifconfig lagg0 on the FreeBSD side says:

lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=401bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO>
    ether 90:e2:ba:7b:0b:38
    inet 10.0.0.2 netmask 0xffffff00 broadcast 10.0.0.255
    inet6 fe80::92e2:baff:fe7b:b38%lagg0 prefixlen 64 scopeid 0xa 
    nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
    media: Ethernet autoselect
    status: active
    laggproto lacp lagghash l2,l3,l4
    laggport: igb3 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
    laggport: igb2 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
    laggport: igb1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
    laggport: igb0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>

We've applied as much of the advice in the FreeBSD network tuning guide as makes sense to our situation. (Much of it is irrelevant, such as the stuff about increasing max FDs.)

We've tried turning off TCP segmentation offloading, with no change in the results.

We do not have a second 4-port server NIC to set up a second test. Because of the successful test with 4 separate interfaces, we're going on the assumption that none of the hardware is damaged.3

We see these paths forward, none of them appealing:

  1. Buy a bigger, badder switch, hoping that SMC's LACP implementation just sucks, and that the new switch will be better. (Upgrading to an HP 2530-24G didn't help.)

  2. Stare at the FreeBSD lagg configuration some more, hoping that we missed something.4

  3. Forget link aggregation and use round-robin DNS to effect the load balancing instead.

  4. Replace the server NIC and switch again, this time with 10 GigE stuff, at about 4× the hardware cost of this LACP experiment.


Footnotes

  1. Why not move to FreeBSD 10, you ask? Because FreeBSD 10.0-RELEASE still uses ZFS pool version 28, and this server's been upgraded to ZFS pool 5000, a new feature in FreeBSD 9.3. The 10.x line won't get that until FreeBSD 10.1 ships about a month hence. And no, rebuilding from source to get onto the 10.0-STABLE bleeding edge isn't an option, since this is a production server.

  2. Please don't jump to conclusions. Our test results later in the question tell you why this is not a duplicate of this question.

  3. iperf3 is a pure network test. While the eventual goal is to try and fill that 4 Gbit/sec aggregate pipe from disk, we are not yet involving the disk subsystem.

  4. Buggy or poorly designed, maybe, but no more broken than when it left the factory.

  5. I've already gone cross-eyed from doing that.

Best Answer

What is the load balancing algorithm in use on both the system and the switch?

All my experience with this is on Linux and Cisco, not FreeBSD and SMC, but the same theory still applies.

The default load balancing mode on the Linux bonding driver's LACP mode, and on older Cisco switches like the 2950, is to balance based on MAC address only.

This means if all your traffic is going from one system (file server) to one other MAC (either a default gateway or a Switched Virtual Interface on the switch) then the source and destination MAC will be the same, so only one slave will ever be used.

From your diagram it doesn't look like you're sending traffic to a default gateway, but I'm not sure if the test servers are in 10.0.0.0/24, or if the test systems are in other subnets and being routed via a Layer 3 interface on the switch.

If you are routing on the switch, there's your answer.

The solution to this is to use a different load balancing algorithm.

Again I don't have experience with BSD or SMC, but Linux and Cisco can balance either based on L3 information (IP address) or L4 information (port number).

As each of your test systems must have a different IP, try balancing based on L3 information. If that still doesn't work, change a few IPs around and see if you change the load balancing pattern.

Related Topic