FreeBSD link aggregation no faster than single link

bandwidthbondingfreebsdlacp

We put a 4 port Intel I340-T4 NIC in a FreeBSD 9.3 server¹ and configured it for link aggregation in LACP mode in an attempt to decrease the time it takes to mirror 8 to 16 TiB of data from a master file server to 2-4 clones in parallel. We were expecting to get up to 4 Gbit/sec aggregate bandwidth, but no matter what we've tried, it never comes out faster than 1 Gbit/sec aggregate.²

We're using iperf3 to test this on a quiescent LAN.³ The first instance nearly hits a gigabit, as expected, but when we start a second one in parallel, the two clients drop in speed to roughly ½ Gbit/sec. Adding a third client drops all three clients' speeds to ~⅓ Gbit/sec, and so on.

We've taken care in setting up the iperf3 tests that traffic from all four test clients comes into the central switch on different ports:

LACP test setup

We've verified that each test machine has an independent path back to the rack switch and that the file server, its NIC, and the switch all have the bandwidth to pull this off by breaking up the lagg0 group and assigning a separate IP address to each of the four interfaces on this Intel network card. In that configuration, we did achieve ~4 Gbit/sec aggregate bandwidth.

When we started down this path, we were doing this with an old SMC8024L2 managed switch. (PDF datasheet, 1.3 MB.) It wasn't the highest-end switch of its day, but it's supposed to be able to do this. We thought the switch might be at fault, due to its age, but upgrading to a much more capable HP 2530-24G did not change the symptom.

The HP 2530-24G switch claims the four ports in question are indeed configured as a dynamic LACP trunk:

# show trunks
Load Balancing Method:  L3-based (default)

  Port | Name                             Type      | Group Type    
  ---- + -------------------------------- --------- + ----- --------
  1    | Bart trunk 1                     100/1000T | Dyn1  LACP    
  3    | Bart trunk 2                     100/1000T | Dyn1  LACP    
  5    | Bart trunk 3                     100/1000T | Dyn1  LACP    
  7    | Bart trunk 4                     100/1000T | Dyn1  LACP

We've tried both passive and active LACP.

We've verified that all four NIC ports are getting traffic on the FreeBSD side with:

$ sudo tshark -n -i igb$n

Oddly, tshark shows that in the case of just one client, the switch splits the 1 Gbit/sec stream over two ports, apparently ping-ponging between them. (Both the SMC and HP switches showed this behavior.)

Since the clients' aggregate bandwidth only comes together in a single place — at the switch in the server's rack — only that switch is configured for LACP.

It doesn't matter which client we start first, or which order we start them in.

ifconfig lagg0 on the FreeBSD side says:

lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=401bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO>
    ether 90:e2:ba:7b:0b:38
    inet 10.0.0.2 netmask 0xffffff00 broadcast 10.0.0.255
    inet6 fe80::92e2:baff:fe7b:b38%lagg0 prefixlen 64 scopeid 0xa 
    nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
    media: Ethernet autoselect
    status: active
    laggproto lacp lagghash l2,l3,l4
    laggport: igb3 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
    laggport: igb2 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
    laggport: igb1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
    laggport: igb0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>

We've applied as much of the advice in the FreeBSD network tuning guide as makes sense to our situation. (Much of it is irrelevant, such as the stuff about increasing max FDs.)

We've tried turning off TCP segmentation offloading, with no change in the results.

We do not have a second 4-port server NIC to set up a second test. Because of the successful test with 4 separate interfaces, we're going on the assumption that none of the hardware is damaged.³

We see these paths forward, none of them appealing:

~~Buy a bigger, badder switch, hoping that SMC's LACP implementation just sucks, and that the new switch will be better.~~ (Upgrading to an HP 2530-24G didn't help.)
Stare at the FreeBSD lagg configuration some more, hoping that we missed something.⁴
Forget link aggregation and use round-robin DNS to effect the load balancing instead.
Replace the server NIC and switch again, this time with 10 GigE stuff, at about 4× the hardware cost of this LACP experiment.

Footnotes

Why not move to FreeBSD 10, you ask? Because FreeBSD 10.0-RELEASE still uses ZFS pool version 28, and this server's been upgraded to ZFS pool 5000, a new feature in FreeBSD 9.3. The 10.x line won't get that until FreeBSD 10.1 ships about a month hence. And no, rebuilding from source to get onto the 10.0-STABLE bleeding edge isn't an option, since this is a production server.
Please don't jump to conclusions. Our test results later in the question tell you why this is not a duplicate of this question.
iperf3 is a pure network test. While the eventual goal is to try and fill that 4 Gbit/sec aggregate pipe from disk, we are not yet involving the disk subsystem.
Buggy or poorly designed, maybe, but no more broken than when it left the factory.
I've already gone cross-eyed from doing that.

Best Answer

What is the load balancing algorithm in use on both the system and the switch?

All my experience with this is on Linux and Cisco, not FreeBSD and SMC, but the same theory still applies.

The default load balancing mode on the Linux bonding driver's LACP mode, and on older Cisco switches like the 2950, is to balance based on MAC address only.

This means if all your traffic is going from one system (file server) to one other MAC (either a default gateway or a Switched Virtual Interface on the switch) then the source and destination MAC will be the same, so only one slave will ever be used.

From your diagram it doesn't look like you're sending traffic to a default gateway, but I'm not sure if the test servers are in 10.0.0.0/24, or if the test systems are in other subnets and being routed via a Layer 3 interface on the switch.

If you are routing on the switch, there's your answer.

The solution to this is to use a different load balancing algorithm.

Again I don't have experience with BSD or SMC, but Linux and Cisco can balance either based on L3 information (IP address) or L4 information (port number).

As each of your test systems must have a different IP, try balancing based on L3 information. If that still doesn't work, change a few IPs around and see if you change the load balancing pattern.

Related Solutions

Link Aggregation between Xserve G4 running Leopard Server and Netgear GSM7224

Not much direct help I'm sure but there'll be no need for giving each link it's own IP address, LACP is a layer-2 thing so they'll only need one IP for the pair.

How exactly & specifically does layer 3 LACP destination address hashing work

What you're looking for is commonly called a "transmit hash policy" or "transmit hash algorithm". It controls the selection of a port from a group of aggregate ports with which to transmit a frame.

Getting my hands on the 802.3ad standard has proven difficult because I'm not willing to spend money on it. Having said that, I've been able to glean some information from a semi-official source that sheds some light on what you're looking for. Per this presentation from the 2007 Ottawa, ON, CA IEEE High Speed Study Group meeting the 802.3ad standard does not mandate particular algorithms for the "frame distributor":

This standard does not mandate any particular distribution algorithm(s); however, any distribution algorithm shall ensure that, when frames are received by a Frame Collector as specified in 43.2.3, the algorithm shall not cause a) Mis-ordering of frames that are part of any given conversation, or b) Duplication of frames. The above requirement to maintain frame ordering is met by ensuring that all frames that compose a given conversation are transmitted on a single link in the order that they are generated by the MAC Client; hence, this requirement does not involve the addition (or modification) of any information to the MAC frame, nor any buffering or processing on the part of the corresponding Frame Collector in order to re-order frames.

So, whatever algorithm a switch / NIC driver uses to distribute transmitted frames must adhere to the requirements as stated in that presentation (which, presumably, was quoting from the standard). There is no particular algorithm specified, only a compliant behavior defined.

Even though there's no algorithm specified, we can look at a particular implementation to get a feel for how such an algorithm might work. The Linux kernel "bonding" driver, for example, has an 802.3ad-compliant transmit hash policy that applies the function (see bonding.txt in the Documentation\networking directory of the kernel source):

Destination Port = ((<source IP> XOR <dest IP>) AND 0xFFFF) 
    XOR (<source MAC> XOR <destination MAC>)) MOD <ports in aggregate group>

This causes both the source and destination IP addresses, as well as the source and destination MAC addresses, to influence the port selection.

The destination IP address used in this type of hashing would be the address that's present in the frame. Take a second to think about that. The router's IP address, in an Ethernet frame header away from your server to the Internet, isn't encapsulated anywhere in such a frame. The router's MAC address is present in the header of such a frame, but the router's IP address isn't. The destination IP address encapsulated in the frame's payload will be the address of the Internet client making the request to your server.

A transmit hash policy that takes into account both source and destination IP addresses, assuming you have a widely varied pool of clients, should do pretty well for you. In general, more widely varied source and/or destination IP addresses in the traffic flowing across such an aggregated infrastructure will result in more efficient aggregation when a layer 3-based transmit hash policy is used.

Your diagrams show requests coming directly to the servers from the Internet, but it's worth pointing out what a proxy might do to the situation. If you're proxying client requests to your servers then, as chris speaks about in his answer then you may cause bottlenecks. If that proxy is making the request from its own source IP address, instead of from the Internet client's IP address, you'll have fewer possible "flows" in a strictly layer 3-based transmit hash policy.

A transmit hash policy could also take layer 4 information (TCP / UDP port numbers) into account, too, so long as it kept with the requirements in the 802.3ad standard. Such an algorithm is in the Linux kernel, as you reference in your question. Beware that the the documentation for that algorithm warns that, due to fragmentation, traffic may not necessarily flow along the same path and, as such, the algorithm isn't strictly 802.3ad-compliant.

Best Answer

Related Solutions

Link Aggregation between Xserve G4 running Leopard Server and Netgear GSM7224

How exactly & specifically does layer 3 LACP destination address hashing work

Related Topic