Linux Networking – Switch Flooding When Bonding Interfaces

ciscolinuxnetworkingredhatswitch

                                 +--------+
                                 | Host A |
                                 +----+---+
                                     | eth0 (AA:AA:AA:AA:AA:AA)
                                     |
                                     |
                                +----+-----+
                                | Switch 1 | (layer2/3)
                                +----+-----+
                                     |
                                +----+-----+
                                | Switch 2 |
                                +----+-----+
                                     |
                          +----------+----------+
+-------------------------+       Switch 3      +-------------------------+
|                         +----+-----------+----+                         |
|                              |           |                              |
|                              |           |                              |
|     eth0 (B0:B0:B0:B0:B0:B0) |           | eth4 (B4:B4:B4:B4:B4:B4)     |
|                         +----+-----------+----+                         |
|                         |        Host B       |                         |
|                         +----+-----------+----+                         |
|     eth1 (B1:B1:B1:B1:B1:B1) |           | eth5 (B5:B5:B5:B5:B5:B5)     |
|                              |           |                              |
|                              |           |                              |
+------------------------------+           +------------------------------+

Topology overview
- Host A has a single NIC.
- Host B has four NICs which are bonded using the balance-alb mode.
- Both hosts run RHEL 6.0, and both are on the same IPv4 subnet.
Traffic analysis
- Host A is sending data to Host B using some SQL database application.
- Traffic from Host A to Host B: The source int/MAC is eth0/AA:AA:AA:AA:AA:AA, the destination int/MAC is eth5/B5:B5:B5:B5:B5:B5.
- Traffic from Host B to Host A: The source int/MAC is eth0/B0:B0:B0:B0:B0:B0, the destination int/MAC is eth0/AA:AA:AA:AA:AA:AA.
- Once the TCP connection has been established, Host B sends no further frames out eth5.
- The MAC address of eth5 expires from the bridge tables of both Switch 1 & Switch 2.
- Switch 1 continues to receive frames from Host A which are destined for B5:B5:B5:B5:B5:B5.
- Because Switch 1 and Switch 2 no longer have bridge table entries for B5:B5:B5:B5:B5:B5, they flood the frames out all ports on the same VLAN (except for the one it came in on, of course).
Reproduce
- If you ping Host B from a workstation which is connected to either Switch 1 or 2, B5:B5:B5:B5:B5:B5 re-enters the bridge tables and the flooding stops.
- After five minutes (the default bridge table timeout), flooding resumes.
Question
- It is clear that on Host B, frames arrive on eth5 and exit out eth0. This seems ok as that's what the Linux bonding algorithm is designed to do – balance incoming and outgoing traffic. But since the switch stops receiving frames with the source MAC of eth5, it gets timed out of the bridge table, resulting in flooding.
- Is this normal? Why aren't any more frames originating from eth5? Is it because there is simply no other traffic going on (the only connection is a single large data transfer from Host A)?

I've researched this for a long time and haven't found an answer. Documentation states that no switch changes are necessary when using mode 6 of the Linux interface bonding (balance-alb). Is this behavior occurring because Host B doesn't send any further packets out of eth5, whereas in normal circumstances it's expected that it would? One solution is to setup a cron job which pings Host B to keep the bridge table entries from timing out, but that seems like a dirty hack.

Best Answer

Yes - this is expected. You've hit a fairly common issue with NIC bonding to hosts, unicast flooding. As you've noted, the timers on your switch for the hardware addresses in question as no frames sourced from these addresses are being observed.

Here are the general options-

1.) Longer address table timeouts. On a mixed L2/L3 switch the ARP and CAM timers should be close to one another (with the CAM timer running a few seconds longer). This recommendation stands regardless of the rest of the configuration. On the L2 switch the timers can generally be set longer without too many problems. That said, unless you disable the timers altogether you'll be back in the same situation eventually if there isn't some kind of traffic sourcing from those other addresses.

2.) You could hard-code the MAC addresses on the switches in question (all of the switches in the diagram, unfortunately). This is obviously not optimal for a number of reasons.

3.) Change the bonding mode on the Linux side to one that uses a common source MAC (i.e. 802.3ad / LACP). This has a lot of operational advantages if your switch supports it.

4.) Generate gratuitous arps via a cron job from each interface. You may need some dummy IP's on the various interfaces to prevent an oscillation condition (i.e. the host's IP cycles through the various hardware addresses).

5.) If it's a traffic issue, just go to 10GE! (sorry - had to throw that in there)

The LACP route is probably the most common and supportable and the switches can likely be configured to balance inbound traffic to the server fairly evenly across the various links. Failing that I think the gratuitous arp option is going to be the easiest to integrate.

Related Solutions

Networking – How Does Layer 3 LACP Destination Address Hashing Work?

What you're looking for is commonly called a "transmit hash policy" or "transmit hash algorithm". It controls the selection of a port from a group of aggregate ports with which to transmit a frame.

Getting my hands on the 802.3ad standard has proven difficult because I'm not willing to spend money on it. Having said that, I've been able to glean some information from a semi-official source that sheds some light on what you're looking for. Per this presentation from the 2007 Ottawa, ON, CA IEEE High Speed Study Group meeting the 802.3ad standard does not mandate particular algorithms for the "frame distributor":

This standard does not mandate any particular distribution algorithm(s); however, any distribution algorithm shall ensure that, when frames are received by a Frame Collector as specified in 43.2.3, the algorithm shall not cause a) Mis-ordering of frames that are part of any given conversation, or b) Duplication of frames. The above requirement to maintain frame ordering is met by ensuring that all frames that compose a given conversation are transmitted on a single link in the order that they are generated by the MAC Client; hence, this requirement does not involve the addition (or modification) of any information to the MAC frame, nor any buffering or processing on the part of the corresponding Frame Collector in order to re-order frames.

So, whatever algorithm a switch / NIC driver uses to distribute transmitted frames must adhere to the requirements as stated in that presentation (which, presumably, was quoting from the standard). There is no particular algorithm specified, only a compliant behavior defined.

Even though there's no algorithm specified, we can look at a particular implementation to get a feel for how such an algorithm might work. The Linux kernel "bonding" driver, for example, has an 802.3ad-compliant transmit hash policy that applies the function (see bonding.txt in the Documentation\networking directory of the kernel source):

Destination Port = ((<source IP> XOR <dest IP>) AND 0xFFFF) 
    XOR (<source MAC> XOR <destination MAC>)) MOD <ports in aggregate group>

This causes both the source and destination IP addresses, as well as the source and destination MAC addresses, to influence the port selection.

The destination IP address used in this type of hashing would be the address that's present in the frame. Take a second to think about that. The router's IP address, in an Ethernet frame header away from your server to the Internet, isn't encapsulated anywhere in such a frame. The router's MAC address is present in the header of such a frame, but the router's IP address isn't. The destination IP address encapsulated in the frame's payload will be the address of the Internet client making the request to your server.

A transmit hash policy that takes into account both source and destination IP addresses, assuming you have a widely varied pool of clients, should do pretty well for you. In general, more widely varied source and/or destination IP addresses in the traffic flowing across such an aggregated infrastructure will result in more efficient aggregation when a layer 3-based transmit hash policy is used.

Your diagrams show requests coming directly to the servers from the Internet, but it's worth pointing out what a proxy might do to the situation. If you're proxying client requests to your servers then, as chris speaks about in his answer then you may cause bottlenecks. If that proxy is making the request from its own source IP address, instead of from the Internet client's IP address, you'll have fewer possible "flows" in a strictly layer 3-based transmit hash policy.

A transmit hash policy could also take layer 4 information (TCP / UDP port numbers) into account, too, so long as it kept with the requirements in the 802.3ad standard. Such an algorithm is in the Linux kernel, as you reference in your question. Beware that the the documentation for that algorithm warns that, due to fragmentation, traffic may not necessarily flow along the same path and, as such, the algorithm isn't strictly 802.3ad-compliant.

Cisco – Switch Floods Packets that should be Unicast

A quick prequel-

ARP table - A L3 device (router, host, etc) maintains a mapping between a given IP address and a corresponding MAC address.

CAM table - This may be known by other names in particular switch platforms, but the upshot is that a given L2 switching device maintains a mapping between a given hardware address and one or more physical switch ports.

What's happening in the case above is called unicast flooding. This is a condition where the router still has a live ARP entry even though the switch's CAM table has flushed the corresponding entry. As a result, when the router receives a packet for a given host it is simply forwarded to the switch without first sending an ARP request (the IP : MAC mapping is still cached). The switch, however, no longer knows the port to which this MAC address is mapped (this entry having been aged out earlier). If the switch doesn't have a CAM entry for a given unicast MAC then it will flood packets for that MAC to all ports until it sees a response (i.e. the response to an ARP request).

For obscure reasons ARP and CAM timers are generally quite different on Cisco switches. The values vary somewhat but the mismatch continues through the most modern Nexus devices. Best practice is to set the ARP and CAM timers to similar values - ideally with the CAM table set to 5 seconds or so longer than ARP. It's better for the router to re-ARP than for the switch to have to flood. Setting both values to ~600 seconds (10 minutes) generally isn't too bad, but some environments might want to go a bit longer if excessive ARP traffic is seen on the router.

Best Answer

Related Solutions

Networking – How Does Layer 3 LACP Destination Address Hashing Work?

Cisco – Switch Floods Packets that should be Unicast

Related Topic