Fixing Bond Slave Interfaces Aggregator ID Issue on LACP

high-availabilitylacpnetworkingrhel6rhel7

I have a bug on some servers where LACP (802.3ad) is not working.
I have on all servers a bonding device bond0 with two eth slaves and each interface is plugged on a different swich, and both switches configured with LACP.

Everything seems to be ok, but a network engineer detected some MLAG (Arista LACP implementation) was not working while the physical devices were up.

When I looked to /proc/net/bonding/bond0 of affected servers, I found each interface has a different Aggregator ID. On nominal servers the Aggregator ID is the same.

The issue can be reproduced by switching off and on the port on the switch, then we can observe despite physical link is up, MLAG is down. The bug is present on RHEL 6 and 7 (but not all servers are affected).

Configuration

#/etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
MACADDR=14:02:ec:44:e9:80
IPADDR=xxx.xxx.xxx.xxx
NETMASK=xxx.xxx.xxx.xxx
BONDING_OPTS="mode=802.3ad lacp_rate=slow xmit_hash_policy=layer3+4"
BOOTPROTO=none
ONBOOT=yes
USERCTL=no
NM_CONTROLLED=no
PEERDNS=no

# /etc/sysconfig/network-scripts/ifcfg-eno49 (same for other interface)
HWADDR=14:02:ec:44:e9:80
MASTER=bond0
SLAVE=yes
BOOTPROTO=none
ONBOOT=yes
USERCTL=no
NM_CONTROLLED=no
PEERDNS=no

We have a workaround now – set down and up eth interface on server – but this is not ideal.

To check LACP protocol, I did

tcpdump -i eno49 -tt -vv -nnn ether host 01:80:c2:00:00:02

I can see a packet every 30 seconds on one interface but on the other I see a packet every 1 second as is it was trying to establish LACP session.

Do you have a way to troubleshoot and fix that ?

(sorry if I did not use the right term for network I'm not really skilled in LACP)

Thanks

Best Answer

After digging into some documentation and some testing, I found out when a server is using bonding you need to force monitor the network links using the miimon parameter from the bonding module.

While looking at /proc/net/bonding/bond0 I should have seen one of the device has the MII status down where actually it was up on the link level.

https://access.redhat.com/articles/172483#Link_Monitoring_Modes states that:

It is critical that a link monitoring mode, either the miimon or arp_interval and arp_ip_target parameters be specified. Configuring a bond without a link monitoring mode is not a valid use of the bonding driver

So to report that in ifcfg-bond0 file you pass that in the BONDING_OPTS options

#/etc/sysconfig/network-scripts/ifcfg-bond0
...
BONDING_OPTS="mode=802.3ad lacp_rate=slow xmit_hash_policy=layer2+2 miimon=100"
...

so it forces to poll the links every 100ms.

Restart the network service to apply the change.

Related Solutions

Centos 6.2 Fresh ‘Basic Server’ install networking issues

Your routes and configuration look fine.

$: route -n
Destination // Gateway // Genmask // Flags // Metric // Ref // Use // Iface
66.*.*.0       0.0.0.0    255.255.255.248   U   1003      0   0     0  em2
0.0.0.0        66.*.*.1   0.0.0.0           UG     0      0   0     0  em2

The first route, 66.*.*.0/29 with gateway 0.0.0.0 tells your computer to use interface em2 and then make an arp request to find the hardware address of the host you're trying to reach. This is a "connected" route.

The second one is the default route, pointing at your default gateway through em2. If you need to send a packet in another network than 66.*.*.0/29, your computer will make an arp request to find 66.*.*.1 and then send the packets to it.

The only thing in your configuration that could be an issue is the NM_CONTROLLED=yes statement in /etc/sysconfig/network-scripts/ifcfg-em2. This tells the system that this interface is controlled by NetworkManager. This could interfere with your static configuration.

However, even without any default gateway you should be able to ping and ssh from the 66.*.*.0/29 subnet to your machine.

Check layer 1 first, and ensure that the cable is plugged on each side. Use leds on nic and switch, and check if the system sees it correctly:

# mii-tool
eth0: negotiated 1000baseT-FD flow-control, link ok

Then verify if any iptables are dropping the packets. Use iptables -L or iptables-save to check for any rules, and iptables -D <rule> to delete them. Pay attention to the default policy.

Also, on some systems, NetworkManager can configure ufw automatically, and I've had issues with static interface configuration that wasn't seen by NM and hence blocked by ufw.

Ubuntu – Network bonding mode 802.3ad on Ubuntu 12.04 and a Cisco Switch

You'll never get more than 1 NIC's performance between two servers. Switches do not spread the frames from a single source across multiple links in a Link Aggregation Group (LAG). What they actually do is hash the source MAC or IP (or both) and use that hash to assign the client to one NIC.

So your server can transmit across as many NIC's as you want, but those frames will all be sent to the destination server on one link.

Best Answer

Related Solutions

Centos 6.2 Fresh ‘Basic Server’ install networking issues

Ubuntu – Network bonding mode 802.3ad on Ubuntu 12.04 and a Cisco Switch

Related Topic