Linux – Configuring second NIC knocks server off the network

linuxnetworkingUbuntu

Yesterday I spent 4 hours trying to get my network's DHCP/DNS/SMB server back online. Long story short, it took numerous wildly frustrated shots in the dark (no DNS = no internet resources for help) and no fewer than half a dozen reboots to finally restore my server to functioning order.

What precipitated this was configuring and enabling my server's second Ethernet port in /etc/network/interfaces. That's when it all hit the fan. I've finally gotten eth1 disabled again and eth0 is working as before, but this isn't the state I want this server to be in.

eth0 and eth1 are both gigabit ports built into the motherboard (an ASUS something-or-other), and previously they were both bonded together (round-robin, I think); however, the server's been completely reformatted and re-installed since then (hard drive failure precipitated that), so I would think that anything the bonding driver had configured would be dead and gone.

While the server was offline, ifconfig seemed to be showing that it was receiving packets just fine, but every single outgoing packet was being dropped. (I should have saved the output from ifconfig during the issue, but the 'TX' line showed "packets:0" and "dropped:123"; also "errors:0 … overrun:0 carrier:0".)

eth0 is configured with a static IP; I did the same for eth1. Here is /etc/network/interfaces:

root@odin:~# cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
auto eth0
iface eth0 inet static
    address 10.12.0.50
    netmask 255.0.0.0
    gateway 10.12.0.2

# The secondary network interface
# Commented out now because this was the only way I could get it to work again
#auto eth1
#iface eth1 inet static
#   address 10.12.0.51
#   netmask 255.0.0.0
#   gateway 10.12.0.2

ethtool shows:

root@odin:~# ethtool eth0
Settings for eth0:
    Supported ports: [ MII ]
    Supported link modes:   10baseT/Half 10baseT/Full
                            100baseT/Half 100baseT/Full
                            1000baseT/Full
    Supports auto-negotiation: Yes
    Advertised link modes:  10baseT/Half 10baseT/Full
                            100baseT/Half 100baseT/Full
                            1000baseT/Full
    Advertised pause frame use: No
    Advertised auto-negotiation: Yes
    Speed: 1000Mb/s
    Duplex: Full
    Port: MII
    PHYAD: 1
    Transceiver: external
    Auto-negotiation: on
    Supports Wake-on: g
    Wake-on: d
    Link detected: yes

The output for eth1 is identical, except that it shows "Link detected: no" because it's disabled currently; "Link detected" was always "yes" for either interface when it was supposedly enabled, even when eth0 was apparently unable to send any packets.

/var/log/syslog shows numerous entries like this:

May 11 21:55:08 odin kernel: [  797.050022] forcedeth 0000:00:08.0: eth0: Got tx_timeout. irq: 00000020·
May 11 21:55:08 odin kernel: [  797.050026] forcedeth 0000:00:08.0: eth0: Ring at 112804000·
May 11 21:55:08 odin kernel: [  797.050029] forcedeth 0000:00:08.0: eth0: Dumping tx registers·
May 11 21:55:08 odin kernel: [  797.050035] forcedeth 0000:00:08.0: eth0:   0: 00000020 000000df 00000003 0001000d 00000000 00000000 00000000 00000000·
[bunch more lines like this one, though none reference eth1]

Also in syslog are countless repetitions of the following lines:

May 11 21:54:42 odin kernel: [  770.480861] martian source 10.12.0.50 from 10.42.0.206, on dev eth1·
May 11 21:54:42 odin kernel: [  770.480865] ll header: ff:ff:ff:ff:ff:ff:00:1e:65:d6:6c:6a:08:06·
May 11 21:54:42 odin kernel: [  770.987932] martian source 10.12.0.51 from 10.12.0.2, on dev eth1·
May 11 21:54:42 odin kernel: [  770.987937] ll header: ff:ff:ff:ff:ff:ff:00:13:46:ed:e2:4a:08:06

The "from" address is different, but it's always eth1 and always "source" 10.12.0.50 or .51. That "martian" thing reminded me that I am running Shorewall, but turning it off (and verifying that iptables -L showed nothing but accepting everything from/to anywhere) had no effect whatsoever. I'm not even sure why eth1 would be seeing traffic intended for eth0's address in the first place, given that they're connected to a switch that (in my understanding, anyway) would only send packets to their intended destinations. (It is an unmanaged gigabit switch, Linksys I think.)

I don't even know how to begin to diagnose or troubleshoot what went wrong here. Frankly, I'm afraid to try to start eth1 again, especially since I don't even know what finally fixed the problem so I don't know that I could get it reverted again to its current state. What can I do to figure out what happened, and to fix it so that I can again turn on eth1 without blowing up the server's networking again? Could the hardware still be mis-configured from the previous system install using the bonding driver? How could I determine that and, if that's the case, fix it?

Both ports worked perfectly independently on the previous install before I set up bonding, and I had no issues at all during that time. I re-installed the system about 4-ish weeks ago, and eth1 has been disabled since then (Ubuntu detected it during the installation routine, but I of course chose eth0 as my "primary" interface during the install and Ubuntu apparently made no effort to configure eth1 after that).

Best Answer

Couple of notes:

  • If you have a bond with two ports connected to the same unmanaged switch, it won't support the necessary protocols to bond the ports together. You must use mode=active-backup
  • No, your previous configuration won't affect your setup now.
  • The martians are a result of having two NICs on the same subnet. They're being sent to eth1 as they're broadcast packets. Other than cluttering your logs, you shouldn't have trouble with these in your setup.
  • the transmit timeouts look like some sort of a hardware problem

What you should do:

  • Try running: ip addr flush dev eth1; ip link set up dev eth1 to see if merely bringing up eth1 causes eth0 to fail. If it does, you likely have hardware problems.
  • Set up a single bonded interface (mode=active-backup) with both eth0 and eth1 as slaves and assign the server's IP address to that.