Switch – How to diagnose a bridging (ethernet) loop

ethernetswitch

Given that spanning tree has failed (or you don't have any spanning tree) and get an ethernet loop, what's the best way to diagnose where the problem is?

Which switch?, which cable? and so on.

Best Answer

OK, so assume you have a topology like:

          SW1
         /   \
        /     \
       /       \
PC A--SW2-----SW3--PC B

For some reason there is a bridging loop, STP is disabled or someone applied a filter in the wrong place or such.

PC A wants to communicate with PC B. It first ARPs for the MAC of PC B, the destination is a broadcast with MAC ffff.ffff.ffff. So the frame goes to both SW1 and SW3. The SRC MAC is PC A. SW1 then floods the frame towards SW3 and SW3 will flood the frame coming from SW2 to SW1.

SW1 and SW3 learned the MAC of PC A when the first frame came in. When the second one comes in from the opposite direction it has to relearn it. Because these events occur so fast and repeatedly you will see log messages complaining about MAC flapping. Something like "MAC FLAP 0000.0000.0001 is flapping between Gi0/24 and Gi0/23". This is a good sign that you have a loop.

What you could do then is to try to trace this MAC. Try looking in the ARP cache of a device in the same subnet and see what IP this device has. So with the MAC you could try to trace it with sh mac-address-table or with the IP maybe you have a list with all IPs and where they are connected.

If the host gets a IP address from a DHCP server you could also try there to find where the host is coming from. If you have option 82 enabled that would be a great help.

Other signs are that the CLI will be very sluggish. CPU load will be very high. Switches do almost everything in ASICs so if a switch has a CPU load over 50% it's probably not good. You should implement SNMP monitoring and watch for high CPU load. Also look for the MAC flap messages. If the switches have a loop the LEDs will probably be blinking like crazy.

Things you could do to protect against loops:

  • Enable STP! (duh)
  • SNMP monitoring of CPU load
  • Enable SNMP traps for certain events like STP topology changes
  • Enable storm control on the ports to limit broadcast
  • Don't span your VLANs too much in your L2 topology
  • Enable port security and limit number of MAC addresses per port
  • Enable Option82 if you run DHCP