Packet Loss Troubleshooting – Packet Loss Within Private Network (Server to Server)

packet-loss

I am troubleshooting a bizarre case of packet loss. We have a cabinet of servers with a top of cabinet switch (Brocade FESX648-PREM). That switch runs BGP sessions with our transit providers.

We have one server (referred to below as the "bad server") that's experiencing 50% packet loss. The server is running Windows Server 2012 R2 and it's been running for months without issue until this morning. At this point, I suspect something might be wrong with the switch itself, so I'm turning to this community for help with additional troubleshooting rather than ServerFault or SuperUser for server-related troubleshooting.

This is what I've checked so far to rule out the cause of the packet loss on the bad server:

  1. No other servers in the cabinet are experiencing packet loss.
  2. The gateway switch and bad server can ping each other without issue.
  3. If I log into another server in the cabinet and attempt to ping the bad server, then I do get the packet loss.
  4. The routing table on the bad server is fine — the default route points to the proper gateway, no other entries exist (except for
    local IPv4 assignments).
  5. Firewalls have been disabled.
  6. No VPN setup is in effect (i.e., routing table on the bad server just has the default route).
  7. CPU load and network traffic are both very low.
  8. Server has been power cycled.
  9. Speed and duplex settings are set to auto-neg and are the same on both the switch and server.
  10. Forced 100mbit full on both ends, still had the packet loss.
  11. There are no port errors (no drops, collisions, FCS etc) recorded on the switch.
  12. CPU utilization on the switch is low (http://pastebin.com/q24QSqEz).

Anyone have any ideas where I should look next? The results of #2, #3, and #11 in particular are really throwing me for a loop…

Best Answer

This ended up being a failing switch. A couple days later we started having issues on ports 37-48. The FESX648-PREM is powered by port ASICs which control port regions. Those regions are: 1-12, 13-24, 25-36 and 37-48. One of the failure modes on this box is that a port ASIC can die and cause forwarding problems.

The "bad server" above, was the only server we had in use on the 37-48 region. So when we switched the port and re-tested, we had the same result because the failing ASIC affected multiple ports.

We replaced the entire switch and that resolved the issue.