Packet Loss Troubleshooting – Packet Loss Within Private Network (Server to Server)

packet-loss

I am troubleshooting a bizarre case of packet loss. We have a cabinet of servers with a top of cabinet switch (Brocade FESX648-PREM). That switch runs BGP sessions with our transit providers.

We have one server (referred to below as the "bad server") that's experiencing 50% packet loss. The server is running Windows Server 2012 R2 and it's been running for months without issue until this morning. At this point, I suspect something might be wrong with the switch itself, so I'm turning to this community for help with additional troubleshooting rather than ServerFault or SuperUser for server-related troubleshooting.

This is what I've checked so far to rule out the cause of the packet loss on the bad server:

No other servers in the cabinet are experiencing packet loss.
The gateway switch and bad server can ping each other without issue.
If I log into another server in the cabinet and attempt to ping the bad server, then I do get the packet loss.
The routing table on the bad server is fine — the default route points to the proper gateway, no other entries exist (except for
local IPv4 assignments).
Firewalls have been disabled.
No VPN setup is in effect (i.e., routing table on the bad server just has the default route).
CPU load and network traffic are both very low.
Server has been power cycled.
Speed and duplex settings are set to auto-neg and are the same on both the switch and server.
Forced 100mbit full on both ends, still had the packet loss.
There are no port errors (no drops, collisions, FCS etc) recorded on the switch.
CPU utilization on the switch is low (http://pastebin.com/q24QSqEz).

Anyone have any ideas where I should look next? The results of #2, #3, and #11 in particular are really throwing me for a loop…

Best Answer

This ended up being a failing switch. A couple days later we started having issues on ports 37-48. The FESX648-PREM is powered by port ASICs which control port regions. Those regions are: 1-12, 13-24, 25-36 and 37-48. One of the failure modes on this box is that a port ASIC can die and cause forwarding problems.

The "bad server" above, was the only server we had in use on the 37-48 region. So when we switched the port and re-tested, we had the same result because the failing ASIC affected multiple ports.

We replaced the entire switch and that resolved the issue.

Related Solutions

Cisco – Finding transparent firewall packet loss

Interface Internal-Data0/0 "", is up, line protocol is up
     2749335943 input errors, 0 CRC, 0 frame, 2749335943 overrun, 0 ignored, 0 abort
                                              ^^^^^^^^^^^^^^^^^^
         0 output errors, 0 collisions, 0 interface resets

You show overruns on the InternalData interfaces, so you are dropping traffic through the ASA. With that many drops, it's not hard to imagine that this is contributing to problem. Overruns happen when the internal Rx FIFO queues overflow (normally because of some problem with load).

EDIT to respond to a question in the comments:

I don't understand why the firewall is overloaded, it is not close to using 10Gbps. Can you explain why we are seeing overruns even when the CPU and bandwidth are low? The CPU is about 5% and the bandwidth either direction never goes much higher than 1.4Gbps.

I have seen this happen over and over when a link is seeing traffic microbursts, which exceed either the bandwidth, connection-per-second, or packet-per-second horsepower of the device. So many people quote 1 or 5 minute statistics as if the traffic is relatively constant across that timeframe.

I would take a look at your firewall by running these commands every two or three seconds (run term pager 0 to avoid paging issues)...

show clock
show traffic detail | i ^[a-zA-Z]|overrun|packets dropped
show asp drop

Now graph out how much traffic you're seeing every few seconds vs drops; if you see massive spikes in policy drops or overruns when your traffic spikes, then you're closer to finding the culprit.

Don't forget that you can sniff directly on the ASA with this if you need help identifying what's killing the ASA... you have to be quick to catch this sometimes.

capture FOO circular-buffer buffer <buffer-size> interface <intf-name>

Netflow on your upstream switches could help as well.

Measure one-way latency/jitter/packet-loss

One way to do this is ICMP Timestamp, which is milliseconds from midnight UTC. It has the added benefit that you don't necessarily need to control both ends, as long as the far-end is not firewalled, there is good chance it'll work.

However, to have reliable one-way measurements, you need reliably same time in both ends. As ICMP timestamp only have precision of 1ms (which is not nearly enough for many applications, but sufficient for this) it's reasonably easy to find even non-cooperating hosts where ICMP timestamp will provide useful data.

If you control both ends, be sure that you are synchronizing NTP to only 1 server and same server. The absolute clock is not very important, it's just important that you experience as closely same time as possible.

If ICMP timestamp is not sufficient, it's very easy to write 10 lines of ruby/perl/python or even C to do measurements when you control both ends.

I can't really suggest software for doing ICMP timestamp measurements unidirectionally, hping2 supports sending ICMP timestamp but for some reason does not output unidirectional values. I wrote patch for hping2 to display one way latencies.

Best Answer

Related Solutions

Cisco – Finding transparent firewall packet loss

Measure one-way latency/jitter/packet-loss

Related Topic