Cisco – Finding transparent firewall packet loss

ciscocisco-asafirewallpacket-lossswitch

We use the Cisco ASA 5585 in a layer2 transparent mode. The configuration is just two 10GE links between our business partner dmz and our inside network. A simple map looks like this.

10.4.2.9/30                    10.4.2.10/30
core01-----------ASA1----------dmzsw

The ASA has 8.2(4) and SSP20. The switches are 6500 Sup2T with 12.2. There are no packet drops on any switch or ASA interface!! Our maximum traffic is about 1.8Gbps between the switches and the CPU load on the ASA is very low.

We have a strange problem. Our nms admin sees very bad packet loss that started sometime in June. The packet loss is growing very fast, but we do not know why. Traffic through the firewall has remained constant, but packet loss is growing quickly. These are the nagios ping failures we see through the firewall. Nagios sends 10 pings to every server. Some of the failures loose all pings, not all failures loose all ten pings.

enter image description here

The strange thing is that if we use mtr from the nagios server, the packet loss isn't very bad.

                             My traceroute  [v0.75]
nagios    (0.0.0.0)                                    Fri Jul 19 03:43:38 2013
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                  Packets               Pings
 Host                           Loss%   Snt Drop   Last  Best   Avg  Wrst StDev
 1. 10.4.61.1                    0.0%  1246    0    0.4   0.3   0.3  19.7   1.2
 2. 10.4.62.109                  0.0%  1246    0    0.2   0.2   0.2   4.0   0.4
 3. 10.4.62.105                  0.0%  1246    0    0.4   0.4   0.4   3.6   0.4
 4. 10.4.62.37                   0.0%  1246    0    0.5   0.4   0.7  11.2   1.7
 5. 10.4.2.9                     1.3%  1246   16    0.8   0.5   2.1  64.8   7.9
 6. 10.4.2.10                    1.4%  1246   17    0.9   0.5   3.5 102.4  11.2
 7. dmz-server                   1.1%  1246   13    0.6   0.5   0.6   1.6   0.2

When we ping between the switches, we don't loose many packets, but it is obvious the problem starts somewhere between the switches.

core01#ping ip 10.4.2.10 repeat 500000

Type escape sequence to abort.
Sending 500000, 100-byte ICMP Echos to 10.4.2.10, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Success rate is 99 percent (499993/500000), round-trip min/avg/max = 1/2/6 ms
core01#

How can we have so many ping failures and no packet drops on the interfaces? How can we find where the problem is? Cisco TAC is going in circles on this problem, they keep asking for show tech from so many different switches and it is obvious that the problem is betwen core01 and dmzsw. Can someone help?

Update July 30, 2013

Thank you to everyone for helping me find the problem. It was a misbehaving application that sent a lot of small UDP packets for about 10 seconds at a time. These packets were were denied by the firewall. It looks like my manager wants to upgrade our ASA so we don't have this problem again.

More Information

From questions in the comments:

ASA1# show inter detail | i ^Interface|overrun|error
Interface GigabitEthernet0/0 "", is administratively down, line protocol is down
        0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
        0 output errors, 0 collisions, 0 interface resets
Interface GigabitEthernet0/1 "", is administratively down, line protocol is down
        0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
        0 output errors, 0 collisions, 0 interface resets
Interface GigabitEthernet0/2 "", is administratively down, line protocol is down
        0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
        0 output errors, 0 collisions, 0 interface resets
Interface GigabitEthernet0/3 "", is administratively down, line protocol is down
        0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
        0 output errors, 0 collisions, 0 interface resets
Interface GigabitEthernet0/4 "", is administratively down, line protocol is down
        0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
        0 output errors, 0 collisions, 0 interface resets
Interface GigabitEthernet0/5 "", is administratively down, line protocol is down
        0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
        0 output errors, 0 collisions, 0 interface resets
Interface GigabitEthernet0/6 "", is administratively down, line protocol is down
        0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
        0 output errors, 0 collisions, 0 interface resets
Interface GigabitEthernet0/7 "", is administratively down, line protocol is down
        0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
        0 output errors, 0 collisions, 0 interface resets
Interface Internal-Data0/0 "", is up, line protocol is up
     2749335943 input errors, 0 CRC, 0 frame, 2749335943 overrun, 0 ignored, 0 abort
         0 output errors, 0 collisions, 0 interface resets
           RX[00]: 156069204310 packets, 163645512578698 bytes, 0 overrun
           RX[01]: 185159126458 packets, 158490838915492 bytes, 0 overrun
           RX[02]: 192344159588 packets, 197697754050449 bytes, 0 overrun
           RX[03]: 173424274918 packets, 196867236520065 bytes, 0 overrun
Interface Internal-Data1/0 "", is up, line protocol is up
    26018909182 input errors, 0 CRC, 0 frame, 26018909182 overrun, 0 ignored, 0 abort
    0 output errors, 0 collisions, 0 interface resets
           RX[00]: 194156313803 packets, 189678575554505 bytes, 0 overrun
           RX[01]: 192391527307 packets, 184778551590859 bytes, 0 overrun
           RX[02]: 167721770147 packets, 179416353050126 bytes, 0 overrun
           RX[03]: 185952056923 packets, 205988089145913 bytes, 0 overrun
Interface Management0/0 "Mgmt", is up, line protocol is up
        0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
        0 output errors, 0 collisions, 0 interface resets
Interface Management0/1 "", is administratively down, line protocol is down
        0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
        0 output errors, 0 collisions, 0 interface resets
Interface TenGigabitEthernet0/8 "Inside", is up, line protocol is up
        0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
        0 output errors, 0 collisions, 0 interface resets
Interface TenGigabitEthernet0/9 "DMZ", is up, line protocol is up
        0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
        0 output errors, 0 collisions, 0 interface resets

ASA1#

Best Answer

Interface Internal-Data0/0 "", is up, line protocol is up
     2749335943 input errors, 0 CRC, 0 frame, 2749335943 overrun, 0 ignored, 0 abort
                                              ^^^^^^^^^^^^^^^^^^
         0 output errors, 0 collisions, 0 interface resets

You show overruns on the InternalData interfaces, so you are dropping traffic through the ASA. With that many drops, it's not hard to imagine that this is contributing to problem. Overruns happen when the internal Rx FIFO queues overflow (normally because of some problem with load).

EDIT to respond to a question in the comments:

I don't understand why the firewall is overloaded, it is not close to using 10Gbps. Can you explain why we are seeing overruns even when the CPU and bandwidth are low? The CPU is about 5% and the bandwidth either direction never goes much higher than 1.4Gbps.

I have seen this happen over and over when a link is seeing traffic microbursts, which exceed either the bandwidth, connection-per-second, or packet-per-second horsepower of the device. So many people quote 1 or 5 minute statistics as if the traffic is relatively constant across that timeframe.

I would take a look at your firewall by running these commands every two or three seconds (run term pager 0 to avoid paging issues)...

show clock
show traffic detail | i ^[a-zA-Z]|overrun|packets dropped
show asp drop

Now graph out how much traffic you're seeing every few seconds vs drops; if you see massive spikes in policy drops or overruns when your traffic spikes, then you're closer to finding the culprit.

Don't forget that you can sniff directly on the ASA with this if you need help identifying what's killing the ASA... you have to be quick to catch this sometimes.

capture FOO circular-buffer buffer <buffer-size> interface <intf-name>

Netflow on your upstream switches could help as well.

Related Solutions

Cisco Catalyst cluster heartbeat switch issue – increasing input errors

CRC errors are often cabling problems. Here are the things I would check next before swapping out hardware:

Are the servers connected directly to the switch or do they connect through some sort of infrastructure cabling? If so, get the infrastructure cables re-certified.
If you have a real cable tester (not a simple continuity tester), I would test the cables.
If the cables are hand made, I would replace with factory made cables. Often run into these types of issues with hand made cables.
Check to see if there is any source of EM near where the cables run. Re-path the cables if you can even temporarily to make sure they are kept separate from power or other sources of EM.

Beyond that, I would start at the NICs as you already indicated. Could be you got some from a bad run.

Cisco – Output Drops on Serial interface: Better queueing or Output queue size

You're right, you wouldn't really see the burstiness easily on SNMP. 1GE can send 1.48Mpps, so it takes very very little time to congest the the 45Mbps, which can handle less than 75kpps.

If your ingress is 1GE and egress is 45Mbps, then obviously the congestion point of 45Mbps will need to drop packets. This is normal and expected. If you increase buffers you'll introduce more delay.
1GE takes 0.45ms to send 40 1500B IP frames, which is right now the amount of burst you can handle. However dequeueing them on the 45Mbps already takes 10ms.

If you don't have any acute problem, I would probably not do anything about it. But if some traffic is more eligible for dropping than other, then you should replace FIFO with class-based queueing. Say maybe you want to prioritize so that more ftp is dropped and less voip.
Then it'll also make more sense to add more buffering on the ftp traffic, as it's not really sensitive to delay.

If you want to try your luck with deeper buffers, something like this should suffice:

policy-map WAN-OUT
 class class-default
    fair-queue
    queue-limit 200 packets
!
interface Serial1/0
  service-policy output WAN-OUT

This would cause 50ms buffers on the Serial1 and would allow you to handle up-to 2.25ms burst from single Gige interface.

Best Answer

Related Solutions

Cisco Catalyst cluster heartbeat switch issue – increasing input errors

Cisco – Output Drops on Serial interface: Better queueing or Output queue size

Related Topic