Cisco – High latency/drops between Cisco switches in two locations. How to troubleshoot

I'm receiving user complaints about poor network application performance between two parts of a large warehouse facility. The software is a curses-based terminal application running on a Linux server. The clients are PCs running a telnet or SSH client. The trouble started a day ago with no recent (known) changes to the environment.

The core switch is a Cisco Catalyst 4507R-E in the MDF, linked to a 4-member stack of Cisco Catalyst 2960 switches in the IDF… They are connected via multimode fiber. The servers are in the MDF. The clients impacted are in the IDF.

Pinging from the Linux application server to the 2960 stack's management address across the building shows high variance and a lot of latency:

--- shipping-2960.mdmarra.local ping statistics ---
864 packets transmitted, 864 received, 0% packet loss, time 863312ms
rtt min/avg/max/mdev = 0.521/5.317/127.037/8.698 ms

However, pings to client computers from the application server are a bit more consistent:

--- charles-pc.mdmarra.local ping statistics ---
76 packets transmitted, 76 received, 0% packet loss, time 75001ms
rtt min/avg/max/mdev = 0.328/0.481/1.355/0.210 ms

None of the relevant Linux interfaces or switchports show errors (see bottom of question).

How can I troubleshoot this?

Is there an easy method to determine port activity?
Is the ping variance on the management IP of the switch the wrong thing to measure?
Could this be the result of a rogue PC?
Since the problem is isolated to one part of the building, is there anything else I should be checking? Other users in the warehouse are fine and haven't had any issues.

Edit:

I later discovered that the Cisco 2960 CPU utilization is extremely high due to the bug detailed here.

From the 2960 stack…

shipping-2960#sh int GigabitEthernet1/0/52
GigabitEthernet1/0/52 is up, line protocol is up (connected) 
  Hardware is Gigabit Ethernet, address is b414.894a.09b4 (bia b414.894a.09b4)
  Description: TO_MDF_4507
  MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec, 
     reliability 255/255, txload 13/255, rxload 1/255
  Encapsulation ARPA, loopback not set
  Keepalive not set
  Full-duplex, 1000Mb/s, link type is auto, media type is 1000BaseSX SFP
  input flow-control is off, output flow-control is unsupported 
  ARP type: ARPA, ARP Timeout 04:00:00
  Last input 00:00:00, output 00:00:01, output hang never
  Last clearing of "show interface" counters never
  Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 441
  Queueing strategy: fifo
  Output queue: 0/40 (size/max)
  5 minute input rate 3053000 bits/sec, 613 packets/sec
  5 minute output rate 51117000 bits/sec, 4815 packets/sec
     981767797 packets input, 615324451566 bytes, 0 no buffer
     Received 295141786 broadcasts (286005510 multicasts)
     0 runts, 0 giants, 0 throttles
     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
     0 watchdog, 286005510 multicast, 0 pause input
     0 input packets with dribble condition detected
     6372280523 packets output, 8375642643516 bytes, 0 underruns
     0 output errors, 0 collisions, 0 interface resets
     0 babbles, 0 late collision, 0 deferred
     0 lost carrier, 0 no carrier, 0 PAUSE output
     0 output buffer failures, 0 output buffers swapped out

Additional output:

Cisco 4507R-E CPU utilization – sorted.

Cisco 2960 CPU utilization – sorted.

tcam utilization of 2960. Not available on the 4507.

shipping-2960# show platform tcam utilization

CAM Utilization for ASIC# 0                      Max            Used
                                             Masks/Values    Masks/values

 Unicast mac addresses:                       8412/8412        335/335   
 IPv4 IGMP groups + multicast routes:          384/384           1/1     
 IPv4 unicast directly-connected routes:       320/320          28/28    
 IPv4 unicast indirectly-connected routes:       0/0            28/28    
 IPv6 Multicast groups:                        320/320          11/11    
 IPv6 unicast directly-connected routes:       256/256           1/1     
 IPv6 unicast indirectly-connected routes:       0/0             1/1     
 IPv4 policy based routing aces:                32/32           12/12    
 IPv4 qos aces:                                384/384          42/42    
 IPv4 security aces:                           384/384          33/33    
 IPv6 policy based routing aces:                16/16            8/8     
 IPv6 qos aces:                                 60/60           31/31    
 IPv6 security aces:                           128/128           9/9

Cisco 2960 CPU utilization history…

shipping-2960#show processes cpu history

    3333333444443333344444444443333333333444443333344444444443
    9977777111119999966666222229999977777555559999911111000008
100                                                           
 90                                                           
 80                                                           
 70                                                           
 60                                                           
 50                  *****               *****                
 40 **********************************************************
 30 **********************************************************
 20 **********************************************************
 10 **********************************************************
   0....5....1....1....2....2....3....3....4....4....5....5....
             0    5    0    5    0    5    0    5    0    5    
               CPU% per second (last 60 seconds)

    4488887787444454444787888444444454677774444444447888544444
    6401207808656506776708000447546664789977697589953201636647
100                                                           
 90                                                           
 80   *###*##*         *#*##*          *#**          ###      
 70   #######*         *#####         *###*         *###      
 60   #######*         *#####       * *####         *###*     
 50 * ########*********######  ** *** *####*********####* ** *
 40 ##########################################################
 30 ##########################################################
 20 ##########################################################
 10 ##########################################################
   0....5....1....1....2....2....3....3....4....4....5....5....
             0    5    0    5    0    5    0    5    0    5    
               CPU% per minute (last 60 minutes)
              * = maximum CPU%   # = average CPU%

    8889888888888888988888889888888888888888888888888888888888888888898889
    2322334378633453364454472653323431254225563228261399243233354222402310
100                                                                       
 90    *    ***   * **  *  ****        *   ***   * *  **       *     *   *
 80 *#############################*********************************#******
 70 *#####################################################################
 60 *#####################################################################
 50 ######################################################################
 40 ######################################################################
 30 ######################################################################
 20 ######################################################################
 10 ######################################################################
   0....5....1....1....2....2....3....3....4....4....5....5....6....6....7.
             0    5    0    5    0    5    0    5    0    5    0    5    0 
                   CPU% per hour (last 72 hours)
                  * = maximum CPU%   # = average CPU%

Best Answer

Cisco switches puts ICMP at the bottom of the priority list. We get the same results if we ping a busy 3750-X.

You need to look at the system utilization on the switches, as I suspect they are so busy that they are doing software processing of packets. Are you running any kind of layer 3 services on these?

There is a quite serious bug in IOS 12.2.53:

CSCth24278 (Catalyst 2960-S switches)

The CPU utilization on the switch remains high (50 to 60 percent) when the switch is not being accessed by a telnet or a console session. When you telnet or console into the switch, the CPU utilization goes down.

There is no workaround.

Upgrade to 12.2.58-SE1 or later to fix this situation.

Best Answer

Related Solutions

Cisco – Connecting two Cisco 2960 switches

How to troubleshoot excessive latency between client and server

Related Topic