I'm receiving user complaints about poor network application performance between two parts of a large warehouse facility. The software is a curses-based terminal application running on a Linux server. The clients are PCs running a telnet or SSH client. The trouble started a day ago with no recent (known) changes to the environment.
The core switch is a Cisco Catalyst 4507R-E in the MDF, linked to a 4-member stack of Cisco Catalyst 2960 switches in the IDF… They are connected via multimode fiber. The servers are in the MDF. The clients impacted are in the IDF.
Pinging from the Linux application server to the 2960 stack's management address across the building shows high variance and a lot of latency:
--- shipping-2960.mdmarra.local ping statistics ---
864 packets transmitted, 864 received, 0% packet loss, time 863312ms
rtt min/avg/max/mdev = 0.521/5.317/127.037/8.698 ms
However, pings to client computers from the application server are a bit more consistent:
--- charles-pc.mdmarra.local ping statistics ---
76 packets transmitted, 76 received, 0% packet loss, time 75001ms
rtt min/avg/max/mdev = 0.328/0.481/1.355/0.210 ms
None of the relevant Linux interfaces or switchports show errors (see bottom of question).
How can I troubleshoot this?
- Is there an easy method to determine port activity?
- Is the ping variance on the management IP of the switch the wrong thing to measure?
- Could this be the result of a rogue PC?
- Since the problem is isolated to one part of the building, is there anything else I should be checking? Other users in the warehouse are fine and haven't had any issues.
Edit:
I later discovered that the Cisco 2960 CPU utilization is extremely high due to the bug detailed here.
From the 2960 stack…
shipping-2960#sh int GigabitEthernet1/0/52
GigabitEthernet1/0/52 is up, line protocol is up (connected)
Hardware is Gigabit Ethernet, address is b414.894a.09b4 (bia b414.894a.09b4)
Description: TO_MDF_4507
MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,
reliability 255/255, txload 13/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive not set
Full-duplex, 1000Mb/s, link type is auto, media type is 1000BaseSX SFP
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:00:00, output 00:00:01, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 441
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 3053000 bits/sec, 613 packets/sec
5 minute output rate 51117000 bits/sec, 4815 packets/sec
981767797 packets input, 615324451566 bytes, 0 no buffer
Received 295141786 broadcasts (286005510 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 286005510 multicast, 0 pause input
0 input packets with dribble condition detected
6372280523 packets output, 8375642643516 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 PAUSE output
0 output buffer failures, 0 output buffers swapped out
Additional output:
Cisco 4507R-E CPU utilization – sorted.
Cisco 2960 CPU utilization – sorted.
tcam utilization of 2960. Not available on the 4507.
shipping-2960# show platform tcam utilization
CAM Utilization for ASIC# 0 Max Used
Masks/Values Masks/values
Unicast mac addresses: 8412/8412 335/335
IPv4 IGMP groups + multicast routes: 384/384 1/1
IPv4 unicast directly-connected routes: 320/320 28/28
IPv4 unicast indirectly-connected routes: 0/0 28/28
IPv6 Multicast groups: 320/320 11/11
IPv6 unicast directly-connected routes: 256/256 1/1
IPv6 unicast indirectly-connected routes: 0/0 1/1
IPv4 policy based routing aces: 32/32 12/12
IPv4 qos aces: 384/384 42/42
IPv4 security aces: 384/384 33/33
IPv6 policy based routing aces: 16/16 8/8
IPv6 qos aces: 60/60 31/31
IPv6 security aces: 128/128 9/9
Cisco 2960 CPU utilization history…
shipping-2960#show processes cpu history
3333333444443333344444444443333333333444443333344444444443
9977777111119999966666222229999977777555559999911111000008
100
90
80
70
60
50 ***** *****
40 **********************************************************
30 **********************************************************
20 **********************************************************
10 **********************************************************
0....5....1....1....2....2....3....3....4....4....5....5....
0 5 0 5 0 5 0 5 0 5
CPU% per second (last 60 seconds)
4488887787444454444787888444444454677774444444447888544444
6401207808656506776708000447546664789977697589953201636647
100
90
80 *###*##* *#*##* *#** ###
70 #######* *##### *###* *###
60 #######* *##### * *#### *###*
50 * ########*********###### ** *** *####*********####* ** *
40 ##########################################################
30 ##########################################################
20 ##########################################################
10 ##########################################################
0....5....1....1....2....2....3....3....4....4....5....5....
0 5 0 5 0 5 0 5 0 5
CPU% per minute (last 60 minutes)
* = maximum CPU% # = average CPU%
8889888888888888988888889888888888888888888888888888888888888888898889
2322334378633453364454472653323431254225563228261399243233354222402310
100
90 * *** * ** * **** * *** * * ** * * *
80 *#############################*********************************#******
70 *#####################################################################
60 *#####################################################################
50 ######################################################################
40 ######################################################################
30 ######################################################################
20 ######################################################################
10 ######################################################################
0....5....1....1....2....2....3....3....4....4....5....5....6....6....7.
0 5 0 5 0 5 0 5 0 5 0 5 0
CPU% per hour (last 72 hours)
* = maximum CPU% # = average CPU%
Best Answer
Cisco switches puts ICMP at the bottom of the priority list. We get the same results if we ping a busy 3750-X.
You need to look at the system utilization on the switches, as I suspect they are so busy that they are doing software processing of packets. Are you running any kind of layer 3 services on these?
There is a quite serious bug in IOS 12.2.53:
Upgrade to 12.2.58-SE1 or later to fix this situation.