Cisco Router – How to Overcome Microbursts in Cisco or Linux Machines

ciscolinuxmulticastpacket-lossswitch

I have linux machine multicasting UDP streams with Java applications. The multicats go though ciso router. I see lot of packet losses at receivers. Checking ports in routers reveals overruns counter increasing. Searching on internet specifies these happens due to microbursts from TX machine. I also guess same, because drops increases with time, may be java app getting unstable with time, and then starting outputting bursts of packets, its buggy. Maybe I can change something in the router?

EDIT

Following is Cisco hardware:

Cisco WS-C6504-E w/ FAN-MOD-4HS Empty 4-slot 6500 Enhanced Chassis 1

Cisco PWR-2700-AC/4 2700W AC Power for 7604 6504-E #12025 X 2 2

Cisco Catalyst 6500 7600 Supervisor 720 Module – WS-SUP720-3BXL WS-F6K-PFC3BXL 1

Cisco WS-X6548-GE-TX 48-Port 1G Copper Eth Module

Following is CPU utilization:

RHE-001#show fabric utilization all
 slot    channel      speed    Ingress %     Egress %
    1          0        20G            0            0
    2          0         8G           12            2
    3          0         8G            9           14
    4          0         8G            0           13

RHE-001#

Following is interface statistics:

RHE-001#show int GigabitEthernet 2/4
GigabitEthernet2/4 is up, line protocol is up (connected)
  Hardware is C6k 1000Mb 802.3, address is 0023.04dd.0d00 (bia 0023.04dd.0d00)
  Internet address is 10.0.1.13/30
  MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,
     reliability 255/255, txload 7/255, rxload 21/255
  Encapsulation ARPA, loopback not set
  Keepalive set (10 sec)
  Full-duplex, 1000Mb/s, media type is 10/100/1000BaseT
  input flow-control is off, output flow-control is off
  Clock mode is auto
  ARP type: ARPA, ARP Timeout 04:00:00
  Last input 00:00:27, output 00:00:05, output hang never
  Last clearing of "show interface" counters 5d08h
  Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
  Queueing strategy: fifo
  Output queue: 0/40 (size/max)
  5 minute input rate 83389000 bits/sec, 7648 packets/sec
  5 minute output rate 30360000 bits/sec, 2786 packets/sec
  L2 Switched: ucast: 32 pkt, 2048 bytes - mcast: 0 pkt, 0 bytes
  L3 in Switched: ucast: 0 pkt, 0 bytes - mcast: 3542591539 pkt, 4825009676118 bytes mcast
  L3 out Switched: ucast: 0 pkt, 0 bytes mcast: 2879819193 pkt, 3922313740866 bytes
     3542548642 packets input, 4830700273704 bytes, 0 no buffer
     Received 3542548610 broadcasts (3542458124 IP multicasts)
     0 runts, 0 giants, 0 throttles
     0 input errors, 0 CRC, 0 frame, 4243199 overrun, 0 ignored
     0 watchdog, 0 multicast, 0 pause input
     0 input packets with dribble condition detected
     1276819687 packets output, 1738995021346 bytes, 0 underruns
     0 output errors, 0 collisions, 0 interface resets
     0 babbles, 0 late collision, 0 deferred
     0 lost carrier, 0 no carrier, 0 PAUSE output
     0 output buffer failures, 0 output buffers swapped out
RHE-001#

As we can see overrun counter increasing.

Buffers:

RHE-001#show buffers
Buffer elements:
     499 in free list (500 max allowed)
     623919821 hits, 0 misses, 0 created

Public buffer pools:
Small buffers, 104 bytes (total 1024, permanent 1024):
     1021 in free list (128 min, 2048 max allowed)
     183749828 hits, 0 misses, 0 trims, 0 created
     0 failures (0 no memory)
Medium buffers, 256 bytes (total 3000, permanent 3000):
     2999 in free list (64 min, 3000 max allowed)
     22779465 hits, 0 misses, 0 trims, 0 created
     0 failures (0 no memory)
Middle buffers, 600 bytes (total 512, permanent 512):
     510 in free list (64 min, 1024 max allowed)
     5814462 hits, 0 misses, 0 trims, 0 created
     0 failures (0 no memory)
Big buffers, 1536 bytes (total 1000, permanent 1000):
     999 in free list (64 min, 1000 max allowed)
     2009529750 hits, 0 misses, 0 trims, 0 created
     0 failures (0 no memory)
VeryBig buffers, 4520 bytes (total 10, permanent 10):
     10 in free list (0 min, 100 max allowed)
     363 hits, 0 misses, 0 trims, 0 created
     0 failures (0 no memory)
Large buffers, 9240 bytes (total 8, permanent 8):
     8 in free list (0 min, 10 max allowed)
     57 hits, 0 misses, 0 trims, 0 created
     0 failures (0 no memory)
Huge buffers, 18024 bytes (total 2, permanent 2):
     2 in free list (0 min, 4 max allowed)
     41 hits, 0 misses, 0 trims, 0 created
     0 failures (0 no memory)

Interface buffer pools:
Syslog ED Pool buffers, 600 bytes (total 150, permanent 150):
     118 in free list (150 min, 150 max allowed)
     10421 hits, 10168 misses
LI Middle buffers, 600 bytes (total 512, permanent 256, peak 512 @ 7w0d):
     256 in free list (256 min, 768 max allowed)
     171 hits, 85 fallbacks, 0 trims, 256 created
     0 failures (0 no memory)
     256 max cache size, 256 in cache
     0 hits in cache, 0 misses in cache
EOBC0/0 buffers, 1524 bytes (total 2400, permanent 2400):
     1200 in free list (0 min, 2400 max allowed)
     1200 hits, 0 fallbacks
     1200 max cache size, 680 in cache
     2369496864 hits in cache, 0 misses in cache
LI Big buffers, 1536 bytes (total 512, permanent 256, peak 512 @ 7w0d):
     256 in free list (256 min, 768 max allowed)
     171 hits, 85 fallbacks, 0 trims, 256 created
     0 failures (0 no memory)
     256 max cache size, 256 in cache
     0 hits in cache, 0 misses in cache
IPC buffers, 4096 bytes (total 2352, permanent 2352):
     2242 in free list (784 min, 7840 max allowed)
     333747144 hits, 0 fallbacks, 0 trims, 0 created
     0 failures (0 no memory)
LI Very Big buffers, 4520 bytes (total 257, permanent 128, peak 257 @ 7w0d):
     129 in free list (128 min, 384 max allowed)
     85 hits, 43 fallbacks, 4101 trims, 4230 created
     0 failures (0 no memory)
     128 max cache size, 128 in cache
     0 hits in cache, 0 misses in cache
Private Huge IPC buffers, 18024 bytes (total 2, permanent 2):
     2 in free list (1 min, 4 max allowed)
     0 hits, 0 misses, 0 trims, 0 created
     0 failures (0 no memory)
Private Huge buffers, 65280 bytes (total 2, permanent 2):
     2 in free list (1 min, 4 max allowed)
     787 hits, 0 misses, 0 trims, 0 created
     0 failures (0 no memory)

Header pools:


RHE-001#

EDIT 2

There are no switches. The multicast transmitting machines and receiving machines are directly connected to cisco router.

Best Answer

So first of all show fabric utilization all shows fabric utilization, not CPU utilization. Fabric doesn't have CPU component per se, and you can go all up to 100% of fabric utilization without any adverse effects similar to what CPU causes when nearing to the full load.

Next, the WS-X6548-GE-TX is 8Gbit/s card, so "old" fabric attached LC with 8Gbit/s channel. Internally, it shares buffers per 8 ports on card, so given you're getting 'overrun' errors that typically point to a problem with receiving traffic in timely manner and handing it over to other ports, first thing I'd do is separate incoming 8-port group on the card to separate group. In other words, if there is specific port/group of ports receiving high-volume multicast traffic, I'd move it to separate group on the card - and please rememeber, each consecutive 8 ports is one "group":

http://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst6500/hardware/Module_Installation/Mod_Install_Guide/6500-emig/02ethern.html#wp1043307

This means, among other things, that the 8Gbit/s interface to fabric is statically sliced in 6 groups of 8 ports, out of which each group gets 1Gbit/s maximum. So, if in any given port group (of 8x 10/100/1000 ports) you have ports receiving traffic over 1Gbit/s you'll hit exactly problems you're encountering. That's why my proposal is to move any other ports out of the 8-port group apart from the one interface receiving massive amounts of multicast traffic (it seems it's GigabitEthernet 2/4 in your case). You can find this information literally stated in the release notes:

http://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst6500/ios/15-1SY/release_notes.html#pgfId-4909956

The aggregate bandwidth of each set of 8 ports (1–8, 9–16, 17–24, 25–32, 33–40, and 41–48) is 1 Gbps."

For better utilization of the physical ports, I'd recommend you take a look at the WS-X6748-GE-TX card, that also has 48 10/100/1000 ports but has also two 20Gbit/s fabric connections. Those 20Gbit/s fabric channels are split between ports 1-24 and 25-48, so you get still oversubscription, but only 24Gbit/s over the 20Gbit/s supported by channel, not 8Gbit/s over 1Gbit/s as in 6548 (so, effectively, 1.2:1 oversubscription in 6748 vs 8:1 in 6548). This should give you space to burst traffic over the link from the sending station and distribute it across the system.

Related Topic