Cisco 7609-S Line card error counter is regularly exceeding threshold

ciscocisco-7600errorsswitch

I am receiving the following syslog messages on a 7609-S.

Jun 17 11:52:27.560 BST: %CONST_DIAG-SP-4-ERROR_COUNTER_DATA: ID:47 IN:4 PO:255 RE:169 RM:255 DV:5 EG:2 CF:10 TF:472
Jun 17 11:52:27.560 BST: %CONST_DIAG-SP-4-ERROR_COUNTER_WARNING: Module 6 Error counter exceeds threshold, system operation continue.

The card in slot 6 is as follows:

router1#show module 6 
Mod Ports Card Type                              Model              Serial No.
--- ----- -------------------------------------- ------------------ -----------
  6   48  SFM-capable 48 port 10/100/1000mb RJ45 WS-X6548-GE-TX     XXXXXXXX

Mod MAC addresses                       Hw    Fw           Sw           Status
--- ---------------------------------- ------ ------------ ------------ -------
  6  000e.d771.8550 to 000e.d771.857f  10.1   7.2(1)       8.7(0.22)FW2 Ok

Mod  Online Diag Status 
---- -------------------
  6  Pass

router1#show ver
Cisco IOS Software, c7600rsp72043_rp Software (c7600rsp72043_rp-ADVENTERPRISEK9-M), Version 12.2(33)SRE3, RELEASE SOFTWARE (fc1)
  • 2013-06-02 : I received this message, once, for the first time
  • 2013-06-06 : I received the message again, only once
  • 2013-06-11 : I received the message again, only once
  • 2013-06-17 : I have received this message three times today, in a 2 hour period

Searching on the Internet I see other people reporting this issue and it seems to be an indication of hardware failure on the horizon. Has anyone experienced this error before? It simply means (to the best of my knowledge) that that line card is receiving a high volume of errors, above a certain threshold which causes the system to log a syslog message. Should I be worried about this line card?

I do have some graphs I will post here when I get some time over the next day or two showing interface error counters and traffic etc, although I'm not finding much correlation at this point!

Best Answer

Worst case scenario, your HW has gone bad
Best case scenario, it's cosmetic failure due to software defect, luckily you are in SRE which will be supported until 2015, so maybe upgrade it to latest rebuild.

There are two bugID which will cause this error in very benign way.

  • CSCsk03373, due to large packets, fixed in SXH
  • CSCsw32280, due to CRC errors, fixed in SXH

You should probably check 'show diag events', it should correlate with these messages.

GOLD gives us description for 'TestErrorCounterMonitor', which gives us some data on understanding the message

ID -- Asic Identification
IN -- Asic Instance
PO -- Asic Port Number
RE -- Register Identification
RM -- Register Identification More
EG -- Error Group
DV -- Delta Value
CF -- Consecutive Failure
TF -- Total Failure

I don't unfortunately have CEF256 cards, so I can't check which ASIC it was, but you should be able to do it with:

remote command switch show platform hardware asic-versions | i 47

IN will which of the ASIC it is, I'm guessing as there is at least 4 of them, it is 'pinnacle' ASIC, which is port-asic in CEF256, as I don't think CEF256 has 4 of any other ASIC.

If it is pinnacle, you should be able to use 'sh int capabilities module X' and 'sh int X capabilities' to determine which ports are sharing the 4th port ASIC.

However as the 'Asic Port Number' is 255, it seems to contradict it being 'pinnacle' as no physical port would have this number.
There are some special ports in the card like EOBC, RBUS, DBUS and fabric. Unfortunately I don't know what 255 means, it might mean some of these special ports, it might be just place-holder value.

If 'Total Failure' or TF correlates with interface CRC errors, it might be CSCsw32280, otoh CSCsw32280 should show sensible PO number.

If everything else fails, buy smartnet for the card for a year. I'd be curious if you'd answer your own question when you solve this as to what was the root cause. And especially if you can find out what is port 255.