Cisco – Cat6K parity failure and long SSO failover

ciscocisco-6500cisco-iosfailoverswitch

From a Cat6K, RP crashinfo indicates parity error. Guidance seems to be to not do anything unless this happens more than once in 12-month period. At what point do you push Cisco for hardware or RAM replacement?

Cache error detected!
  CPO_ECC     (reg 26/0): 0x00000064
  CPO_CACHERI (reg 27/0): 0x20000000
  CP0_CAUSE   (reg 13/0): 0x00000400

Real cache error detected.  System will be halted.

Error: Primary instr cache, fields: data,
Actual physical addr 0x00000000,
virtual address is imprecise.

 Imprecise Data Parity Error

 Imprecise Data Parity Error

Interrupt exception, CPU signal 20, PC = 0x41AAE2DC

Also took exceedingly long to failover from active SUP720-PFC3B to hot standby — 18 minutes — w/SSO. **From my research, it appears long failover time could be due to crashdumps, but I don't see coredumps configured. And without crashdumps, wouldn't root-cause be difficult if not impossible.

Cisco's states the following for failover time. Without coredumps configured — no exception type commands — why would SSO take so long (18 min)? I was completely down during this period; even my HSRP VIPs that were active seemed to have stay alive on a dead SUP instead of moving to another Cat6K; need more log analysis to know fo sure.

The time required by the device to switch over from the active RP to the standby RP is between zero and three seconds.

Although the newly active processor takes over almost immediately following a switchover, the time required for the device to begin operating again in full redundancy (SSO) mode can be several minutes, depending on the platform. The length of time can be due to a number of factors including the time needed for the previously active processor to obtain crash information, load code and microcode, and synchronize configurations between processors.

s-oc4-n2-agg1#sh redundancy states
       my state = 13 -ACTIVE
     peer state = 8  -STANDBY HOT
           Mode = Duplex
           Unit = Secondary
        Unit ID = 6
Redundancy Mode (Operational) = sso
Redundancy Mode (Configured)  = sso
Redundancy State              = sso

     Split Mode = Disabled
   Manual Swact = Enabled
 Communications = Up

   client count = 62
 client_notification_TMR = 30000 milliseconds
          keep_alive TMR = 9000 milliseconds
        keep_alive count = 0
    keep_alive threshold = 19
           RF debug mask = 0x0

Best Answer

At what point do you push Cisco for hardware or RAM replacement?

Whenever you feel like you can't afford the risk of the HW being bad. Single-bit parity errors happen because Cisco used non-ECC memory in some components of the Supervisor. As long as an occasional 3-second SSO failover is tolerable, just roll with Cisco's recommendation, because solar radiation causes most parity bit failures in IOS.

From my research, it appears my only option to reduce failover time is to disable crashdumps altogether; is this correct? And without crashdumps, root-cause will be difficult if not impossible.

Having a core dump is great, but I would only enable ftp core dumps when you must figure out the cause of a recurring bug in your network... the reasons...

  • A core dump sends the entire entire contents of RAM to disk. Cisco IOS writes from the MSFC to the network very slowly, and as you noticed a core dump from a RP with half a gigabyte (or more) of DRAM takes a long time.
  • Only Cisco IOS developers will care about the core dump (TAC will just attach it to the case, but only a developer can analyze it since it requires special skills). Getting a developer's attention is pointless if it's the root cause of the bug is known... and suffering 15 minute outages for every solar flare event is too heavy a price to pay.
  • Many failover events are caused by soft parity errors from cosmic radiation. Waiting all that time for a core dump is useless because there is no bug to be fixed.

BTW, a bug's root cause can be found without a core dump; Cisco IOS developers do this all the time based on the stack trace you find in IOS crashinfo. Also see this example crashinfo. I diagnosed crashes for several years in Cisco Adv Services, and crashinfo is how we isolated most Cisco bugs.