From a Cat6K, RP crashinfo indicates parity error. Guidance seems to be to not do anything unless this happens more than once in 12-month period. At what point do you push Cisco for hardware or RAM replacement?
Cache error detected! CPO_ECC (reg 26/0): 0x00000064 CPO_CACHERI (reg 27/0): 0x20000000 CP0_CAUSE (reg 13/0): 0x00000400 Real cache error detected. System will be halted. Error: Primary instr cache, fields: data, Actual physical addr 0x00000000, virtual address is imprecise. Imprecise Data Parity Error Imprecise Data Parity Error Interrupt exception, CPU signal 20, PC = 0x41AAE2DC
Also took exceedingly long to failover from active SUP720-PFC3B to hot standby — 18 minutes — w/SSO. **From my research, it appears long failover time could be due to crashdumps, but I don't see coredumps configured. And without crashdumps, wouldn't root-cause be difficult if not impossible.
Cisco's states the following for failover time. Without coredumps configured — no exception
type commands — why would SSO take so long (18 min)? I was completely down during this period; even my HSRP VIPs that were active seemed to have stay alive on a dead SUP instead of moving to another Cat6K; need more log analysis to know fo sure.
The time required by the device to switch over from the active RP to the standby RP is between zero and three seconds.
Although the newly active processor takes over almost immediately following a switchover, the time required for the device to begin operating again in full redundancy (SSO) mode can be several minutes, depending on the platform. The length of time can be due to a number of factors including the time needed for the previously active processor to obtain crash information, load code and microcode, and synchronize configurations between processors.
s-oc4-n2-agg1#sh redundancy states my state = 13 -ACTIVE peer state = 8 -STANDBY HOT Mode = Duplex Unit = Secondary Unit ID = 6 Redundancy Mode (Operational) = sso Redundancy Mode (Configured) = sso Redundancy State = sso Split Mode = Disabled Manual Swact = Enabled Communications = Up client count = 62 client_notification_TMR = 30000 milliseconds keep_alive TMR = 9000 milliseconds keep_alive count = 0 keep_alive threshold = 19 RF debug mask = 0x0
Best Answer
Whenever you feel like you can't afford the risk of the HW being bad. Single-bit parity errors happen because Cisco used non-ECC memory in some components of the Supervisor. As long as an occasional 3-second SSO failover is tolerable, just roll with Cisco's recommendation, because solar radiation causes most parity bit failures in IOS.
Having a core dump is great, but I would only enable ftp core dumps when you must figure out the cause of a recurring bug in your network... the reasons...
BTW, a bug's root cause can be found without a core dump; Cisco IOS developers do this all the time based on the stack trace you find in IOS crashinfo. Also see this example crashinfo. I diagnosed crashes for several years in Cisco Adv Services, and crashinfo is how we isolated most Cisco bugs.