Linux – How to check/replace first in response to mcelog “Memory address parity error” / MEMORY CONTROLLER AC_CHANNEL0_ERR messages

hardwarelinux

I have a server that kernel panics every few days.

mcelog tells me:

Hardware event. This is not a software error.
MCE 0
CPU 6 BANK 8 
MISC 0 
TIME 1317928482 Thu Oct  6 15:14:42 2011
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
Processor context corrupt
MCA: MEMORY CONTROLLER AC_CHANNEL0_ERR
Transaction: Address/Command error
Memory address parity error
Memory corrected error count (CORE_ERR_CNT): 21763
Memory transaction Tracker ID (RTId): 0
Memory DIMM ID of error: 0
Memory channel ID of error: 0
Memory ECC syndrome: 0
STATUS ea1540c0008000b0 MCGSTATUS 0
MCGCAP 1c09 APICID 20 SOCKETID 1 
CPUID Vendor Intel Family 6 Model 44

I'm going to try a BIOS update. After that, I'm not sure what to try next. Disabling the 2nd CPU will probably keep me up and running for now.

Best Answer

If this is really a CPU error it is propably broken somehow.

You could try an Intel-microcode-update first.