Interpreting mcelog output for bad DIMM

memory

I'm getting streams of mcelog errors on a machine to which I don't have physical access. It seems like a bad DIMM, but I'm having a hard time determining exactly which one.

mcelog output looks like

Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 5 
MISC 21402a2a86 ADDR a8c35dcc0 
TIME 1452026764 Tue Jan  5 12:46:04 2016
MCG status:
MCi status:
Error overflow
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNEL3_ERR
Transaction: Memory read error
STATUS cc0000c000010093 MCGSTATUS 0
MCGCAP 1000c14 APICID 20 SOCKETID 1 
CPUID Vendor Intel Family 6 Model 45

Hardware event. This is not a software error.
MCE 1
CPU 1 BANK 11 
MISC 90840000000208c ADDR a089ddac0 
TIME 1452026764 Tue Jan  5 12:46:04 2016
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER MS_CHANNEL3_ERR
Transaction: Memory scrubbing error
STATUS 8c000050000800c3 MCGSTATUS 0
MCGCAP 1000c14 APICID 20 SOCKETID 1 
CPUID Vendor Intel Family 6 Model 45

There's a lot of inconsistency between my output and the mcelog documentation. My first problem is that there are 2 8-core Xeons and normally I would assume that they are numbered 0 and 1. However, some posts I've read suggest that the "first" CPU might be labeled by mcelog as CPU 0-7, and the "second" as CPU 8-15.

The second problem is that I can't figure out what BANK 5 means. It's not synonymous with the DIMM slots, because right now we are only using slots 1-4. dmidecode helpfully reports "Bank Locator: Not Specified" on every DIMM.

Also, MEMORY CONTROLLER MS_CHANNEL3_ERR makes me think that the error is coming in on channel 3. According to the motherboard diagram, channel 3 is for slots 4, 8, and 12, which would mean mean the DIMM in 4 is the culprit, but I'm not sure how to verify that.

I have tried mcelog with the –dmi switch, but it fails and suggests and update. This machine is badly out of date (Ubuntu 12.04, and not even the latest packages for that release), but updating it opens another can of worms. I'd like to get this memory problem fixed before I try anything else drastic.

I'm grateful for any help in interpreting this and figuring out what to replace before I send someone on the long drive to the data center.

Best Answer

I never did find a clear interpretation of the mcelog data, but my best guess worked out, and I figured I should follow up for posterity.

  • I assumed CPU 1 meant the second CPU, helpfully labeled as 2 on the motherboard diagram.
  • I assumed MEMORY CONTROLLER MS_CHANNEL3_ERR indicated channel 3 on that CPU's memory controller. As above, that channel controls slots 4, 8 and 12, and only slot 4 had a chip in it.
  • I ignored everything else.

I had someone swap out that DIMM, and, presto! No more streams of Machine Check errors.

Related Topic