Interpreting mcelog output for bad DIMM

memory

I'm getting streams of mcelog errors on a machine to which I don't have physical access. It seems like a bad DIMM, but I'm having a hard time determining exactly which one.

mcelog output looks like

Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 5 
MISC 21402a2a86 ADDR a8c35dcc0 
TIME 1452026764 Tue Jan  5 12:46:04 2016
MCG status:
MCi status:
Error overflow
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNEL3_ERR
Transaction: Memory read error
STATUS cc0000c000010093 MCGSTATUS 0
MCGCAP 1000c14 APICID 20 SOCKETID 1 
CPUID Vendor Intel Family 6 Model 45

Hardware event. This is not a software error.
MCE 1
CPU 1 BANK 11 
MISC 90840000000208c ADDR a089ddac0 
TIME 1452026764 Tue Jan  5 12:46:04 2016
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER MS_CHANNEL3_ERR
Transaction: Memory scrubbing error
STATUS 8c000050000800c3 MCGSTATUS 0
MCGCAP 1000c14 APICID 20 SOCKETID 1 
CPUID Vendor Intel Family 6 Model 45

There's a lot of inconsistency between my output and the mcelog documentation. My first problem is that there are 2 8-core Xeons and normally I would assume that they are numbered 0 and 1. However, some posts I've read suggest that the "first" CPU might be labeled by mcelog as CPU 0-7, and the "second" as CPU 8-15.

The second problem is that I can't figure out what BANK 5 means. It's not synonymous with the DIMM slots, because right now we are only using slots 1-4. dmidecode helpfully reports "Bank Locator: Not Specified" on every DIMM.

Also, MEMORY CONTROLLER MS_CHANNEL3_ERR makes me think that the error is coming in on channel 3. According to the motherboard diagram, channel 3 is for slots 4, 8, and 12, which would mean mean the DIMM in 4 is the culprit, but I'm not sure how to verify that.

I have tried mcelog with the –dmi switch, but it fails and suggests and update. This machine is badly out of date (Ubuntu 12.04, and not even the latest packages for that release), but updating it opens another can of worms. I'd like to get this memory problem fixed before I try anything else drastic.

I'm grateful for any help in interpreting this and figuring out what to replace before I send someone on the long drive to the data center.

Best Answer

I never did find a clear interpretation of the mcelog data, but my best guess worked out, and I figured I should follow up for posterity.

I assumed CPU 1 meant the second CPU, helpfully labeled as 2 on the motherboard diagram.
I assumed MEMORY CONTROLLER MS_CHANNEL3_ERR indicated channel 3 on that CPU's memory controller. As above, that channel controls slots 4, 8 and 12, and only slot 4 had a chip in it.
I ignored everything else.

I had someone swap out that DIMM, and, presto! No more streams of Machine Check errors.

Meaning of the values

The first line means:

total: Your total (physical) RAM (excluding a small bit that the kernel permanently reserves for itself at startup); that's why it shows ca. 11.7 GiB , and not 12 GiB, which you probably have.
used: memory in use by the OS.
free: memory not in use.
shared / buffers / cached: This shows memory usage for specific purposes, these values are included in the value for used.

The second line gives first line values adjusted. It gives the original value for used minus the sum buffers+cached and the original value for free plus the sum buffers+cached, hence its title. These new values are often more meaningful than those of first line.

The last line (Swap:) gives information about swap space usage (i.e. memory contents that have been temporarily moved to disk).

Background

To actually understand what the numbers mean, you need a bit of background about the virtual memory (VM) subsystem in Linux. Just a short version: Linux (like most modern OS) will always try to use free RAM for caching stuff, so Mem: free will almost always be very low. Therefore the line -/+ buffers/cache: is shown, because it shows how much memory is free when ignoring caches; caches will be freed automatically if memory gets scarce, so they do not really matter.

A Linux system is really low on memory if the free value in -/+ buffers/cache: line gets low.

For more details about the meaning of the numbers, see e.g. the questions:

Changes in procps 3.3.10

Note that the output of free was changed in procps 3.3.10 (released in 2014). The columns reported are now "total", "used", "free", "shared", "buff/cache", "available", and the meanings of some of the values changed, mainly to better account for the Linux kernel's slab cache.

See Debian Bug report #565518 for the motivation, and What do the changes in free output from 14.04 to 16.04 mean? for more details information.

Mysql – Where is the free memory? (Solaris 10)

Can please paste a out of following

prtdiag -v
prstat -a

Best Answer

Related Solutions

Meaning of Buffers/Cache Line in Free Command Output – Linux Memory Usage

Meaning of the values

Background

Changes in procps 3.3.10

Mysql – Where is the free memory? (Solaris 10)

Related Topic