Linux Memory – How to Get Notified of ECC Errors

ecclinux

How do I get notified, when a Linux machine equipped with ECC memory recognizes a memory failure? I'm interested in both correctable and uncorrectable errors.

if a message is written to dmesg/the syslog, this is already fine, but I'd love to know what to look for
installing additional daemons (like smartmontools for hard drives) is acceptable
Nagios/Icinga monitoring would be another way to go
not all machines to be monitored have IPMI

Systems of interest have Supermicro boards (X9SCM-F), regarding an HP N54L Microserver I'm just curios, but don't care too much. All systems run Debian or Ubuntu Linux.

Best Answer

The Linux kernel supports the error detection and correction (EDAC) features of some chipsets. On a supported system with ECC the status of your memory controller is accessible via sysfs:

/sys/devices/system/edac/mc

The directory tree under that locations should correspond to your hardware, e.g.:

/sys/devices/system/edac/mc/mc0/csrow2/power
/sys/devices/system/edac/mc/mc0/csrow0/power
/sys/devices/system/edac/mc/mc0/dimm2/power
/sys/devices/system/edac/mc/mc0/dimm0/power
/sys/devices/system/edac/mc/mc1/power
...

Depending on your hardware, you might have to explicitly load the right edac driver, cf.:

find /lib/modules/$(uname -r) -name '*edac*'

The edac-utils package provides a command line frontend and a library for accessing that data, e.g.:

edac-util -rfull          
mc0:csrow0:mc#0memory#0:CE:0
mc0:csrow2:mc#0memory#2:CE:0
mc0:noinfo:all:UE:0
mc0:noinfo:all:CE:0
mc1:noinfo:all:UE:0
mc1:noinfo:all:CE:0

You can setup some kind of cron-job that periodically calls eac-util and feeds the results into your monitoring system, where you can then configure some notifications.

In addition to that, running mcelog is generally a good idea. Depends on the system, but uncorrectable/correctable ECC errors are likely reported as machine check exception (MCE), as well. I mean, even brief periods of CPU throttling due to higher temperature are reported as MCE.

Related Solutions

Linux – ECC chipkill errors: which DIMM

In addition to using the EDAC codes, you can use the CLI only HP utilities to determine this while the machine is online. The cli versions are far more lightweight than the web based ones and do not require you to open ports or have a daemon constantly running.

hpasmcli will give you the cartridge and module #'s of the failed modules. A little quicker than analyzing EDAC.

Example:

hpasmcli -s "show dimm"

DIMM Configuration
------------------
Cartridge #: 0
Module #: 1
Present: Yes
Form Factor: 9h
Memory Type: 13h
Size: 1024 MB
Speed: 667 MHz
Status: Ok

Cartridge #: 0
Module #: 2
Present: Yes
Form Factor: 9h
Memory Type: 13h
Size: 1024 MB
Speed: 667 MHz
Status: Ok

Cartridge #: 0
Module #: 3
Present: Yes
Form Factor: 9h
Memory Type: 13h
Size: 1024 MB
Speed: 667 MHz
Status: Ok

Cartridge #: 0
Module #: 4
Present: Yes
Form Factor: 9h
Memory Type: 13h
Size: 1024 MB
Speed: 667 MHz
Status: Ok

Status will change for failed modules.

Linux Bash – How to Sort du -h Output by Size

As of GNU coreutils 7.5 released in August 2009, sort allows a -h parameter, which allows numeric suffixes of the kind produced by du -h:

du -hs * | sort -h

If you are using a sort that does not support -h, you can install GNU Coreutils. E.g. on an older Mac OS X:

brew install coreutils
du -hs * | gsort -h

From sort manual:

-h, --human-numeric-sort compare human readable numbers (e.g., 2K 1G)

Best Answer

Related Solutions

Linux – ECC chipkill errors: which DIMM

Linux Bash – How to Sort du -h Output by Size

Related Topic