How do I get notified, when a Linux machine equipped with ECC memory recognizes a memory failure? I'm interested in both correctable and uncorrectable errors.
- if a message is written to dmesg/the syslog, this is already fine, but I'd love to know what to look for
- installing additional daemons (like smartmontools for hard drives) is acceptable
- Nagios/Icinga monitoring would be another way to go
- not all machines to be monitored have IPMI
Systems of interest have Supermicro boards (X9SCM-F), regarding an HP N54L Microserver I'm just curios, but don't care too much. All systems run Debian or Ubuntu Linux.
Best Answer
The Linux kernel supports the error detection and correction (EDAC) features of some chipsets. On a supported system with ECC the status of your memory controller is accessible via sysfs:
The directory tree under that locations should correspond to your hardware, e.g.:
Depending on your hardware, you might have to explicitly load the right edac driver, cf.:
The
edac-utils
package provides a command line frontend and a library for accessing that data, e.g.:You can setup some kind of cron-job that periodically calls
eac-util
and feeds the results into your monitoring system, where you can then configure some notifications.In addition to that, running
mcelog
is generally a good idea. Depends on the system, but uncorrectable/correctable ECC errors are likely reported as machine check exception (MCE), as well. I mean, even brief periods of CPU throttling due to higher temperature are reported as MCE.