Linux – Monitoring physical RAM errors on Linux

linuxmemorymonitoring

I would like to monitor the ram of two linux systems (Ubuntu and Red Hat). I realize I can run memtest86 from boot to diagnose bad ram. But are there are any solutions to monitor ram while the system is still running. I'm sort of thinking a daemon that writes and reads back from random unused memory. Anybody seen something like this before?

Best Answer

Most modern servers of any reasonable quality have an IPMI module which will report bad RAM (usually via SBE (single-bit error) messages from ECC RAM -- You are using ECC RAM in your servers, right?). The IPMI module also monitors and reports on a bunch of other useful stuff.

You can monitor the IPMI module using a variety of network monitoring systems (if you have a management network for the IPMI NICs) or using ipmitool which is available on most Unix systems. Many vendors (Dell and IBM for sure) also have specialized tools that interrogate the IPMI module for on-line diagnostics. Check with your hardware vendor for more details.

Related Topic