What does ECC RAM failure look like

ecc

For Non-ECC memory I have a decent idea of what a failure looks like; certain random things start going wrong (e.g. PNG checksums fail validation once and then not the next time), that sort of thing. But I'm relatively new to ECC RAM. What do I expect when ECC RAM fails? I know if there's a single-bit flipped, it should just automatically correct that, but how would I know if there are more serious issues or if the module needs to be replaced?

I found one report that suggested that the system might spontaneously shut off or fail to power on, but it's not clear to me why that would be the case.

Best Answer

Linux kills the programs using memory pages with bits flipped beyond recovery (thus one ECC word with 2 flips), using a SIGBUS signal. Then it blacklists that page so that it won't be reused.

When encountering corrected faults repeatedly (typically not the case with transient flips, but with hard faults that persist after correction), pages are migrated transparently to another physical page, but using the same virtual addresses. This is done through a "leaky bucket" counter, that counts ECC errors per page over the last X units of time.

These approaches are respectively called hard and soft page offlining. You can read more and access error statistics/logs through mcelog, which is part of all Linux kernels starting at version 2.6. Note that you can set it so that your kernel will panic and reboot the machine at each error, if you so wish.

This also exists under the name of memory page retirement in Solaris systems, and other OSs undoubtedly have their own version of it though I don't know the names or references of the top of my head.

In short, the hardware reports the errors and the OS mitigates their effects. So chances are you won't get a lot of symptoms, but you may ask your OS or tools for statistics.

Related Topic