I've seen a dicussion about ECC ram use on servers. Why is it better?
ECC RAM – What is ECC RAM and Why is it Better?
eccmemory
Related Solutions
As stuff is stored into, left, and eventually pulled out of RAM, some corruption naturally occurs (theories vary, but the one with the most weight right now is EMI from the computer itself). ECC is a feature of RAM and motherboards that allows detection and correction of this corruption.
The corruption is usually pretty minor (ECC can usually detect and fix 1-2 bits per 64 bit "word" - and that's waaaaay beyond the typical error rates), but increases in frequency with the density of the RAM. Your average workstation/PC will never notice it. On a server where you're running high density RAM 24/7 in a high-demand environment serving critical services, you take every step you possibly can to prevent stuff from breaking.
Also note that ECC RAM must be supported by your motherboard, and the average workstation/PC does not support it.
ECC RAM is more expensive than non-ECC, is much more sensitive to clock speeds, and can incur a small (1-2%) performance hit. If it helps, an analogy that works is RAM to RAID controllers. On your PC, that hardware-assisted software RAID built into your chipset is great protection against single disk failures. On a server, that would never be enough. You need high-end, battery-backed fully hardware RAID with onboard RAM to ensure that you don't lose data due to a power outage, disk failure, or whatever.
So no, you don't really need ECC RAM in your workstation. The benefit simply will not justify the price.
Linux kills the programs using memory pages with bits flipped beyond recovery (thus one ECC word with 2 flips), using a SIGBUS signal. Then it blacklists that page so that it won't be reused.
When encountering corrected faults repeatedly (typically not the case with transient flips, but with hard faults that persist after correction), pages are migrated transparently to another physical page, but using the same virtual addresses. This is done through a "leaky bucket" counter, that counts ECC errors per page over the last X units of time.
These approaches are respectively called hard and soft page offlining. You can read more and access error statistics/logs through mcelog, which is part of all Linux kernels starting at version 2.6. Note that you can set it so that your kernel will panic and reboot the machine at each error, if you so wish.
This also exists under the name of memory page retirement in Solaris systems, and other OSs undoubtedly have their own version of it though I don't know the names or references of the top of my head.
In short, the hardware reports the errors and the OS mitigates their effects. So chances are you won't get a lot of symptoms, but you may ask your OS or tools for statistics.
Best Answer
ECC RAM can recover from small errors in bits, by utilizing parity bits. Since servers are a shared resource where up-time and reliability are important, ECC RAM is generally used with only a modest difference in price. ECC RAM is also used in CAD/CAM workstations were small bit errors could cause calculation mistakes which become more significant problems when a design goes to manufacturing.