Electronic – the probability of a bit error occurring in modern computers

error correctionmemorymicrocontrollerramsignal processing

What is the probability of a bit error occuring when reading/writing from/to the latest memory technologies (ssd, hdd, ram) in modern computers? If the same terminology is used in this context as in networking/communication systems, then my question can be rephrased as:

What is the bit error rate(BER) when accessing memory either to read/write in modern computers?

As a follow up question, how do computers deal with such bit errors?

Best Answer

Computers handle this by either using software that's designed to blue screen in a controlled manner when critical data structures are corrupted, or by using ECC memory, which stores data with some redundancy to allow a single bit error to be corrected and a double-bit error to be detected with no recourse other than a hard shutdown.

There is some literature analyzing the actual numbers in the context of datacenters. One study tracks the number of dectected-and-corrected errors, showing that, at scale, 2% of machines encountered a recoverable (typically single-bit) ECC fault, while a tiny number encountered unrecoverable (double-bit) faults.

Google tests, shown on slide 11 of this slide deck, reported that 32% of machines would report a correctable per year, while around 1.3% reported uncorrectable errors.

However, it is almost certainly incorrect to consider these bit errors as a randomly-distributed error rate (like you would for an AWGN communication channel). Because DRAM may experience manufacturing imperfections, in practice 20% of servers contribute 90% of the errors, and errors are also likely clustered in time. Google reports strong clustering and age-dependence for their correctable errors, while uncorrectable errors have anomalous behavior (Google HW ops will pull RAM for replacement as soon as an uncorrectable rror arises, according to the paper).

This makes it difficult to compute a meaningful bit error rate that could be expected of any one random computer. There are also other issues with extrapolating these values, since enterprise customers with ECC RAM are going to have very different supply chains, parts, and workloads than home computers. Furthermore, if ECC RAM is made with a different grade of chips than non-ECC ram, then there will be a systematic bias because most of these large-scale studies come from ECC-RAM-based datacenter fleets.