Electronic – the probability of a bit error occurring in modern computers

error correctionmemorymicrocontrollerramsignal processing

What is the probability of a bit error occuring when reading/writing from/to the latest memory technologies (ssd, hdd, ram) in modern computers? If the same terminology is used in this context as in networking/communication systems, then my question can be rephrased as:

What is the bit error rate(BER) when accessing memory either to read/write in modern computers?

As a follow up question, how do computers deal with such bit errors?

Best Answer

Computers handle this by either using software that's designed to blue screen in a controlled manner when critical data structures are corrupted, or by using ECC memory, which stores data with some redundancy to allow a single bit error to be corrected and a double-bit error to be detected with no recourse other than a hard shutdown.

There is some literature analyzing the actual numbers in the context of datacenters. One study tracks the number of dectected-and-corrected errors, showing that, at scale, 2% of machines encountered a recoverable (typically single-bit) ECC fault, while a tiny number encountered unrecoverable (double-bit) faults.

Google tests, shown on slide 11 of this slide deck, reported that 32% of machines would report a correctable per year, while around 1.3% reported uncorrectable errors.

However, it is almost certainly incorrect to consider these bit errors as a randomly-distributed error rate (like you would for an AWGN communication channel). Because DRAM may experience manufacturing imperfections, in practice 20% of servers contribute 90% of the errors, and errors are also likely clustered in time. Google reports strong clustering and age-dependence for their correctable errors, while uncorrectable errors have anomalous behavior (Google HW ops will pull RAM for replacement as soon as an uncorrectable rror arises, according to the paper).

This makes it difficult to compute a meaningful bit error rate that could be expected of any one random computer. There are also other issues with extrapolating these values, since enterprise customers with ECC RAM are going to have very different supply chains, parts, and workloads than home computers. Furthermore, if ECC RAM is made with a different grade of chips than non-ECC ram, then there will be a systematic bias because most of these large-scale studies come from ECC-RAM-based datacenter fleets.

Related Solutions

Electronic – Single Bit Error Correction & Double Bit Error Detection

A Hamming code is a particular kind of error-correcting code (ECC) that allows single-bit errors in code words to be corrected. Such codes are used in data transmission or data storage systems in which it is not feasible to use retry mechanisms to recover the data when errors are detected. This type of error recovery is also known as forward error correction (FEC).

Constructing a Hamming code to protect, say, a 4-bit data word

Hamming codes are relatively easy to construct because they're based on parity logic. Each check bit is a parity bit for a particular subset of the data bits, and they're arranged so that the pattern of parity errors directly indicates the position of the bit error.

It takes three check bits to protect four data bits (the reason for this will become apparent shortly), giving a total of 7 bits in the encoded word. If you number the bit positions of an 8-bit word in binary, you see that there is one position that has no "1"s in its column, three positions that have a single "1" each, and four positions that have two or more "1"s.

If the four data bits are called A, B, C and D, and our three check bits are X, Y and Z, we place them in the columns such that the check bits are in the columns with one "1" and the data bits are in the columns with more than one "1". The bit in position 0 is not used.

Bit position:  7  6  5  4  3  2  1  0
   in binary:  1  1  1  1  0  0  0  0
               1  1  0  0  1  1  0  0
               1  0  1  0  1  0  1  0
         Bit:  A  B  C  X  D  Y  Z  -

The check bit X is set or cleared so that all of the bits with a "1" in the top row — A, B C and X — have even parity. Similarly, the check bit Y is the parity bit for all of the bits with a "1" in the second row (A, B and D), and the check bit Z is the parity bit for all of the bits with a "1" in the third row (A, C and D).

Now all seven bits — the codeword — are transmitted (or stored), usually reordered so that the data bits appear in their original sequence: A B C D X Y Z. When they're received (or retrieved) later, the data bits are put through the same encoding process as before, producing three new check bits X', Y' and Z'. If the new check bits are XOR'd with the received check bits, an interesting thing occurs. If there's no error in the received bits, the result of the XOR is all zeros. But if there's a single bit error in any of the seven received bits, the result of the XOR is a nonzero three-bit number called the "syndrome" that directly indicates the position of the bit error as defined in the table above. If the bit in this position is flipped, then the original 7-bit codeword is perfectly reconstructed.

A couple of examples will illustrate this. Let's assume that the data bits are all zero, which also means that all of the check bits are zero as well. If bit "B" is set in the received word, then the recomputed check bits X'Y'Z' (and the syndrome) will be 110, which is the bit position for B. If bit "Y" is set in the received word, then the recomputed check bits will be "000", and the syndrome will be "010", which is the bit position for Y.

Hamming codes get more efficient with larger codewords. Basically, you need enough check bits to enumerate all of the data bits plus the check bits plus one. Therefore, four check bits can protect up to 11 data bits, five check bits can protect up to 26 data bits, and so on. Eventually you get to the point where if you have 8 bytes of data (64 bits) with a parity bit on each byte, you have enough parity bits to do ECC on the 64 bits of data instead.

Different (but equivalent) Hamming codes

Given a specific number N of check bits, there are 2^N equivalent Hamming codes that can be constructed by arbitrarily choosing each check bit to have either "even" or "odd" parity within its group of data bits. As long as the encoder and the decoder use the same definitions for the check bits, all of the properties of the Hamming code are preserved.

Sometimes it's useful to define the check bits so that an encoded word of all-zeros or all-ones is always detected as an error.

What happens when multiple bits get flipped in a Hamming codeword

Multible bit errors in a Hamming code cause trouble. Two bit errors will always be detected as an error, but the wrong bit will get flipped by the correction logic, resulting in gibberish. If there are more than two bits in error, the received codeword may appear to be a valid one (but different from the original), which means that the error may or may not be detected.

In any case, the error-correcting logic can't tell the difference between single bit errors and multiple bit errors, and so the corrected output can't be relied on.

Extending a Hamming code to detect double-bit errors

Any single-error correcting Hamming code can be extended to reliably detect double bit errors by adding one more parity bit over the entire encoded word. This type of code is called a SECDED (single-error correcting, double-error detecting) code. It can always distinguish a double bit error from a single bit error, and it detects more types of multiple bit errors than a bare Hamming code does.

It works like this: All valid code words are (a minimum of) Hamming distance 3 apart. The "Hamming distance" between two words is defined as the number of bits in corresponding positions that are different. Any single-bit error is distance one from a valid word, and the correction algorithm converts the received word to the nearest valid one.

If a double error occurs, the parity of the word is not affected, but the correction algorithm still corrects the received word, which is distance two from the original valid word, but distance one from some other valid (but wrong) word. It does this by flipping one bit, which may or may not be one of the erroneous bits. Now the word has either one or three bits flipped, and the original double error is now detected by the parity checker.

Note that this works even when the parity bit itself is involved in a single-bit or double-bit error. It isn't hard to work out all the combinations.

Electronic – How to error correction codes reduce bit error rate, for same amount of energy

Think about the probability of an uncorrectable error occurring in each message. If you just send r bits, all errors are uncorrectable, so the chance of a successfully sent message is $$(1-P(bitError))^r$$ (in other words, all bits have to pass through with no error.) Now, say you spread the energy just a little to add k error correction bits, such that you can correct a single bit error. Then the chance of a single bit error does go up due to the weaker signal, but the chance of a successfully sent message becomes something like $$(1-P(higherBitError))^{r+k} + (r+k)(P(higherBitError)(1-P(higherBitError))^{r+k-1})$$ (In other words, it's successful if all bits pass through with no error as before. But it's also successful in the (r+k) possible situations where 1 bit has an error and the other r+k-1 bits don't.)

I'm not 100% sure my probability formulation is exactly correct, but you can see the idea: there are now more possible situations where you win. Depending on the values of r, k and how much the bit error probability increases, the error correction will often result in a higher probability of a successfully sent message.

Best Answer

Related Solutions

Electronic – Single Bit Error Correction & Double Bit Error Detection

Electronic – How to error correction codes reduce bit error rate, for same amount of energy

Related Topic