Electronic – Why have apparent memory bitflips in non-ECC memory not increased

eccerrorerror correctionmemoryram

Back in the early 2000s I remember asking about why it was so important that servers use ECC memory. The prevailing wisdom at the time was that systems with lots of RAM would be, statistically, more likely to suffer bitflips. This makes sense – if each cell has a 10-20 probability of suffering a bitflip per second, then 109 cells have a 10-11 probability per second. The more cells you have, the higher the probability of a bitflip in any given time period.

Back then we would be looking at a ballpark of 128MB to 1GB of RAM. These days we regularly put 16GB or more in laptops and desktops, with workstations commonly having 64GB or more. For argument's sake, let's say we've increased total RAM amounts by two orders of magnitude. So we should see a hundred times or so more bitflips, on average, in any given system, assuming that nothing else changed.

The more I thought about it, though, the more I realised that the random bitflip rate should be much higher in newer systems:

  • Lower operating voltages means lower distinction between a 0 and a 1.
  • Lower gate charge means less overall energy required to flip a bit.
  • More densely packed gates increases the likelihood of being affected by cosmic rays.
  • Refresh times don't seem to have gone anywhere. DDR2 tRFC was 40-60 clocks, DDR3 tRFC was more like 90-130 clocks, and DDR4 tRFC is more like 200-450 clocks. When you divide by the internal memory clock rates to get a wall time for each refresh timing it doesn't really show much of a trend – it's effectively flat but with a higher margin either way as time goes on.

But, as far as I know, we're not seeing bitflips everywhere on non-ECC RAM, at least within the confines of our atmosphere.

So, what's the deal? Why aren't we seeing endless bitflips everywhere, at least 100x if not 10000x more frequently than two decades ago? Is ECC actually important in the context of growing RAM sizes, or do the stats not back it up? Or is there some other technology advance that is mitigating bitflip problems in non-ECC memory? I'm particularly interested in answers with authoritative references rather than speculation about error rates.

Best Answer

Single-event upsets (SEU) at sea level tend to be caused either by radioactive contaminants in the IC manufacturing materials (particularly the metals) generating alpha particles or by high-energy neutrons (caused by cosmic rays in the atmosphere) ionizing atoms in the silicon itself.

Over the years, manufacturers have greatly reduced the threat caused by radioactive contaminants. There are also proprietary approaches to cell layout that can help to mitigate the risk of SEU. All of this is probably going to be trade secrets and not public information.

And, no, I'm not going to do your literature search for you. However, I recommend that you go through the IEEE Transactions on Nuclear Science if you want references.