Rowhammer DRAM Bug – What It Is and How to Mitigate It

bugeccmemorySecurity

DRAM chips are very tightly packed. Research has shown that neighboring bits can be flipped at random.

What is the probability of the bug triggering at random in a server-grade DRAM chip with ECC (the CMU-Intel paper cites e.g. the number 9.4×10^-14 for an unknown chip for one failure in a year's time)?
How do I know whether the bug is fixed before buying memory?
What should I do to counter malicious attempts to do privilege escalation by e.g. tenants or unprivileged users on e.g. CentOS 7?

References:

Best Answer

The CMU-Intel paper you cited shows (on page 5) that the error rate depends heavily on the part number / manufacturing date of the DRAM module and varies by a factor of 10-1000. There are also some indications that the problem is much less pronounced in recently (2014) manufactured chips.

The number '9.4x10^-14' that you cited was used in the context of a proposed theoretical mitigation mechanism called "PARA" (that might be similar to an existing mitigation mechanism pTRR (pseudo Target Row Refresh)) and is irrelevant to your question, because PARA has nothing to do with ECC.

A second CMU-Intel paper (page 10) mentions the effects of different ECC algorithms on error reduction (factor 10^2 to 10^5, possibly much more with sophisticated memory tests and "guardbanding").

ECC effectively turns the Row Hammer exploit into a DOS attack. 1bit errors will be corrected by ECC, and as soon as a non-correctable 2bit error is detected the system will halt (assuming SECDED ECC).

A solution is to buy hardware that supports pTRR or TRR. See current blog post from Cisco about Row Hammer. At least some manufacturers seem to have one of these mitigation mechanisms built into their DRAM modules, but keep it deeply hidden in their specs. To answer your question: ask the vendor.

Faster refresh rates (32ms instead of 64ms) and aggressive Patrol Scrub intervals help, too, but would have a performance impact. But I don't know any server hardware that actually allows finetuning these parameters.

I guess there's not much you can do on the operating system side except terminating suspicous processes with constant high cpu usage and high cache misses.

Best Answer

Related Solutions

ECC RAM – What is ECC RAM and Why is it Better?

IT Security – How to Search for Backdoors from Previous IT Personnel

Related Topic