Server Hardware – Is it Necessary to Burn-In RAM for Server-Class Hardware?

hardwarehpmemorystress-testingsupermicro

Considering the fact that many server-class systems are equipped with ECC RAM, is it necessary or useful to burn-in the memory DIMMs prior to their deployment?

I've encountered an environment where all server RAM is placed through a lengthy burn-in/stress-tesing process. This has delayed system deployments on occasion and impacts hardware lead-time.

The server hardware is primarily Supermicro, so the RAM is sourced from a variety of vendors; not directly from the manufacturer like a Dell Poweredge or HP ProLiant.

Is this a useful exercise? In my past experience, I simply used vendor RAM out of the box. Shouldn't the POST memory tests catch DOA memory? I've responded to ECC errors long before a DIMM actually failed, as the ECC thresholds were usually the trigger for warranty placement.

  • Do you burn-in your RAM?
  • If so, what method(s) do you use to perform the tests?
  • Has it identified any problems ahead of deployment?
  • Has the burn-in process resulted in any additional platform stability versus not performing that step?
  • What do you do when adding RAM to an existing running server?

Best Answer

I found a document by Kingston detailing how they work with Server Memory, I believe that this process would, normally, be the same for most known manufacturers. Memory chips, as well as all semiconductor devices, follow a particular reliability/failure pattern that is known as the Bathtub Curve:

enter image description here

Time is represented on the horizontal axis, beginning with the factory shipment and continuing through three distinct time periods:

  • Early Life Failures: Most failures occur during the early usage period. However, as time goes on, the number of failures diminishes quickly. The Early Life Failure period, shown in yellow, is approximately 3 months.

  • Useful Life: During this period, failures are extremely rare. The useful life period is shown in blue and is estimated to be 20+ years.

  • End-of-Life Failures: Eventually, semiconductor products wear out and fail. The End-of-Life period is shown in green

Now because Kingston noted that high fail-rates would occur the first three months (after these three months the unit is considered good until it's EOL about 15 - 20 years later). They designed a test using a unit called the KT2400 which brutally tests the server memory modules for 24 hours at 100 degrees celsius at high voltage, by which all cells of every DRAM chip is continuously exercised; this high level of stress testing has the effect of aging the modules by at least three months (as noted before the critical period where most modules show failures).

The results were:

In March 2004, Kingston began a six-month trial in which 100 percent of its server memory was tested in the KT2400. Results were closely monitored to measure the change in failures. In September 2004, after all the test data was compiled and analyzed, results showed that failures were reduced by 90 percent. These results exceeded expectations and represent a significant improvement for a product line that was already at the top of its class.

So why is burning in memory not useful for server memory? Simply, because it's already done by your manufacturer!