ECC memory errors causing random server reboots

eccmemorysupermicro

I'm running ubuntu server 14.04 on Supermicro X10SLM-F / Xeon E3-1271 v3

Memory: SuperTalent 32GB DDR3 1600 ECC

About every 4 days, the logs on Ubuntu will show this:

{1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
{1}[Hardware Error]: event severity: corrected
{1}[Hardware Error]:  Error 0, type: corrected
{1}[Hardware Error]:  fru_text: CorrectedErr
{1}[Hardware Error]:   section_type: memory error
[Firmware Warn]: error section length is too small

Immediately after this the server reboots itself in a "power-cycle" fashion.

When I look in the BIOS event log, I see this:

DATE            TIME           ERROR CODE      SEVERITY
06/13/15      13:13:38      Smbios 0x02         P1-DIMMB2

And the description of the error is:

Single Bit ECC Memory Error

ipmitool in Ubuntu show this:

ipmitool sel elist
...
...
  1a | 06/13/2015 | 13:13:39 | Memory | Correctable ECC | Asserted | CPU 0 DIMM 8
  1b | 06/13/2015 | 13:13:39 | Memory | Uncorrectable ECC | Asserted | CPU 0 DIMM 8

A few questions:

  1. If the ECC memory is self correcting, why does the machine reboot itself?

  2. Am I, perhaps, missing some setting in the BIOS that will stop the box from rebooting itself?

  3. Is this obviously a memory stick issue or can this be a slot issue or a CPU issue?

  4. How to stop the server from rebooting?

Thank you for any advice.

Best Answer

The system should not reboot upon correctable memory error. Do you see additional information/pattern via ipmitool sel elist ? The BMC watchdog could reboot the system, check if it is enabled via ipmitool mc watchdog get. As you already have the information on the location of the bad memory module, replace it and if the problem manifests again, the memory slot could be at fault.

X10SLM-F the RAM that you use is not on the list of tested RAM modules - if you have the possibility, replace all the memory bars in a 'problem' system with equivalent Supermicro-tested ones. Also, check the list of supported OS for you Ubuntu version.

Related to the CMOS settings, you could use Supermicro SUM, provided you have the SUM keys installed, to dump the BIOS settings from all the systems then vimdiff them to see if there is any CMOS parameter being different for the systems that regularly reboot compared to the system(s) that do not.

sum -i <IP Address of the BMC> -u <BMC user> -p <BMC password> -c GetCurrentBiosCfg --file myconf.conf