I think I have a memory in my server which has errors and I am wondering how I can find which one it is.
Server model: Supermicro 6072R-EN3RFT
RAM: 128 GB
CentOS 7 with latest updates installed
The mcelog says the following:
:[ 883.230897] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
:[ 883.230904] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: cc0001c7000800c1
:[ 883.230906] EDAC sbridge MC0: TSC 0
:[ 883.230908] EDAC sbridge MC0: ADDR b71b18000
:[ 883.230909] EDAC sbridge MC0: MISC 908401000200e8c
:[ 883.504829] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1469612575 SOCKET 0 APIC 0
:[ 883.504841] mce: [Hardware Error]: Machine check events logged
:[ 883.606151] EDAC MC0: 7 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xb71b18 offset:0x0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:1 rank:1)
:[ 899.306134] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
:[ 899.306143] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: cc000207000800c1
:[ 899.306145] EDAC sbridge MC0: TSC 0
:[ 899.306148] EDAC sbridge MC0: ADDR c71b19000
:[ 899.306150] EDAC sbridge MC0: MISC 908410000200e8c
:[ 899.306153] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1469612590 SOCKET 0 APIC 0
:[ 899.306172] mce: [Hardware Error]: Machine check events logged
:[ 899.644814] EDAC MC0: 8 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xc71b19 offset:0x0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:1 rank:1)
:[ 901.190512] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
:[ 901.190528] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
:[ 901.190533] {1}[Hardware Error]: event severity: corrected
:[ 901.190538] {1}[Hardware Error]: Error 0, type: corrected
:[ 901.190541] {1}[Hardware Error]: fru_text: CorrectedErr
:[ 901.190546] {1}[Hardware Error]: section_type: memory error
:[ 901.190549] [Firmware Warn]: error section length is too small
:[ 4916.540282] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
:[ 4916.540290] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: cc000287000800c1
:[ 4916.540292] EDAC sbridge MC0: TSC 0
:[ 4916.540294] EDAC sbridge MC0: ADDR b743ff000
:[ 4916.540296] EDAC sbridge MC0: MISC 908400800240e8c
:[ 4916.540298] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1469616606 SOCKET 0 APIC 0
:[ 4916.540313] mce: [Hardware Error]: Machine check events logged
:[ 4916.540340] EDAC MC0: 10 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xb743ff offset:0x0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:1 rank:1)
I tried the following:
grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:669
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch3_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch2_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch3_ce_count:0
Does this mean, I have 8 slots with 16 GB in each and the first slot contains the memory with error?
Any ideas which one is the memory module with errors? I am not a system administrator so I don't really know how to proceed…
Kind regards
Best Answer
I would expect your DIMM slots to perhaps be labelled
BANK A DIMM 0
,BANK A DIMM 1
, etc. up toBANK B DIMM 3
. You could make the assumption thatBANK A DIMM 0
is the problem one, and so try swapping it with one of the other 7 assuming they're all equal and repeat your tests until it generates an error again. If a different/sys/devices/system/edac/mc/mc?/csrow0/ch?_ce_count
counter is incremented then you can be reasonably sure you've found the problem DIMM.