Linux – EDAC memory error after upgrade server SuperMicro with CentOS 7. Is these specific errors of motherboard, OS or broken RAM module

centos7linuxmemorysupermicro

I have server on SuperMicro MBD-X9DRD-EF motherboard. It worked well at CentOS7 during the year with one CPU (Intel Original Xeon X6 E5-2620v2) and 128 Gb (8×16 Gb) LVDDR (1600MHz Crucial ECC Reg RTL (PC3-12800)) memory. Last month we upgrade this server by adding second CPU and additional 128 Gb memory, absolutely identical to the existing ones.
But after intensively usage the server (during 3-4 days), we start to receive (very frequently) such errors:

[root@GBserver log]# dmesg
[614781.869098] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[614781.869104] EDAC sbridge MC1: CPU 6: Machine Check Event: 0 Bank 7: 8c00004000010090
[614781.869106] EDAC sbridge MC1: TSC 0
[614781.869108] EDAC sbridge MC1: ADDR 38126a6c40
[614781.869110] EDAC sbridge MC1: MISC 14066ca86
[614781.869112] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1473082855 SOCKET 1 APIC 20
[614782.595676] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x38126a6 offset:0xc40 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:1 ha:0 channel_mask:1 rank:1)

And the output of edac-util:

[root@GBserver log]# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 296182 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors

mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 296182 Corrected Errors

Is these errors arose via motherboard, CPU or OS fault, or we have broken memory chip? What we should to do? How to find broken memory module?

Best Answer

After 3 weeks there were about 11M corrected errors logged. I found broken memory module after seen the BIOS log. enter image description here This is the answer my question.
Next, I will remove the broken module and will replace it by another.