Linux – Finding the source of a (memory read) hardware error

linuxmcelogmemory

When logging into my server, I'm seing lots of these errors:

Message from syslogd@****** at May 31 20:06:59 ...
 kernel:[500570.908383] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1622484419 SOCKET 0 APIC 0 microcode 71a

Message from syslogd@****** at May 31 20:10:11 ...
 kernel:[500762.908155] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: c01d8a8000010091

Message from syslogd@****** at May 31 20:10:11 ...
 kernel:[500762.908278] mce: [Hardware Error]: TSC 0 

Message from syslogd@****** at May 31 20:10:11 ...
 kernel:[500762.908299] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1622484611 SOCKET 0 APIC 0 microcode 71a

Message from syslogd@****** at May 31 20:11:10 ...
 kernel:[500821.884806] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: c01ec00000010091

Message from syslogd@****** at May 31 20:11:10 ...
 kernel:[500821.885130] mce: [Hardware Error]: TSC 0 

And the syslog shows some memory read errors:

May 31 20:35:18 ****** kernel: [502269.884160] EDAC sbridge MC0: MISC 20403aba86 
May 31 20:35:18 ****** kernel: [502269.884166] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1622486118 SOCKET 0 APIC 0
May 31 20:35:18 ****** kernel: [502269.884228] EDAC MC0: 16682 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x170c7a offset:0xa00 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0091 socket:0 ha:0 channel_mask:2 rank:1)
May 31 20:35:19 ****** kernel: [502270.908292] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
May 31 20:35:19 ****** kernel: [502270.908349] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 5: cc12b44000010091
May 31 20:35:19 ****** kernel: [502270.908356] EDAC sbridge MC0: TSC 0 
May 31 20:35:19 ****** kernel: [502270.908359] EDAC sbridge MC0: ADDR 3ef245d00 
May 31 20:35:19 ****** kernel: [502270.908363] EDAC sbridge MC0: MISC 20404c4c86 
May 31 20:35:19 ****** kernel: [502270.908366] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1622486119 SOCKET 0 APIC 0
May 31 20:35:19 ****** kernel: [502270.908567] EDAC MC0: 19153 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0x3ef245 offset:0xd00 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0091 socket:0 ha:0 channel_mask:2 rank:4)

It seems I could have a faulty RAM module, but memtest86 shows everything OK. Could this be my CPU's fault?

Best Answer

but memtest86 shows everything OK. Could this be my CPU's fault?

Yes, but here is what is more likely: You have ECC memory and it works.

Basically it fixes single bit errors transparently. It signals this - which the OS is smart enough to intercept and log.

Memtest is too primitive for this, and does not intercept the notification, all it sees is that the test passes, because ECC fixes the errors.