Fedora Server 34 Crashes on HP ProLiant DL380e G8 – Troubleshooting Guide

fedorahp-prolianthpememory

I am unfortunately having a problem with my HP ProLiant DL380e G8 server running Fedora Server 34. I suspect these are memory errors or a DIMM being/going bad, however I'm not sure.

Feedback is very welcome!

I've ran journalctl -r, which returns the following output in the PasteBin link (a snippet of what looks out of the ordinary): https://pastebin.com/KPUZHceD

All help and ideas are appreciated!

Kind regards

Edit:
In response to the comment of @Michael Hampton:
The output posted here:

<27>Sep  7 17:03:51 mcelog: Location: SOCKET:0 CHANNEL:3 DIMM:1 []
Sep 07 17:03:51 turbo mcelog[1304]: Location: SOCKET:0 CHANNEL:3 DIMM:1 []
Sep 07 17:03:51 turbo mcelog[1303]: <27>Sep  7 17:03:51 mcelog: corrected DIMM memory error count exceeded threshold: 10 in 24h
Sep 07 17:03:51 turbo mcelog[1303]: corrected DIMM memory error count exceeded threshold: 10 in 24h
Sep 07 17:03:51 turbo mcelog[1304]: <27>Sep  7 17:03:51 mcelog: Location: SOCKET:0 CHANNEL:3 DIMM:1 []
Sep 07 17:03:51 turbo mcelog[1304]: Location: SOCKET:0 CHANNEL:3 DIMM:1 []
Sep 07 17:03:51 turbo mcelog[1303]: <27>Sep  7 17:03:51 mcelog: corrected DIMM memory error count exceeded threshold: 10 in 24h
Sep 07 17:03:51 turbo mcelog[1303]: corrected DIMM memory error count exceeded threshold: 10 in 24h
Sep 07 17:03:51 turbo mcelog[1067]: CPUID Vendor Intel Family 6 Model 45 Step 7
Sep 07 17:03:51 turbo mcelog[1067]: MICROCODE 71a
Sep 07 17:03:51 turbo mcelog[1067]: MCGCAP 1000812 APICID 2 SOCKETID 0
Sep 07 17:03:51 turbo mcelog[1067]: STATUS c80000c400800093 MCGSTATUS 0
Sep 07 17:03:51 turbo mcelog[1067]: MemCtrl:
Sep 07 17:03:51 turbo mcelog[1067]: Transaction: Memory read error
Sep 07 17:03:51 turbo mcelog[1067]: MCA: MEMORY CONTROLLER RD_CHANNEL3_ERR
Sep 07 17:03:51 turbo mcelog[1067]: MCi_MISC register valid
Sep 07 17:03:51 turbo mcelog[1067]: Corrected error
Sep 07 17:03:51 turbo mcelog[1067]: Error overflow
Sep 07 17:03:51 turbo mcelog[1067]: MCi status:
Sep 07 17:03:51 turbo mcelog[1067]: MCG status:
Sep 07 17:03:51 turbo mcelog[1067]: TIME 1631027031 Tue Sep  7 17:03:51 2021
Sep 07 17:03:51 turbo mcelog[1067]: MISC d22131295c834800
Sep 07 17:03:51 turbo mcelog[1067]: CPU 1 BANK 11
Sep 07 17:03:51 turbo mcelog[1067]: MCE 7
Sep 07 17:03:51 turbo mcelog[1067]: Hardware event. This is not a software error.
Sep 07 17:03:51 turbo mcelog[1067]: CPUID Vendor Intel Family 6 Model 45 Step 7
Sep 07 17:03:51 turbo mcelog[1067]: MICROCODE 71a
Sep 07 17:03:51 turbo mcelog[1067]: MCGCAP 1000812 APICID 3 SOCKETID 0
Sep 07 17:03:51 turbo mcelog[1067]: STATUS c80000c400800093 MCGSTATUS 0
Sep 07 17:03:51 turbo mcelog[1067]: MemCtrl:
Sep 07 17:03:51 turbo mcelog[1067]: Transaction: Memory read error
Sep 07 17:03:51 turbo mcelog[1067]: MCA: MEMORY CONTROLLER RD_CHANNEL3_ERR
Sep 07 17:03:51 turbo mcelog[1067]: MCi_MISC register valid
Sep 07 17:03:51 turbo mcelog[1067]: Corrected error
Sep 07 17:03:51 turbo mcelog[1067]: Error overflow
Sep 07 17:03:51 turbo mcelog[1067]: MCi status:
Sep 07 17:03:51 turbo mcelog[1067]: MCG status:
Sep 07 17:03:51 turbo mcelog[1067]: TIME 1631027031 Tue Sep  7 17:03:51 2021
Sep 07 17:03:51 turbo mcelog[1067]: MISC d22131295c834800
Sep 07 17:03:51 turbo mcelog[1067]: CPU 13 BANK 11
Sep 07 17:03:51 turbo mcelog[1067]: MCE 6
Sep 07 17:03:51 turbo mcelog[1067]: Hardware event. This is not a software error.
Sep 07 17:03:51 turbo mcelog[1067]: CPUID Vendor Intel Family 6 Model 45 Step 7
Sep 07 17:03:51 turbo mcelog[1067]: MICROCODE 71a
Sep 07 17:03:51 turbo mcelog[1067]: MCGCAP 1000812 APICID 0 SOCKETID 0
Sep 07 17:03:51 turbo mcelog[1067]: STATUS c80000c400800093 MCGSTATUS 0
Sep 07 17:03:51 turbo mcelog[1067]: MemCtrl:
Sep 07 17:03:51 turbo mcelog[1067]: Transaction: Memory read error
Sep 07 17:03:51 turbo mcelog[1067]: MCA: MEMORY CONTROLLER RD_CHANNEL3_ERR
Sep 07 17:03:51 turbo mcelog[1067]: MCi_MISC register valid
Sep 07 17:03:51 turbo mcelog[1067]: Corrected error
Sep 07 17:03:51 turbo mcelog[1067]: Error overflow
Sep 07 17:03:51 turbo mcelog[1067]: MCi status:
Sep 07 17:03:51 turbo mcelog[1067]: MCG status:
Sep 07 17:03:51 turbo mcelog[1067]: TIME 1631027031 Tue Sep  7 17:03:51 2021
Sep 07 17:03:51 turbo mcelog[1067]: MISC d22131295c834800
Sep 07 17:03:51 turbo mcelog[1067]: CPU 0 BANK 11
Sep 07 17:03:51 turbo mcelog[1067]: MCE 5
Sep 07 17:03:51 turbo mcelog[1067]: Hardware event. This is not a software error.
Sep 07 17:03:51 turbo mcelog[1067]: Running trigger `dimm-error-trigger' (reporter: memdb)
Sep 07 17:03:51 turbo mcelog[1067]: CPUID Vendor Intel Family 6 Model 45 Step 7
Sep 07 17:03:51 turbo mcelog[1067]: MICROCODE 71a
Sep 07 17:03:51 turbo mcelog[1067]: MCGCAP 1000812 APICID 6 SOCKETID 0
Sep 07 17:03:51 turbo mcelog[1067]: STATUS c80000c400800093 MCGSTATUS 0
Sep 07 17:03:51 turbo mcelog[1067]: MemCtrl:
Sep 07 17:03:51 turbo mcelog[1067]: Transaction: Memory read error
Sep 07 17:03:51 turbo mcelog[1067]: MCA: MEMORY CONTROLLER RD_CHANNEL3_ERR
Sep 07 17:03:51 turbo mcelog[1067]: MCi_MISC register valid
Sep 07 17:03:51 turbo mcelog[1067]: Corrected error
Sep 07 17:03:51 turbo mcelog[1067]: Error overflow
Sep 07 17:03:51 turbo mcelog[1067]: MCi status:
Sep 07 17:03:51 turbo mcelog[1067]: MCG status:
Sep 07 17:03:51 turbo mcelog[1067]: TIME 1631027031 Tue Sep  7 17:03:51 2021
Sep 07 17:03:51 turbo mcelog[1067]: MISC d22131295c834800
Sep 07 17:03:51 turbo mcelog[1067]: CPU 3 BANK 11
Sep 07 17:03:51 turbo mcelog[1067]: MCE 4
Sep 07 17:03:51 turbo mcelog[1067]: Hardware event. This is not a software error.
Sep 07 17:03:51 turbo mcelog[1067]: CPUID Vendor Intel Family 6 Model 45 Step 7
Sep 07 17:03:51 turbo mcelog[1067]: MICROCODE 71a
Sep 07 17:03:51 turbo mcelog[1067]: MCGCAP 1000812 APICID a SOCKETID 0
Sep 07 17:03:51 turbo mcelog[1067]: STATUS c801c00400800093 MCGSTATUS 0
Sep 07 17:03:51 turbo mcelog[1067]: MemCtrl:
Sep 07 17:03:51 turbo mcelog[1067]: Transaction: Memory read error
Sep 07 17:03:51 turbo mcelog[1067]: MCA: MEMORY CONTROLLER RD_CHANNEL3_ERR
Sep 07 17:03:51 turbo mcelog[1067]: MCi_MISC register valid
Sep 07 17:03:51 turbo mcelog[1067]: Corrected error
Sep 07 17:03:51 turbo mcelog[1067]: Error overflow
Sep 07 17:03:51 turbo mcelog[1067]: MCi status:
Sep 07 17:03:51 turbo mcelog[1067]: MCG status:
Sep 07 17:03:51 turbo mcelog[1067]: TIME 1631027031 Tue Sep  7 17:03:51 2021
Sep 07 17:03:51 turbo mcelog[1067]: MISC d2213fa689118800
Sep 07 17:03:51 turbo mcelog[1067]: CPU 5 BANK 11
Sep 07 17:03:51 turbo mcelog[1067]: MCE 3
Sep 07 17:03:51 turbo mcelog[1067]: Hardware event. This is not a software error.
Sep 07 17:03:51 turbo mcelog[1067]: CPUID Vendor Intel Family 6 Model 45 Step 7
Sep 07 17:03:51 turbo mcelog[1067]: MICROCODE 71a
Sep 07 17:03:51 turbo mcelog[1067]: MCGCAP 1000812 APICID 5 SOCKETID 0
Sep 07 17:03:51 turbo mcelog[1067]: STATUS c801bd8400800093 MCGSTATUS 0
Sep 07 17:03:51 turbo mcelog[1067]: MemCtrl:
Sep 07 17:03:51 turbo mcelog[1067]: Transaction: Memory read error
Sep 07 17:03:51 turbo mcelog[1067]: MCA: MEMORY CONTROLLER RD_CHANNEL3_ERR
Sep 07 17:03:51 turbo mcelog[1067]: MCi_MISC register valid
Sep 07 17:03:51 turbo mcelog[1067]: Corrected error
Sep 07 17:03:51 turbo mcelog[1067]: Error overflow
Sep 07 17:03:51 turbo mcelog[1067]: MCi status:
Sep 07 17:03:51 turbo mcelog[1067]: MCG status:
Sep 07 17:03:51 turbo mcelog[1067]: TIME 1631027031 Tue Sep  7 17:03:51 2021
Sep 07 17:03:51 turbo mcelog[1067]: MISC d2213f0649118800
Sep 07 17:03:51 turbo mcelog[1067]: CPU 14 BANK 11
Sep 07 17:03:51 turbo mcelog[1067]: MCE 2
Sep 07 17:03:51 turbo mcelog[1067]: Hardware event. This is not a software error.
Sep 07 17:03:51 turbo mcelog[1067]: CPUID Vendor Intel Family 6 Model 45 Step 7
Sep 07 17:03:51 turbo mcelog[1067]: MICROCODE 71a
Sep 07 17:03:51 turbo mcelog[1067]: MCGCAP 1000812 APICID 1 SOCKETID 0
Sep 07 17:03:51 turbo mcelog[1067]: STATUS c801bec400800093 MCGSTATUS 0
Sep 07 17:03:51 turbo mcelog[1067]: MemCtrl:
Sep 07 17:03:51 turbo mcelog[1067]: Transaction: Memory read error
Sep 07 17:03:51 turbo mcelog[1067]: MCA: MEMORY CONTROLLER RD_CHANNEL3_ERR
Sep 07 17:03:51 turbo mcelog[1067]: MCi_MISC register valid
Sep 07 17:03:51 turbo mcelog[1067]: Corrected error
Sep 07 17:03:51 turbo mcelog[1067]: Error overflow
Sep 07 17:03:51 turbo mcelog[1067]: MCi status:
Sep 07 17:03:51 turbo mcelog[1067]: MCG status:
Sep 07 17:03:51 turbo mcelog[1067]: TIME 1631027031 Tue Sep  7 17:03:51 2021
Sep 07 17:03:51 turbo mcelog[1067]: MISC d221196e09118800
Sep 07 17:03:51 turbo mcelog[1067]: CPU 12 BANK 11
Sep 07 17:03:51 turbo mcelog[1067]: MCE 1
Sep 07 17:03:51 turbo mcelog[1067]: Hardware event. This is not a software error.
Sep 07 17:03:51 turbo mcelog[1067]: CPUID Vendor Intel Family 6 Model 45 Step 7
Sep 07 17:03:51 turbo mcelog[1067]: MICROCODE 71a
Sep 07 17:03:51 turbo mcelog[1067]: MCGCAP 1000812 APICID 0 SOCKETID 0
Sep 07 17:03:51 turbo mcelog[1067]: STATUS c0107b4000010093 MCGSTATUS 0
Sep 07 17:03:51 turbo mcelog[1067]: Transaction: Memory read error
Sep 07 17:03:51 turbo mcelog[1067]: MCA: MEMORY CONTROLLER RD_CHANNEL3_ERR
Sep 07 17:03:51 turbo mcelog[1067]: Corrected error
Sep 07 17:03:51 turbo mcelog[1067]: Error overflow
Sep 07 17:03:51 turbo mcelog[1067]: STATUS c0107b4000010093 MCGSTATUS 0
Sep 07 17:03:51 turbo mcelog[1067]: Transaction: Memory read error
Sep 07 17:03:51 turbo mcelog[1067]: MCA: MEMORY CONTROLLER RD_CHANNEL3_ERR
Sep 07 17:03:51 turbo mcelog[1067]: Corrected error
Sep 07 17:03:51 turbo mcelog[1067]: Error overflow
Sep 07 17:03:51 turbo mcelog[1067]: MCi status:
Sep 07 17:03:51 turbo mcelog[1067]: MCG status:
Sep 07 17:03:51 turbo mcelog[1067]: TIME 1631027031 Tue Sep  7 17:03:51 2021
Sep 07 17:03:51 turbo mcelog[1067]: CPU 0 BANK 5
Sep 07 17:03:51 turbo mcelog[1067]: MCE 0
Sep 07 17:03:51 turbo mcelog[1067]: Hardware event. This is not a software error.
Sep 07 17:03:51 turbo mcelog[1067]: mcelog: mcelog read: Input/output error
Sep 07 17:03:51 turbo kernel: ERST: [Firmware Warn]: Firmware does not respond in time.
Sep 07 17:03:51 turbo kernel: mce: [Hardware Error]: Machine check events logged
Sep 07 17:03:51 turbo kernel: mce: [Hardware Error]: Machine check events logged
Sep 07 17:03:51 turbo kernel: mce_notify_irq: 6 callbacks suppressed

Best Answer

This post has been fixed by removing 2 faulty RAM sticks from the server and reseating the CPU, since that was not making good contact either.

Thanks for all the help!

Related Topic