Centos – How to find which memory has CE error

centoshardwarememory

I think I have a memory in my server which has errors and I am wondering how I can find which one it is.

Server model: Supermicro 6072R-EN3RFT

RAM: 128 GB

CentOS 7 with latest updates installed

The mcelog says the following:

:[  883.230897] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
:[  883.230904] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: cc0001c7000800c1
:[  883.230906] EDAC sbridge MC0: TSC 0 
:[  883.230908] EDAC sbridge MC0: ADDR b71b18000 
:[  883.230909] EDAC sbridge MC0: MISC 908401000200e8c 
:[  883.504829] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1469612575 SOCKET 0 APIC 0
:[  883.504841] mce: [Hardware Error]: Machine check events logged
:[  883.606151] EDAC MC0: 7 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xb71b18 offset:0x0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:1 rank:1)
:[  899.306134] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
:[  899.306143] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: cc000207000800c1
:[  899.306145] EDAC sbridge MC0: TSC 0 
:[  899.306148] EDAC sbridge MC0: ADDR c71b19000 
:[  899.306150] EDAC sbridge MC0: MISC 908410000200e8c 
:[  899.306153] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1469612590 SOCKET 0 APIC 0
:[  899.306172] mce: [Hardware Error]: Machine check events logged
:[  899.644814] EDAC MC0: 8 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xc71b19 offset:0x0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:1 rank:1)
:[  901.190512] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
:[  901.190528] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
:[  901.190533] {1}[Hardware Error]: event severity: corrected
:[  901.190538] {1}[Hardware Error]:  Error 0, type: corrected
:[  901.190541] {1}[Hardware Error]:  fru_text: CorrectedErr
:[  901.190546] {1}[Hardware Error]:   section_type: memory error
:[  901.190549] [Firmware Warn]: error section length is too small
:[ 4916.540282] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
:[ 4916.540290] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: cc000287000800c1
:[ 4916.540292] EDAC sbridge MC0: TSC 0 
:[ 4916.540294] EDAC sbridge MC0: ADDR b743ff000 
:[ 4916.540296] EDAC sbridge MC0: MISC 908400800240e8c 
:[ 4916.540298] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1469616606 SOCKET 0 APIC 0
:[ 4916.540313] mce: [Hardware Error]: Machine check events logged
:[ 4916.540340] EDAC MC0: 10 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xb743ff offset:0x0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:1 rank:1)

I tried the following:

grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:669
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch3_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch2_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch3_ce_count:0

Does this mean, I have 8 slots with 16 GB in each and the first slot contains the memory with error?

Any ideas which one is the memory module with errors? I am not a system administrator so I don't really know how to proceed…

Kind regards

Best Answer

I would expect your DIMM slots to perhaps be labelled BANK A DIMM 0, BANK A DIMM 1, etc. up to BANK B DIMM 3. You could make the assumption that BANK A DIMM 0 is the problem one, and so try swapping it with one of the other 7 assuming they're all equal and repeat your tests until it generates an error again. If a different /sys/devices/system/edac/mc/mc?/csrow0/ch?_ce_count counter is incremented then you can be reasonably sure you've found the problem DIMM.

Related Solutions

Linux – ECC chipkill errors: which DIMM

In addition to using the EDAC codes, you can use the CLI only HP utilities to determine this while the machine is online. The cli versions are far more lightweight than the web based ones and do not require you to open ports or have a daemon constantly running.

hpasmcli will give you the cartridge and module #'s of the failed modules. A little quicker than analyzing EDAC.

Example:

hpasmcli -s "show dimm"

DIMM Configuration
------------------
Cartridge #: 0
Module #: 1
Present: Yes
Form Factor: 9h
Memory Type: 13h
Size: 1024 MB
Speed: 667 MHz
Status: Ok

Cartridge #: 0
Module #: 2
Present: Yes
Form Factor: 9h
Memory Type: 13h
Size: 1024 MB
Speed: 667 MHz
Status: Ok

Cartridge #: 0
Module #: 3
Present: Yes
Form Factor: 9h
Memory Type: 13h
Size: 1024 MB
Speed: 667 MHz
Status: Ok

Cartridge #: 0
Module #: 4
Present: Yes
Form Factor: 9h
Memory Type: 13h
Size: 1024 MB
Speed: 667 MHz
Status: Ok

Status will change for failed modules.

Windows Memory – How to Find Memory Usage of Individual Windows Services

There is an easy way to get the information you are asking for (but it does require a slight change to your system):

Split each service to run in its own SVCHOST.EXE process and the service consuming the CPU cycles will be easily visible in Task Manager or Process Explorer (the space after "=" is required):

SC Config Servicename Type= own

Do this in a command line window or put it into a BAT script. Administrative privileges are required and a restart of the computer is required before it takes effect.

The original state can be restored by:

SC Config Servicename Type= share

Example: to make Windows Management Instrumentation run in a separate SVCHOST.EXE:

SC Config winmgmt Type= own

This technique has no ill effects, except perhaps increasing memory consumption slightly. And apart from observing CPU usage for each service it also makes it easy to observe page faults delta, disk I/O read rate and disk I/O write rate for each service. For Process Explorer, menu View/Select Columns: tab Process Memory/Page Fault Delta, tab Process Performance/IO Delta Write Bytes, tab Process Performance/IO Delta Read Bytes, respectively.

On most systems there is only one SVCHOST.EXE process that has a lot of services. I have used this sequence (it can be pasted directly into a command line window):

rem  1. "Automatic Updates"
SC Config wuauserv Type= own

rem  2. "COM+ Event System"
SC Config EventSystem Type= own

rem  3. "Computer Browser"
SC Config Browser Type= own

rem  4. "Cryptographic Services"
SC Config CryptSvc Type= own

rem  5. "Distributed Link Tracking"
SC Config TrkWks Type= own

rem  6. "Help and Support"
SC Config helpsvc Type= own

rem  7. "Logical Disk Manager"
SC Config dmserver Type= own

rem  8. "Network Connections"
SC Config Netman Type= own

rem  9. "Network Location Awareness"
SC Config NLA Type= own

rem 10. "Remote Access Connection Manager"
SC Config RasMan Type= own

rem 11. "Secondary Logon"
SC Config seclogon Type= own

rem 12. "Server"
SC Config lanmanserver Type= own

rem 13. "Shell Hardware Detection"
SC Config ShellHWDetection Type= own

rem 14. "System Event Notification"
SC Config SENS Type= own

rem 15. "System Restore Service"
SC Config srservice Type= own

rem 16. "Task Scheduler"
SC Config Schedule Type= own

rem 17. "Telephony"
SC Config TapiSrv Type= own

rem 18. "Terminal Services"
SC Config TermService Type= own

rem 19. "Themes"
SC Config Themes Type= own

rem 20. "Windows Audio"
SC Config AudioSrv Type= own

rem 21. "Windows Firewall/Internet Connection Sharing (ICS)"
SC Config SharedAccess Type= own

rem 22. "Windows Management Instrumentation"
SC Config winmgmt Type= own

rem 23. "Wireless Configuration"
SC Config WZCSVC Type= own

rem 24. "Workstation"
SC Config lanmanworkstation Type= own

rem End.

Best Answer

Related Solutions

Linux – ECC chipkill errors: which DIMM

Windows Memory – How to Find Memory Usage of Individual Windows Services

Related Topic