Linux – ECC chipkill errors: which DIMM

ecchardwarelinuxmemory

We often get DIMMs in our servers going bad with the following errors in syslog:

May  7 09:15:31 nolcgi303 kernel: EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
May  7 09:15:31 nolcgi303 kernel: MC0: CE page 0xa0, offset 0x40, grain 8, syndrome 0xb50d, row 2, channel 0, label "": k8_edac
May  7 09:15:31 nolcgi303 kernel: MC0: CE - no information available: k8_edac Error Overflow set
May  7 09:15:31 nolcgi303 kernel: EDAC k8 MC0: extended error code: ECC chipkill x4 error

We can use the HP SmartStart CD to determine which DIMM has the error but that requires taking the server out of production. Is there a cunning way to work out which DIMM's bust while the server is up? All our servers are HP hardware running RHEL 5.

Best Answer

In addition to using the EDAC codes, you can use the CLI only HP utilities to determine this while the machine is online. The cli versions are far more lightweight than the web based ones and do not require you to open ports or have a daemon constantly running.

hpasmcli will give you the cartridge and module #'s of the failed modules. A little quicker than analyzing EDAC.

Example:

hpasmcli -s "show dimm"

DIMM Configuration
------------------
Cartridge #: 0
Module #: 1
Present: Yes
Form Factor: 9h
Memory Type: 13h
Size: 1024 MB
Speed: 667 MHz
Status: Ok

Cartridge #: 0
Module #: 2
Present: Yes
Form Factor: 9h
Memory Type: 13h
Size: 1024 MB
Speed: 667 MHz
Status: Ok

Cartridge #: 0
Module #: 3
Present: Yes
Form Factor: 9h
Memory Type: 13h
Size: 1024 MB
Speed: 667 MHz
Status: Ok

Cartridge #: 0
Module #: 4
Present: Yes
Form Factor: 9h
Memory Type: 13h
Size: 1024 MB
Speed: 667 MHz
Status: Ok

Status will change for failed modules.