ECC CE (Correctable Error) occuring every 5 minutes exactly

ecchardwarememory

On one of our computing nodes I am getting ECC CE (correctable errors). What is a little bit peculiar about is is that errors are not massive, just a single occurrence exactly every 5 minutes.

messages.log:

May  7 11:43:37 armada9 kernel: [22220081.676263] EDAC MC1: 1 CE on unknown memory (csrow:4 channel:1 page:0x41daad offset:0xc30 grain:0 syndrome:0x2254)
May  7 11:48:37 armada9 kernel: [22220381.919057] EDAC MC1: 1 CE on unknown memory (csrow:4 channel:1 page:0x407bb8 offset:0x150 grain:0 syndrome:0x33a8)
May  7 11:53:37 armada9 kernel: [22220682.161798] EDAC MC1: 1 CE on unknown memory (csrow:4 channel:1 page:0x41e6bd offset:0x6a0 grain:0 syndrome:0x33a8)
May  7 11:58:37 armada9 kernel: [22220982.404501] EDAC MC1: 1 CE on unknown memory (csrow:4 channel:1 page:0x427c14 offset:0x880 grain:0 syndrome:0x33a8)
May  7 12:03:37 armada9 kernel: [22221282.647210] EDAC MC1: 1 CE on unknown memory (csrow:4 channel:1 page:0x426e88 offset:0x830 grain:0 syndrome:0x33a8)

syslog example entry:

May  7 12:03:37 armada9 kernel: [22221282.647114] [Hardware Error]: MC4 Error (node 1): DRAM ECC error detected on the NB.
May  7 12:03:37 armada9 kernel: [22221282.647210] EDAC MC1: 1 CE on unknown memory (csrow:4 channel:1 page:0x426e88 offset:0x830 grain:0 syndrome:0x33a8)
May  7 12:03:37 armada9 kernel: [22221282.647215] [Hardware Error]: Error Status: Corrected error, no action required.
May  7 12:03:37 armada9 kernel: [22221282.647299] [Hardware Error]: CPU:6 (10:8:0) MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc54400033080813
May  7 12:03:37 armada9 kernel: [22221282.647393] [Hardware Error]: MC4_ADDR: 0x0000000426e88830
May  7 12:03:37 armada9 kernel: [22221282.647443] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)

Another thing than baffles me is that cat /sys/devices/system/edac/mc/mc*/csrow*/ce_count
shows 4x 0. dmidecode -t memory | grep Size reports there are 8x 2GB dice installed.
But cat /sys/devices/system/edac/mc/mc*/csrow*/size_mb shows 4x 4096. I am guessing that the memory chips are single ranked, and pairs of dice got coupled. Is this thinking right? Still it does not explain why error count is 0.

This is going on for about 2-3 days already. Every error so far was reported as corrected, but this is pretty annoying and probably not safe.

Is the RAM die dying and I am lucky that it's just some system process happened to be placed in there (as opposed to computation)? I don't think I have anything running every 5 minutes, but maybe some logging tools are.

Or the reason can be something else?

Best Answer

A similar problem happened when I installed new DIMMs in my PowerEdge R815. I thought one of the DIMMs was bad, but didn't know which of the 32 DIMMs it might be. It turned out that the hardware's LCD panel (and the hardware log) reported the failure, and provided the DIMM slot id. When I reseated the DIMM, the error went away -- so it wasn't an error that could be corrected by ECC after all.