What to do in response to repeat DRAM ECC error notifications for the same memory location

ecchardwarememory

I woke up this morning to what's a first for me; one of my systems had logged DRAM ECC error notifications. Three of them, in fact, for as far as I can tell the exact same memory location (obviously, the system isn't actually named localhost):

Aug 31 05:00:46 localhost kernel: [719099.816034] [Hardware Error]: CPU:0   MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c6c40006b080a13
Aug 31 05:00:46 localhost kernel: [719099.816046] [Hardware Error]:         MC4_ADDR: 0x0000000641f49d20
Aug 31 05:00:46 localhost kernel: [719099.816051] [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
Aug 31 05:00:46 localhost kernel: [719099.816059] EDAC amd64 MC0: CE ERROR_ADDRESS= 0x641f49d20
Aug 31 05:00:46 localhost kernel: [719099.816070] EDAC MC0: CE page 0x641f49, offset 0xd20, grain 0, syndrome 0x6bd8, row 2, channel 0, label "": amd64_edac
Aug 31 05:00:46 localhost kernel: [719099.816075] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

The above was followed by an identical notification at system time 05:10:46 (719699.8160) and then one more at 05:20:46 (720299.8160) which also had Over on the CPU:0 MC4_STATUS line (status 0xdc6c40006b080813). So far the system has been stable since, with no further errors logged. System activity is normal, and the system in question has been running with ECC RAM since 2014 but never logged any ECC errors.

I wouldn't be too worried about a single correctable ECC error. The almost exactly ten minutes (down to a few microseconds, in fact) in between the errors being logged could be simply for RAM scrubbing happening every ten minutes; unfortunately, on this particular system, the scrub interval is not exposed as a setting. However, the three consecutive errors in the same memory location (same value for CE ERROR_ADDRESS) does have me a little bit concerned.

Update: The host in question has logged several more since I originally posted this question, all with the same value for CE ERROR_ADDRESS.

How seriously should I take this? What's a good response; order replacement RAM right away and schedule to install it ASAP, treat this as just a momentary glitch, or be on toes to replace RAM if it happens again but no specific action right now?

Best Answer

ECC RAM tends to be used on critical servers. The system is reporting a hardware failure. If it's not a critical system and you don't mind everything going through it potentially corrupting, sure wait and see what happens, but if you care about your data more than the cost of the RAM replace the faulty RAM ASAP.