if you have not yet installed the PAE kernel then what kernel are you currently running???
the memtest may not identify the errors due to the memory being ECC memory
try running edac-util -v
if there are any uncorrectable issues you will be able to identify the bad memory rows.
Well, this isn't a fully-integrated system like an HP, Dell or IBM server, so the monitoring and reporting of such a failure isn't going to be present or consistent.
With the systems I've managed, disks fail the most often, followed by RAM, power supplies, fan, system boards and CPUs.
Memory can fail... There isn't much you can do about it.
See: Is it necessary to burn-in RAM for server-class hardware?
Since you can't really prevent ECC errors and RAM failure, just be prepared for it. Keep spares. Have physical access to your systems and maintain the warranty of your components. I definitely wouldn't introduce "precautionary replacement" into an environment. Some of this is a function of your hardware... Do you have IPMI? Sometimes hardware logs will end up there.
This is one of the value-adds of better server hardware. Here's a snippet from an HP ProLiant DL580 G4 server where the ECC threshold on the RAM was exceeded, then progressed to the DIMM being disabled... then finally the server crashing (ASR) and rebooting itself with the bad DIMM deactivated.
0004 Repaired 22:21 12/01/2008 22:21 12/01/2008 0001
LOG: Corrected Memory Error threshold exceeded (Slot 1, Memory Module 1)
0005 Repaired 20:41 12/06/2008 20:43 12/06/2008 0002
LOG: POST Error: 201-Memory Error Single-bit error occured during memory initialization, Board 1, DIMM 1. Bank containing DIMM(s) has been disabled.
0006 Repaired 21:37 12/06/2008 21:41 12/06/2008 0002
LOG: POST Error: 201-Memory Error Single-bit error occured during memory initialization, Board 1, DIMM 1. Bank containing DIMM(s) has been disabled.
0007 Repaired 02:58 12/07/2008 02:58 12/07/2008 0001
LOG: POST Error: 201-Memory Error Single-bit error occured during memory initialization, Board 1, DIMM 1. Bank containing DIMM(s) has been disabled.
0008 Repaired 19:31 12/08/2009 19:31 12/08/2009 0001
LOG: ASR Detected by System ROM
Best Answer
You can try reseating and/or rearranging the RAM, then test throughly. If it passes then it can be chalked up to seating alignment, but if it still comes up as a detected issue, then you need to seriously consider its replacement. If the same slot comes back up with a different module in its place, then you may want to take a closer look at the slot/motherboard itself.
A lot of RAM manufacturers offer good warranty periods, quite a few with life-time terms, so its worth looking into that as well.