It does make a difference, it will only make sense if you require the RAS (Reliability, Availability, and Service) features on x4 or x8 devices and understand the trade-offs for your needs. More details can be explained in the Dell white paper Dellâ„¢ PowerEdgeâ„¢ Servers
2009 - Memory.
Also, configuration and layout with details specific to the R710 are available on the Technical Guidebook for the PowerEdge R710 - (Google this because I don't have reputation for link).
The important issue to note is the difference between ECC on the chip and the "Advanced ECC" provided by Dell's BIOS for Single Device Data Correction (SDDC). You will have a performance impact on both. The ECC will recover from errors during writes to the chip. However, SDDC goes a step further and will organize the bits so that an entire chip can fail and still be recoverable. See an example and details SDDC E7500 Chipset
The issues is whether your performance and/or reliability are of the utmost concern in your specific usage of the machine. If a chip failure will cause a loss of critical data or usage on this machine and it's non-redundant in the implementation, Advanced ECC may be a great way to go. However, you do so at a performance impact which may be more important to you.
I've implemented both in the field on Dell PowerEdge servers for single Microsoft SQL Server implementations. If I can be of more help, just comment to let me know.
Hope that helps.
EDIT: Coverage gap / ECC implementations
Yes, there is a coverage gap even if you implement both. Since, you are specifically using a cluster of high availability servers, IMHO you should use the Advanced ECC. Your performance impact is minimal compared to the benefits for the clustered devices. According to Crucial you have only a 2% decrease in performance on ECC memory in general.
The gap would be more specific to the types of errors that occur and how each handles the errors. In your specific situation it shouldn't translate to data loss. Since this is an Enterprise DBMS and errors, concurrency issues, etc. are managed at the software level in order to prevent data loss. A detailed history is kept of changes in a properly configured DBMS and the software that uses it can typically setup to have the transaction "rollback" any if a severe error occurs.
ECC Implementations
ECC will attempt to correct any bit errors in memory read/write. However, if the error is more significant, then not even ECC will be able to recover, causing potential loss of data. There is more discussion on ECC as well at ServerFault/What is ECC ram and why is it better?
According to Wikipedia on ECC_Memory
ECC memory maintains a memory system effectively free from single-bit errors...
SDDC
If you refer to the E7500 chipset document above (note the 55xx/56xx from Intel require login/partnership but the idea is similar which is why I didn't link originally), which describes SDDC and how it's made possible. Basically, it uses a technique for organizing the words written to memory that ensures all are written in such a way that every word will only contain a single bit error i.e. the word should be recoverable from the single bit error (as above). Now that's per word, so it could potentially recover from up to 4-bit errors on x4 devices (1 per word) and up to 8-bit errors on x8 devices (still 1 per word) by error correcting each word.
Additional errors, more bit errors, total memory failure, channel failure, bus failure, etc. can still all cause horrible problems but that's why you have a cluster and an Enterprise DBMS.
In short, if you have everything enabled and there's too many bit errors for error correction algorithms to correct you will still have an error i.e. error coverage gap. These can be exceptionally rare though.
This indicates that a single-bit error (SBE) has occurred on DIMM 6 with such a frequency that the system is no longer logging the error until it is rebooted. (See https://support.quest.com/SolutionDetail.aspx?id=SOL60022 for background.)
It's a bit perplexing that you're seeing the same error after replacing the motherboard but it is possible that the replacement board has the same defect as the first board. Since you moved the DIMMs around and the problem hasn't followed the DIMM I'm less likely to suspect the DIMM.
I would use the appropriate Dell MpMemory diagnostic for that server rather than memtest+. The Dell tool is going to be aware of any Dell-specific hardware features.
Best Answer
The system should not reboot upon correctable memory error. Do you see additional information/pattern via
ipmitool sel elist
? The BMC watchdog could reboot the system, check if it is enabled viaipmitool mc watchdog get
. As you already have the information on the location of the bad memory module, replace it and if the problem manifests again, the memory slot could be at fault.X10SLM-F the RAM that you use is not on the list of tested RAM modules - if you have the possibility, replace all the memory bars in a 'problem' system with equivalent Supermicro-tested ones. Also, check the list of supported OS for you Ubuntu version.
Related to the CMOS settings, you could use
Supermicro SUM
, provided you have the SUM keys installed, to dump the BIOS settings from all the systems thenvimdiff
them to see if there is any CMOS parameter being different for the systems that regularly reboot compared to the system(s) that do not.