Non-ECC RAM for virtualization

eccvirtual-machinesxeon

I'm on a quest of building a virtualization server. However, I was asking myself a question: should I stick with non-ECC RAM for this server or not?

This because I found a Xeon CPU that falls in the budget I was given. However, if I want to have a CPU that supports ECC RAM, it goes out of the current budget for this.

The server will run around 10 virtual machines 24/7, with Linux and Windows virtual machines mixed.

Any oppinions about this?

Best Answer

That depends on whether you're fine with a higher risk of in-memory corruption.

ECC in no way totally guarantees that all errors will be corrected or detected - but it does a pretty good job of detecting and even correcting quite a few types of failures. This is especially relevant if your stack is running on a single node rather than HA/replicated across more than one. If you only have one pool of memory that acts as a single source of truth, you better make it a good one.

That said, it's all about the use case. Let's say you get a module goes bad (or it's fine and you live close to a star) and you start to corrupt data silently (we're not using ECC here). Does it affect your business if some data gets lost or mangled before the condition is detected? In most cases, it does - so it's worth spending a bit more money on hardware to mitigate the possibility in those cases.

In general, applications and their developers rely heavily on the reliability of the datapath. Is a less reliable stack going to waste a significant amount of administrator and developer time? That could end up being more expensive than just buying better hardware.

Some of this is mitigated if your infrastructure is clustered and replicated, since there are many storage and application systems out there than can perform checksumming of a dataset that spans multiple hardware nodes. One bad node doesn't necessarily spoil the bunch in these systems, so at some scales you can afford to reduce per-node redundancy and error checking. It doesn't sound like this is the situation, though.

Related Topic