A provider(data center) recommended I go with 1TB SSDs in a software RAID 1 over HW RAID 10 with mechanical drives.
Their quote:
Typically SSDs are most reliable than RAID cards and since you have
less parts, there are less points of failure. There won't be much of a
CPU load since RAID1 is extremely simple storage.
How true is that and when running virtual machines is RAID 1 SW even ideal? They say so.
Some more details:
I plan to run XEN/XEN-HvM/KVM — in other words, it will be Linux running as the HOST and I want a setup where the guests can host Windows to Linux and can compile their own kernels.
What I want to accomplish:
To be able to quickly recognize a drive failure and have a replacement thrown in with little to no downtime or performance hits.
Best Answer
In RAID10 any one of your drives can fail and the array will survive, the same as RAID1. While RAID10 can survive four of the six "two drives failed at once" circumstances the main reason to use R10 with four drives instead of R1 with two is performance rather than extra reliability, and the SSDs will give you a greater performance jump.
Early SSDs had reliability issues, but most properly run tests I've seen suggest that those days are long gone and they tend to be no more likely to fail than spinning metal based drives - the overall reliability has increased and wear levelling tricks are getting very intelligent.
I'm assuming you are running the RAID array on the host, in which case unless you have a specific load pattern in your VMs (that would be a problem on direct physical hardware too) the difference between soft RAID and hard RAID is not going to be dependent on the use of VMs. If you are running RAID inside the VMs then you are likely to be doing something wrong (unless the VMs are for learning or testing RAID management of course).
The key advantages of hardware RAID are:
The key advantage of good software RAID (i.e. Linux's mdadm managed arrays) is:
SSD over-provision space for two reasons: it leaves plenty of blocks free to be remapped if a block goes bad (traditional drives do this too) and it stops the write performance hole (except for huge write-heavy loads) even where TRIM is not used as the extra blocks can cycle through the wear levelling pool along with all the others (and the controller can pre-wipe them ready for next use at its leisure). Consumer grade drives only really under-allocate enough for the remapping use and a small amount of performance protection, so it is useful to manually under-allocate (partitioning only 200GiB of a 240GB drive for instance) which has a similar effect. See reports like this one for details on this (that report is released by a controller manufacturer but seems a general description of the matter rather then a sales pitch, you'll no doubt find manufacturer-neutral reports on the same subject if you look for them). Enterprise grade drives tend to over-provision by much larger amounts (for both the above reasons: reliability and performance).