Linux – Hot spare host vs cold spare host

failoverhardwarelinuxredundancy

We have several hosts where we have an identical hot spare host, which is patched and updated so it is very close to have to same software and config. In case of failure the network cable is switched and the DHCP server is updated with the new MAC address. This is best case, as there usually are a bit more that needs modification.

I feel it is a waste of electricity to have a hot spare host and waste of time to maintain it, and since config modifications are needed in case of failover, I'd like to ask the following:

Are hot spare hosts old school and there are better ways now?

Instead of having a hot spare host, would it make sense to make it a cold spare, take the hard drives and put them in the primary host and change the RAID from 1 to 1+1. In case of failure all I would have to do is change network cables, update the DHCP server, take the hard drives and insert them in the cold spare and power on. The benefit, as I see it, is that the 2×2 disks are always in sync, so only one host to maintain and no config changes are needed when failing over.

Is that a good idea?

Best Answer

Sobrique explains how the manual intervention causes your proposed solution to be sup-optimal, and ewwhite talks about probability of failure of various components. Both of those IMO make very good points and should be strongly considered.

There is however one issue that nobody seems to have commented on at all so far, which surprises me a little. You propose to:

make [the current hot spare host] a cold spare, take the hard drives and put them in the primary host and change the RAID from 1 to 1+1.

This doesn't protect you against anything the OS does on disk.

It only really protects you against disk failure, which by moving from mirrors (RAID 1) to mirrors of mirrors (RAID 1+1) you greatly reduce the impact of to begin with. You could get the same result by increasing the number of disks in each mirror set (go from 2-disk RAID 1 to 4-disk RAID 1, for example), along with quite likely improving read performance during ordinary operations.

Well then, let's look at some ways this could fail.

  • Let's say you are installing system updates, and something causes the process to fail half-way; maybe there's a power and UPS failure, or maybe you have a freak accident and hit a crippling kernel bug (Linux is pretty reliable these days, but there's still the risk).
  • Maybe an update introduces a problem that you didn't catch during testing (you do test system updates, right?) requiring a failover to the secondary system while you fix the primary
  • Maybe a bug in the file system code causes spurious, invalid writes to disk.
  • Maybe a fat-fingered (or even malicious) administrator does rm -rf ../* or rm -rf /* instead of rm -rf ./*.
  • Maybe a bug in your own software causes it to massively corrupt the database contents.
  • Maybe a virus manages to sneak in.

Maybe, maybe, maybe... (and I'm sure there are plenty more ways your proposed approach could fail.) However, in the end this boils down to your "the two sets are always in sync" "advantage". Sometimes you don't want them to be perfectly in sync.

Depending on what exactly has happened, that's when you want either a hot or cold standby ready to be switched on and over to, or proper backups. Either way, RAID mirrors of mirrors (or RAID mirrors) don't help you if the failure mode involves much of anything aside from hardware storage device failure (disk crash). Something like ZFS' raidzN can likely do a little better in some regards but not at all better in others.

To me, this would make your proposed approach a no-go from the beginning if the intent is any sort of disaster failover.