How to configure UPS to restart servers in the right sequence

ups

Here we have some servers and almost each of them has a dedicated UPS. There are dependencies between them so they must be switched on in the correct sequence. Ultimately we are experiencing serious problems with the power supply, so the servers are shutdown and then restarted in a random order when power is restored. It is not a problem if the servers were switched off during a blackout, it is important they work correctly without any human intervention once power is restored.

Our UPS are quite cheap and the only configuration parameter useful for my goal is power the load xx seconds after power is restored. In theory putting the right delays on each UPS I can fix the order of server restart but I don't trust the UPS will behave as expected.

Is it the right way to go ?
Do high level UPS give other options to fix the restart sequence ?
One final note: my Ups are in the range of 1000 – 2200 VA

Best Answer

The standard answer for this is "not at all". Fix the software to handle restarts in random order. If you really need SOME servers to start first (example: Active Directory) put them on USV's that are possibly surviving a LOT longer. A low power atom based server is good enough as Active Directory controller and will survive a day on a small USV.

Do high level UPS give other options to fix the restart sequence ?

No. I would say it is generally assumed programmers are competent enough to work around the issue properly.

What you COULD do is:

  • Have servers start "randomly". Except for DHCP / Active Directory there is nothing really demanding an order that can not be fixed.
  • Have a control server after some time (5 minutes) start the services on the various machines in the correct order.

I would say that this type of setup is a lot more common. I would call any software that REQUIRES server starts in a particular order (outside of pure infrastructure) as broken and unfit for business.

Just as note: our own setup is a low cost 20kva USV (low cost because we got one used) for the servers, with a slaved 2000VA USV for a machine serving as "root" of the network (and backup machine). Slaved means that the USV is behind the big one - so it only switches to battery when the large one (that lasts between half an hour and 8 hours depending on how much of our computing grid is online) is going into terminal shutdown.