Cluster Management – How to Shut Down Nodes During Low Load

cluster

I'm developing software for the energy consulting business and in monitoring energy use in datacenters, I've noticed that the typical electric load "pattern" of a datacenter is just a flat line, because all the gear runs 24/7. If you compare this to the actual usage pattern (network load, CPU usage etc), which we did, you regularly have long stretches with little usage but the full capacity available.

These patterns are very predictable in many cases and to save energy, it would be great to turn off part of the equipment (servers, switches, storage) regularly or in low-load conditions. However, I can think of several aspects that would have to be looked at, including

  • handling peak loads or sudden spikes
  • data consistency among nodes
  • long startup (and, possibly, synchronization) times compared to average uptime of a node

There's probably more. Is there software that handles such a scenario and what else should be looked out for? Is this a viable suggestion to make?

For my purposes, a cluster wouldn't necessarily mean to cluster machines on the OS level, identical hosts that receive requests via a load balancer (i. e. application level clustering) would also count. I'm not sure how MySQL cluster or similar work, but I'd probably count those as well.

I'm looking for advice for any operating system.

See also my post on energy efficiency over at Stack Overflow that brought up this question.

Best Answer

Power

Use Switched PDUs so that you can turn servers and switches on and off out-of-band. This is OS- and device-independent, which will greatly simplify the configuration and logic that powers things on and off. If your servers all have network-enabled IPMI interfaces, you can use those instead. I would recommend against trying to turn things on and off using higher-level things like wake-on-LAN.

Power up/down Logic

This could take many forms. Some clustering software (such as Moab) has a solution for this built in. Otherwise, you can write some script with the following pseudocode:

  1. Check overall cluster load
  2. If cluster load > threshold1, turn on some nodes
  3. If cluster load < threshold2, turn off some nodes

Put that in cron and have it run every half hour.

Clustering Software Stack

Obviously, you'll need to make sure your clustering software stack can deal with these devices going up and down all the time. Do a lot of testing here, consider obscure timing issues (booting takes time) and any race conditions that will creep up in the power up/down logic you use.

Related Topic