A plan to mitigate PDU failures

electrical-powerinfrastructurepower-distribution-unitpower-supply-unit

A client just experienced a complete failure of an APC AP7911A switched/metered rack power distribution unit (PDU). This obviously took all of the connected equipment down with it. The equipment is fine, as well as the upstream UPS units.

In situations where it's not possible to balance devices across multiple power feeds/PDUs/UPS units (e.g. switches with single power supplies, lack of high-line power feeds, etc.), how do you mitigate failures like this? This was a single rack installation in a less-than-ideal computer room, but typical for most small/medium businesses. Should one plan for individual PDU failure, or is it just something that gets dealt with when it happens?

Best Answer

Multiple PSUs in servers are ok but not a magic bullet. Often when things to do with power go they take out other things around them eg. the backplane that your redundant psus both connect to. Far more likely to keep running if you have two servers on seperate UPSs.

Best of all is to work in redundancy at your application or platform layer so that machines or racks can go out without it causing a problem but when you haven't got the budget for that you can still reduce the risk by having spares of any non redundant equipment ready to swap out, but also by keeping things simple. A fancy managed PDU is far more likely to go down than a dumb power bar.

Also it is worth bearing in mind that many small businesses simply can't do things the proper way or choose to do things the cheapest way and live with the consequences if they happen. I've seen inexperienced admins go out of their way to avoid doing things a certain way that have been slated around here or similar sites only to put something worse in place. A less than ideal solution is often better than nothing.