Datacenter – Assessing Equipment Damage Following a Lightning Strike

datacenterdisasterelectrical-powernetworking

One of my client's sites received a direct lightning hit last week (coincidentally on Friday the 13th!).

I was remote to the site, but working with someone onsite, I discovered a strange pattern of damage. Both internet links were down, most servers were inaccessible. Much of the damage occurred in the MDF, but one fiber-connected IDF also lost 90% of the ports on a switch stack member. Enough spare switch ports were available to redistribute cabling elsewhere and reprogram, but there was downtime while we chased down affected devices..

This was a new building/warehousing facility and a lot of planning went into the design of the server room. The main server room is run off of an APC SmartUPS RT 8000VA double-conversion online UPS, backed by a generator. There was proper power distribution to all connected equipment. Offsite data replication and systems backups were in place.

In all, the damage (that I'm aware of) was:

  • Failed 48-port line card on a Cisco 4507R-E chassis switch.
  • Failed Cisco 2960 switch in a 4-member stack. (oops… loose stacking cable)
  • Several flaky ports on a Cisco 2960 switch.
  • HP ProLiant DL360 G7 motherboard and power supply.
  • Elfiq WAN link balancer.
  • One Multitech fax modem.
  • WiMax/Fixed-wireless internet antenna and power-injector.
  • Numerous PoE connected devices (VoIP phones, Cisco Aironet access points, IP security cameras)

Most of the issues were tied to losing an entire switch blade in the Cisco 4507R-E. This contained some of the VMware NFS networking and the uplink to the site's firewall. A VMWare host failed, but HA took care of the VM's once storage networking connectivity was restored. I was forced to reboot/power cycle a number of devices to clear funky power states. So the time to recovery was short, but I'm curious about what lessons should be learned…

  • What additional protections should be implemented to protect equipment in the future?
  • How should I approach warranty and replacement? Cisco and HP are replacing items under contract. The expensive Elfiq WAN link balancer has a blurb on their website that basically said "too bad, use a network surge protector". (seems like they expect this type of failure)
  • I've been in IT long enough to have encountered electrical storm damage in the past, but with very limited impact; e.g. a cheap PC's network interface or the destruction of mini switches.
  • Is there anything else I can do to detect potentially flaky equipment, or do I simply have to wait for odd behavior to surface?
  • Was this all just bad luck, or something that should be really be accounted for in disaster recovery?

With enough $$$, it's possible to build all sorts of redundancies into an environment, but what's a reasonable balance of preventative/thoughtful design and effective use of resources here?

Best Answer

A couple of jobs ago, one of the datacenters for the place I was working for was one floor below a very large aerial. This large, thin, metal item was the tallest thing in the area and was hit by lightning every 18 months or so. The datacenter itself was built around 1980, so I wouldn't call it the most modern thing around, but they had long experience dealing with lightning damage (the serial-comms boards had to be replaced every time, which is a trial if the comms boards are in a system that hasn't had any new parts made in 10 years).

One thing that was brought up by the old hands is that all that spurious current can find a way around anything, and can spread in a common ground once it bridges in. And can bridge in from air-gaps. Lightning is an exceptional case, where normal safety standards aren't good enough to prevent arcs and will go as far as it has energy. And it has a lot. If there is enough energy it can arc from a suspended-ceiling grid (perhaps one of the suspension wires is hung from a loop with connection to a building girder in the cement) to the top of a 2-post rack and from there into the networking goodies.

Like hackers, there is only so much you can do. Your power-feeds all have breakers on them that clamp spurious voltages, but your low-voltage networking gear almost never does and represents a common-path for an extremely energetic current to route.


Detecting potentially flaky kit is something that I know how to do in theory, but not in reality. Probably your best bet is to put the suspect gear into an area and deliberately bring the temperature in the room up into the high end of the Operating Range and see what happens. Run some tests, load the heck out of it. Leave it there for a couple days. The added thermal stress over any pre-existing electrical damage may weed out some time-bombs.

It definitely did shorten the lifespan of some of your devices, but finding out which ones is hard. Power conditioning circuitry inside power-supplies may have compromised components and be delivering dirty power to the server, something you could only detect through the use of specialized devices designed to test power-supplies.


Lightning strikes are not something I've considered for DR outside of having a DC in a facility with a giant lightning rod on the roof. Generically, a strike is one of those things that happen so infrequently it's shuffled under 'act of god' and moved along.

But... you've had one now. It shows your facility had the right conditions at least once. It's time to get an assessment for how prone your facility is given the right conditions and plan accordingly. If you're only thinking of the DR impacts of lightning now, I think that's appropriate.

Related Topic