Probability of Blade Chassis Failure: What to Know

blade-chassishardwarenetworkingredundancystorage

In my organisation we are thinking about buying blade servers – instead of rack servers. Of course technology vendors also make them sound very nice. A concern, that I read very often in different forums, is, that there is a theoretical possibility of the server chassis going down – which would in consequence take all the blades down. That is due to shared infrastructure.

My reaction on this probability would be to have redundancy and by two chassis instead of one (very costly of course).

Some people (including e.g. HP Vendors) try to convince us, that the chassis is very very unlikely to fail, due to many redundancies (redundant power supply, etc.).

Another concern on my side is, that if something goes down, spare parts might be required – which is difficult in our location (Ethiopia).

So I would ask to experienced administrators, that have managed blade server: What is your experience? Do they go down as a whole – and what is the sensible shared infrastructure, that might fail?

That question could be extended to shared storage. Again I would say, that we need two storage units instead of only one – and again the vendors say, that this things are so rock solid, that no failure is expected.

Well – I can hardly believe, that such a critical infrastructure can be very reliable without redundancy – but maybe you can tell me, whether you have successfull blade-based projects, that work without redundancy in its core parts (chassis, storage…)

At the moment, we look at HP – as IBM looks much too expensive.

Best Answer

There's a low probability of complete chassis failure...

You'll likely encounter issues in your facility before sustaining a full failure of a blade enclosure.

My experience is primarily with HP C7000 and HP C3000 blade enclosures. I've also managed Dell and Supermicro blade solutions. Vendor matters a bit. But in summary, the HP gear has been stellar, Dell has been fine, and Supermicro was lacking in quality, resiliency and was just poorly-designed. I've never experienced failures on the HP and Dell side. The Supermicro did have serious outages, forcing us to abandon the platform. On the HP's and Dells, I've never encountered a full chassis failure.

  • I've had thermal events. The air-conditioning failed at a co-location facility sending temperatures to 115°F/46°C for 10 hours.
  • Power surges and line failures: Losing one side of an A/B feed. Individual power supply failures. There are usually six power supplies in my blade setups, so there's ample warning and redundancy.
  • Individual blade server failures. One server's issues do not affect the others in the enclosure.
  • An in-chassis fire...

I've seen a variety of environments and have had the benefit of installing in ideal data center conditions, as well as some rougher locations. On the HP C7000 and C3000 side, the main thing to consider is that the chassis is entirely modular. The components are designed minimize the impact of a component failure affecting the entire unit.

Think of it like this... The main C7000 chassis is comprised of front, (passive) midplane and backplane assemblies. The structural enclosure simply holds the front and rear components together and supports the systems' weight. Nearly every part can be replaced... believe me, I've disassembled many. The main redundancies are in fan/cooling, power and networking an management. The management processors (HP's Onboard Administrator) can be paired for redundancy, however the servers can run without them.

enter image description here

Fully-populated enclosure - front view. The six power supplies at the bottom run the full depth of the chassis and connect to a modular power backplane assembly at the rear of the enclosure. Power supply modes are configurable: e.g. 3+3 or n+1. So the enclosure definitely has power redundancy. enter image description here

Fully-populated enclosure - rear view. The Virtual Connect networking modules in the rear have an internal cross-connect, so I can lose one side or the other and still maintain network connectivity to the servers. There are six hot-swappable power supplies and ten hot-swappable fans. enter image description here

Empty enclosure - front view. Note that there's really nothing to this part of the enclosure. All connections are passed-through to the modular midplane. enter image description here

Midplane assembly removed. Note the six power feeds for the midplane assembly at the bottom. enter image description here

Midplane assembly. This is where the magic happens. Note the 16 separate downplane connections: one for each of the blade servers. I've had individual server sockets/bays fail without killing the entire enclosure or affecting the other servers. enter image description here

Power supply backplane(s). 3ø unit below standard single-phase module. I changed power distribution at my data center and simply swapped the power supply backplane to deal with the new method of power delivery enter image description here

Chassis connector damage. This particular enclosure was dropped during assembly, breaking the pins off of a ribbon connector. This went unnoticed for days, resulting in the running blade chassis catching FIRE... enter image description here

Here are the charred remains of the midplane ribbon cable. This controlled some of the chassis temperature and environment monitoring. The blade servers within continued to run without incident. The affected parts were replaced at my leisure during scheduled downtime, and all was well. enter image description here

Related Topic