Users can't get to their e-mail, the CEO can't get to the company's home page, and your pager just went off with a "911" code. What do you do when everything blows up?
Disaster Recovery – Checklist for When Everything Blows Up
disaster-recovery
Related Solutions
It's a good thing that you're thinking about what questions to ask your hosting company, but I think you're approaching it backwards. First figure out your requirements, and then ask each company how their infrastructure will meet them.
When they're explaining how their infrastructure meets your needs don't be afraid to ask questions, and if you aren't satisfied with the answers you're getting don't be afraid to insist on having someone relatively upper-level give you a good explanation -- You are giving the hosting company good money, and if their sales guy can't explain things to your satisfaction insist on a network engineer or someone from their Datacenter Operations team to explain things.
In addition to what everyone else has mentioned, some other things to consider (geared toward colocation - hosting your hardware at someone else's facility):
General
- Is the facility clean?
- Is the house cabling neat and orderly?
- If they use cable trays, look up -- Things should be neat, and strapped down with velcro ties.
(NO plastic zip ties, NO tape) - If they route all cabling through the floor ask to look in the underfloor.
(They may say no. If they say yes stick your head down there and look around. Again, all cabling should be neat and bundled with velcro ties. Suspended cable trays are important here to allow airflow. See cooling.)
- If they use cable trays, look up -- Things should be neat, and strapped down with velcro ties.
Network
- What providers do they have for their uplinks?
- Where (physically) are their network uplinks?
- Is their network core redundant?
- Do they provide redundant access drops to your rack?
Power
- What kind of UPS systems does the facility have?
- How long can they hold the load?
- What kind of generators does the facility have?
- How often are the generators tested?
- How much fuel is on site?
- What are the fueling provisions in the event of an emergency?
- How much power can you draw in your rack? (Circuit capacity, cooling cappacity).
- Do they provide diverse redundant power to your rack (circuits from separate UPS banks)?
Cooling
- Is the cooling truly N+1 Redundant?
(If they lose an air conditioner will the room stay at temperature?) - Are they using a Hot-Aisle/Cold-Aisle layout? (They should be. If not, worry.)
- Bonus points if they have containment to keep the hot (or cold) air where it belongs.
- Is there adequate pressure in the floor?
(Assuming they use a traditional down-flow cooling system that bows cold air into the floor, stand at a perforated tile near an air conditioner, then at one as far from the AC unit as you can get -- The breeze should be relatively even) - Does the room feel hot? (Obviously standing in hot aisles it will, but how is it near the door? In cold aisles?)
- Does the room feel wet?
- Do they take advantage of "free" cooling?
(air-side economizers, heat wheels, etc?)
Security and Access
- Do you have 24x7x365 access to the facility? (You should!)
- How is that access controlled (Thumbprint? Keycard? See the man at the desk?)
Monitoring
- Do they offer monitoring? (and do you want it?)
- Ask to see their facility monitoring system
(they might say no, but if it's a really slick system they might want to show off)
Managed Services (If you want them)
- What services are included? Typical selections include:
- Basic monitoring (ping)
- Advanced monitoring (SNMP, services, etc.)
- Remote Hands (you call and we type what you tell us to)
- How many hours of consulting/troubleshooting service are included in your base price?
- Is patching service included?
- If yes, how do they handle patching (software used, scheduling, etc.)?
Disaster recovery
I put this last because it's really a minimal concern -- Datacenters spend money making themselves very reliable and robust in the face of subsystem/component failures. Disaster Recovery in the sense of "what happens if my datacenter goes away" is best addressed by having another datacenter, so the questions I ask are along those lines:
- Do they have an off-site facility where customers can host cold- or warm-standby racks?
- Do they have enough bandwidth for you to host a cold/warm standby rack with another provider and replicate everything you need?
- Does the facility itself have solid plans to recover from system/component failures?
- Sometimes it's fun to play what-if: "What if the entire northeast lost power for a week?"
Boot into the rescue system provided by Hetzner and check what damage you have done.
Transfer out any files to a safe location and redeploy the server afterwards.
I'm afraid that is the best solution in your case.
Best Answer
The first answer is stay calm! I learned that the hard way that panicking often just makes things worse. Once thats achieved the next thing is to actually ascertain what the problem is. Complaints from users and managers will be coming at you from all angles, telling you what THEY cannot do, but not what the problem is.
Once you know the problem you can start the plan to fix it and start giving your angry users a timescale!