Multi-site high availability

failoverfault-tolerancehigh-availabilityload balancing

We have a SaaS application that we need to be highly available. We already have an expensive, well-maintained Hyper-V failover cluster, but today the datacenter where we host that cluster had a five-hour power outage that knocked us completely offline. So now we're wondering if a better approach might be to use servers at two separate datacenters. Assuming we get all the back-end file replication and data replication working between these two sites, we're wondering how to handle the front-end routing — no wonder how we approach the problem, we always wind up with the load balancer being a single point of failure.

So the question is … how can we set up load-balancing between two hosting sites such that the load balancer isn't the single point of failure? Is there a way to use two separate load balancers, one at each site? Should we be considering round-robin DNS?

Best Answer

To do this properly, you need to have:

  • Two seperate instances in two datacenters (as you've already determined)
  • Synchronisation between the two datacenters (as you've already determined)
  • A way of re-directing clients from one to the other in the event of a failure

There are two common ways of doing this. One simple, one... not.

DNS

Round-Robin DNS isn't quite what you want, because chances are you want all requests to go to the primary DC, and the second DC is only used during downtime of the first.

What you can do though is set a very low TTL on your DNS (say, 30 seconds, or 5 minutes), which will mean that if your DC does go down, you just update your DNS and within 5 minutes or so, all of your clients will be pointing at your other DC.

This means that because your two DC's will have different IP layouts, you need to adjust for this in your setup of the datacenter.

BGP

Basically, if you're asking this question, then this is out of your reach. In short, your IP addresses stay the same, but they are "moved" from one datacenter to the other. This involves expensive routers, expensive IP ranges, and expensive subscriptions to your local registry for AS numbers and IP ranges.

Your BGP routers stop advertising your at your primary datacenter, and start advertising at your secondary datacenter. Then the internet routes around the offline datacenter and sends traffic to your new DC.


If you are virtualised with ESXi and vSphere, VMWare have a pretty good product that we trialled once called VMWare Site Recovery Manager, which basically does everything for you. It keeps your VM configs in sync and powers them up on the 2nd site when the 1st site goes offline. It is big bucks though.