DNS – If DNS Failover is Not Recommended, What Is?

datacenterdisaster-recoverydomain-name-systemfailoverhigh-availability

As a followup question to his very popular question: Why is DNS failover not recommended?, I think it was agreed that DNS failover is not 100% reliable due to caching.

However the highest voted answer did not really discuss what is the better solution to achieve failover between two different data centers. The only solution presented was local load balancing (single data center).

So my question is quite simply what is the real solution to cross data center failover?

Best Answer

A whole data center would need to go down or be unreachable for this to apply. Your backup at another data center would then be reached by routing the IP addresses to the other data center. This would happen through the BGP route announcements from the primary data center no longer being provided. The secondary announcements from the secondary data center would then be used.

Smaller businesses are generally not large enough to justify the expense of portable IP address allocations and their own autonomous system number to announce BGP routes with. In this case a provider would multiple locations is the way to go.

You either have to be reached via your original IP addresses, or via a change of IP address done by DNS. Since DNS is not designed to do this in the ways needed by what "failover" means (users can be out of reach by at least as long as your TTL, or the TTL imposed by some caching servers), going to the backup site with the same IPs is the best solution.

Related Solutions

Why is DNS failover not recommended

By 'DNS failover' I take it you mean DNS Round Robin combined with some monitoring, i.e. publishing multiple IP addresses for a DNS hostname, and removing a dead address when monitoring detects that a server is down. This can be workable for small, less trafficked websites.

By design, when you answer a DNS request you also provide a Time To Live (TTL) for the response you hand out. In other words, you're telling other DNS servers and caches "you may store this answer and use it for x minutes before checking back with me". The drawbacks come from this:

With DNS failover, a unknown percentage of your users will have your DNS data cached with varying amounts of TTL left. Until the TTL expires these may connect to the dead server. There are faster ways of completing failover than this.
Because of the above, you're inclined to set the TTL quite low, say 5-10 minutes. But setting it higher gives a (very small) performance benefit, and may help your DNS propagation work reliably even if there is a short glitch in network traffic. So using DNS based failover goes against high TTLs, but high TTLs are a part of DNS and can be useful.

The more common methods of getting good uptime involve:

Placing servers together on the same LAN.
Place the LAN in a datacenter with highly available power and network planes.
Use a HTTP load balancer to spread load and fail over on individual server failures.
Get the level of redundancy / expected uptime you require for your firewalls, load balancers and switches.
Have a communication strategy in place for full-datacenter failures, and the occasional failure of a switch / database server / other resource that cannot easily be mirrored.

A very small minority of web sites use multi-datacenter setups, with 'geo-balancing' between datacenters.

DNS Round Robin – Is It the Only Way to Ensure Instant Fail-Over for Multiple Data Centers?

When I use the term "DNS Round Robin" I generally mean in in the sense of the "cheap load balancing technique" as OP describes it.

But that's not the only way DNS can be used for global high availability. Most of the time, it's just hard for people with different (technology) backgrounds to communicate well.

The best load balancing technique (if money is not a problem) is generally considered to be:

A Anycast'ed global network of 'intelligent' DNS servers,
and a set of globally spread out datacenters,
where each DNS node implements Split Horizon DNS,
and monitoring of availability and traffic flows are available to the 'intelligent' DNS nodes in some fashion,
so that the user DNS request flows to the nearest DNS server via IP Anycast,
and this DNS server hands out a low-TTL A Record / set of A Records for the nearest / best datacenter for this end user via 'intelligent' split horizon DNS.

Using anycast for DNS is generally fine, because DNS responses are stateless and almost extremely short. So if the BGP routes change it's highly unlikely to interrupt a DNS query.

Anycast is less suited for the longer and stateful HTTP conversations, thus this system uses split horizon DNS. A HTTP session between a client and server is kept to one datacenter; it generally cannot fail over to another datacenter without breaking the session.

As I indicated with "set of A Records" what I would call 'DNS Round Robin' can be used together with the setup above. It is typically used to spread the traffic load over multiple highly available load balancers in each datacenter (so that you can get better redundancy, use smaller/cheaper load balancers, not overwhelm the Unix network buffers of a single host server, etc).

So, is it true that, with multiple data centers and HTTP traffic, the use of DNS RR is the ONLY way to assure high availability?

No it's not true, not if by 'DNS Round Robin' we simply mean handing out multiple A records for a domain. But it's true that clever use of DNS is a critical component in any global high availability system. The above illustrates one common (often best) way to go.

Edit: The Google paper "Moving Beyond End-to-End Path Information to Optimize CDN Performance" seems to me to be state-of-the-art in global load distribution for best end-user performance.

Edit 2: I read the article "Why DNS Based .. GSLB .. Doesn't Work" that OP linked to, and it is a good overview -- I recommend looking at it. Read it from the top.

In the section "The solution to the browser caching issue" it advocates DNS responses with multiple A Records pointing to multiple datacenters as the only possible solution for instantaneous fail over.

In the section "Watering it down" near the bottom, it expands on the obvious, that sending multiple A Records is uncool if they point to datacenters on multiple continents, because the client will connect at random and thus quite often get a 'slow' DC on another continent. Thus for this to work really well, multiple datacenters on each continent are needed.

This is a different solution than my steps 1 - 6. I can't provide a perfect answer on this, I think a DNS specialist from the likes of Akamai or Google is needed, because much of this boils down to practical know-how on the limitations of deployed DNS caches and browsers today. AFAIK, my steps 1-6 are what Akamai does with their DNS (can anyone confirm this?).

My feeling -- coming from having worked as a PM on mobile browser portals (cell phones) -- is that the diversity and level of total brokeness of the browsers out there is incredible. I personally would not trust a HA solution that requires the end user terminal to 'do the right thing'; thus I believe that global instantaneous fail over without breaking a session isn't feasible today.

I think my steps 1-6 above are the best that are available with commodity technology. This solution does not have instantaneous fail over.

I'd love for one of those DNS specialists from Akamai, Google etc to come around and prove me wrong. :-)

Best Answer

Related Solutions

Why is DNS failover not recommended

DNS Round Robin – Is It the Only Way to Ensure Instant Fail-Over for Multiple Data Centers?

Related Topic