When I use the term "DNS Round Robin" I generally mean in in the sense of the "cheap load balancing technique" as OP describes it.
But that's not the only way DNS can be used for global high availability. Most of the time, it's just hard for people with different (technology) backgrounds to communicate well.
The best load balancing technique (if money is not a problem) is generally considered to be:
- A Anycast'ed global network of 'intelligent' DNS servers,
- and a set of globally spread out datacenters,
- where each DNS node implements Split Horizon DNS,
- and monitoring of availability and traffic flows are available to the 'intelligent' DNS nodes in some fashion,
- so that the user DNS request flows to the nearest DNS server via IP Anycast,
- and this DNS server hands out a low-TTL A Record / set of A Records for the nearest / best datacenter for this end user via 'intelligent' split horizon DNS.
Using anycast for DNS is generally fine, because DNS responses are stateless and almost extremely short. So if the BGP routes change it's highly unlikely to interrupt a DNS query.
Anycast is less suited for the longer and stateful HTTP conversations, thus this system uses split horizon DNS. A HTTP session between a client and server is kept to one datacenter; it generally cannot fail over to another datacenter without breaking the session.
As I indicated with "set of A Records" what I would call 'DNS Round Robin' can be used together with the setup above. It is typically used to spread the traffic load over multiple highly available load balancers in each datacenter (so that you can get better redundancy, use smaller/cheaper load balancers, not overwhelm the Unix network buffers of a single host server, etc).
So, is it true that, with multiple data centers
and HTTP traffic, the use of DNS RR is the ONLY
way to assure high availability?
No it's not true, not if by 'DNS Round Robin' we simply mean handing out multiple A records for a domain. But it's true that clever use of DNS is a critical component in any global high availability system. The above illustrates one common (often best) way to go.
Edit: The Google paper "Moving Beyond End-to-End Path Information to Optimize CDN Performance" seems to me to be state-of-the-art in global load distribution for best end-user performance.
Edit 2: I read the article "Why DNS Based .. GSLB .. Doesn't Work" that OP linked to, and it is a good overview -- I recommend looking at it. Read it from the top.
In the section "The solution to the browser caching issue" it advocates DNS responses with multiple A Records pointing to multiple datacenters as the only possible solution for instantaneous fail over.
In the section "Watering it down" near the bottom, it expands on the obvious, that sending multiple A Records is uncool if they point to datacenters on multiple continents, because the client will connect at random and thus quite often get a 'slow' DC on another continent. Thus for this to work really well, multiple datacenters on each continent are needed.
This is a different solution than my steps 1 - 6. I can't provide a perfect answer on this, I think a DNS specialist from the likes of Akamai or Google is needed, because much of this boils down to practical know-how on the limitations of deployed DNS caches and browsers today. AFAIK, my steps 1-6 are what Akamai does with their DNS (can anyone confirm this?).
My feeling -- coming from having worked as a PM on mobile browser portals (cell phones) -- is that the diversity and level of total brokeness of the browsers out there is incredible. I personally would not trust a HA solution that requires the end user terminal to 'do the right thing'; thus I believe that global instantaneous fail over without breaking a session isn't feasible today.
I think my steps 1-6 above are the best that are available with commodity technology. This solution does not have instantaneous fail over.
I'd love for one of those DNS specialists from Akamai, Google etc to come around and prove me wrong. :-)
Jeff, I disagree, load balancing does not imply redundancy, it's quite the opposite in fact. The more servers you have, the more likely you'll have a failure at a given instant. That's why redundancy IS mandatory when doing load balancing, but unfortunately there are a lot of solutions which only provide load balancing without performing any health check, resulting in a less reliable service.
DNS roundrobin is excellent to increase capacity, by distributing the load across multiple points (potentially geographically distributed). But it does not provide fail-over. You must first describe what type of failure you are trying to cover. A server failure must be covered locally using a standard IP address takeover mechanism (VRRP, CARP, ...). A switch failure is covered by resilient links on the server to two switches. A WAN link failure can be covered by a multi-link setup between you and your provider, using either a routing protocol or a layer2 solution (eg: multi-link PPP). A site failure should be covered by BGP : your IP addresses are replicated over multiple sites and you announce them to the net only where they are available.
From your question, it seems that you only need to provide a server fail-over solution, which is the easiest solution since it does not involve any hardware nor contract with any ISP. You just have to setup the appropriate software on your server for that, and it's by far the cheapest and most reliable solution.
You asked "what if an haproxy machine fails ?". It's the same. All people I know who use haproxy for load balancing and high availability have two machines and run either ucarp, keepalived or heartbeat on them to ensure that one of them is always available.
Hoping this helps!
Best Answer
The short answer is that it varies.
When multiple address records are present in the answer set, a queried DNS server normally returns them in a randomized order. The operating system will typically present the returned record set to the application in the order they were received. That said, there are options on both sides of the transaction (the nameserver and the OS) which can result in different behaviors. Usually these are not employed. As an example, a little-known file called
/etc/gai.conf
controls this on glibc based systems.The Zytrax book (DNS for Rocket Scientists) has a good summary on the history of this topic, and concludes that RFC 6724 is the current standard that applications and resolver implementations should adhere to.
From here it's worth noting a choice quote from RFC 6724:
The standard encourages applications to not stop at the first address on failure, but it is neither a requirement nor the behavior that many casually written applications are going to implement. You should never rely solely on multiple address records for high availability unless you are certain that the greater (or at least most important) percentage of your consuming applications will play nicely. Modern browsers tend to be good about this, but remember that they are not the only consumers that you are dealing with.
(also, as @kasperd notes below, it's important to distinguish between what this buys you in HA vs. load balancing)