DNS – Do Clients Implement Failover/Load-Balancing on Multiple A Records?

domain-name-systemtcp

Typically, load balancers like Amazon's Elastic Load Balancers use a DNS record set with multiple A records to provide multiple load balancer instances which can handle traffic to requesting endpoints:

$ dig +short my-fancy-elb.us-east-1.elb.amazonaws.com
10.0.1.1
10.0.1.2

If I attempt to curl this URL in verbose mode, I notice that curl seems to round-robin attempts to the two IP addresses:

$ curl -ivs http://my-fancy-elb.us-east-1.elb.amazonaws.com | grep -i 'connected'
* Connected to my-fancy-elb.us-east-1.elb.amazonaws.com (10.0.1.1)
$ curl -ivs http://my-fancy-elb.us-east-1.elb.amazonaws.com | grep -i 'connected'
* Connected to my-fancy-elb.us-east-1.elb.amazonaws.com (10.0.1.2)

Is the fact that curl does round-robin on the A records described in the record set done by the curl binary itself or is it something that the Linux kernel does for it?

TCP exists at layer 4 and DNS exists at layer 7, so I'd imagine that individual binaries and libraries would have to implement their own load-balancing and failover: fetching the DNS record set for the given domain name and choosing a TCP address to connect to from that set.

Can I reasonably expect that programming languages, browsers, and libraries like curl will do load-balancing and failover on A records for me?

Best Answer

The short answer is that it varies.

When multiple address records are present in the answer set, a queried DNS server normally returns them in a randomized order. The operating system will typically present the returned record set to the application in the order they were received. That said, there are options on both sides of the transaction (the nameserver and the OS) which can result in different behaviors. Usually these are not employed. As an example, a little-known file called /etc/gai.conf controls this on glibc based systems.

The Zytrax book (DNS for Rocket Scientists) has a good summary on the history of this topic, and concludes that RFC 6724 is the current standard that applications and resolver implementations should adhere to.

From here it's worth noting a choice quote from RFC 6724:

   Well-behaved applications SHOULD NOT simply use the first address
   returned from an API such as getaddrinfo() and then give up if it
   fails.  For many applications, it is appropriate to iterate through
   the list of addresses returned from getaddrinfo() until a working
   address is found.  For other applications, it might be appropriate to
   try multiple addresses in parallel (e.g., with some small delay in
   between) and use the first one to succeed.

The standard encourages applications to not stop at the first address on failure, but it is neither a requirement nor the behavior that many casually written applications are going to implement. You should never rely solely on multiple address records for high availability unless you are certain that the greater (or at least most important) percentage of your consuming applications will play nicely. Modern browsers tend to be good about this, but remember that they are not the only consumers that you are dealing with.

(also, as @kasperd notes below, it's important to distinguish between what this buys you in HA vs. load balancing)