Load Balancing Best Practices for Persistence

domain-name-systemload balancinglvsround-robin

We run a web application serving up web APIs for an increasing number of clients. To start, the clients were generally home, office, or other wireless networks submitting chunked http uploads to our API. We've now branched out into handling more mobile clients. The files ranging from a few k to several gigs, broken down into smaller chunks and reassembled on our API.

Our current load balancing is performed at two layers, first we use round robin DNS to advertise multiple A records for our api.company.com address. At each IP, we host a Linux LVS: http://www.linuxvirtualserver.org/, load-balancer that looks at the source IP address of a request to determine which API server to hand the connection to. This LVS boxes are configured with heartbeatd to take-over external VIPs and internal gateway IPs from one another.

Lately, we've seen two new error conditions.

The first error is where clients are oscillating or migrating from one LVS to another, mid-upload. This in turn causes our load balancers to lose track of the persistent connection and send the traffic to a new API server, thereby breaking the chunked upload across two or more servers. Our intent was for the Round Robin DNS TTL value for our api.company.com (which we've set at 1 hour) to be honored by the downstream caching nameservers, OS caching layers, and client application layers. This error occurs for approximately 15% of our uploads.

The second error we've seen much less commonly. A client will initiate traffic to an LVS box and be routed to realserver A behind it. Thereafter, the client will come in via a new source IP address, which the LVS box does not recognize, thereby routing ongoing traffic to realserver B also behind that LVS.

Given our architecture as described in part above, I'd like to know what are people's experiences with a better approach that will allow us to handle each of the error cases above more gracefully?

Edit 5/3/2010:

This looks like what we need. Weighted GSLB hashing on the source IP address.

http://www.brocade.com/support/Product_Manuals/ServerIron_ADXGlobalServer_LoadBalancingGuide/gslb.2.11.html#271674

Best Answer

The canonical solution to this is to not rely on end user IP address, but instead use a Layer 7 (HTTP/HTTPS) load balancer with "Sticky Sessions" via a cookie.

Sticky sessions means the load balancer will always direct a given client to the same backend server. Via cookie means the load balancer (which is itself a fully capable HTTP device) inserts a cookie (which the load balancer creates and manages automagically) to remember which backend server a given HTTP connection should use.

The main downside to sticky sessions is that beckend server load can become somewhat un-even. The load balancer can only distribute load fairly when new connections are made, but given that existing connections may be long-lived in your scenario, then in some time periods load will not be distributed entirely fairly.

Just about every Layer 7 load balancer should be able to do this. On Unix/Linux, some common examples are nginx, HAProxy, Apsis Pound, Apache 2.2 with mod_proxy, and many more. On Windows 2008+ there is Microsoft Application Request Routing. As appliances, Coyote Point, loadbalancer.org, Kemp and Barracuda are common in the low-end space; and F5, Citrix NetScaler and others in high-end.

Willy Tarreau, the author of HAProxy, has a nice overview of load balancing techniques here.

About the DNS Round Robin:

Our intent was for the Round Robin DNS TTL value for our api.company.com (which we've set at 1 hour) to be honored by the downstream caching nameservers, OS caching layers, and client application layers.

It will not be. And DNS Round Robin isn't a good fit for load balancing. And if nothing else convinces you, keep in mind that modern clients may prefer one host over all others due to longest prefix match pinning, so if the mobile client changes IP address, it may choose to switch to another RR host.

Basically, it's okay to use DNS round robin as a coarse-grained load distribution, by pointing 2 or more RR records to highly available IP addresses, handled by real load balancers in active/passive or active/active HA. And if that's what you're doing, then you might as well serve those DNS RR records with long Time To Live values, since the associated IP addresses are highly available already.

Related Solutions

BIND9 DNS for Round Robin routing

Your forward zone is correct for DNS round-robin (one hostname, two addresses). You can confirm that your DNS server is returning both records by running dig www.mygateway.com. You should receive two A records.

Your reverse zone IS NOT configured correctly for round-robin DNS. What you've created there are entries for www.183.9.15.in-addr.arpa, which will both be returned, and one picked by the client's resolver library. This is definitely not what you want.
What you probably want are records like:

216    IN   PTR www.mygateway.com.
223    IN   PTR www.mygateway.com.

which will ensure that reverse DNS lookups for 15.9.183.216 and 15.9.183.223 return "www.mygateway.com" (and therefore match the forward A records).

Remember that round-robin DNS doesn't guarantee even load distribution: The choice of which record to use is made by the client resolver library and may be decided randomly, by which record was received first, by which record was received last, or any other method some drunk programmer came up with while hacking together a resolver library.

DNS round-robin is cheap and reasonably effective, but if you need good load balancing you may want to invest in load-balancing hardware (or software - pf, HAProxy, etc.).

Obligatory plug: Your question and some of the mistakes you made above imply a fundamental misunderstanding of some basic DNS concepts. I strongly suggest picking up a copy of DNS and BIND (electronically or from your local bookstore) and reading through it - at least chapters 1, 5 and 6, and in your case the relevant part of chapter 10.

The time you save by doing so will far outweigh the price of the book.

Per-packet round-robin load balancing for UDP

The requirement was satisfied as follows:

I've installed a more recent version of ipvsadm (and its kernel modules), the one that supports the --ops flag (1.26). Since keepalived does not expose this flag in its configuration file, you have to apply it manually. Luckily, you can do that after the "virtual service" is created (in terms of plain ipvsadm, you can first ipvsam -A a virtual service without --ops, and then ipvsadm -E it to add one packet scheduling).

Since keepalived creates the the virtual service for you, all you have to do is to edit it after it is created, which happens when quorum is gained for this virtual server (basically, there is a sufficient number of working realservers). Here's how it looks in the keepalived.conf file:

virtual_server <VIP> <VPORT> {
    lb_algo rr
    lb_kind NAT
    protocol UDP
    ...

    # Enable one-packet scheduling when quorum is gained
    quorum_up "ipvsadm -E -u <VIP>:<VPORT> --ops -s rr"

    ... realserver definitions, etc ...
}

This works, but I've encountered a number of problems (kind of) with this setup:

There is small time gap (less than a second, more like 1/10), between quorum going up and the script in quorum_up getting executed. Any datagrams that manage to go through the director during that time will create a connection entry in ipvsadm, and further datagrams from that source host / port will be stuck on the same realserver even after the --ops flag is added. You can minimize the chance of this happening by making sure that the virtual service is never deleted once it is created. You do that by specifying inhibit_on_failure flag in your realserver definitions so that they are not deleted when the corresponding realserver is down (when all realservers are deleted, the virtual service is also deleted), but instead their weight is set to zero (they stop receiving traffic then). As a result, the only time datagrams can slip by is during keepalived startup (assuming you have at least one realserver up at that time, so that quorum will be gained immediately).
When --ops is active, the director does not rewrite the source host / port of the datagrams that the realservers sends to the clients, so the source host / port are those of the realserver that has sent this particular datagram. This might be a problem (it was for my clients). You can amend that by SNAT'ing those datagrams with iptables.
I've noticed significant system CPU load when the director is under load. Turns out, CPU is hogged by ksoftirqd. It does not happen if you turn off --ops. Presumably, the problem is that the packet dispatching algorithm is fired on every datagram instead of just the first datagram in the "connection" (if that even applies to UDP..). I haven't actually found the way to "fix" that, but maybe I haven't tried hard enough. The system has some specific load requirements and under that load the processor usage does not max out; neither are there any lost datagrams, so this problem is not considered a show-stopper. It is still rather alarming though.

Summary: the setup definitely works (also under load), but the hoops one has to jump through and the problems I've encountered (especially №3.. maybe someone knows the solution?), mean that, given time, I would've used a userspace program (written in C, probably) for listening on a UDP socket and distributing the received datagrams between realservers, in conjunction with something that would check the health of realservers for me, SNAT in iptables to rewrite the source host / port and keepalived in VRRP mode for HA.

Best Answer

Related Solutions

BIND9 DNS for Round Robin routing

Per-packet round-robin load balancing for UDP

Related Topic