Nginx – DNS Failover with multiple Nginx load balancers

domain-name-systemfailovernginx

Our application is hosted on EC2, however because the nature of the app, it requires extremely high availability. We have an image of the app running on Linode as a failover.

However, doing a DNS flip to Linode would take some time. We came up with a strategy to minimize this downtime, but I would like some advice on how to best implement it.

The application is a ROR application. We're running 6 frontend nodes on EC2 and use Nginx as a load balancer with proxy_pass.

Our load balancer on Linode however does not balance to the Linode nodes, but to the EC2 nodes. This is so we have the IP of our Linode LB in our DNS record. So when a client connect, the DNS round robin to either EC2 or Linode LB. The chosen LB will then redirects the request to one of the node on EC2. I case of an EC2 outage, we would simply change the config of the Linode LB to balance to it's own node (plus other things like a DB flip, etc, etc).

I know this is not great for performance, but reliability is more important to us.

To issue we are having arrises when for whatever reason, the Linode LB cannot connect to EC2. Nginx will in the case return a 502 Bad Gateway error, which does not cause the client to use the DNS failover.

We are hoping for a way to force the client into using the DNS fallback when that situation arises. Is there a way of doing this? Preferably with Nginx, but other solutions would be considered if it does not support this.

Thanks!

Best Answer

I love this approach, it is my most favorite and I will buy you a beer if you are ever in San Francisco!

Two answers, first to your 502 issue you should add this to your nginx, so if there are at least some capable nodes nginx will retry(by default on a 502 it just gives up):

http://wiki.nginx.org/HttpProxyModule#proxy_next_upstream

proxy_next_upstream 

syntax: proxy_next_upstream [error|timeout|invalid_header|http_500|http_502|http_503|http_504|http_404|off];

Secondly, for your 'back to DNS', you need to change the approach slightly. For these setups what I've done usually is pull DNS all the way back to the app nodes themself which tests the connectivity all the way through the load balancer and to the end node. As a bonus you can integrate DNS with your application and have it shut down the DNS server if the app is dead. The idea here is to have the clients DNS request 'test' that the entire path works, not just the connectivity to the LB. Obviously you can't use NGINX for this, I've used pf rules for this, you can do the same thing in iptables. Where you just round robin requests to backend nodes and run bind on your backend servers. The idea then is to make sure you have multiple NS entries, one to each 'LB' you have. The client will take care of testing each NS record, in testing I've done the average failover time is 2 seconds and it worked for 99% of the operating systems we looked at. Let me know if that makes sense. It will work better than any scenario that tries to recover after the client has already made the first TCP request.

With this solution I've built sites that maintain 100% availability according to Gomez and Keynote monitoring. As you already mentioned it can cause some initial performance penalty for the DNS lookup but the site always works and customers love that (as does my pager).

Related Solutions

Apache load balancer, failover & backup

To set a BalancerMember as a hot spare, so that it will only get requests if no other upstream is available, use status=+H:

<Proxy balancer://failovercluster>
    BalancerMember http://10.1.1.4
    BalancerMember http://10.1.1.5 status=+H
</Proxy>

(one caveat is that some older versions of apache had a bug where +H didn't work)

One common method to set up a second load balancer would be to just set up a second system with identical configuration, and use DNS round robin to let users hit whichever one they happen to hit. Of course, this can incur delays for clients if one of the load balancers goes down; not a good thing.

Another option is to use VRRPd. I won't go into the implementation specifics here, but it would have your two load balancers sharing a single virtual IP address, which would move to the other device if one of them becomes unreachable.

Using "reload" (/etc/init.d/apache2 reload) will do a graceful restart of the apache service; this avoids dropping connections.

Browser-based DNS failover using multiple A records

OK I am going to start by saying DNS is not a good failover system in any way, you need a reverse proxy or load balancer. There are several reasons why the experience is not the same. First of all in chrome it uses The OS to grab DNS info so that is dependent on the OS for the IPs, so the OS in this case might only give it one IP.

As far as the other browsers its highly dependent on how they do DNS to how it'll work. So the browser itself might decide to not try the other IPs or even try the same one several times depending on the response the DNS server has.

This brings us to the DNS server itself, most do not respect your TTL records and keep then how ever long it feels, meaning Users could get your old IP for quite a while...

Fourthly user experience, do you want users to have to refresh 3 or 4 times to get your website? Do you have any session or login based stuff on your site, what happens if the browser gets another IP in the middle of the session. If you really need HA and uptime you really need to consider doing it right,honestly or it will end up more fractured than using just one server.

Best Answer

Related Solutions

Apache load balancer, failover & backup

Browser-based DNS failover using multiple A records

Related Topic