Nginx – DNS Failover with multiple Nginx load balancers

domain-name-systemfailovernginx

Our application is hosted on EC2, however because the nature of the app, it requires extremely high availability. We have an image of the app running on Linode as a failover.

However, doing a DNS flip to Linode would take some time. We came up with a strategy to minimize this downtime, but I would like some advice on how to best implement it.

The application is a ROR application. We're running 6 frontend nodes on EC2 and use Nginx as a load balancer with proxy_pass.

Our load balancer on Linode however does not balance to the Linode nodes, but to the EC2 nodes. This is so we have the IP of our Linode LB in our DNS record. So when a client connect, the DNS round robin to either EC2 or Linode LB. The chosen LB will then redirects the request to one of the node on EC2. I case of an EC2 outage, we would simply change the config of the Linode LB to balance to it's own node (plus other things like a DB flip, etc, etc).

I know this is not great for performance, but reliability is more important to us.

To issue we are having arrises when for whatever reason, the Linode LB cannot connect to EC2. Nginx will in the case return a 502 Bad Gateway error, which does not cause the client to use the DNS failover.

We are hoping for a way to force the client into using the DNS fallback when that situation arises. Is there a way of doing this? Preferably with Nginx, but other solutions would be considered if it does not support this.

Thanks!

Best Answer

I love this approach, it is my most favorite and I will buy you a beer if you are ever in San Francisco!

Two answers, first to your 502 issue you should add this to your nginx, so if there are at least some capable nodes nginx will retry(by default on a 502 it just gives up):

http://wiki.nginx.org/HttpProxyModule#proxy_next_upstream

proxy_next_upstream 

syntax: proxy_next_upstream [error|timeout|invalid_header|http_500|http_502|http_503|http_504|http_404|off];

Secondly, for your 'back to DNS', you need to change the approach slightly. For these setups what I've done usually is pull DNS all the way back to the app nodes themself which tests the connectivity all the way through the load balancer and to the end node. As a bonus you can integrate DNS with your application and have it shut down the DNS server if the app is dead. The idea here is to have the clients DNS request 'test' that the entire path works, not just the connectivity to the LB. Obviously you can't use NGINX for this, I've used pf rules for this, you can do the same thing in iptables. Where you just round robin requests to backend nodes and run bind on your backend servers. The idea then is to make sure you have multiple NS entries, one to each 'LB' you have. The client will take care of testing each NS record, in testing I've done the average failover time is 2 seconds and it worked for 99% of the operating systems we looked at. Let me know if that makes sense. It will work better than any scenario that tries to recover after the client has already made the first TCP request.

With this solution I've built sites that maintain 100% availability according to Gomez and Keynote monitoring. As you already mentioned it can cause some initial performance penalty for the DNS lookup but the site always works and customers love that (as does my pager).