Nginx – Zero downtime deployment (Tomcat), Nginx or HAProxy, behind hardware LB – how to “starve” old server

deploymentload balancingnginxtomcat

Currently we have the following setup.

  • Hardware Load Balancer (LB)
  • Box A running Tomcat on 8080 (TA)
  • Box B running Tomcat on 8080 (TB)

TA and TB are running behind LB.

For now it's pretty complicated and manual job to take Box A or Box B out of LB to do the zero downtime deployment.

I am thinking to do something like this:

  • Hardware Load Balancer (LB)
  • Box A running Nginx on 8080 (NA)
  • Box A running Tomcat on 8081 (TA1)
  • Box A running Tomcat on 8082 (TA2)
  • Box B running Nginx on 8080 (NB)
  • Box B running Tomcat on 8081 (TB1)
  • Box B running Tomcat on 8082 (TB2)

Basically LB will be directing traffic between NA and NB now.
On each of Nginx's we'll have TA1, TA2 and TB1, TB2 configured as upstream servers.
Once one of the upstreams's healthcheck page is unresponsive (shutdown) the traffic goes to another one (HttpHealthcheckModule module on Nginx).

So the deploy process is simple.

Say, TA1 is active with version 0.1 of the app. Healthcheck on TA1 is OK.
We start TA2 with Healthcheck on it as ERROR. So Nginx is not talking to it.
We deploy app version 0.2 to TA2. Make sure it works. Now, we switch the Healthcheck on TA2 to OK, switch Healthcheck to TA1 to ERROR. Nginx will start serving TA2, and will remove TA1 out of rotation. Done!
And now same with the other box.

While it sounds all cool and nice, how do we "starve" the Nginx?
Say we have pending connections, some users on TA1. If we just turn it off, sessions will break (we have cookie-based sessions). Not good.
Any way to starve traffic to one of the upstream servers with Nginx?


Best Answer

My opinion is if you're willing to put in such a complicated setup just to get zero-downtime, you should instead be investing your time in getting a better load balancer. By going down the path you suggest, you are trading simplicity for marginal cost savings, which always ends up costing you more than you anticipate.

You're going to need downtime to implement this zero-downtime solution you propose. Take that downtime to instead figure out how to use your existing load balancer, or swap it out for something better.