HAProxy health checks: using httpchk and observe

haproxy

I'm using HAProxy 1.4.18 with the following backend configuration

backend staging
  option httpchk HEAD /check.txt HTTP/1.0
  http-check disable-on-404
  default-server error-limit 1 on-error mark-down
  server staging01 x.x.x.x:80 check observe layer7
  server staging02 x.x.x.x:80 check observe layer7

The servers are running multiple applications on apache/passenger.

The combination of httpchk and disable-on-404 allows graceful shutdown and removing a server from the lb quite easily while still being able to access directly (ie for testing).

I'm trying to setup observe in order to disable a server when an application is not working.
I've broken the application configuration on staging02 so it always return a 500.
It's correctly marked DOWN after the first 500 but then marked UP at the next httpchk.

Here's the log file:

Server staging/staging02 is DOWN, reason: Health analyze, info: "Detected 1 consecutive errors, last one was: Wrong http response". 1 active and 1 backup servers left. 2 sessions active, 0 requeued, 0 remaining in queue.
Server staging/staging02 is DOWN, reason: Health analyze, info: "Detected 1 consecutive errors, last one was: Wrong http response". 1 active and 1 backup servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
Server staging/staging02 is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 0ms. 2 active and 1 backup servers online. 0 sessions requeued, 0 total in queue.

Is there a way to combine those two checks ?

Best Answer

The distinction I understand now is that /check.txt does actually return a 200 response but all requests to the application return a 500. HAProxy sees the 500s coming back from the proxied requests and takes the server out of the pool but then initiates its own check, receives a 200 and puts the server back in the pool.

The solution would be to do one of:

  1. Configure Apache, rather than the application, so that every request returns a 500 response, even the static file /check.txt.
  2. Change /check.txt into a Ruby app that contains just enough logic to choose between a 200 and a 500 response when appropriate.
  3. Set the inter value to something ridiculous like 3600. This should give you an hour to do your testing or (if the server went down on its own) figure out the problem and bring it back up.
  4. Set the inter value to something smaller like 60 but set the rise value to something higher like 60. This would also give you an hour before the server was added back to the pool. (Note, these two are listed last because they're probably very bad ideas.)