Elastichsearch node health check for haproxy

elasticsearchhaproxy

I have place haproxy in front of a three node ES(elasticsearch) cluster. So far the way i check for each node in haproxy is by using httpcheck. Bellow is a snippet of my config:

backend elastic_nodes
balance roundrobin
option forwardfor
option httpchk
http-check expect status 200
server elastic1 10.88.0.101:9200 check port 9200  fall 3 rise 3
server elastic2 10.88.0.102:9200 check port 9200  fall 3 rise 3
server elastic3 10.88.0.103:9200 check port 9200  fall 3 rise 3

So far this check works fine but if the cluster turns red the response code still is "200" (this is correct since http-wise the node is accessible) which will make haproxy consider the backend server healthy.

On the other side, if i check the status of the cluster and marking a node as down upon receiving health status "Red", this will mark all backend servers as down thus disabling the ES service. My problem on this approach is that in the past indeed my cluster was Red but it was still usable since there was just a single shard missing (a days log). In other words, there are cases where ES Red status is not a big issue and you want to still serve ES requests (instead of marking all backend nodes down with haproxy this blocking ES service).

Is there any other approach to this?

Best Answer

We use HAproxy to balance between two redundant clusters. During normal operation each receives ~50% of traffic; each is provisioned to take 100% when necessary.

We experienced a fault recently based on a failure case we had not planned for: all client and master nodes stayed up, so our cluster was responsive to REST–but all data nodes were temporarily offline, all indices appeared red and empty, and queries against them returned 0 results. But with a 200, following REST convention.

Our simple HAproxy health check failed us in this case; it merely checked for 200s.

I am now investigating use of http-check expect ! string red with a URI that targets the index of interest directly. I haven't used the more advanced http-check features before.

A more expensive check, but, should correctly take the client nodes for a lobotomized cluster out of the pool.

UPDATE (2): I have switched us over to using

option httpchk get /_cat/indices/<index of interest>
http-check expect rstring \b(green|yellow)\b

and it indeed seems like a better test.

(Second revision: using explicit check for green or yellow instead of just not-red, belatedly thought about index entirely missing from _cat fiter..._