I have place haproxy in front of a three node ES(elasticsearch) cluster. So far the way i check for each node in haproxy is by using httpcheck. Bellow is a snippet of my config:
backend elastic_nodes
balance roundrobin
option forwardfor
option httpchk
http-check expect status 200
server elastic1 10.88.0.101:9200 check port 9200 fall 3 rise 3
server elastic2 10.88.0.102:9200 check port 9200 fall 3 rise 3
server elastic3 10.88.0.103:9200 check port 9200 fall 3 rise 3
So far this check works fine but if the cluster turns red the response code still is "200" (this is correct since http-wise the node is accessible) which will make haproxy consider the backend server healthy.
On the other side, if i check the status of the cluster and marking a node as down upon receiving health status "Red", this will mark all backend servers as down thus disabling the ES service. My problem on this approach is that in the past indeed my cluster was Red but it was still usable since there was just a single shard missing (a days log). In other words, there are cases where ES Red status is not a big issue and you want to still serve ES requests (instead of marking all backend nodes down with haproxy this blocking ES service).
Is there any other approach to this?
Best Answer
We use HAproxy to balance between two redundant clusters. During normal operation each receives ~50% of traffic; each is provisioned to take 100% when necessary.
We experienced a fault recently based on a failure case we had not planned for: all client and master nodes stayed up, so our cluster was responsive to REST–but all data nodes were temporarily offline, all indices appeared red and empty, and queries against them returned 0 results. But with a 200, following REST convention.
Our simple HAproxy health check failed us in this case; it merely checked for 200s.
I am now investigating use of
http-check expect ! string red
with a URI that targets the index of interest directly. I haven't used the more advancedhttp-check
features before.A more expensive check, but, should correctly take the client nodes for a lobotomized cluster out of the pool.
UPDATE (2): I have switched us over to using
and it indeed seems like a better test.
(Second revision: using explicit check for
green
oryellow
instead of just not-red, belatedly thought about index entirely missing from_cat
fiter..._