We're using HAProxy 1.6.3 to load balance and route HTTP traffic to hundreds of backend servers. We reload the config often (several times a day), both automatically when a server fails and manually for administrative reasons.
The problem is that running the reload command takes up to 3 minutes on one of our HAProxy servers (Ubuntu 16.04). It does not seem to matter whether the server has traffic or not. On our other servers with the same version of OS and HAProxy, the reload takes 1-5 seconds, regardless of load. We have a bunch of long-running requests but as I said it does not seem to matter if the server has traffic or not.
We can see that a new process is spawned, but then it takes a few minutes before it starts accepting traffic (or at least until the CPU usage of the new process climbs above 0%).
The question is: What can cause HAProxy to take so long to reload? What is it doing for so long? How can I find out (for instance, what level of logging do I need to enable and what would I look for in the log?)
We run the following command to reload:
haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D -sf $(cat /var/run/haproxy.pid)
And our config file looks like this:
global
log 127.0.0.1 local0 notice
maxconn 20000
user haproxy
group haproxy
tune.ssl.default-dh-param 2048
defaults
log global
mode http
option httplog
option dontlognull
option http-keep-alive
option forwardfor
retries 3
option redispatch
timeout connect 5s
timeout check 5s
timeout client 60000
timeout server 60000
stats enable
stats uri /haproxy?stats
stats auth [REDACTED]
option httpchk GET / HTTP/1.0
balance roundrobin
default-server inter 10s fall 2 rise 2
frontend http-in
bind *:80
# Define hosts
acl host_1 hdr(host) -i somehost.somedomain.com
[hundreds of host header configurations]
## switches
use_backend 1 if host_1
[hundreds of if-clauses]
frontend https-in
bind *:443 ssl crt [REDACTED] crt [REDACTED] crt [REDACTED]
# Define hosts
acl host_1 hdr(host) -i somehost.somedomain.com
[hundreds of host header configurations]
## switches
use_backend 1 if host_1
[hundreds of if-clauses]
backend 1
server node1 [some IP] check
server node2 [some IP] check
[lots more backends]
Thanks!
UPDATE: The only difference we have found is that the slow server uses Ubuntu 16.04, while the fast server uses 16.04.1. Not sure if that is relevant.
UPDATE 2: Some other server we have are also running 16.04 and they have fast reloads. So it's probably not that. Next step is to re-install haproxy and see if it helps.
UPDATE 3: Reinstall did not help. We're currently running with strace to try to find what it is doing. It seems like it's trying to connect to all our backends but timing out often. Not yet clear why only this server gets timeouts, and why it refuses to take over until all the polling is done.
Best Answer
It turns out that the slow start of the new HAProxy process was due to it trying to resolve the DNS's of all the backend servers. And since we had a lot of servers and a slow (or unresponsive) DNS server caused a lot of query timeouts which caused a slow start of the process.
We changed to a better DNS server and now it takes 2 seconds instead.