Ubuntu – HAProxy reloads very slow (almost 3 minutes)

haproxyUbuntu

We're using HAProxy 1.6.3 to load balance and route HTTP traffic to hundreds of backend servers. We reload the config often (several times a day), both automatically when a server fails and manually for administrative reasons.

The problem is that running the reload command takes up to 3 minutes on one of our HAProxy servers (Ubuntu 16.04). It does not seem to matter whether the server has traffic or not. On our other servers with the same version of OS and HAProxy, the reload takes 1-5 seconds, regardless of load. We have a bunch of long-running requests but as I said it does not seem to matter if the server has traffic or not.

We can see that a new process is spawned, but then it takes a few minutes before it starts accepting traffic (or at least until the CPU usage of the new process climbs above 0%).

The question is: What can cause HAProxy to take so long to reload? What is it doing for so long? How can I find out (for instance, what level of logging do I need to enable and what would I look for in the log?)

We run the following command to reload:

haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D -sf $(cat /var/run/haproxy.pid)

And our config file looks like this:

global
    log 127.0.0.1 local0 notice             
    maxconn 20000                           
    user haproxy
    group haproxy
    tune.ssl.default-dh-param 2048

defaults
    log global
    mode http
    option httplog
    option dontlognull
    option http-keep-alive                  
    option forwardfor                       
    retries 3                               
    option redispatch                       
    timeout connect 5s                      
    timeout check 5s                        
    timeout client 60000                    
    timeout server 60000                    

    stats enable
    stats uri /haproxy?stats
    stats auth [REDACTED]

    option httpchk GET / HTTP/1.0           
    balance roundrobin                      
    default-server inter 10s fall 2 rise 2  

frontend http-in
        bind *:80

        # Define hosts
        acl host_1 hdr(host) -i somehost.somedomain.com
        [hundreds of host header configurations]

        ## switches
        use_backend 1 if host_1
        [hundreds of if-clauses]

frontend https-in
        bind *:443 ssl crt [REDACTED] crt [REDACTED] crt [REDACTED]

        # Define hosts
        acl host_1 hdr(host) -i somehost.somedomain.com
        [hundreds of host header configurations]

        ## switches
        use_backend 1 if host_1
        [hundreds of if-clauses]

backend 1
        server node1 [some IP] check
        server node2 [some IP] check

[lots more backends]

Thanks!

UPDATE: The only difference we have found is that the slow server uses Ubuntu 16.04, while the fast server uses 16.04.1. Not sure if that is relevant.

UPDATE 2: Some other server we have are also running 16.04 and they have fast reloads. So it's probably not that. Next step is to re-install haproxy and see if it helps.

UPDATE 3: Reinstall did not help. We're currently running with strace to try to find what it is doing. It seems like it's trying to connect to all our backends but timing out often. Not yet clear why only this server gets timeouts, and why it refuses to take over until all the polling is done.

Best Answer

It turns out that the slow start of the new HAProxy process was due to it trying to resolve the DNS's of all the backend servers. And since we had a lot of servers and a slow (or unresponsive) DNS server caused a lot of query timeouts which caused a slow start of the process.

We changed to a better DNS server and now it takes 2 seconds instead.