Linux – Haproxy using 10GB Memory and 100% CPU with 50k connections

haproxylinuxnetworkingtcpUbuntu

On a Ubuntu 14.04 x64 server, Haproxy uses 3.3 GB of memory and 6.8 GB of swap, while handling 52k connections. The CPU usage also keeps spiking to 100% before most of the traffic was redirected to another haproxy box. Traffic is mainly peresistent TCP connections.

pid = 3185 (process #1, nbproc = 1)
uptime = 0d 6h14m21s
system limits: memmax = unlimited; ulimit-n = 524341
maxsock = 524341; maxconn = 262144; maxpipes = 0
current conns = 54303; current pipes = 0/0
Running tasks: 1/54336

It was noticed that the memory usage shot up tremendously at around 50k connections. ulimit -n is set to 1048576.

Question: Is the amount of memory usage unusually high? How can we reduce the memory consumption?

I've also read the following from another question, is it relavent? How should I check that the TCP settings are sufficient (for persistent TCP connections) so as not to cause a huge increase in memory usage?

At 54000 concurrent connections, you should be careful about your TCP settings. If running with default settings (87kB read buffer, 16kB write buffer), you can end up eating 10 gigs of memory just for the sockets. 

sysctl.conf

net.core.wmem_max=12582912
net.core.rmem_max=12582912
net.ipv4.tcp_rmem= 10240 87380 12582912
net.ipv4.tcp_wmem= 10240 87380 12582912

haproxy.conf

global
    log /dev/log    local0
    log /dev/log    local1 notice
    maxconn 262144
    chroot /var/lib/haproxy
    user haproxy
    group haproxy
    daemon

defaults
    log global
    mode    tcp
    option  tcplog
    option  dontlognull
    option  redispatch
    retries 3
    maxconn 262144
    contimeout 180000
    clitimeout 180000
    srvtimeout 180000
    timeout contimeout  180000 
    timeout connect  180000
    timeout client  180000
    timeout server 180000

Update

Restarting (not reloading) haproxy lowered the CPU load to 30%. What could have caused the high CPU load previously?

Best Answer

CPU load on HAProxy will spike at 100 once you run out of source ports and tries to scan for available ones. Usually that is 30kish though. What do you have for sysctl net.ipv4.ip_local_port_range?

So for example if you have 30k connections to a single server in the backend you will likely run out of source ports and hit the CPU problem.