Linux – HAProxy tuning – can’t support > 50 concurrent users

haproxylinux

I'm investigating replacing a proprietary, software load balancer with HAProxy. As part of this investigation, I am attempting to test HAProxy under load. While my HAProxy configuration works fine when testing it as a single user, as soon as I put any load on it, the speed of the site starts to drop dramatically, and before long (~ 100 simulated users) our load testing tool starts to report failures.

It's quite a straight forward configuration, with only notable points being we're using HAProxy 1.5.4 with OpenSSL and PCRE support compiled in and used. We also have some ACLs to match on URLs, although that frontend isn't being used in this load test.

This is running on a CentOS 6.5 machine.

Our (sanitised) configuration for the frontend/backend combination in the load test, along with global and defaults:

global 
  daemon
  tune.ssl.default-dh-param 2048
  maxconn 100000
  maxsessrate 100000
  log /dev/log local6

defaults
  mode http
  option forwardfor
  option http-server-close
  timeout client 61s
  timeout server 61s
  timeout connect 13s  
  log global
  option httplog

frontend stats
  bind xxx.xxx.xxx.xxx:80
  default_backend stats-backend

backend stats-backend
  stats enable
  server stats 127.0.0.1:80

frontend portal-frontend
  bind xxx.xxx.xxx.xxx:80
  default_backend portal-backend

frontend portal-frontend-https 
  bind xxx.xxx.xxx.xxx:443 ssl crt /path/to/pem
  default_backend portal-backend

backend portal-backend
  redirect scheme https if !{ ssl_fc }
  appsession session len 140 timeout 4h request-learn
  server web1.example.com web1.example.com:80 check
  server web2.example.com web2.example.com:80 check

[...snip...]

During the load test, we're getting some information from the logs, but not huge amounts. Relevant snippets:

Sep  4 11:06:12 xxxx haproxy[15609]: xxx.xxx.xxx.xxx:30983 [04/Sep/2014:11:05:42.984] portal-frontend-https~ portal-frontend-https/<NOSRV> -1/-1/-1/-1/28782 408 212 - - cR-- 1840/1840/0/0/0 0/0 "<BADREQ>"
...
Sep  4 11:06:03 xxxx haproxy[15609]: xxx.xxx.xxx.xxx:61502 [04/Sep/2014:11:05:47.810] portal-frontend-https~ portal-frontend-https/<NOSRV> -1/-1/-1/-1/14345 400 187 - - CR-- 1715/1693/0/0/0 0/0 "<BADREQ>"
...
Sep  4 11:06:03 xxxx haproxy[15609]: xxx.xxx.xxx.xxx:43939 [04/Sep/2014:11:05:59.553] portal-frontend portal-backend/<NOSRV> 314/-1/-1/-1/2602 302 181 - - LR-- 1719/22/223/0/3 0/0 "GET /mon/login.php?C=1&LID=15576783&TID=8145&PID=8802 HTTP/1.1"

On the basis of these log entries, we've tried things like adjusting timeout http-request, but without any improvement (the load test will run for longer before failures are reported by our tool, but the slow down occurs in a similar way).

I'm confident HAProxy is capable of doing far better than this, but I really don't know where to turn now to start diagnosing what the problem (or limitation) is.

Best Answer

Please run dmesg and ensure your iptables' conntrack table is not full... You may have many messages like this one: "ip_conntrack: table full, dropping packet"

If so, tune your sysctl: net.ipv4.netfilter.ip_conntrack_max Default value is very low. You can set it up to 50000, maybe more, depends on your workload.

Baptiste