High traffic websocket/haproxy tuning

haproxynode.jssocket

i have a pubsub application (mostly chat but some other goodies being pub-ed and sub-ed too) running on node & socket.io.

i'm load testing this app by spinning up some other, real large, boxes and running a node app i wrote for this purpose that spawns a ton of processes that connect using the socket.io-client package.

i found that i can get about 1k concurrent connections to a single 1gig rackspace cloud box. we need to support between 10k and 100k concurrent connections (for specific events, not all the time) though so i put a load balancer in front and figured before a big event i'd spin up more machines. but i've put a haproxy box in front and found that with 2 servers and 2k users i'm golden but with 4 servers even 3k users is a struggle!

i noticed that when my load tests start causing lots of disconnects the node servers are experiencing very high cpu usage (in the 90%) which i find odd because when 2 servers and 2k users i end up with a max of 70% which quickly diminishes.

here are some relevant lines from my haproxy config:

mode http
timeout client 86400000
timeout server 86400000
timeout connect 5000
maxconn 100000

i've also put some kernel net tuning into /etc/sysctl.conf on my haproxy and node boxes:

net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 1024 65023 
net.ipv4.tcp_max_syn_backlog = 10240 
net.ipv4.tcp_max_tw_buckets = 400000 
net.ipv4.tcp_max_orphans = 60000 
net.ipv4.tcp_synack_retries = 3 
net.core.somaxconn = 50000 
net.core.netdev_max_backlog = 50000 
net.ipv4.tcp_rmem = 8192 87380 8388608 
net.ipv4.tcp_wmem = 8192  87380 8388608

and both the haproxy and node boxes have
ulimit -n 99999
in their relevant init scripts (before starting haproxy or node)

the haproxy box is consistently at single digit (or less) cpu usage.

what should my next steps be? does anything here stick out as an issue?

Best Answer

Turn on HAProxy HTTP logging if you haven't already and see if you can find the cause of the disconnects in either that or syslog of the server. The HAProxy HTTP log format includes the termination_state for requests which should help point you in the right direction.