Nginx – HAProxy + nginx reaching max numtcpsock in about 24 hours

haproxynginxnode.jstcp

I'm running a relatively simple VPS (a Media Temple (ve)) with a few PHP-based websites and (eventually) a few Node servers. In order to enable WebSockets support, I'm using HAProxy on port 80, which routes to either nginx or a particular Node process.

I've recently run into a problem, though, where over the course of about 24 hours, my server hits the maximum allowed number of open TCP connections (numtcpsock in Parallels Power Panel, which is set to 1,000). Running nginx alone does not cause this problem, and I currently have no active Node backend servers. Nginx connects to PHP via a UNIX domain socket (and again, the problem doesn't occur with nginx alone). Any thoughts on what could cause this? My configuration:

global
    ## 00-base
    maxconn     500
    nbproc      2
defaults
    ## 00-base
    mode        http
frontend all
    ## 00-ports
    bind 0.0.0.0:80
    ## 10-config
    timeout client 86400000
    default_backend nginx
backend nginx
    ## 00-timeouts
    timeout http-keep-alive 5000
    timeout server 10000
    timeout connect 4000
    ## 10-servers
    server main localhost:8000

Thanks in advance!

UPDATE: after a little bit of lsofing, I was able to determine that more than 90% of the open TCP sockets are indeed owned by HAProxy, and the overwhelming majority of them are in the CLOSE_WAIT or FIN_WAIT2 states. Is this an HAProxy bug? It seems like a file descriptor leak of some kind, unless it's a misconfiguration on my part.

UPDATE 2: I've noticed a pattern in the lsof output. It seems to me that what's happening is nginx is closing an internal connection with HAProxy, but before HAProxy formally closes it, it tries to close the external connection with the client (putting it into FIN_WAIT2). Because the FIN never comes, the connection between nginx and HAProxy stays in CLOSE_WAIT forever. Now the only question is: why is this happening?

Best Answer

The issue is caused by your extremely large timeout. With a 24-hours timeout and a limit to 1000 concurrent connections, you can clearly expect to fill this with clients disconnecting the dirty way. Please use a more reasonable timeout, from minutes to hours at most, it really makes no sense to use 1 day timeouts on the internet. As DukeLion said, the system is waiting for haproxy to close the connection, because haproxy did not receive the close from the client.

Haproxy being working in tunnel mode for TCP and WebSocket, it follows the usual 4-way close :

- receive a close on side A
- forward the close on side B
- receive the close on side B
- forward the close on side A

In your case, I suppose that side A was the server and side B the client. So nginx closed after some time, socket went to CLOSE_WAIT, haproxy forwarded the close to the client, this socket went to FIN_WAIT1, the client ACKed, passing the socket to FIN_WAIT2 and then nothing happens because the client has disappeared, which is something very common on the net. And your timeout means you want this to remain this way for 24 hours.

After 24 hours, your sessions will start timing out on the client side so haproxy will kill them and forward the close to the nginx side, getting rid of it too. But clearly you don't want this to happen, WebSocket was designed so that idle connection could be reopened transparently, so there is no reason to keep an idle connection open for 24 hours. No firewall will keep it along the way !