Unclosed TCP connections in CLOSE_WAIT for various processes

cassandraconnectiontcp

I have cluster of several machines connected on 10GBE network (NICs are Intel 82599EB 10GBE SFI/SFP+) running under Debian 6.0 and faced with problem of hung up TCP connections in CLOSE_WAIT state. I know that in theory connection in CLOSE_WAIT state should be explicitly closed by application, but in my case at least two different applications generate these hung up connections and I think that problem is in something else.


At first this problem was reproduced by Cassandra running as daemon under 'jsvc' process. One node of Cassandra ("server") didn't close connection that was closed one the side of another node that initiated this connection ("client").
After that I ran 'netperf' TCP_CRR test and got error message:

netperf -H 172.15.2.166 -t TCP_CRR -l -5 -D TCP
Connect/Request/Response TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
172.15.2.166 (172.15.2.166) port 0 AF_INET : demo send_tcp_conn_rr: data recv error: Connection reset by peer

With TCP connaction hanging in CLOSE_WAIT state on the 172.15.2.166 machine with strange 1 byte in Recv-Q.

tcp 1 0 172.15.2.166:12865 172.15.2.161:42863 CLOSE_WAIT

I've updated 'ixgbe' driver to the latest 3.9-NAPI, but this problem still persists and now I wonder what else can cause the problem?

Best Answer

  1. Your notes indicate that the server saw a FIN followed by an RST from the client
    and, most likely, the server-app has not closed properly
  2. For any reason if you are not sure which application the connection belongs to,
    Use lsof -n | grep CLOSE.WAIT
  3. If this is Cassendra, you may want to check
    this StackOverflow question, cassandra too many open files