Unclosed TCP connections in CLOSE_WAIT for various processes

cassandraconnectiontcp

I have cluster of several machines connected on 10GBE network (NICs are Intel 82599EB 10GBE SFI/SFP+) running under Debian 6.0 and faced with problem of hung up TCP connections in CLOSE_WAIT state. I know that in theory connection in CLOSE_WAIT state should be explicitly closed by application, but in my case at least two different applications generate these hung up connections and I think that problem is in something else.

At first this problem was reproduced by Cassandra running as daemon under 'jsvc' process. One node of Cassandra ("server") didn't close connection that was closed one the side of another node that initiated this connection ("client").
After that I ran 'netperf' TCP_CRR test and got error message:

netperf -H 172.15.2.166 -t TCP_CRR -l -5 -D TCP
Connect/Request/Response TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
172.15.2.166 (172.15.2.166) port 0 AF_INET : demo send_tcp_conn_rr: data recv error: Connection reset by peer

With TCP connaction hanging in CLOSE_WAIT state on the 172.15.2.166 machine with strange 1 byte in Recv-Q.

tcp 1 0 172.15.2.166:12865 172.15.2.161:42863 CLOSE_WAIT

I've updated 'ixgbe' driver to the latest 3.9-NAPI, but this problem still persists and now I wonder what else can cause the problem?

Best Answer

Your notes indicate that the server saw a FIN followed by an RST from the client
and, most likely, the server-app has not closed properly
For any reason if you are not sure which application the connection belongs to,
Use lsof -n | grep CLOSE.WAIT
If this is Cassendra, you may want to check
this StackOverflow question, cassandra too many open files

Related Solutions

CentOS open port 7000 [RESOLVED]

The default rule in CentOS isn't INPUT, it's RH-Firewall-1-INPUT

Enter the following rule into /etc/sysconfig/iptables

-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 7000 -j ACCEPT

Then, restart iptables via /etc/init.d/iptables restart

Monitoring TCP Connections – Finding Short-Lived TCP Connections Owner Process

You can use the auditd framework for these kind of things. They're not very "user friendly" or intuitive, so requires a little bit of digging around on your part.

First make sure you have auditd installed, running and that your kernel supports it.
For Ubuntu you can install it with apt-get install auditd for example.

Then you add a policy for audit to monitor all connect syscalls like this:

auditctl -a exit,always -F arch=b64 -S connect -k MYCONNECT

If you are using a 32-bit installation of Linux you have to change b64 to b32.

This command will insert a policy to the audit framework, and any connect() syscalls will now be logged to your audit logfiles (usually /var/log/audit/audit.log) for you to look at.

For example, a connection with netcat to news.ycombinator.com port 80 will result in something like this:

type=SYSCALL msg=audit(1326872512.453:12752): arch=c000003e syscall=42 success=no exit=-115 a0=3 a1=24e8fa0 a2=10 a3=7fff07a44cd0 items=0 ppid=5675 pid=7270 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts4 ses=4294967295 comm="nc" exe="/bin/nc.openbsd" key="MYCONNECT"
type=SOCKADDR msg=audit(1326872512.453:12752): saddr=02000050AE84E16A0000000000000000

Here you can see that the /bin/nc.openbsd application initiated a connect() call, if you get lots of connect calls and only want to grep out a certain ip or port you have to do some conversion. The SOCKADDR line contains a saddr argument, it begins with 0200 followed by the port number in hexadecimal (0050) which means 80, and then the IP in hex (AE84E16A) which is news.ycombinator.com's IP of 174.132.225.106.

The audit framework can generate a lot of logs, so remember to disable it when you've accomplished your mission. To disable the above policy, simply replace -a with -d as such:

auditctl -d exit,always -F arch=b64 -S connect -k MYCONNECT

Good documentation on the auditd framework:
http://doc.opensuse.org/products/draft/SLES/SLES-security_sd_draft/part.audit.html

Convert IP adresses to/from hex, dec, binary, etc at:
http://www.kloth.net/services/iplocate.php

General hex/dec converter:
http://www.statman.info/conversions/hexadecimal.html

A Brief Introduction to auditd, from the IT Security Stack Exchange. http://security.blogoverflow.com/2013/01/a-brief-introduction-to-auditd/

Edit 1:
Another quick'n'dirty (swedish: fulhack) way to do it is to create a fast loop that dumps the connection data to you, like this:

while true;do
  ss -ntap -o state established '( dport = :80 )'
  sleep 1
done

This command uses the ss command (socket statistics) to dump current established connections to port 80 including what process initiated it. If its a lot of data you can add | tee /tmp/output after done to both show the output on the screen aswell as write it to /tmp/output for later processing/digging. If it doesn't catch the quick haproxy connection, please try removing sleep 1 but be cautious of extensive logging if its a heavily utilized machine. Modify as needed!

Best Answer

Related Solutions

CentOS open port 7000 [RESOLVED]

Monitoring TCP Connections – Finding Short-Lived TCP Connections Owner Process

Related Topic