Linux – tcp_tw_reuse and tcp_tw_recycle not working in specific environment

haproxylinuxlinux-networkingMySQLtcp

We are running HAProxy to direct mysql traffic from PHP to a set of mysql servers. However, due to an issue with the way the mysql-client sends TCP packets, we get a ton of connections in TIME_WAIT with concurrent port utilization at 50k+, and end up with SOCKERR messages on HAProxy due to port exhaustion under extremely heavy load as described here – http://blog.haproxy.com/2012/12/12/haproxy-high-mysql-request-rate-and-tcp-source-port-exhaustion/

The article above suggests enabling tcp_tw_reuse and tcp_tw_recycle on the HAProxy server, which we've done on one test environment and this practically solved our issue and kept TIME_WAIT connections below 1000 under heavy load. However, the same cannot be said for another environment we have where we also enabled these two TCP settings – the TIME_WAIT is still high and port utilization is still 50k+.

Both environments are on the same kernel, same haproxy version, and we can't figure out what could be contributing to this specific environment not accepting the tcp_tw_reuse and tcp_tw_recycle changes.

On both environments, we expanded the port_range to 1024-65535. This is on CentOS 6.4.

Please assist – we are spinning our heads here and if you need more information, I can provide. Thank you.

Best Answer

Found the root cause. tcp_timestamps must be enabled on the local server as well as whatever outbound server you are trying to reach. tcp_tw_reuse and tcp_tw_recycle depend on tcp_timestamps to determine which ports to reuse. See this - http://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux.html

Related Solutions

Linux – How to Reduce Number of Sockets in TIME_WAIT

One thing you should do to start is to fix the net.ipv4.tcp_fin_timeout=1. That is way to low, you should probably not take that much lower than 30.

Since this is behind nginx. Does that mean nginx is acting as a reverse proxy? If that is the case then your connections are 2x (one to client, one to your web servers). Do you know which end these sockets belong to?

Update:
fin_timeout is how long they stay in FIN-WAIT-2 (From networking/ip-sysctl.txt in the kernel documentation):

tcp_fin_timeout - INTEGER
        Time to hold socket in state FIN-WAIT-2, if it was closed
        by our side. Peer can be broken and never close its side,
        or even died unexpectedly. Default value is 60sec.
        Usual value used in 2.2 was 180 seconds, you may restore
        it, but remember that if your machine is even underloaded WEB server,
        you risk to overflow memory with kilotons of dead sockets,
        FIN-WAIT-2 sockets are less dangerous than FIN-WAIT-1,
        because they eat maximum 1.5K of memory, but they tend
        to live longer. Cf. tcp_max_orphans.

I think you maybe just have to let Linux keep the TIME_WAIT socket number up against what looks like maybe 32k cap on them and this is where Linux recycles them. This 32k is alluded to in this link:

Also, I find the /proc/sys/net/ipv4/tcp_max_tw_buckets confusing. Although the default is set at 180000, I see a TCP disruption when I have 32K TIME_WAIT sockets on my system, regardless of the max tw buckets.

This link also suggests that the TIME_WAIT state is 60 seconds and can not be tuned via proc.

Random fun fact:
You can see the timers on the timewait with netstat for each socket with netstat -on | grep TIME_WAIT | less

Reuse Vs Recycle:
These are kind of interesting, it reads like reuse enable the reuse of time_Wait sockets, and recycle puts it into TURBO mode:

tcp_tw_recycle - BOOLEAN
        Enable fast recycling TIME-WAIT sockets. Default value is 0.
        It should not be changed without advice/request of technical
        experts.

tcp_tw_reuse - BOOLEAN
        Allow to reuse TIME-WAIT sockets for new connections when it is
        safe from protocol viewpoint. Default value is 0.
        It should not be changed without advice/request of technical
        experts.

I wouldn't recommend using net.ipv4.tcp_tw_recycle as it causes problems with NAT clients.

Maybe you might try not having both of those switched on and see what effect it has (Try one at a time and see how they work on their own)? I would use netstat -n | grep TIME_WAIT | wc -l for faster feedback than Munin.

Is it dangerous to change the value of /proc/sys/net/ipv4/tcp_tw_reuse

You can safely reduce the time down, but you may run into issues with inproperly closed connections on networks with packet loss or jitter. I wouldn't start tuning at 1 second, start at 15-30 and work your way down.

Also, you really need to fix your application.

RFC 1185 has a good explanation in section 3.2:

When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT state ties up the socket pair for 4 minutes (see Section 3.5 of [Postel81]. Applications built upon TCP that close one connection and open a new one (e.g., an FTP data transfer connection using Stream mode) must choose a new socket pair each time. This delay serves two different purposes:

 (a)  Implement the full-duplex reliable close handshake of TCP. 

      The proper time to delay the final close step is not really 
      related to the MSL; it depends instead upon the RTO for the 
      FIN segments and therefore upon the RTT of the path.* 
      Although there is no formal upper-bound on RTT, common 
      network engineering practice makes an RTT greater than 1 
      minute very unlikely.  Thus, the 4 minute delay in TIME-WAIT 
      state works satisfactorily to provide a reliable full-duplex 
      TCP close.  Note again that this is independent of MSL 
      enforcement and network speed. 

      The TIME-WAIT state could cause an indirect performance 
      problem if an application needed to repeatedly close one 
      connection and open another at a very high frequency, since 
      the number of available TCP ports on a host is less than 
      2**16.  However, high network speeds are not the major 
      contributor to this problem; the RTT is the limiting factor 
      in how quickly connections can be opened and closed. 
      Therefore, this problem will no worse at high transfer 
      speeds. 

 (b)  Allow old duplicate segements to expire. 

      Suppose that a host keeps a cache of the last timestamp 
      received from each remote host.  This can be used to reject 
      old duplicate segments from earlier incarnations of the

*Note: It could be argued that the side that is sending a FIN knows what degree of reliability it needs, and therefore it should be able to determine the length of the TIME-WAIT delay for the FIN's recipient. This could be accomplished with an appropriate TCP option in FIN segments.

      connection, if the timestamp clock can be guaranteed to have 
      ticked at least once since the old conennection was open. 
      This requires that the TIME-WAIT delay plus the RTT together 
      must be at least one tick of the sender's timestamp clock. 

      Note that this is a variant on the mechanism proposed by 
      Garlick, Rom, and Postel (see the appendix), which required 
      each host to maintain connection records containing the 
      highest sequence numbers on every connection.  Using 
      timestamps instead, it is only necessary to keep one quantity 
      per remote host, regardless of the number of simultaneous 
      connections to that host.

Best Answer

Related Solutions

Linux – How to Reduce Number of Sockets in TIME_WAIT

Is it dangerous to change the value of /proc/sys/net/ipv4/tcp_tw_reuse

Related Topic