Linux – On Linux, is there a configurable socket timeout between kernel and user space

linuxsocket

I'm currently fighting with some crappy piece of (custom) server software which doesn't accept its connections properly (written in Java by a PHP programmer who never before touched sockets let alone threads). My guess is that a thread is dying before the socket is properly accepted in the client thread. I can't be sure and it actually doesn't matter much since the software is currently reimplemented; the old version has to be kept running until the new version goes online, as reliable as possible but without any time and money spent on debugging the old codebase.

The bug manifests itself in the following netstat output; some connections are never transferred from the kernel to use space (that's how I interpret this, better interpretations are welcome):

Proto Recv-Q Send-Q Local Address         Foreign Address         State       PID/Program name
tcp6     228      0 192.0.2.105:1988      46.23.248.10:7925       ESTABLISHED -               
tcp6       0      0 192.0.2.105:1988      221.130.33.37:9826      ESTABLISHED 14741/java      
tcp6       0      0 192.0.2.105:1988      46.23.248.2:5867        ESTABLISHED 14741/java      
tcp6    2677      0 192.0.2.105:1988      221.130.33.37:15688     ESTABLISHED -               
tcp6    3375      0 192.0.2.105:1988      221.130.33.36:3045      ESTABLISHED -               
tcp6   14742      0 192.0.2.105:1988      46.23.248.17:4679       ESTABLISHED -               
tcp6     774      0 192.0.2.105:1988      212.9.19.73:36064       ESTABLISHED -               
tcp6      92      0 192.0.2.105:1988      46.23.248.19:7164       ESTABLISHED -               
tcp6       0      0 192.0.2.105:1988      46.23.248.21:6322       ESTABLISHED 14741/java      
tcp6       0      0 192.0.2.105:1988      221.130.39.216:13937    ESTABLISHED 14741/java      
tcp6    3051      0 192.0.2.105:1988      211.139.145.104:31239   ESTABLISHED -               
tcp6     246      0 192.0.2.105:1988      46.23.248.10:5458       ESTABLISHED -               
tcp6     618      0 192.0.2.105:1988      212.9.19.73:20209       ESTABLISHED -               
tcp6    1041      0 192.0.2.105:1988      46.23.248.18:7424       ESTABLISHED -               
tcp6       0      0 192.0.2.105:1988      46.23.248.10:5065       ESTABLISHED 14741/java      

When this happens and the clients reconnect, they tend to work. But they won't reconnect by itself until they run into a rather long timeout. Since the custom full-duplex protocol in its current incarnation doesn't ack any data sent by the client and the latter doesn't expect any regularly incoming requests from the server, this can be days since the client sends its data happily until the kernel's receive queue runs full. On the server (kernel) side it should be possible to detect stale sockets since the clients send data regularly.

So, assuming my interpretation of this problem is correct, what I wondered was if there is a kernel parameter I can tune which makes the kernel drop/close TCP connections with a RST if they aren't read from by the user space in a timely manner.

Better explanations of what happens here are welcome as well.

Best Answer

You can try tuning TCP keepalive to much shorter values. By default a connection can be idle for two hours before keepalive kicks in.

Exactly what values you should use is really dependent on what your application does and what your users expect or how they interact with it.