Linux – Number of SSH connections rises, and blocks data

connectionlinuxredhatsshssh-tunnel

We have a client server setup, where the client sets up an SSH tunnel and uses port forwarding to send data to the server:

ssh -N -L 5000:localhost:5500 user@serveraddress

The normal number of SSH connections at the server is ~150, and while everything is normal, the server software processes incoming connections pretty fast (a few seconds at most).

However, recently we have noticed that the number of SSH connections rises to 900+. At this point, the server software sees connects to it and accepts these connections, but no data is coming in.

Has anyone seen such symptoms with SSH before? Any ideas on what the issue could be?

Server OS: Red Hat Linux 5.5
Firewall: Disabled
Key Exchange: Tested

EDIT: Adding parts of log data from /var/log/secure on the server side

There seems to be a lot of the following in the log file.

Apr 10 00:07:33 myserver sshd[15038]: fatal: Write failed: Connection timed out
Apr 10 00:12:01 myserver sshd[5259]: fatal: Read from socket failed: Connection reset by peer
Apr 10 00:44:48 myserver sshd[17026]: fatal: Write failed: No route to host
Apr 10 02:09:16 myserver sshd[10398]: fatal: Read from socket failed: Connection reset by peer
Apr 10 02:22:47 myserver sshd[24581]: fatal: Read from socket failed: Connection reset by peer
Apr 10 03:05:57 myserver sshd[12003]: fatal: Read from socket failed: Connection reset by peer
Apr 10 03:23:19 myserver sshd[22421]: fatal: Write failed: Connection timed out
Apr 10 08:13:43 myserver sshd[31993]: fatal: Read from socket failed: Connection reset by peer
Apr 10 08:36:39 myserver sshd[7759]: fatal: Read from socket failed: Connection reset by peer
Apr 10 09:02:32 myserver sshd[12470]: fatal: Write failed: Broken pipe
Apr 10 12:08:05 myserver sshd[728]: fatal: Write failed: Connection reset by peer
Apr 10 12:35:53 myserver sshd[6184]: fatal: Read from socket failed: Connection reset by peer
Apr 10 12:43:14 myserver sshd[2663]: fatal: Write failed: Connection timed out

NOTE: After about 10-15 minutes of the 900+ connections, the system will recover by itself – the number of connections will drop to a normal range, and the server will start getting data again. It sounds like a DOS/DDOS, but this is on an internal network.

ADDENDUM: Checked the connections status based on @kranteg's question. We just had another outage, and these are the results based on a script I wrote for all incoming SSH connections:

===                                                        
Tue Apr 15 12:22:07 EDT 2014 -> Total SSH connections: 996 
===                                                        
0 SYN_SENT                                             
1 SYN_RECV                                             
0 FIN_WAIT1                                            
0 FIN_WAIT2                                            
15 TIME_WAIT                                            
0 CLOSED                                               
760 CLOSE_WAIT                                           
143 ESTABLISHED                                          
77 LAST_ACK                                             
0 LISTEN                                               
0 CLOSING                                              
0 UNKNOWN                                              
===                                                        
===
Tue Apr 15 12:22:17 EDT 2014 -> Total SSH connections: 977
===
0 SYN_SENT
2 SYN_RECV
1 FIN_WAIT1
0 FIN_WAIT2
15 TIME_WAIT
0 CLOSED
756 CLOSE_WAIT
127 ESTABLISHED
76 LAST_ACK
0 LISTEN
0 CLOSING
0 UNKNOWN
===
===
Tue Apr 15 12:22:26 EDT 2014 -> Total SSH connections: 979
===
0 SYN_SENT
2 SYN_RECV
1 FIN_WAIT1
0 FIN_WAIT2
12 TIME_WAIT
0 CLOSED
739 CLOSE_WAIT
148 ESTABLISHED
77 LAST_ACK
0 LISTEN
0 CLOSING
0 UNKNOWN
===

It looks like there is a jump in the number of connections in CLOSE_WAIT. During "normal" operation, the number in CLOSE_WAIT is either 0 or very close to it.

Best Answer

I don't know if this is the correct solution, but it worked for us. Hopefully it will at least point others in the right direction, even if it doesn't solve it completely.

We noticed that every time we had an outage, the processor usage was near 100%. This, in turn, was because of another application batch processing certain files, and using up most of the CPU. We turned this process off, and have not had an outage so far. I honestly don't know if this is the root cause, but it has helped us. Not a single outage since then.