Ssh – Debugging flaky SSH tunnel

sshtunneling

We have a dedicated SSH tunnel server, which supports a few dozen remotely located hosts. The hosts each create a reverse tunnel into the server with assigned port numbers, using autossh to keep the connections persistent. This gives us access to the remote hosts via the server. This has all worked great until recently…

Comcast required us to move from one connection to another. The old and new modems are the same model, but on different cable drops, and of course the new connection has a new IP address. We took the opportunity to replace the server hardware as well, but the new server box is running the same OS (Ubuntu 10.04 LTS) and OpenSSH (5.3p1) as the old. A new host key was generated and distributed to the remote hosts.

Since that change all of the tunnel connections have become flaky, and typically will stay up for only 10 or 15 minutes at the most. Autossh detects and reconnects, but this is making interactive sessions quite frustrating to use. I can't figure out where the problem lies.

Looking at the log on the server, I see:
"Received disconnect from x.x.x.x: 11: disconnected by user"
and then the tunnel being reestablished. Even at log level DEBUG3 I don't see anything happening before the disconnect on the server end, just the expected keepalive messages.

The connections are dying regularly, whether they are in use or not, and they will die while being used and data is flying (like in the middle of a large sftp). The connections don't all die at the same time – it seems pretty randomly distributed.

On the server side we have ClientAliveInterval = 30, ClientAliveCountMax = 6, and TCPKeepAlive = yes.

The remote sites are running OpenSSH 5.6p1.

I'm at wits end… Any ideas on where I should be looking?

Best Answer

A useful tool here (for debugging network connectivity) is mtr, which is a combination of traceroute and ping. Say you were on your workstation, you would do "mtr {remote-server-ip}". The output is matrix like (rows and columns) and will display the latency and packet loss at each hop between your machine and the remote server. I used this the other week to prove to the ISP that they were dropping ~40% of packets at our T1 (which was causing inability to establish VPN connections).