Linux – Remote server hangs, gets stuck. How to debug

linuxsshtcpUbuntu

I have an vps running on VmWare ESX with Ubuntu 8.04 LTS.
It has been running smoothly for the past 3 months, however recently we've notices two strange bugs.

a. The server hangs, today was second time. The nature of the hang is very strange.
I can ping to the server server, it sends back response fine. However all other services like sshd, apache, mysql etc do not respond at all.
When working,

telnet servername 22
Escape character is '^]'.
SSH-2.0-OpenSSH_5.X Debian-5ubuntu1

And other web services would run fine. When its hung, I can make tcp connections to 22 as well as 80 but receive no response at all.

telnet servername 22
Escape character is '^]'.

How can I debug this problem? Is there any daemons I can run that will periodically log status? Please tell me as to how to proceed with it.

b. The another strange problem is that, of lately I am unable to transfer files larger than around 100KB, smaller files of around 1-2 KB works file.

scp anotherserver:filename .

or

wget http://www.example.com/file

would get stuck. There is still around 6GB of space remaining, so I don't think that is an issue. Any pointers where I should look into?

Best Answer

I would suggest using sar from the sysstat (or atsar) package. This run every 10 minutes as a cron job and makes a note of your server's vital statistics -- memory usage, cpu utilization, disk activity, network activity, etc.

You use it like this:

Show processor activity (the default)
sar -p (or just sar)

Show memory ("ram") statistics
sar -r

Show the memory statistics from the 27th
sar -r -f /var/log/sysstat/sa27

Note that the path varies based on your installation. On redhat-based systems, the files are usually in /var/log/sa/, while if you have the atsar package installed, they'll be in /var/log/atsar/ -- but the pattern is that the file will end in a number that represents the day of the month when the data was gathered.

Some versions (like atsar) allow you to simply specify the day: sar -n 27. Check the manpage that came with your installation to find out the correct syntax and what data you can retrieve.

Once you have this installed and running (and you probably do already!) you can use the information it gathers to get an idea of what was going on immediately before the crash. For example, if the report shows your memory to be exhaused and free swap space counting down to zero, then you'll have a pretty good idea of what to look for.

With the information in hand, you can set up additional reporting to give you a better idea of what's wrong: for example, you can write a short bash script that examines certain system statistics (such as the contents of /proc/meminfo or /proc/loadavg) and if the trigger conditions are met, perhaps appends the appropriate debugging information (like the output of ps auwwxf) to a file, or emails the information to you.