Linux – server randomly becoming unresponsive

I've been having a sporadic issue that seems to occur completely at random with one of our Ubuntu servers. The server will randomly decide to stop responding to connections on all services (SSH, HTTP, etc) except for ping requests. It will still respond to pings, but everything else will go dead. The only way to get the system back up is to have the data center perform a hard reboot.

I've been trying to investigate the problem for almost a year now but I have not been able to figure out what is causing this behavior. I installed an array of monitoring utilities, including Monit, and set them up to send me alerts in the event that cpu usage, memory usage, or swap space usage exceeds a certain threshold. I also wrote a script to send me a list of the currently running processes should any of these thresholds be met.

Unfortunately, it seems that whatever causes the server to become unresponsive does so so quickly that the monitoring utilities don't even have an opportunity to send an alert e-mail (that, or the cause of the problem has nothing to do with cpu or memory usage). A friend of mine suggested writing a simple bash script to take the output of ps auxf and write it to a log file every 5 minutes, so I set one up and put it on the crontab.

This morning I woke up and discovered that the server had once again gone unresponsive, so I contacted the data center and asked them to perform a hard reboot once again. I then logged into the server and looked at the log file for the ps auxf snapshots. In the log file, the last recorded snapshot was at midnight, and there was no further snapshots written between then and when the server was rebooted, indicating that the server went unresponsive sometime around midnight, rendering the process logging script inoperable.

The last snapshot didn't contain any process listings indicative of why this has been happening. There were no processes in the last snapshot that were using large amounts of cpu time or memory. I did some googling and saw that other folks have posted about the same problem here. One such post contained answers suggesting that you check /var/log/messages, but unfortunately on this server /var/log/messages has not been written to since 2011 (i have no idea why, other people have had access to this server and may have changed the log path).

My guess is that some kind of kernel panic is occurring causing all services on the server to stop working, but I have no idea what's causing the kernel panic or if that is even what's actually happening. Does anyone have any idea what might be causing this? It has been a real headache for me and I've spent practically a year trying to figure this out.

Thanks!

Best Answer

Do not run cronjobs. Fork a daemon process with very high priority (I assume you can get root). What you want is a what amounts realtime process (cannot be pre-empted), that periodically does whatever kinds of scans you deem appropriate. And uses timer to run those scans. It sounds like a resource block - like available process slots in the process masthead are used up, e.g., the kind of thing you see with a fork bomb. That is not very likely, but sudden freezing is a symptom of a resource gone away completely or totally used up. Memory overuse would cause lots of swapping before the system became completely unusable.

BE careful! You can have this one process kill your system. If you do not feel comfortable with realtime programming get some help.

What does syslog have to say? If syslogd is not running turn the service/daemon on. What do service logs say? Unless this has an onset time of less than few ms, something has to be complaining. Somewhere.

Best Answer

Related Solutions

Linux – DF Not Showing Correct Free Space After File Removal

Performance – How to Reduce High Memory Usage on Linux Server

Related Topic