Linux – Ubuntu 10.04.2 LTS Server – Intermittently hangs with no indication of cause in log files

linuxUbuntu

Quick Description:

I have recently started trying to set up / manage a Linux (Ubuntu 10.04.2 LTS) server in our data center (all other servers are Windows boxes). The server periodically hangs and becomes unresponsive and I'm at a loss to find anything in any log that indicates a specific cause. Sometimes it's up for hours, sometimes days (14 days at longest). Plugging a monitor in to the machine after a hang shows nothing at all. In an effort to troubleshoot the problem we've tried disabling APIC, more out of "educated desperation" than anything else. Unfortunately we are limited in some of the troubleshooting we can do, as we have a single client website hosted on the box (the reason we set it up) so anything that involves significant downtime is a problem.

As this is our first attempt at setting up a linux box, we are using a "well equipped" desktop grade machine but not what I would call "server grade" hardware. This is a standalone box, not a VPS. We are using a hardware, not software, RAID array and have plenty of memory in the box.

Caveats / Background:

  • I am relatively new to Linux in general.
  • I spend much more time writing code than managing servers. I'm comfortable with working on the box, but I'm not really a sysadmin guy.
  • I'm comfortable with the command line but have more experience with OS X (BSD).
  • I am unsure of all of the tools / information / Logs that may be available, though I try to be thorough in checking what I do know.
  • I did not physically configure the hardware so I'm not sure of all of the specs but I can get any info I need to troubleshoot.
  • I may be skipping very basic steps or missing obvious places to look for information without knowing it.

A little more detail:

  • Real memory: 8GB
  • Ubuntu 10.04.2 LTS
  • Hardware RAID 10
  • Managing sites with Webmin version 1.550
  • Server is in a remote data center. Hands on-troubleshooting is difficult.

We have attempted two Linux setups at this point. The first was on a hardware config identical to this one, but with no actual pieces of hardware reused. That attempt was using CentOS and we were attempting to set up CPanel. We scrapped that install because of this same problem (periodic crashing / hanging).

The second attempt (this one) is showing the same behavior. The only thing I can really see in common are the hardware configuration (though CentOS & Ubuntu may have more in common than I think).

The box will run fine for hours, days, or even weeks, and then just stop responding entirely. I check all of the logs I know to check (primarily messages, syslog and kern.log) but I don't see anything that seems like an error to me. I do see lines that I don't understand that may or may not be problems, such as:

rsyslogd: [origin software="rsyslogd" swVersion="4.2.0" x-pid="814" x-info="http://www.rsyslog.com"] rsyslogd was HUPed, type 'lightweight'.

Most of our syslog entries seem to be logs of webmin related cron jobs running. My gut tells me that there is possibly some component in our configuration Linux does not like or needs a driver update (maybe the raid card for example), but I'm unsure of how to do more to track down or determine what that might be. Guess and check is expensive.

Another thought I've had is that one or more of the cron jobs that are running are tripping something up, but it doesn't appear to be reproducible on demand and, again, I'm at a loss on how to test that theory any further. The same cron job does not appear to be running each time the server goes down.

This is a portion the log just prior to our last hang:

Aug  8 11:00:01 linhost01 CRON[10771]: (www-data) CMD ([ -x /usr/lib/cgi-bin/awstats.pl -a -f /etc/awstats/awstats.conf -a -r /var/log/apache2/access.log ] && /usr/lib/cgi-bin/awstats.pl -config=awstats -update >/dev/null)
Aug  8 11:00:01 linhost01 CRON[10772]: (root) CMD (/etc/webmin/status/monitor.pl)
Aug  8 11:01:01 linhost01 CRON[10799]: (root) CMD (/etc/webmin/virtual-server/collectinfo.pl)
Aug  8 11:05:01 linhost01 CRON[10898]: (root) CMD (/etc/webmin/status/monitor.pl)
Aug  8 11:06:01 linhost01 CRON[10924]: (root) CMD (/etc/webmin/virtual-server/collectinfo.pl)
Aug  8 11:09:01 linhost01 CRON[11007]: (root) CMD (  [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 | xargs -n 200 -r -0 rm)
Aug  8 11:10:01 linhost01 CRON[11023]: (www-data) CMD ([ -x /usr/lib/cgi-bin/awstats.pl -a -f /etc/awstats/awstats.conf -a -r /var/log/apache2/access.log ] && /usr/lib/cgi-bin/awstats.pl -config=awstats -update >/dev/null)
Aug  8 11:10:01 linhost01 CRON[11024]: (root) CMD (/etc/webmin/status/monitor.pl)
Aug  8 11:11:01 linhost01 CRON[11063]: (root) CMD (/etc/webmin/virtual-server/collectinfo.pl)
Aug  8 11:15:01 linhost01 CRON[11149]: (root) CMD (/etc/webmin/status/monitor.pl)
Aug  8 11:16:01 linhost01 CRON[11176]: (root) CMD (/etc/webmin/virtual-server/collectinfo.pl)
Aug  8 11:17:01 linhost01 CRON[11243]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug  8 11:20:01 linhost01 CRON[11279]: (www-data) CMD ([ -x /usr/lib/cgi-bin/awstats.pl -a -f /etc/awstats/awstats.conf -a -r /var/log/apache2/access.log ] && /usr/lib/cgi-bin/awstats.pl -config=awstats -update >/dev/null)
Aug  8 11:20:01 linhost01 CRON[11280]: (root) CMD (/etc/webmin/status/monitor.pl)
Aug  8 11:21:01 linhost01 CRON[11307]: (root) CMD (/etc/webmin/virtual-server/collectinfo.pl)
Aug  8 11:25:01 linhost01 CRON[11392]: (root) CMD (/etc/webmin/status/monitor.pl)
Aug  8 11:26:01 linhost01 CRON[11432]: (root) CMD (/etc/webmin/virtual-server/collectinfo.pl)
[SERVER DOWN AFTER THIS POINT]

If anyone can help shed any light or even give me anything else I can post here that might be helpful I would be very appreciative. I'm all for jumping in to learn by doing, but I'm starting to reach the end of my rope on this one.

Happy to post any specific log info or information that might be helpful in offering any suggestions.

Best Answer

For the sake of completeness I figured I'd just wrap this up. After well over a year of this behavior the server finally died. Only when trying (and failing) to rebuild the RAID did we find that not one, but two hard drives were bad.

The definitive cause of this server's problems still is not known, but (with my still somewhat limited understanding of Linux) it is my suspicion that these two drives had problems for some time and trying to use the bad drives was intermittently causing the server to crash / reboot.

Our final solution was to rebuild the server from scratch using virtually the exact same configuration but with all new hardware. The only significant configuration change we made was using ext4 instead of xfs for the file system. The box has been up for several months now without issue.

I'm answering this question only because, for us, it seems like drive failure was the cause and replacing all of the hardware was the best fix for the problem. That said, I don't know that this answer will be too helpful to most people.

Related Topic