I'm not saying this is what's happening but based on my own experience as a CentOS admin, it's most likely runaway apache/php processes taking down the server. I've seen this numerous times on CentOS 5. It's frustrating because there's usually not a trace of what happened in the log files. The machine just grinds to a halt due to physical memory and swap being sucked up by apache/php processes. You would think linux memory management or some daemon would jump in and say "hey stop" but it doesn't. It'll let apache grind your system to a halt.
Having said that, to see what's happening you'll need something that can monitor and log resource usage. I like to use a program called atop. Atop is a lot like the top program but it also takes a snapshot of resource usage at defined intervals. It's pretty simple to install.
wget http://www.atcomputing.nl/Tools/atop/packages/atop-1.23.tar.gz
tar -zxvf atop-1.23.tar.gz
cd atop-1.23 && make install
Open /etc/atop/atop.daily
with a text editor and change INTERVAL=600
to INTERVAL=60
Run the command /etc/atop/atop.daily
from a command prompt to start it. Wait a few minutes and run atop -r /var/log/atop/atop_20091118
with the correct date of course.
Hit the t key to go forward in time and T to go back. Next time your server crashes do this and check the MEM free
and SWP free
lines. If you have memory problems these will be in red. Also look for numerous httpd
lines under CMD
. If apache/php is your problem there'll be a bunch of them.
If this is the case, I recommend looking at you're MaxClients
setting in httpd.conf
. If set too high, apache will gladly eat all of your memory causing your machine to crash. Apache/php can easily eat 40-50MB/process. If you multiply 40mb x MaxClients
you'll get a rough idea of how much memory apache can potentially use. MaxClients
usually defaults to 150 on CentOS so apache can potentially use 6GB of memory by default. This doesn't include memory your system needs for itself and other processes to run. Try setting it to a more realistic value based on the amount of memory you have like 40 if you have 2G of memory and see if that helps. Also if you have KeepAlive On
, set KeepAliveTimeout
to a low number like 2
or 3
.
In my opinion CentOS's apache/php compilation is a real pos that should never have seen the light of day. It's buggy and crash prone. If you run a serious site, I highly recommend compiling your own version of apache/php or even using one of the newer high performance webservers like lighttpd or nginx with fgci php.
You should perform a ps aux
to see if any of the shutdown scripts are hung waiting for a process to finish. It should look something like this:
/etc/rc6.d/K##procname
You can try manually issuing a kill
command for that hung script. Strange though, since there's a timeout set on the scripts where it will force a -KILL
signal to any leftover process.
Also, what's the uptime of the server/box? I've experienced an issue in the past where a box that has an uptime of over a year refuses to shut down. In that case, I've killed each process manually, run sync
several times to flush all data to disk and forced a reboot (power cycle).
Best Answer
try to enter the BMC log and see if there has been a hardware error that caused the reboot (log locations and their interpretation are probably best asked from the HW vendor)
Does the server have a fence device? Any chance it has been fenced?
If you have a smart PDU, there might be logs for power outages in there. If the server is hosted with a managed server farm, I'd ask the NOC team about outages as well