Centos – What can cause ALL services on a server to go down, yet still responding to ping? and how to figure out

centosserver-crashesservice

It has happened to me already twice within very few days that my server goes down completely, meaning http, ssh, ftp, dns, smtp, basically ALL services stop responding, as if the server had been turned off, except it still responds to ping, which is what most buffles me.

I do have some php scripts that cause a huge load (cpu and memory) on the server in short bursts, used by a little group of users, but usually the server "survives" perfectly well to these bursts, and when it goes down it never coincide with such peaks in usage (I'm not saying it can't be related, but it doesn't happen just after those).

I'm not asking you to magically be able to tell me the ultimate cause of these crashes, my question is: is there a single process whose death may cause all these services to go down simultaneously? The funny thing is that all network services go down, except ping.
If the server had 100% of the CPU eaten up by some process, it wouldn't respond to ping either. If apache crashed because of (for example) a broken php script, that would affect http only, not ssh and dns…. etc.

My OS is Cent OS 5.6

Most importantly, after hard-rebooting the server, what system logs should I look at? /var/log/messages doesn't reveal anything suspicious.

Best Answer

(tl;dr still responding to ping is an expected behaviour, check your memory usage)

ICMP echo requests (i.e. ping) are handled by the in-kernel networking stack, with no other dependency.

The kernel is known as being "memory resident", which means it will always be kept in RAM, and can't be swapped to disk like a regular application can.

This means in situations where you run of out of physical memory applications are swapped to disk, but the kernel remains where it is. When both the physical and swap memory are full (and the system can no long manage your programs) the machine will fall-over. However because a) the kernel is still in memory and b) it can respond to ping requests without the help of anything else, the system will keep responding to ping despite everything being dead.

In regard to your problem I'd strongly suspect memory issues. Install "sysstat" and use the "sar" command to see a log of memory/cpu/load/io load etc. I would expect at the times of crash you'd see both 100% physical and swap used.

I would also consider looking at dmesg or /var/log/messages for any sign of the OOM-killer (out-of-memory-killer) being invoked. This is the kernel's emergency system which will start killing processes in the event of memory being exhausted. It's effectiveness depends largely on what processes are being killed. A single process eating up the memory will be efficiently killed and memory freed, however an apache-based website will spawn replacement processes as soon a child process is killed.

Related Topic