Linux OOM-Killer – How to Diagnose Causes of OOM-Killer Killing Processes

centoslinuxrhel5

I have a small virtual private server running CentOS and www/mail/db, which has recently had a couple of incidents where the web server and ssh became unresponsive.

Looking at the logs, I saw that oom-killer had killed these processes, possibly due to running out of memory and swap.

Can anyone give me some pointers at how to diagnose what may have caused the most recent incident? Is it likely the first process killed? Where else should I be looking?

Best Answer

No, the algorithm is not that simplistic. You can find more information in:

http://linux-mm.org/OOM_Killer

If you want to track memory usage, I'd recommend running a command like:

ps -e -o pid,user,cpu,size,rss,cmd --sort -size,-rss | head

It will give you a list of the processes that are using the most memory (and probably causing the OOM situation). Remove the | head if you'd prefer to check all the processes.

If you put this on your cron, repeat it every 5 minutes and save it to a file. Keep at least a couple of days, so you can check what happened later.

For critical services like ssh, I'd recommend using monit for auto restarting them in such a situation. It might save from losing access to the machine if you don't have a remote console to it.

Best of luck,
João Miguel Neves