Linux – How to know the cause of a oom error on Linux

linuxmemory

I have a testing web server.

One time the server was so unresponsive that I had to restart it.

Viewing the logs I could see the server was out if memory and oom killer killed mysqld.

But reading some docs about oom killer I know that mysqld wasn't necessarily (but maybe it was) the cause of the out of memory situation.

So only using the log files can I know what process(es) caused the oom condition?

Best Answer

How do you define the "cause" of the OOM situation? Is it the process using the most memory? Perhaps you have a DB that always takes 3GB of memory to run and thus uses the most memory on the machine. Is it the "cause" of the problem? Probably not.

Ultimately the cause of the problem is "An unexpected situation which may or may not have been the fault of the sysadmin."

Sometimes you can know; for instance if you had process accounting setup (+1 to @JamesHannah) and you saw 3000 httpd or sshd processes (and that was unusual) you could probably blame that daemon.

With that in mind, I present comments from The Source:

/*
 * oom_badness - calculate a numeric value for how bad this task has been
 * @p: task struct of which task we should calculate
 * @p: current uptime in seconds
 *
 * The formula used is relatively simple and documented inline in the
 * function. The main rationale is that we want to select a good task
 * to kill when we run out of memory.
 *
 * Good in this context means that:
 * 1) we lose the minimum amount of work done
 * 2) we recover a large amount of memory
 * 3) we don't kill anything innocent of eating tons of memory
 * 4) we want to kill the minimum amount of processes (one)
 * 5) we try to kill the process the user expects us to kill, this
 *    algorithm has been meticulously tuned to meet the principle
 *    of least surprise ... (be careful when you change it)
 */

"So the ideal candidate for liquidation is a recently started, non privileged process which together with its children uses lots of memory, has been nice'd, and does no raw I/O. Something like a nohup'd parallel kernel build (which is not a bad choice since all results are saved to disk and very little work is lost when a 'make' is terminated)."

Comment block and quote shameless stolen from http://linux-mm.org/OOM_Killer

Related Solutions

Linux OOM-Killer – How to Diagnose Causes of OOM-Killer Killing Processes

No, the algorithm is not that simplistic. You can find more information in:

http://linux-mm.org/OOM_Killer

If you want to track memory usage, I'd recommend running a command like:

ps -e -o pid,user,cpu,size,rss,cmd --sort -size,-rss | head

It will give you a list of the processes that are using the most memory (and probably causing the OOM situation). Remove the | head if you'd prefer to check all the processes.

If you put this on your cron, repeat it every 5 minutes and save it to a file. Keep at least a couple of days, so you can check what happened later.

For critical services like ssh, I'd recommend using monit for auto restarting them in such a situation. It might save from losing access to the machine if you don't have a remote console to it.

Best of luck,
João Miguel Neves

Linux – Forensic Analysis of the OOM-Killer

I'm new to ServerFault and just saw this post. It seems to have resurfaced near the front of the queue even though it is old. Let's put this scary one to bed maybe?

First of all, I have an interest in this topic as I am optimizing systems with limited RAM to run many user processes in a secure way.

It is my opinion that the error messages in this log are referring to OpenVZ Linux containers.

A "ve" is a virtual environment and also known as a container in OpenVZ. Each container is given an ID and the number you are seeing is that ID. More on this here:

https://openvz.org/Container

The term "free" refers to free memory in bytes at that moment in time. You can see the free memory gradually increasing as processes are killed.

The term "gen" I am a little unsure of. I believe this refers to generation. That is, it starts out at 1 and increases by one for every generation of a process in a container. So, for your system, it seems there were 24K+ processes executed since boot. Please correct me if I'm wrong. That should be easy to test.

As to why it killed processes, that's because of your OOM killer configuration. It's trying to bring the free memory back to the expected amount (which looks to be 128 Kb). Oracle has a good write-up of how-to configure this to something you might like better:

http://www.oracle.com/technetwork/articles/servers-storage-dev/oom-killer-1911807.html

Additionally, if you'd like to see the memory configuration for each of your containers, check this out:

https://openvz.org/Setting_UBC_parameters

Best Answer

Related Solutions

Linux OOM-Killer – How to Diagnose Causes of OOM-Killer Killing Processes

Linux – Forensic Analysis of the OOM-Killer

Related Topic