Linux – How to know the cause of a oom error on Linux

linuxmemory

I have a testing web server.

One time the server was so unresponsive that I had to restart it.

Viewing the logs I could see the server was out if memory and oom killer killed mysqld.

But reading some docs about oom killer I know that mysqld wasn't necessarily (but maybe it was) the cause of the out of memory situation.

So only using the log files can I know what process(es) caused the oom condition?

Best Answer

How do you define the "cause" of the OOM situation? Is it the process using the most memory? Perhaps you have a DB that always takes 3GB of memory to run and thus uses the most memory on the machine. Is it the "cause" of the problem? Probably not.

Ultimately the cause of the problem is "An unexpected situation which may or may not have been the fault of the sysadmin."

Sometimes you can know; for instance if you had process accounting setup (+1 to @JamesHannah) and you saw 3000 httpd or sshd processes (and that was unusual) you could probably blame that daemon.

With that in mind, I present comments from The Source:

/*
 * oom_badness - calculate a numeric value for how bad this task has been
 * @p: task struct of which task we should calculate
 * @p: current uptime in seconds
 *
 * The formula used is relatively simple and documented inline in the
 * function. The main rationale is that we want to select a good task
 * to kill when we run out of memory.
 *
 * Good in this context means that:
 * 1) we lose the minimum amount of work done
 * 2) we recover a large amount of memory
 * 3) we don't kill anything innocent of eating tons of memory
 * 4) we want to kill the minimum amount of processes (one)
 * 5) we try to kill the process the user expects us to kill, this
 *    algorithm has been meticulously tuned to meet the principle
 *    of least surprise ... (be careful when you change it)
 */

"So the ideal candidate for liquidation is a recently started, non privileged process which together with its children uses lots of memory, has been nice'd, and does no raw I/O. Something like a nohup'd parallel kernel build (which is not a bad choice since all results are saved to disk and very little work is lost when a 'make' is terminated)."

Comment block and quote shameless stolen from http://linux-mm.org/OOM_Killer