Linux – Forensic Analysis of the OOM-Killer

forensicslinuxoomUbuntu

Ubuntu's Out-Of-Memory Killer wreaked havoc on my server, quietly assassinating my applications, sendmail, apache and others.

I've managed to learn what the OOM Killer is, and about its "badness" rules. While my machine is small, my applications are even smaller, and typically only half of my physical memory is in use, let alone swap-space, so I was surprised. I am trying to work out the culprit, but I don't know how to read the OOM-Killer logs.

Can anyone please point me to a tutorial on how to read the data in the logs (what are ve, free and gen?), or help me parse these logs?

Apr 20 20:03:27 EL135 kernel: kill_signal(13516.0): selecting to kill, queued 0, seq 1, exc 2326 0 goal 2326 0...
Apr 20 20:03:27 EL135 kernel: kill_signal(13516.0): task ebb0c6f0, thg d33a1b00, sig 1
Apr 20 20:03:27 EL135 kernel: kill_signal(13516.0): selected 1, signalled 1, queued 1, seq 1, exc 2326 0 red 61795 745
Apr 20 20:03:27 EL135 kernel: kill_signal(13516.0): selecting to kill, queued 0, seq 2, exc 122 0 goal 383 0...
Apr 20 20:03:27 EL135 kernel: kill_signal(13516.0): task ebb0c6f0, thg d33a1b00, sig 1
Apr 20 20:03:27 EL135 kernel: kill_signal(13516.0): selected 1, signalled 1, queued 1, seq 2, exc 383 0 red 61795 745
Apr 20 20:03:27 EL135 kernel: kill_signal(13516.0): task ebb0c6f0, thg d33a1b00, sig 2
Apr 20 20:03:27 EL135 kernel: OOM killed process watchdog (pid=14490, ve=13516) exited, free=43104 gen=24501.
Apr 20 20:03:27 EL135 kernel: OOM killed process tail (pid=4457, ve=13516) exited, free=43104 gen=24502.
Apr 20 20:03:27 EL135 kernel: OOM killed process ntpd (pid=10816, ve=13516) exited, free=43104 gen=24503.
Apr 20 20:03:27 EL135 kernel: OOM killed process tail (pid=27401, ve=13516) exited, free=43104 gen=24504.
Apr 20 20:03:27 EL135 kernel: OOM killed process tail (pid=29009, ve=13516) exited, free=43104 gen=24505.
Apr 20 20:03:27 EL135 kernel: OOM killed process apache2 (pid=10557, ve=13516) exited, free=49552 gen=24506.
Apr 20 20:03:27 EL135 kernel: OOM killed process apache2 (pid=24983, ve=13516) exited, free=53117 gen=24507.
Apr 20 20:03:27 EL135 kernel: OOM killed process apache2 (pid=29129, ve=13516) exited, free=68493 gen=24508.
Apr 20 20:03:27 EL135 kernel: OOM killed process sendmail-mta (pid=941, ve=13516) exited, free=68803 gen=24509.
Apr 20 20:03:27 EL135 kernel: OOM killed process tail (pid=12418, ve=13516) exited, free=69330 gen=24510.
Apr 20 20:03:27 EL135 kernel: OOM killed process python (pid=22953, ve=13516) exited, free=72275 gen=24511.
Apr 20 20:03:27 EL135 kernel: OOM killed process apache2 (pid=6624, ve=13516) exited, free=76398 gen=24512.
Apr 20 20:03:27 EL135 kernel: OOM killed process python (pid=23317, ve=13516) exited, free=94285 gen=24513.
Apr 20 20:03:27 EL135 kernel: OOM killed process tail (pid=29030, ve=13516) exited, free=95339 gen=24514.
Apr 20 20:03:28 EL135 kernel: OOM killed process apache2 (pid=20583, ve=13516) exited, free=101663 gen=24515.
Apr 20 20:03:28 EL135 kernel: OOM killed process logger (pid=12894, ve=13516) exited, free=101694 gen=24516.
Apr 20 20:03:28 EL135 kernel: OOM killed process bash (pid=21119, ve=13516) exited, free=101849 gen=24517.
Apr 20 20:03:28 EL135 kernel: OOM killed process atd (pid=991, ve=13516) exited, free=101880 gen=24518.
Apr 20 20:03:28 EL135 kernel: OOM killed process apache2 (pid=14649, ve=13516) exited, free=102748 gen=24519.
Apr 20 20:03:28 EL135 kernel: OOM killed process grep (pid=21375, ve=13516) exited, free=132167 gen=24520.
Apr 20 20:03:57 EL135 kernel: kill_signal(13516.0): selecting to kill, queued 0, seq 4, exc 4215 0 goal 4826 0...
Apr 20 20:03:57 EL135 kernel: kill_signal(13516.0): task ede29370, thg df98b880, sig 1
Apr 20 20:03:57 EL135 kernel: kill_signal(13516.0): selected 1, signalled 1, queued 1, seq 4, exc 4826 0 red 189481 331
Apr 20 20:03:57 EL135 kernel: kill_signal(13516.0): task ede29370, thg df98b880, sig 2
Apr 20 20:04:53 EL135 kernel: kill_signal(13516.0): selecting to kill, queued 0, seq 5, exc 3564 0 goal 3564 0...
Apr 20 20:04:53 EL135 kernel: kill_signal(13516.0): task c6c90110, thg cdb1a100, sig 1
Apr 20 20:04:53 EL135 kernel: kill_signal(13516.0): selected 1, signalled 1, queued 1, seq 5, exc 3564 0 red 189481 331
Apr 20 20:04:53 EL135 kernel: kill_signal(13516.0): task c6c90110, thg cdb1a100, sig 2
Apr 20 20:07:14 EL135 kernel: kill_signal(13516.0): selecting to kill, queued 0, seq 6, exc 8071 0 goal 8071 0...
Apr 20 20:07:14 EL135 kernel: kill_signal(13516.0): task d7294050, thg c03f42c0, sig 1
Apr 20 20:07:14 EL135 kernel: kill_signal(13516.0): selected 1, signalled 1, queued 1, seq 6, exc 8071 0 red 189481 331
Apr 20 20:07:14 EL135 kernel: kill_signal(13516.0): task d7294050, thg c03f42c0, sig 2

Watchdog is a watchdog task, that was idle; nothing in the logs to suggest it had done anything for days. Its job is to restart one of the applications if it dies, so a bit ironic that it is the first to get killed.

Tail was monitoring a few logs files. Unlikely to be consuming memory madly.

The apache web-server only serves pages to a little old lady who only uses it to get to church on Sundays a couple of developers who were in bed asleep, and hadn't visited a page on the site for a few weeks. The only traffic it might have had is from the port-scanners; all the content is password-protected and not linked from anywhere, so no spiders are interested.

Python is running two separate custom applications. Nothing in the logs to suggest they weren't humming along as normal. One of them was a relatively recent implementation, which makes suspect #1. It doesn't have any data-structures of any significance, and normally uses only about 8% of the total physical RAW. It hasn't misbehaved since.

The grep is suspect #2, and the one I want to be guilty, because it was a once-off command. The command (which piped the output of a grep -r to another grep) had been started at least 30 minutes earlier, and the fact it was still running is suspicious. However, I wouldn't have thought grep would ever use a significant amount of memory. It took a while for the OOM killer to get to it, which suggests it wasn't going mad, but the OOM killer stopped once it was killed, suggesting it may have been a memory-hog that finally satisfied the OOM killer's blood-lust.

Best Answer

I'm new to ServerFault and just saw this post. It seems to have resurfaced near the front of the queue even though it is old. Let's put this scary one to bed maybe?

First of all, I have an interest in this topic as I am optimizing systems with limited RAM to run many user processes in a secure way.

It is my opinion that the error messages in this log are referring to OpenVZ Linux containers.

A "ve" is a virtual environment and also known as a container in OpenVZ. Each container is given an ID and the number you are seeing is that ID. More on this here:

https://openvz.org/Container

The term "free" refers to free memory in bytes at that moment in time. You can see the free memory gradually increasing as processes are killed.

The term "gen" I am a little unsure of. I believe this refers to generation. That is, it starts out at 1 and increases by one for every generation of a process in a container. So, for your system, it seems there were 24K+ processes executed since boot. Please correct me if I'm wrong. That should be easy to test.

As to why it killed processes, that's because of your OOM killer configuration. It's trying to bring the free memory back to the expected amount (which looks to be 128 Kb). Oracle has a good write-up of how-to configure this to something you might like better:

http://www.oracle.com/technetwork/articles/servers-storage-dev/oom-killer-1911807.html

Additionally, if you'd like to see the memory configuration for each of your containers, check this out:

https://openvz.org/Setting_UBC_parameters