Linux – Avoid a crash when a process allocates too much memory

linuxmemorymemory usage

Similar to this question, we have a computing server with 96GB of RAM that is used to run large jobs in parallel.

Occasionally, the total amount of physical RAM is exceeded, which causes the server to become unresponsive, forcing a reboot. To me, this is not acceptable behavior, so I'm looking for ways to fix this.

I know one way would be to set limits using "ulimit -v". However I'd like to avoid going down that route if possible, as I may occasionally have a signle very large process (as opposed to many small ones), so setting a useful threshold is going to be difficult.

I suspect the problem may come from the fact that the system has 20GB of swap: instead of killing the offending process(es), the system will allocate memory on disk which will make it unresponsive. Is reducing the amount of swap a good idea?

Any insight or experiences with a similar problem highly appreciated!

EDIT

I made a few experiments using the following leaking C++ program:

#include <vector>
#include <unistd.h>

using namespace std;

int main(int argc,char * argv[])
{
        while(true) {
                vector<double>* a = new vector<double>(50000000);
                sleep(1);
        }
}

I ran it a first time with a 256MB swap file. The system completely hung for about 5 minutes, than came back to life. In the logs, I saw that the OOM killer had succesfully killed my leaky program.

I ran it a second time with no swap. This time, the machine didn't come back to life for at least ten minutes, at which point I rebooted the machine. This came as a surprise for me, as I expected the OOM killer to fire up earlier on a machine with no swap.

What I don't understand is the following: why does linux wait until the system is completely hung to do something about the offending process? Is it too much to expect of an OS to not be completely killed by one badly coded process?

Best Answer

If you want your server to still be responsive, you need to do your best to avoid swapping. However, reducing the swap amount or disabling it will not solve your problem.

You need either to control your jobs memory usage or install more memory ships in the server machine.

You can try cgroups (control groups) to control your processes CPU and memory usage.