Linux – Automatically suspend/hibernate a process when too much memory taken

linuxmemory usagevirtual-memory

I'm running a small Debian compute cluster on 8-core PCs with 16GB of RAM. I'm running batches of about 1k tasks (each batch has a total expected time of a month to be finished). A single task is single-threaded (so I can run multiple of them in parallel on each PC), does not consume much IO (loads several megabytes of data on start and dumps several megabytes of data on exit; does not communicate with outside world otherwise), its run time is unknown (from few minutes to ~week), its memory consumption is unknown (ranging from several megabytes to ~8GB; usage may grow slowly or quickly). I'd like to as many such tasks as possible in parallel on a single PC, but I want to avoid excessive swapping.

So I got an idea: I could monitor the memory usage of these tasks and suspend (kill -SIGSTOP) or hibernate (using a tool like CryoPID) tasks which consume too much memory to restart them later. By memory usage I mean number of “active virtual pages”, or number of allocated, not shared memory pages that have actually been touched (these tasks may allocate memory without using them).

I started looking for some tools to do that. I know that I can ulimit or run a task inside a memory-limited cgroup, but—if I understand them correctly—these solutions will kill the process instead of suspending it. I want to avoid killing them, because I would need to start them from scratch later, and that means wasted time. Also, they cannot actually measure the number of active virtual pages.

I could use real virtual machines, but they seem to have significant overhead in this case—having separate kernel, memory allocations, etc. would decrease the available memory; I'd have to run 8 of them. Also, as far as I know, they'd add computational overhead too.

I imagine that a tool that would implement such behavior would hook up some function to a page fault notification that would decide on each page fault whether it is time to suspend the process or not. But I don't know any tool that would work this way either.

Are there other choices?

Best Answer

What you are referring to is process checkpointing. There is some work in the later kernels to offer this (in conjunction with the freezer cgroup) but its not ready yet.

This is actually very difficult to achieve well unfortunately because certain resources which are shared go stale after being unavailable for a fixed period of time (TCP springs to mind, although this may also apply to applications that use a wall clock, or perhaps some shared memory that changes state during a processes offline period).

As for stopping the process when it reaches a certain memory utilization, theres a hack I can think of that will do this.

  • You create a cgroup that contains the freezer and memory subsystems.
  • Place your task(s) inside of the cgroup.
  • Attach a process to cgroup.event_control and set a memory threshold that you do not want to exceed (this is somewhat explained in the kernel documentation.)
  • At exceed time you freeze the cgroup. The kernel should eventually evict these pages to swap (providing your cgroup has enough).

Note the "freeze" cgroup will not evict pages to a media persistent location, but it will swap the pages out when enough time has passed and the pages are needed for something else.

Even if this does work (its pretty hacky if it did) you need to consider whether or not this is really doing anything to solve your problem.

  • How do you know it wouldn't be better to allow a process using a lot of memory to just go faster to finish quickly its memory intensive period and relinquish the memory?
  • If you try to wake processes up fairly by round-robining processes - you could argue you're doing a worse job than what the CPU scheduler is already doing for you.
  • If some processes are more important than others (and should be woken up longer/finish quicker) its probably better to just allocate them more cpu time than keeping other processes completely frozen.
  • Whilst it would be slow -- you could add a lot of swap (so you can never overcommit) then greatly reduce the interactivity of the scheduler to try to help you reduce aggressive page evictions. This is done in sched_min_granularity_ns.

Unfortunately, the best solution would be the ability to checkpoint your tasks. Its a shame that most of the implementations are just not that concrete enough yet.

Alternatively, you could wait a couple of years for proper checkpoint/restore to be available in the kernel!