Linux – Long page allocation stalls on Linux – why does this happen

linuxmemoryoom

I have a problem (which I can reliably reproduce) on a bunch of Linux hosts, where the system becomes completely unresponsive after a process aggressively consumes memory. I see things like this in the kernel log:

2017-09-14 19:53:51.252365 kernel: hyperkube: page allocation stalls for 62933ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
2017-09-14 19:53:51.252402 kernel: hyperkube cpuset=kube-proxy mems_allowed=0
2017-09-14 19:53:51.252440 kernel: CPU: 1 PID: 1438 Comm: hyperkube Not tainted 4.11.9-coreos #1
2017-09-14 19:53:51.252478 kernel: Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
2017-09-14 19:53:51.252512 kernel: Call Trace:
2017-09-14 19:53:51.252591 kernel:  dump_stack+0x63/0x90
2017-09-14 19:53:51.252628 kernel:  warn_alloc+0x11c/0x1b0
2017-09-14 19:53:51.252682 kernel:  __alloc_pages_slowpath+0x811/0xe50
2017-09-14 19:53:51.252720 kernel:  ? alloc_pages_current+0x8c/0x110
2017-09-14 19:53:51.258910 kernel:  __alloc_pages_nodemask+0x21b/0x230
2017-09-14 19:53:51.258951 kernel:  alloc_pages_current+0x8c/0x110
2017-09-14 19:53:51.259009 kernel:  __page_cache_alloc+0xae/0xc0
2017-09-14 19:53:51.259041 kernel:  filemap_fault+0x338/0x630
2017-09-14 19:53:51.268298 kernel:  ? filemap_map_pages+0x19d/0x390
2017-09-14 19:53:51.268360 kernel:  ext4_filemap_fault+0x31/0x50 [ext4]
2017-09-14 19:53:51.268397 kernel:  __do_fault+0x1e/0xc0
2017-09-14 19:53:51.268436 kernel:  __handle_mm_fault+0xb06/0x1090
2017-09-14 19:53:51.268471 kernel:  handle_mm_fault+0xd1/0x240
2017-09-14 19:53:51.268504 kernel:  __do_page_fault+0x222/0x4b0
2017-09-14 19:53:51.268539 kernel:  do_page_fault+0x22/0x30
2017-09-14 19:53:51.268572 kernel:  page_fault+0x28/0x30
2017-09-14 19:53:51.268605 kernel: RIP: 0033:0x45d561
2017-09-14 19:53:51.268666 kernel: RSP: 002b:00007f64d3ef2de8 EFLAGS: 00010246
2017-09-14 19:53:51.268717 kernel: RAX: 0000000000000000 RBX: 000000000000007c RCX: 000000000045d561
2017-09-14 19:53:51.268757 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
2017-09-14 19:53:51.277186 kernel: RBP: 00007f64d3ef2df8 R08: 00007f64d3ef2de8 R09: 0000000000000000
2017-09-14 19:53:51.277239 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
2017-09-14 19:53:51.277283 kernel: R13: 0000000000000034 R14: 0000000000000000 R15: 00000000000000f3
2017-09-14 19:53:51.277322 kernel: Mem-Info:
2017-09-14 19:53:51.277355 kernel: active_anon:903273 inactive_anon:164 isolated_anon:0
                                    active_file:166 inactive_file:754 isolated_file:0
                                    unevictable:0 dirty:0 writeback:0 unstable:0
                                    slab_reclaimable:8251 slab_unreclaimable:17340
                                    mapped:591 shmem:2354 pagetables:4389 bounce:0
                                    free:14896 free_pcp:73 free_cma:0
2017-09-14 19:53:51.277393 kernel: Node 0 active_anon:3613092kB inactive_anon:656kB active_file:864kB inactive_file:2744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:2364kB dirty:0kB writeback:0kB shmem:9416kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 606208kB writeback_tmp:0kB unstable:0kB pages_scanned:246 all_unreclaimable? no
2017-09-14 19:53:51.288390 kernel: Node 0 DMA free:15052kB min:184kB low:228kB high:272kB active_anon:764kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15900kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:84kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
2017-09-14 19:53:51.288448 kernel: lowmem_reserve[]: 0 3717 3717 3717
2017-09-14 19:53:51.288483 kernel: Node 0 DMA32 free:44532kB min:44868kB low:56084kB high:67300kB active_anon:3612328kB inactive_anon:656kB active_file:912kB inactive_file:2516kB unevictable:0kB writepending:0kB present:3915776kB managed:3841148kB mlocked:0kB slab_reclaimable:33004kB slab_unreclaimable:69276kB kernel_stack:10096kB pagetables:17556kB bounce:0kB free_pcp:412kB local_pcp:156kB free_cma:0kB
2017-09-14 19:53:51.288520 kernel: lowmem_reserve[]: 0 0 0 0
2017-09-14 19:53:51.288553 kernel: Node 0 DMA: 5*4kB (UM) 1*8kB (M) 3*16kB (UM) 2*32kB (UM) 1*64kB (M) 2*128kB (UM) 1*256kB (U) 0*512kB 2*1024kB (UM) 0*2048kB 3*4096kB (ME) = 15052kB
2017-09-14 19:53:51.288609 kernel: Node 0 DMA32: 537*4kB (UMEH) 360*8kB (UMEH) 397*16kB (UMEH) 238*32kB (UMEH) 141*64kB (UMEH) 61*128kB (E) 22*256kB (E) 4*512kB (ME) 1*1024kB (M) 0*2048kB 0*4096kB = 44532kB
2017-09-14 19:53:51.288735 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
2017-09-14 19:53:51.288784 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
2017-09-14 19:53:51.294569 kernel: 3234 total pagecache pages
2017-09-14 19:53:51.294619 kernel: 0 pages in swap cache
2017-09-14 19:53:51.294670 kernel: Swap cache stats: add 0, delete 0, find 0/0
2017-09-14 19:53:51.294747 kernel: Free swap  = 0kB
2017-09-14 19:53:51.294781 kernel: Total swap = 0kB
2017-09-14 19:53:51.294825 kernel: 982941 pages RAM
2017-09-14 19:53:51.300569 kernel: 0 pages HighMem/MovableOnly
2017-09-14 19:53:51.300616 kernel: 18679 pages reserved
2017-09-14 19:53:51.300673 kernel: 0 pages hwpoisoned

As you can see here, the system was seemingly stalled for >60 seconds trying to allocate memory. After around 10 minutes of the system being completely unusable, the OOM killer steps in and kills the greedy process.

I'd really love if someone could help me understand:

  • Why it takes the OOM killer so long to act?
  • Why these allocations take so long? If there is no memory available, why does this not just fail?

Best Answer

Relying on the OOM killer means you have already lost. OOM is only a last chance effort to prevent a kernel panic.

Collecting performance data will show you the symptoms of becoming unresponsive. In particular, if load average is much greater than the number of CPUs, you have too many ready to run tasks piling up. Probably because of spending so much system time to scan memory pages. Which will result in poor response time.

Regarding why determining no memory is not trivial, have a look at Documentation/sysctl/vm.txt specifically overcommit_memory. The kernel is quite sophisticated at pretending there is more memory available, because actual physical use is typically much smaller than what is allocated. This takes a little CPU time, which can get exacerbated in pathologically low memory conditions.

I see this is a Xen VM probably in EC2. In clouds it is easy to spin up larger VMs if you want to throw resources at the problem.

If you suspect this much allocated is too much for the application, have its developers look at its memory profile such as with Valgrind or looking a kdump when it paniced from OOM.