Cannot allocate available memory (even half!) on AWS EC2 instance

I have two slightly different AWS EC2 instances of the same type with a huge amount of memory (c4.8xlarge with 60GB of RAM). One of those instances is just a copy which has been launched from a backup image (AMI) and the issue cannot be reproduced on it.

I stopped all of the services except system ones so most of the memory is free:

> free -m
              total        used        free      shared  buff/cache   available
Mem:          60382         201       59545           9         635       59695
Swap:             0           0           0

I cannot allocate even half of the available memory using stress utility:

> sudo stress --vm 1 --vm-keep --vm-bytes 30G
stress: info: [40005] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: FAIL: [40006] (494) hogvm malloc failed: Cannot allocate memory
...

And here is an output of memtester:

> sudo memtester 60000
memtester version 4.3.0 (64-bit)
Copyright (C) 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 60000MB (62914560000 bytes)
got  29811MB (31259688960 bytes), trying mlock ...locked.
Loop 1:
  Stuck Address       : ok
  ...

There are no any ulimit memory restrictions enabled. I have the same issue on the copies of that server. But everything is fine on the server restored from the older image:

> stress --vm 1 --vm-keep --vm-bytes 58G
stress: info: [14516] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd


> sudo memtester 59000
memtester version 4.3.0 (64-bit)
Copyright (C) 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 59000MB (61865984000 bytes)
got  59000MB (61865984000 bytes), trying mlock ...locked.
...

What can I do to figure out the issue?

2 - Don't overcommit. The total address space commit for the system is not permitted to exceed swap + a configurable amount (default is 50%) of physical RAM. Depending on the amount you use, in most situations this means a process will not be killed while accessing pages but will receive errors on memory allocation as appropriate.

Best Answer

It looks like somebody set the vm.overcommit_memory value to 2 in the new image.

https://www.kernel.org/doc/Documentation/vm/overcommit-accounting:

To fix the issue - enable vm.overcommit_memory (by setting it to 0), or adjust vm.overcommit_ratio, or make a 30Gb swap.

Don't really know how to figure out such wierd problems, but I'd probably do the following things:

Read all kernel docs related to memory management.
Compare the vm.* sysctl parameters on both servers.
Inspect the dmesg messages for hardware/system errors.
Build the kernel with a debug information, attach a debugger, setup a breakpoint somewhere near the mmap syscall and look what's going.

Best Answer

Related Solutions

Delete Amazon EC2 terminated instance

Can’t Connect to EC2 Instance in VPC – Troubleshooting Guide

Related Topic