Ubuntu – Proxmox kills VM when RAM usage grows

memoryproxmoxUbuntuzfs

I don't know if this stackexchange site is the most suited to my question, please flag this for mods to move the question elsewhere if it isn't.

Context & issue

I installed Proxmox on a physical machine (home server), chose ZFS mirror during the installation and created a big Ubuntu server VM taking nearly all resources.

Whenever I make heavy disk I/O in the VM (writing or reading large amounts of data), the RAM usage grows and after it reaches 100%, the VM crashes. When reading the output of htop, it's the orange portion of the RAM that grows, which I understand is the buffer/cache used by kernel and should be recoverable.

Setup

My current setup is the following:

  • 24GB RAM
  • 2 x 2TB HDD
  • Proxmox installed on both drives, ZFS mirror chosen during the install.
  • Ubuntu 20.04 on a VM on Proxmox, taking 1.5TB of space and limited to 20GB of RAM. It's using LVM and ext4, the VirtIO SCSI controller and mostly default values from Proxmox.

Hard disk specs of the Proxmox VM

Research

I didn't really find anything like this with my favorite search engine but maybe I just didn't use the right keywords 🤔

I read that poor disk performance on ZFS might be linked to an inappropriate ashift value. Mine seems to be 12 for disks with a sector size of 4096 bytes, which is ok from what I understand (2^12 = 4096), see code block below.

I tried reducing the RAM allocated to the VM to 16GB in the event that ZFS needed more to handle all this data flow but it just made it crash faster.

The zpool seems healthy as far as I can tell (see code block below).

Initially the Ubuntu install came with an 8GB swap file which I disabled, didn't help.

I'm pretty much out of ideas now. Feel free to point for obvious mistakes or oversight!

I guess my questions are: why does it crash ? and why does the RAM usage grow?

gaugendre@ubuntu:~$ free -h  # the "buff/cache" grows
              total        used        free      shared  buff/cache   available
Mem:           19Gi       1.4Gi       4.7Gi       119Mi        13Gi        17Gi
Swap:            0B          0B          0B

root@pm:~# zpool get all | grep ashift
rpool  ashift                         12                             local

root@pm:~# fdisk -l
Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: WDC WD20EFRX-68E
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: D4515220-E091-40B3-B6CC-2FE98C69A5F2

Device       Start        End    Sectors  Size Type
/dev/sda1       34       2047       2014 1007K BIOS boot
/dev/sda2     2048    1050623    1048576  512M EFI System
/dev/sda3  1050624 3907029134 3905978511  1.8T Solaris /usr & Apple ZFS

Partition 1 does not start on physical sector boundary.


Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: WDC WD20EFRX-68E
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 1A09CBBB-F15C-47BD-A5E6-BC616AA46EB6

Device       Start        End    Sectors  Size Type
/dev/sdb1       34       2047       2014 1007K BIOS boot
/dev/sdb2     2048    1050623    1048576  512M EFI System
/dev/sdb3  1050624 3907029134 3905978511  1.8T Solaris /usr & Apple ZFS

Partition 1 does not start on physical sector boundary.


Disk /dev/zd0: 1.5 TiB, 1610612736000 bytes, 3145728000 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 8192 bytes
I/O size (minimum/optimal): 8192 bytes / 8192 bytes
Disklabel type: gpt
Disk identifier: 8F028008-A5B1-4D8A-BF51-825BCC73949F

Device       Start        End    Sectors  Size Type
/dev/zd0p1    2048       4095       2048    1M BIOS boot
/dev/zd0p2    4096    2101247    2097152    1G Linux filesystem
/dev/zd0p3 2101248 3145725951 3143624704  1.5T Linux filesystem

root@pm:~# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool  1.81T   157G  1.66T        -         -     0%     8%  1.00x    ONLINE  -

root@pm:~# zpool status -v
  pool: rpool
 state: ONLINE
  scan: none requested
config:

    NAME                                                STATE     READ WRITE CKSUM
    rpool                                               ONLINE       0     0     0
      mirror-0                                          ONLINE       0     0     0
        ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M7KVECCZ-part3  ONLINE       0     0     0
        ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M3FASL3F-part3  ONLINE       0     0     0

errors: No known data errors

htop RAM usage growing
htop screenshot

Proxmox OOM kill logs

Feb 02 15:51:39 pm kernel: pve-firewall invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Feb 02 15:51:39 pm kernel: CPU: 1 PID: 1496 Comm: pve-firewall Tainted: P           O      5.4.73-1-pve #1
Feb 02 15:51:39 pm kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./A320M Pro4-F, BIOS P2.20 07/27/2020
Feb 02 15:51:39 pm kernel: Call Trace:
Feb 02 15:51:39 pm kernel:  dump_stack+0x6d/0x9a
Feb 02 15:51:39 pm kernel:  dump_header+0x4f/0x1e1
Feb 02 15:51:39 pm kernel:  oom_kill_process.cold.33+0xb/0x10
Feb 02 15:51:39 pm kernel:  out_of_memory+0x1ad/0x490
Feb 02 15:51:39 pm kernel:  __alloc_pages_slowpath+0xd40/0xe30
Feb 02 15:51:39 pm kernel:  ? __wake_up_common_lock+0x8c/0xc0
Feb 02 15:51:39 pm kernel:  __alloc_pages_nodemask+0x2df/0x330
Feb 02 15:51:39 pm kernel:  alloc_pages_current+0x81/0xe0
Feb 02 15:51:39 pm kernel:  __page_cache_alloc+0x6a/0xa0
Feb 02 15:51:39 pm kernel:  pagecache_get_page+0xbe/0x2e0
Feb 02 15:51:39 pm kernel:  ? __switch_to_asm+0x34/0x70
Feb 02 15:51:39 pm kernel:  filemap_fault+0x887/0xa70
Feb 02 15:51:39 pm kernel:  ? __switch_to_asm+0x34/0x70
Feb 02 15:51:39 pm kernel:  ? __switch_to_asm+0x40/0x70
Feb 02 15:51:39 pm kernel:  ? __switch_to_asm+0x34/0x70
Feb 02 15:51:39 pm kernel:  ? __switch_to_asm+0x40/0x70
Feb 02 15:51:39 pm kernel:  ? __switch_to_asm+0x34/0x70
Feb 02 15:51:39 pm kernel:  ? __switch_to_asm+0x40/0x70
Feb 02 15:51:39 pm kernel:  ? __switch_to_asm+0x34/0x70
Feb 02 15:51:39 pm kernel:  ? xas_load+0xc/0x80
Feb 02 15:51:39 pm kernel:  ? xas_find+0x17e/0x1b0
Feb 02 15:51:39 pm kernel:  ? filemap_map_pages+0x28d/0x3b0
Feb 02 15:51:39 pm kernel:  __do_fault+0x3c/0x130
Feb 02 15:51:39 pm kernel:  __handle_mm_fault+0xe75/0x12a0
Feb 02 15:51:39 pm kernel:  handle_mm_fault+0xc9/0x1f0
Feb 02 15:51:39 pm kernel:  __do_page_fault+0x233/0x4c0
Feb 02 15:51:39 pm kernel:  do_page_fault+0x2c/0xe0
Feb 02 15:51:39 pm kernel:  page_fault+0x34/0x40
Feb 02 15:51:39 pm kernel: RIP: 0033:0x55e4e4e79b18
Feb 02 15:51:39 pm kernel: Code: Bad RIP value.
Feb 02 15:51:39 pm kernel: RSP: 002b:00007ffdaad052a0 EFLAGS: 00010206
Feb 02 15:51:39 pm kernel: RAX: 0000000000000000 RBX: 000055e4e6577260 RCX: 00007f073b0286f4
Feb 02 15:51:39 pm kernel: RDX: 0000000000000000 RSI: 00007ffdaad05260 RDI: 00007ffdaad05260
Feb 02 15:51:39 pm kernel: RBP: 000055e4e657ca78 R08: 0000000000000e02 R09: 0000000000000602
Feb 02 15:51:39 pm kernel: R10: 000000000517e000 R11: 0000000000000246 R12: 000055e4e89ef7b0
Feb 02 15:51:39 pm kernel: R13: 000055e4e657ca70 R14: 0000000000000000 R15: 0000000000000000
Feb 02 15:51:39 pm kernel: Mem-Info:
Feb 02 15:51:39 pm kernel: active_anon:5184091 inactive_anon:16558 isolated_anon:0
 active_file:84 inactive_file:0 isolated_file:0
 unevictable:1330 dirty:0 writeback:10 unstable:0
 slab_reclaimable:13066 slab_unreclaimable:144860
 mapped:14964 shmem:17028 pagetables:11872 bounce:0
 free:40335 free_pcp:762 free_cma:0
Feb 02 15:51:39 pm kernel: Node 0 active_anon:20736364kB inactive_anon:66232kB active_file:628kB inactive_file:0kB unevictable:5320kB isolated(anon):0kB isolated(file):0kB mapped:59856kB dirty:0kB writeback:40kB shmem:68112kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 7663616kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Feb 02 15:51:39 pm kernel: Node 0 DMA free:15884kB min:44kB low:56kB high:68kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15884kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Feb 02 15:51:39 pm kernel: lowmem_reserve[]: 0 1357 21897 21897 21897
Feb 02 15:51:39 pm kernel: Node 0 DMA32 free:86332kB min:4184kB low:5572kB high:6960kB active_anon:1333188kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:1497008kB managed:1431132kB mlocked:0kB kernel_stack:0kB pagetables:8kB bounce:0kB free_pcp:248kB local_pcp:0kB free_cma:0kB
Feb 02 15:51:39 pm kernel: lowmem_reserve[]: 0 0 20539 20539 20539
Feb 02 15:51:39 pm kernel: Node 0 Normal free:59628kB min:63348kB low:84380kB high:105412kB active_anon:19403176kB inactive_anon:66232kB active_file:0kB inactive_file:400kB unevictable:5320kB writepending:40kB present:21482752kB managed:21040620kB mlocked:5320kB kernel_stack:6304kB pagetables:47480kB bounce:0kB free_pcp:2980kB local_pcp:28kB free_cma:0kB
Feb 02 15:51:39 pm kernel: lowmem_reserve[]: 0 0 0 0 0
Feb 02 15:51:39 pm kernel: Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15884kB
Feb 02 15:51:39 pm kernel: Node 0 DMA32: 3*4kB (UM) 0*8kB 0*16kB 6*32kB (UM) 8*64kB (UM) 70*128kB (UM) 60*256kB (U) 42*512kB (UM) 39*1024kB (U) 0*2048kB 0*4096kB = 86476kB
Feb 02 15:51:39 pm kernel: Node 0 Normal: 5350*4kB (UME) 460*8kB (UME) 177*16kB (UME) 718*32kB (UME) 112*64kB (UM) 9*128kB (U) 4*256kB (UM) 1*512kB (U) 0*1024kB 0*2048kB 0*4096kB = 60744kB
Feb 02 15:51:39 pm kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Feb 02 15:51:39 pm kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Feb 02 15:51:39 pm kernel: 18053 total pagecache pages
Feb 02 15:51:39 pm kernel: 0 pages in swap cache
Feb 02 15:51:39 pm kernel: Swap cache stats: add 0, delete 0, find 0/0
Feb 02 15:51:39 pm kernel: Free swap  = 0kB
Feb 02 15:51:39 pm kernel: Total swap = 0kB
Feb 02 15:51:39 pm kernel: 5748939 pages RAM
Feb 02 15:51:39 pm kernel: 0 pages HighMem/MovableOnly
Feb 02 15:51:39 pm kernel: 127030 pages reserved
Feb 02 15:51:39 pm kernel: 0 pages cma reserved
Feb 02 15:51:39 pm kernel: 0 pages hwpoisoned
Feb 02 15:51:39 pm kernel: Tasks state (memory values in pages):
Feb 02 15:51:39 pm kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Feb 02 15:51:39 pm kernel: [    683]     0   683    19753     9110   172032        0             0 systemd-journal
Feb 02 15:51:39 pm kernel: [    755]     0   755     5806      650    69632        0         -1000 systemd-udevd
Feb 02 15:51:39 pm kernel: [   1005]   106  1005     1705      385    57344        0             0 rpcbind
Feb 02 15:51:39 pm kernel: [   1006]   100  1006    23270      573    86016        0             0 systemd-timesyn
Feb 02 15:51:39 pm kernel: [   1020]     0  1020     4878      611    77824        0             0 systemd-logind
Feb 02 15:51:39 pm kernel: [   1021]     0  1021    93118      287    86016        0             0 lxcfs
Feb 02 15:51:39 pm kernel: [   1022]     0  1022    41689      511    81920        0             0 zed
Feb 02 15:51:39 pm kernel: [   1023]   104  1023     2264      444    57344        0          -900 dbus-daemon
Feb 02 15:51:39 pm kernel: [   1024]     0  1024    68958      296    77824        0             0 pve-lxc-syscall
Feb 02 15:51:39 pm kernel: [   1025]     0  1025     1022      326    49152        0             0 qmeventd
Feb 02 15:51:39 pm kernel: [   1026]     0  1026    56455      662    90112        0             0 rsyslogd
Feb 02 15:51:39 pm kernel: [   1029]     0  1029     3064      656    61440        0             0 smartd
Feb 02 15:51:39 pm kernel: [   1030]     0  1030      535      240    40960        0         -1000 watchdog-mux
Feb 02 15:51:39 pm kernel: [   1071]     0  1071     1681      330    49152        0             0 ksmtuned
Feb 02 15:51:39 pm kernel: [   1137]     0  1137      954      288    45056        0             0 lxc-monitord
Feb 02 15:51:39 pm kernel: [   1182]     0  1182     1722       61    49152        0             0 iscsid
Feb 02 15:51:39 pm kernel: [   1187]     0  1187     1848     1257    49152        0           -17 iscsid
Feb 02 15:51:39 pm kernel: [   1212]     0  1212     1402      350    49152        0             0 agetty
Feb 02 15:51:39 pm kernel: [   1230]     0  1230     3962      630    69632        0         -1000 sshd
Feb 02 15:51:39 pm kernel: [   1391]     0  1391   218005      697   208896        0             0 rrdcached
Feb 02 15:51:39 pm kernel: [   1461]     0  1461    10868      651    86016        0             0 master
Feb 02 15:51:39 pm kernel: [   1463]   107  1463    10968      597    86016        0             0 qmgr
Feb 02 15:51:39 pm kernel: [   1465]     0  1465   378386    12276   450560        0             0 pmxcfs
Feb 02 15:51:39 pm kernel: [   1473]     0  1473     2125      565    53248        0             0 cron
Feb 02 15:51:39 pm kernel: [   1496]     0  1496    75361    20384   307200        0             0 pve-firewall
Feb 02 15:51:39 pm kernel: [   1500]     0  1500      568      187    40960        0             0 none
Feb 02 15:51:39 pm kernel: [   1505]     0  1505    74951    19974   307200        0             0 pvestatd
Feb 02 15:51:39 pm kernel: [   1541]     0  1541    88255    29812   405504        0             0 pvedaemon
Feb 02 15:51:39 pm kernel: [   1555]     0  1555    83335    23049   364544        0             0 pve-ha-crm
Feb 02 15:51:39 pm kernel: [   1632]    33  1632    88663    30378   434176        0             0 pveproxy
Feb 02 15:51:39 pm kernel: [   1638]    33  1638    17578    12613   184320        0             0 spiceproxy
Feb 02 15:51:39 pm kernel: [   1640]     0  1640    83246    22949   368640        0             0 pve-ha-lrm
Feb 02 15:51:39 pm kernel: [   1483]     0  1483  5418704  5111805 41844736        0             0 kvm
Feb 02 15:51:39 pm kernel: [  15884]     0 15884    90381    30381   430080        0             0 pvedaemon worke
Feb 02 15:51:39 pm kernel: [  30602]     0 30602    90429    30408   434176        0             0 pvedaemon worke
Feb 02 15:51:39 pm kernel: [  10126]     0 10126    90380    30339   434176        0             0 pvedaemon worke
Feb 02 15:51:39 pm kernel: [  21058]    33 21058    17631    12591   176128        0             0 spiceproxy work
Feb 02 15:51:39 pm kernel: [  21085]     0 21085    21543      371    69632        0             0 pvefw-logger
Feb 02 15:51:39 pm kernel: [  21097]    33 21097    88702    30316   417792        0             0 pveproxy worker
Feb 02 15:51:39 pm kernel: [  21098]    33 21098    88663    30272   417792        0             0 pveproxy worker
Feb 02 15:51:39 pm kernel: [  21099]    33 21099    88702    30278   417792        0             0 pveproxy worker
Feb 02 15:51:39 pm kernel: [   7252]   107  7252    10956      616    86016        0             0 pickup
Feb 02 15:51:39 pm kernel: [  32252]     0 32252     1314      153    53248        0             0 sleep
Feb 02 15:51:39 pm kernel: [   2329]     0  2329     5806      544    65536        0             0 systemd-udevd
Feb 02 15:51:39 pm kernel: [   2330]     0  2330     5806      542    65536        0             0 systemd-udevd
Feb 02 15:51:39 pm kernel: [   2331]     0  2331     5806      542    65536        0             0 systemd-udevd
Feb 02 15:51:39 pm kernel: [   2332]     0  2332     5806      542    65536        0             0 systemd-udevd
Feb 02 15:51:39 pm kernel: [   4195]     0  4195     5806      653    65536        0             0 systemd-udevd
Feb 02 15:51:39 pm kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/qemu.slice/100.scope,task=kvm,pid=1483,uid=0
Feb 02 15:51:39 pm kernel: Out of memory: Killed process 1483 (kvm) total-vm:21674816kB, anon-rss:20445684kB, file-rss:1536kB, shmem-rss:0kB, UID:0 pgtables:40864kB oom_score_adj:0
Feb 02 15:51:39 pm kernel: oom_reaper: reaped process 1483 (kvm), now anon-rss:0kB, file-rss:68kB, shmem-rss:0kB
Feb 02 15:51:40 pm kernel: fwbr100i0: port 2(tap100i0) entered disabled state
Feb 02 15:51:40 pm kernel: fwbr100i0: port 2(tap100i0) entered disabled state
Feb 02 15:51:40 pm systemd[1]: 100.scope: Succeeded.
Feb 02 15:51:43 pm qmeventd[1019]: Starting cleanup for 100
Feb 02 15:51:43 pm kernel: fwbr100i0: port 1(fwln100i0) entered disabled state
Feb 02 15:51:43 pm kernel: vmbr0: port 2(fwpr100p0) entered disabled state
Feb 02 15:51:43 pm kernel: device fwln100i0 left promiscuous mode
Feb 02 15:51:43 pm kernel: fwbr100i0: port 1(fwln100i0) entered disabled state
Feb 02 15:51:43 pm kernel: device fwpr100p0 left promiscuous mode
Feb 02 15:51:43 pm kernel: vmbr0: port 2(fwpr100p0) entered disabled state
Feb 02 15:51:44 pm qmeventd[1019]: Finished cleanup for 100

Best Answer

My issue was likely due to a combination of several factors, leading to a memory shortage on the host and ultimately the VM getting OOM killed to retrieve memory.

Factors in play:

  • The host has 24GB of physical RAM but for some reason it only reports ~22GB. I didn't investigate that further yet.
  • 20GB of RAM were allocated to the single VM, which left only ~2GB for the host and ZFS.
  • Heavy RAM usage and disk I/O inside the VM.
  • The disk I/O inside the VM caused the ZFS ARC cache to grow without bounds.

Ultimately, the host killed the virtual machine process to retrieve RAM and prevent an outage.

I limited ZFS's ARC cache on the host to 4GB and reduced the RAM allocated to the VM to 12GB to let the host breathe.

Limit ZFS ARC cache (on the host):

$ cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_min=2147483648
options zfs zfs_arc_max=4294967296

Take new options into account (on the host):

# update-initramfs -u

Thanks to @MichaelHampton for leading me to this conclusion.

Related Topic