Linux – Ubuntu 14.04 KVM host – not allocating KVM guest correctly, high load

kvm-virtualizationlinuxUbuntuwindows-server-2008-r2

Encountering an odd issue with Ubuntu 14.04 KVM hosts/VMs. We have a number of physical machines, all the same spec, but when we try and start the same type of KVM image (OpenNebula is managing the VM deployment, so exactly the same VM configuration on each machine) we have issues with over half the hosts and VMs not booting.

The VMs (Windows 2008 R2) are allocated 16GB of RAM and 12 cores, the host has 24GB of RAM and 12 cores / 24 threads – so no issue about lack of resources and each host is only running 1 VM. Nothing else is contending for resources on the host or using any significant (>500MB) memory (just a few things like puppet, splunk etc).

If I log onto one of the hosts with a problem and look at top:

top - 05:13:21 up 2 days, 21:23,  1 user,  load average: 81.41, 74.04, 44.58
Tasks: 302 total,   1 running, 301 sleeping,   0 stopped,   0 zombie
%Cpu(s): 33.3 us,  8.4 sy,  0.0 ni, 58.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  24680184 total, 24467788 used,   212396 free,   141636 buffers
KiB Swap: 37731324 total,        0 used, 37731324 free. 11193004 cached Mem

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                          
3593 oneadmin  20   0 21.771g 0.011t   9356 D 999.0 49.8 131:19.15 qemu-system-x86

You can see KVM has only been able to allocate 11GB, and qemu-system-x86 is hammering the CPU (which is fine, plenty of cores free) but the VM itself hasn't fully started yet (I can't RDP, and the fact the CPU is high seems like its related or an effect of the issue).

In comparison, if I look at an identical machine running the same VM config:

top - 05:24:58 up 3 days,  3:53,  1 user,  load average: 1.08, 1.15, 0.72
Tasks: 290 total,   1 running, 289 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.7 us,  0.7 sy,  0.0 ni, 96.0 id,  2.6 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  24680184 total, 24375512 used,   304672 free,   136520 buffers
KiB Swap: 37731324 total,   340872 used, 37390452 free.  7446028 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                     
27824 oneadmin  20   0 21.304g 0.015t   6524 S  38.2 65.1  12:05.93 qemu-system-x86

Here you can see it's happily allocated its memory (0.015t), has booted up fine (I can RDP), and is just chugging along idle (for the most part).

What is preventing KVM in the first VM shown above from correctly being able to allocate resources?

Scanning system logs, libvirt/qemu logs, etc. I find no mention of any issues.

Update:

After checking kernel logs I came across this:

May 24 05:00:27 ubuntu kernel: [249051.039660] perf samples too long (5103 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
May 24 05:00:33 ubuntu kernel: [249057.722532] ------------[ cut here ]------------
May 24 05:00:33 ubuntu kernel: [249057.722562] kernel BUG at /build/buildd/linux-3.13.0/mm/memory.c:3756!
May 24 05:00:33 ubuntu kernel: [249057.722587] invalid opcode: 0000 [#1] SMP 
May 24 05:00:33 ubuntu kernel: [249057.722605] Modules linked in: ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc ip6table_filter ip6_tables ipt_REJECT xt_LOG xt_limit xt_multiport nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter xt_conntrack nf_conntrack ip_tables x_tables ast ttm drm_kms_helper drm syscopyarea sysfillrect sysimgblt gpio_ich intel_powerclamp coretemp kvm_intel dcdbas kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd serio_raw lpc_ich joydev i7core_edac edac_core ipmi_si mac_hid lp parport hid_generic usbhid hid igb i2c_algo_bit dca ptp ahci libahci pps_core
May 24 05:00:33 ubuntu kernel: [249057.722903] CPU: 11 PID: 3601 Comm: qemu-system-x86 Not tainted 3.13.0-24-generic #47-Ubuntu
May 24 05:00:33 ubuntu kernel: [249057.722934] Hardware name: Dell                   CS24-TY               /S99                   , BIOS DS993B15 09/20/2010
May 24 05:00:33 ubuntu kernel: [249057.722972] task: ffff88062d8dc7d0 ti: ffff88062b4f2000 task.ti: ffff88062b4f2000
May 24 05:00:33 ubuntu kernel: [249057.723000] RIP: 0010:[<ffffffff81179051>]  [<ffffffff81179051>] handle_mm_fault+0xe61/0xf10
May 24 05:00:33 ubuntu kernel: [249057.723038] RSP: 0018:ffff88062b4f3a08  EFLAGS: 00010246
May 24 05:00:33 ubuntu kernel: [249057.723058] RAX: 0000000000000100 RBX: 00007f3864300000 RCX: ffff88062b4f3788
May 24 05:00:33 ubuntu kernel: [249057.723083] RDX: ffff88062d8dc7d0 RSI: 0000000000000000 RDI: 8000000340c009e6
May 24 05:00:33 ubuntu kernel: [249057.723108] RBP: ffff88062b4f3a90 R08: 0000000000000000 R09: 0000000000000019
May 24 05:00:33 ubuntu kernel: [249057.723134] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88032e317908
May 24 05:00:33 ubuntu kernel: [249057.723159] R13: ffff88032c785740 R14: ffff88032c437380 R15: 0000000000000000
May 24 05:00:33 ubuntu kernel: [249057.723185] FS:  00007f3b94ff9700(0000) GS:ffff88063fca0000(0000) knlGS:0000000000000000
May 24 05:00:33 ubuntu kernel: [249057.723213] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
May 24 05:00:33 ubuntu kernel: [249057.723235] CR2: fffff98001b88400 CR3: 0000000315fad000 CR4: 00000000000027e0
May 24 05:00:33 ubuntu kernel: [249057.723260] Stack:
May 24 05:00:33 ubuntu kernel: [249057.723269]  ffffffff8171da4a 0000000000000010 0000000000000202 ffff88062b4f3a38
May 24 05:00:33 ubuntu kernel: [249057.723303]  0000000000000018 ffffea00124a9900 ffff88062b4f3a90 ffffffff81176fc8
May 24 05:00:33 ubuntu kernel: [249057.723336]  ffff88032e3179b8 ffff880629b41a08 ffff880300000019 8000000400000001
May 24 05:00:33 ubuntu kernel: [249057.723368] Call Trace:
May 24 05:00:33 ubuntu kernel: [249057.723385]  [<ffffffff8171da4a>] ? _raw_spin_lock+0x3a/0x50
May 24 05:00:33 ubuntu kernel: [249057.723407]  [<ffffffff81176fc8>] ? follow_page_mask+0xf8/0x5b0
May 24 05:00:33 ubuntu kernel: [249057.723430]  [<ffffffff81179266>] __get_user_pages+0x166/0x5e0
May 24 05:00:33 ubuntu kernel: [249057.723473]  [<ffffffffa0297709>] __gfn_to_pfn_memslot+0x169/0x3c0 [kvm]
May 24 05:00:33 ubuntu kernel: [249057.723504]  [<ffffffffa02979bc>] __gfn_to_pfn+0x5c/0x60 [kvm]
May 24 05:00:33 ubuntu kernel: [249057.723531]  [<ffffffffa02979fa>] gfn_to_pfn_async+0x1a/0x20 [kvm]
May 24 05:00:33 ubuntu kernel: [249057.723564]  [<ffffffffa02b9f8f>] try_async_pf+0x3f/0x1b0 [kvm]
May 24 05:00:33 ubuntu kernel: [249057.723595]  [<ffffffffa02b4f6e>] ? mapping_level.isra.92+0x7e/0xa0 [kvm]
May 24 05:00:33 ubuntu kernel: [249057.723628]  [<ffffffffa02ba416>] tdp_page_fault+0x106/0x1f0 [kvm]
May 24 05:00:33 ubuntu kernel: [249057.723660]  [<ffffffffa02b5634>] kvm_mmu_page_fault+0x24/0x100 [kvm]
May 24 05:00:33 ubuntu kernel: [249057.723688]  [<ffffffffa0156634>] handle_ept_violation+0x94/0x160 [kvm_intel]
May 24 05:00:33 ubuntu kernel: [249057.723716]  [<ffffffffa015bb05>] vmx_handle_exit+0xb5/0x890 [kvm_intel]
May 24 05:00:33 ubuntu kernel: [249057.723743]  [<ffffffff8109d91e>] ? __vtime_account_system+0x2e/0x40
May 24 05:00:33 ubuntu kernel: [249057.723775]  [<ffffffffa02a9c05>] vcpu_enter_guest+0x795/0xcb0 [kvm]
May 24 05:00:33 ubuntu kernel: [249057.723806]  [<ffffffffa029fde4>] ? kvm_check_async_pf_completion+0x14/0x110 [kvm]
May 24 05:00:33 ubuntu kernel: [249057.723842]  [<ffffffffa02ae108>] kvm_arch_vcpu_ioctl_run+0x1e8/0x460 [kvm]
May 24 05:00:33 ubuntu kernel: [249057.723872]  [<ffffffffa0298062>] kvm_vcpu_ioctl+0x2c2/0x5b0 [kvm]
May 24 05:00:33 ubuntu kernel: [249057.723898]  [<ffffffff810d9329>] ? futex_wake+0x1a9/0x1d0
May 24 05:00:33 ubuntu kernel: [249057.724805]  [<ffffffff811cc6e0>] do_vfs_ioctl+0x2e0/0x4c0
May 24 05:00:33 ubuntu kernel: [249057.725701]  [<ffffffff8109dd84>] ? vtime_account_user+0x54/0x60
May 24 05:00:33 ubuntu kernel: [249057.726603]  [<ffffffff811cc941>] SyS_ioctl+0x81/0xa0
May 24 05:00:33 ubuntu kernel: [249057.727496]  [<ffffffff817266bf>] tracesys+0xe1/0xe6
May 24 05:00:33 ubuntu kernel: [249057.728375] Code: ff 48 89 d9 4c 89 e2 4c 89 ee 4c 89 f7 44 89 4d c8 e8 34 c1 ff ff 85 c0 0f 85 94 f5 ff ff 49 8b 3c 24 44 8b 4d c8 e9 68 f3 ff ff <0f> 0b be 8e 00 00 00 48 c7 c7 18 25 a6 81 44 89 4d c8 e8 18 e7 
May 24 05:00:33 ubuntu kernel: [249057.730245] RIP  [<ffffffff81179051>] handle_mm_fault+0xe61/0xf10
May 24 05:00:33 ubuntu kernel: [249057.731128]  RSP <ffff88062b4f3a08>
May 24 05:00:33 ubuntu kernel: [249057.745224] ---[ end trace 9b34ed0875c40df4 ]---

Wondering if there is an issue with this kernel (3.13.0-24-generic). I'm also getting odd system behaviour on the host that is having issues – if I try and ps it hangs at a certain point and I have to kill my SSH session. I also can't reboot the system, presumably also hanging during the reboot process. That kernel issue is clearly breaking the rest of the system.

Best Answer

Yes -- there is an issue with that kernel. We had the same problem running multiple java instances on a dual E5-2687.

Never ran out of memory, but after a few hours of load it would crash if we tried to ssh in or run ps. Dmesg showed the same error you have:

kernel BUG at /build/buildd/linux-3.13.0/mm/memory.c:3756!

We installed the latest testing kernel and have been stable since (3.15.0-031500rc2-generic #201404201435). The kernel and headers can be downloaded from http://kernel.ubuntu.com/~kernel-ppa/mainline/?C=N;O=D and installed with dpkg.

We are currently testing, I'll post if there's a crash with this kernel.

UPDATE: We ran 3.15.0-031500rc2-generic for AMD64, and although it rarely happens we have had this same error occur. Granted, it was after running a dual Xeon E5-2687W at 100% utilization on all cores for about 3 days, but it still happened and should not have (Ubuntu 12 didn't, same java app).

Related Topic