Qemu+libvirt+kvm significant performance lag in VM’s

kvm-virtualizationlibvirtperformanceqemuvirtual-machines

I am troubleshooting poor performance on our new VM host machine which is a Dell PowerEdge R415 with hardware RAID.

We've got about 20 VM's running like this:

qemu-system-x86_64 \
  -enable-kvm \
  -name rc.stb.mezzanine -S \
  -machine pc-0.12,accel=kvm,usb=off \
  -m 2048 \
  -realtime mlock=off \
  -smp 2,sockets=2,cores=1,threads=1 \
  -uuid 493d519c-8bb5-2cf8-c037-1094a3c48a7a \
  -no-user-config \
  -nodefaults \
  -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/rc.stb.monitor,server,nowait \
  -mon chardev=charmonitor,id=monitor,mode=control \
  -rtc base=utc \
  -no-shutdown \
  -boot strict=on \ 
  -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \
  -drive file=/dev/vg-raid/lv-rc,if=none,id=drive-virtio-disk0,format=raw \
  -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
  -drive file=/var/lib/libvirt/images/ubuntu-14.04-server-amd64.iso,if=none,id=drive-ide0-1-0,readonly=on,format=raw \
  -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=2 \
  -netdev tap,fd=32,id=hostnet0 \
  -device e1000,netdev=hostnet0,id=net0,mac=52:54:df:26:15:e9,bus=pci.0,addr=0x3 \
  -chardev pty,id=charserial0 \
  -device isa-serial,chardev=charserial0,id=serial0 \
  -vnc 127.0.0.1:1 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 \
  -device ES1370,id=sound0,bus=pci.0,addr=0x4 \
  -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6

The VM host uses a completely stock Ubuntu 14.04 LTS setup with a bridge interface in front of libvirt+qemu.

The problem is basically that the VM's are experiencing sporadic "lag". It's experienced as serious performance issues by the users. Sometimes it's hardly noticable, yet sometimes a single VM can struggle to respond to ping in time as shown below (tested from the VM Host to eliminate any network related issues).

root@vm-host-1:~# ping rc
PING rc (192.168.200.7) 56(84) bytes of data.
64 bytes from rc (192.168.200.7): icmp_seq=1 ttl=64 time=0.202 ms
64 bytes from rc (192.168.200.7): icmp_seq=2 ttl=64 time=0.214 ms
64 bytes from rc (192.168.200.7): icmp_seq=3 ttl=64 time=0.241 ms
64 bytes from rc (192.168.200.7): icmp_seq=4 ttl=64 time=0.276 ms
64 bytes from rc (192.168.200.7): icmp_seq=5 ttl=64 time=0.249 ms
64 bytes from rc (192.168.200.7): icmp_seq=6 ttl=64 time=0.228 ms
64 bytes from rc (192.168.200.7): icmp_seq=7 ttl=64 time=0.198 ms
64 bytes from rc (192.168.200.7): icmp_seq=8 ttl=64 time=3207 ms
64 bytes from rc (192.168.200.7): icmp_seq=9 ttl=64 time=2207 ms
64 bytes from rc (192.168.200.7): icmp_seq=10 ttl=64 time=1203 ms
64 bytes from rc (192.168.200.7): icmp_seq=11 ttl=64 time=203 ms
64 bytes from rc (192.168.200.7): icmp_seq=12 ttl=64 time=0.240 ms
64 bytes from rc (192.168.200.7): icmp_seq=13 ttl=64 time=0.271 ms
64 bytes from rc (192.168.200.7): icmp_seq=14 ttl=64 time=0.279 ms
^C
--- rc.mezzanine ping statistics ---
14 packets transmitted, 14 received, 0% packet loss, time 13007ms
rtt min/avg/max/mdev = 0.198/487.488/3207.376/975.558 ms, pipe 4

Local commands run on the respective VM's are also occasionally slow, e.g. running 'ls' might usually take a split second, but occasionally it takes a second or two. Clearly something is unhealthy.

I've been chasing this problem for several days. The VM host's disks are healthy, and memory usage isn't bad, as can be seen from:

virt-top 10:24:10 - x86_64 12/12CPU 3000MHz 64401MB
20 domains, 20 active, 20 running, 0 sleeping, 0 paused, 0 inactive D:0 O:0 X:0
CPU: 0.0%  Mem: 28800 MB (28800 MB by guests)

And

root@vm-host-1:~# free -m
             total       used       free     shared    buffers     cached
Mem:         64401      57458       6943          0      32229        338
-/+ buffers/cache:      24889      39511
Swap:         7628        276       7352

This server is perfectly capable of running this amount of VM's, especially considering that they are not high load VM's – but mostly idling.

What's the procedure to troubleshoot this? I suspect that one of the VM's is misbehaving, causing ripple effects that affect all the VM's, but I have yet to determine which VM this is, due to the sporadic occurrences of the problem.

The VM host hardware on its own is perfectly healthy – and gives no issues when run without VM's, so I suspect this is a qemu/libvirt/kvm issue – perhaps triggered by a misbehaving VM.

Slabtop output:

 Active / Total Objects (% used)    : 17042162 / 17322929 (98.4%)
 Active / Total Slabs (% used)      : 365122 / 365122 (100.0%)
 Active / Total Caches (% used)     : 69 / 125 (55.2%)
 Active / Total Size (% used)       : 1616671.38K / 1677331.17K (96.4%)
 Minimum / Average / Maximum Object : 0.01K / 0.10K / 14.94K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
9326928 9219652  98%    0.10K 239152       39    956608K buffer_head
6821888 6821559  99%    0.06K 106592       64    426368K kmalloc-64
465375 369738  79%    0.05K   5475       85     21900K shared_policy_node
260689 230992  88%    0.55K   4577       57    146464K radix_tree_node
 65280  63353  97%    0.03K    510      128      2040K kmalloc-32
 46494  45631  98%    0.19K   1107       42      8856K dentry
 45936  23865  51%    0.96K   1392       33     44544K ext4_inode_cache
 44136  43827  99%    0.11K   1226       36      4904K sysfs_dir_cache
 29148  21588  74%    0.19K    694       42      5552K kmalloc-192
 22784  18513  81%    0.02K     89      256       356K kmalloc-16
 19890  19447  97%    0.04K    195      102       780K Acpi-Namespace
 19712  19320  98%    0.50K    616       32      9856K kmalloc-512
 16744  15338  91%    0.57K    299       56      9568K inode_cache
 16320  16015  98%    0.04K    160      102       640K ext4_extent_status
 15216  14690  96%    0.16K    317       48      2536K kvm_mmu_page_header
 12288  12288 100%    0.01K     24      512        96K kmalloc-8
 12160  11114  91%    0.06K    190       64       760K anon_vma
  7776   5722  73%    0.25K    244       32      1952K kmalloc-256
  7056   7056 100%    0.09K    168       42       672K kmalloc-96
  5920   4759  80%    0.12K    185       32       740K kmalloc-128
  5050   4757  94%    0.63K    101       50      3232K proc_inode_cache
  4940   4046  81%    0.30K     95       52      1520K nf_conntrack_ffffffff81cd9b00
  3852   3780  98%    0.11K    107       36       428K jbd2_journal_head
  3744   2911  77%    2.00K    234       16      7488K kmalloc-2048
  3696   3696 100%    0.07K     66       56       264K ext4_io_end
  3296   2975  90%    1.00K    103       32      3296K kmalloc-1024

Best Answer

We've plunged into the same issue.

Caused by Ubuntu 14.04 kernel bug.

Solution #1: Update Kernel to 3.13.0-33.58 (or newer)

Solution #2: Disable KSM