Sql-server – KVM top shows high CPU load on host for windows7 guest though windows is on idle

kvm-virtualizationqemusql servervirtual-machineswindows 7

we have an virtualization environment with actually 4 VMs (2 x linux, 1 x w2k3, 1 x win7).
In the host system (Debian Jessie) top always shows a CPU load of 30-70% (or more) for the qemu process of the win7 guest even though taskmanager inside the guest is at zero cpu load.

    top - 11:12:08 up 6 days,  1:47,  1 user,  load average: 0,70, 0,62, 0,55
Tasks: 216 total,   2 running, 214 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5,0 us,  3,7 sy,  0,0 ni, 91,3 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
KiB Mem:  24776900 total, 21591188 used,  3185712 free,   122680 buffers
KiB Swap:  3905532 total,    60748 used,  3844784 free.   399364 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                              
11138 libvirt+  20   0 10,804g 8,243g  18536 R  70,1 34,9   2137:30 qemu-system-x86                      
12134 libvirt+  20   0 7309216 6,046g  18792 S   3,7 25,6 139:13.88 qemu-system-x86                      
12055 libvirt+  20   0 8900940 4,057g  18500 S   2,3 17,2 109:41.87 qemu-system-x86                      
12041 libvirt+  20   0 2956240 1,388g  18292 S   2,0  5,9  61:38.55 qemu-system-x86                      
 5569 root      20   0 1007924  23456  11012 S   1,0  0,1   1:16.86 libvirtd

taskmanager performance in guest

Inside the guest there is an MSSQL 2008 R2 Express running. Traceflag -T8038 is set therefor (according to proxmox performance tweaks). Also tablet device is removed from configuration and ballooning device is disabled inside guest (as i don't know how to disable it in VM-configuration).
Furthermore it also runs an Pervasive SQL 8 server to fire an old btrieve database.

Strange thing is that the CPU load in top drops to an adequate level (1-3%) if i completely remove all NICs from the guest. Actually as an NIC i passed through one of the physical NICs (an Intel I350). But behaviour is the same for virtualized NICs.
All this tested without any clients connected.

Actual guest configuration:

<domain type='kvm'>
  <name>win7</name>
  <uuid>4b62c825-07ce-49b9-be8c-63f1f51ec28c</uuid>
  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>
  <vcpu placement='static'>2</vcpu>
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.1'>hvm</type>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
    </hyperv>
  </features>
  <cpu mode='host-model'>
    <model fallback='allow'/>
    <topology sockets='1' cores='2' threads='1'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
    <timer name='hypervclock' present='yes'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
  </pm>
  <devices>
    <emulator>/usr/bin/kvm</emulator>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source dev='/dev/vg_vm/lv_win7Pro'/>
      <target dev='vda' bus='virtio'/>
      <boot order='1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <target dev='hdb' bus='ide'/>
      <readonly/>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>
    <controller type='usb' index='0' model='ich9-ehci1'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x7'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci1'>
      <master startport='0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0' multifunction='on'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci2'>
      <master startport='2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x1'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci3'>
      <master startport='4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'/>
    <controller type='ide' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </controller>
    <serial type='pty'>
      <target port='0'/>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <channel type='spicevmc'>
      <target type='virtio' name='com.redhat.spice.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <graphics type='vnc' port='-1' autoport='yes'/>
    <video>
      <model type='qxl' ram='65536' vram='65536' heads='1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x07' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </hostdev>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </memballoon>
  </devices>
</domain>

Any tips what could cause this and how to improve?

Best Answer

I had similar problem in the past, with IRQ storming in the guest and high load on the host. You must isolate what, in the guest, is storming the CPU. Prime candidate are both the MSSQL instance and the hal.dll library.

To debug, follow these steps:

  • stop your MSSQL instance. Do the host load decrease? If so, you found the culprit. The point is that MSSQL use an high timer frequency (1ms), even when idle. On bare metal this is not an issue (the system will simply use some more watts), but on virtualized one can be a concern. If possible, you should identify what timer source Windows is using and try to swith between the available ones. As a workaround, a patch exists to raise the clock timer interrupts to 12ms. For more information, see here and here.
  • if point n.1 brings no benefits, maybe the problem is HAL-related. I see you are using two vCPU; try to start the VM with a single vCPU. Do that change anything? If no, do a screenshot of Windows's hardware tab (expanding the HAL node) and report back here.

EDIT: Ok, it seems that nor MSSQL nor HAL is the root cause of your host load. Go ahead to the second debug phase:

  • stop your virtual machine and remove, from its definition, all USB devices. Restart the machine and check host load: it changed?
  • if no, please use the powertop utility to monitor host's CPU activity. Here you should see what software routine / interrupt is serviced the most. Run in 30 seconds and report back here.