Linux – How to investigate an unresponsive KVM guest

kvm-virtualizationlinuxUbuntu

What steps can I take to investigate a KVM guest that freezes about once every two weeks? By "freezes", I mean there is no response when I try to connect with "ssh" or "virsh console". The host is Ubuntu (natty, 11.04), using libvirt to manage its guests, and the guest is Ubuntu (natty, 11.04), both server editions with no window manager installed.

If I force the guest to reset, it works fine for another week. There are no recent or relevant message in the guest syslog (to indicate a kernel panic, etc). For all I know, it could be that the virtual network and tty are breaking and stopping me from talking to the guest. The host runs three other, nearly identical, guests that have been stable all year. If the guest itself is crashing, shouldn't there be some indication in syslog?

The disk is an lvm logical volume configured with virtio

% cat /etc/libvirt/qemu/vm-et.xml

    <domain type='kvm'>
      <name>vm-et</name>
      <uuid>8df572f1-e1dc-275a-4b9f-b7c322e2f5d3</uuid>
      <memory>2048576</memory>
      <currentMemory>2048576</currentMemory>
      <vcpu>1</vcpu>
      <os>
        <type arch='x86_64' machine='pc-0.12'>hvm</type>
        <boot dev='hd'/>
      </os>
      <features>
        <acpi/>
      </features>
      <clock offset='utc'/>
      <on_poweroff>destroy</on_poweroff>
      <on_reboot>restart</on_reboot>
      <on_crash>destroy</on_crash>
      <devices>
        <emulator>/usr/bin/kvm</emulator>
        <!--<disk type='file' device='disk'>
          <driver name='qemu' type='qcow2'/>
          <source file='/usr/scratch/appliances/vm-et/ubuntu-kvm/tmpzwV0x3.qcow2'/>
          <target dev='hda' bus='ide'/>
          <address type='drive' controller='0' bus='0' unit='0'/>
        </disk>-->
        <controller type='ide' index='0'>
          <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
        </controller>
        <interface type='bridge'>
          <mac address='52:54:00:5a:1f:b4'/>
          <source bridge='br0'/>
          <model type='virtio'/>
          <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
        </interface>
        <input type='mouse' bus='ps2'/>
        <graphics type='vnc' port='-1' autoport='yes' listen='127.0.0.1'/>
        <video>
          <model type='cirrus' vram='9216' heads='1'/>
          <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
        </video>
        <memballoon model='virtio'>
          <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
       </memballoon>
        <disk type='file' device='disk'>
          <source file='/dev/vg1/lv-et'/>
          <target dev='vda' bus='virtio'/>
        </disk>

        <serial type="pty">
          <source path="/dev/pts/3"/>
          <target port="1"/>
        </serial>

      </devices>
    </domain>

Best Answer

Investigating there kinds of problems is really difficult because you'd need to isolate different features of the setup and test them - which is very difficult on such a commplex setup and as the repro is a two weeks long process.

The first thing is try to do is to configure the syslog to send the logs over the network to a remote syslog service (possibly the one running on the host - you'd need to enable remote forwarding access on the syslog server) to allow you to catch errors that didn't make it into the guest log due to storage free space or sync issues.

If that doesn't give any useful info, you can try hooking into the guest serial console (see here for details) and log anything that happens there to a log file on the host.