KVM QEMU Guest VMs randomly lose network connection

bondingbridgekvm-virtualizationnetworkingqemu

I'm working on setting up a server with KVM/QEMU and all Linux servers.
We are going to use this server for web development, git, VoIP PBX, etc.
(We were using XenServer and Windows Server 2016, but I'm a Linux fan.)
I've come across some issues with virtual machines seemingly randomly
losing network connection or going to sleep or something like that.
I can't seem to pin down what the issue is.

I've looked through a lot of forums and posts even here on Server Fault,
but nothing quite fits what I'm trying to do. I'll attach an image below of our network setup. We have 2 locations, and a VPN between them with firewalls. The machine in question is a Dell PowerEdge R710. I've
successfully installed Ubuntu 18.10 and KVM/QEMU on it as a host OS
(18.10 because of an issue with Virtual Manager not showing all network
connections in 18.04.) I use Virtual Manager to manage
installing/monitoring new VMs from my laptop (Dev Computer 1) over ssh.

I have 6 guest VMs all installed with either Ubuntu 18.04 or Debian 9
(our VoIP PBX) and they all work great except for the occasional network
hiccup. All are connected through a bonded bridge in the host machine
(including the host itself). There are 4 NICs all bonded and I've used
the bond as an interface for the bridge. I'm using netplan for the
network configuration and I'll post the config yaml below. I'm using
static IP configurations for all the guest VMs that simply set an IP for
the default "ens3" interface through netplan, but I can post that too
if it will help.

Some interesting things I've noticed:

  1. I can always ssh into the host machine, it never seems to lose
    connection.
  2. When one of the 6 machines loses network connection, I can still
    ssh into it from the host machine, but it will sometimes hang for a
    bit while reestablishing connection.
  3. If I ssh into the offending VM from the host and do a ping to the
    gateway (firewall) it will snap out of it and we can connect to it
    again.
  4. Occasionally the guest VMs will be unable to see each other, but if
    I ssh into whichever can't see the other and run a ping it will
    usually start working after a few "Destination Host Unreachable"
    messages.

I can get any other command outputs or logs that would be necessary to
further diagnose this, and I'd really appreciate anyone who may know more
about this looking into it. I'm a huge Linux fan, and want this to work
the way I know it can, but these random disconnects are not making this
solution look very good. Thanks to any who take time to read this!

Network Map

Host machine netplan configuration:

network:
    version: 2
    renderer: networkd
    ethernets:
        eno1:
            dhcp4: false
            dhcp6: false
        eno2:
            dhcp4: false
            dhcp6: false
        eno3:
            dhcp4: false
            dhcp6: false
        eno4:
            dhcp4: false
            dhcp6: false
    bonds:
        bond0:
            interfaces:
                - eno1
                - eno2
                - eno3
                - eno4
            addresses: [192.168.5.20/24]
            dhcp4: false
            gateway4: 192.168.5.1
            nameservers:
                addresses: [192.168.1.6,1.1.1.1]
    bridges:
        br0:
            addresses: [192.168.5.21/24]
            dhcp4: false
            gateway4: 192.168.5.1
            nameservers:
                addresses: [192.168.1.6,1.1.1.1]
            interfaces:
                - bond0

Best Answer

I have an almost identical configuration currently in production. Ubuntu 18.04+KVM/QEMU on an R710 and I have not experienced this issue.

While it's possible that it's a difference of Ubuntu versions, with you being on 18.10, or an actual hardware issue you're having, the only notable difference I see in this configuration is the bond - which I am not using. My bridge configuration looks like the one below:

    bridges:
        br0:
            dhcp4: yes
            interfaces:
                - eno1

It's only using eno1 as that's the only interface with a cable running to it. It may be worthwhile, purely for troubleshooting purposes, to attempt using a similar configuration so see if it resolves the issue.

If that is the issue, the things that stick out to me as being potentially flawed in your configuration are the redundant parameters in your bond/bridge. To my understanding, parameters like the addresses, gateway, and nameservers should be innately inherited from the interface in use. Potentially attempt setting all of these settings in either the bridge or the bond, but not both.

Lastly, considering it appears we are on near-identical hardware, running some sort of test on the VM host to confirm that the network card itself is not bad.

Hope this helps!