Intermittent host and VM connectivity on ESXi standalone

networkingvmware-esxi

I have a standalone ESXi 5.5.0 b2143827. It is running on a Dell R710 with 144GB of RAM. It has approximately 20 VM's on it.

Right now, I cannot get onto the console via the VMWare vSphere client or SSH. It just acts as if the server does not exist. The host will come back at seemingly random times and I can get onto the host via SSH and the vSphere client, but then it will just go off the network again at an undetermined time in the future. I can access it through the emergency console on the physical host itself (Alt+F1).

However, all the VM's are active and working. But about 10 times a day, all the VMs will drop off the network for between 15 seconds and 5 minutes. Then they will come back just fine and everything keeps on ticking.

I have done the following:

  • It was on a previous build, I updated it to b2143827. This made no difference
  • /sbin/services.sh restart – this does not help the situation
  • Restarted the physical host. This made no difference.
  • From the physical console (Alt+F1) I have pinged another physical device on the network. It does not drop any packets at all.
  • From the physical console, I have pinged a virtual machine on the host. It suffers approximately 80% loss
  • From a remote machine, I can ping the management IP address with 0% packet loss
  • From a remote machine, I can ping a VM on the host and can see the host clearly go off and back on the network occasionally
  • I watched tail -f /var/log/hostd.log for a while and saw nothing untoward happening there
  • The system is installed on an SD card. I have shut the server down, DD'd the card to another card, then booted it on the new card. Same issue.
  • Tried a different network switch
  • Ran the Dell Update Manager and updated every single firmware to the latest version.

I'm at a loss where to go from here. This server has operated flawlessly for the past 2.5 years. VMWare used to be installed on a physical drive, but 6 months ago it was moved onto the SD card so we could reconfigure the physical drives.

Best Answer

I'd suggest updating the firmware of the Broadcom NICs on your Dell PowerEdge server. The fact that you're seeing external connectivity problem in addition to VM-specific pings points at a NIC issue.

  • Can you try another NIC device? (this host has four)
  • How many uplinks do you have from the Standard vSwitch? (you should have multiple live uplinks)
  • How reproducible is the issue?

Regarding the SDHC boot, I really only advocate the use of SD/USB boot on ESXi servers that are member of a vSphere cluster and have shared storage. Due to the failure mode of those cards under ESXi, there's no advantage to using them to boot standalone systems. See the differences between ESXi's installable and embedded modes.