VMware automatically rebooted all hosts

vmware-esxi

Yesterday our entire infrastructure crashed because all our ESXi hosts thought it would be an amazing idea to run updates at the same time. Edit: Or at least that's what we think happened, but nobody is really sure.

Normally we don't ever update the ESXi unless we have issues with them or somehow are informed of something that must be fixed.

Some information:

3x IBM x3650 M4 (7915D3G) configured in HA master/slave,
ESXi version 5.5.0, IMM v. 3.73, Build 1331820

We're pretty baffled by the situation. Our support provided above cause of error and attached log files printing lines such as (the file is pretty huge, so I'll stick to this critical part):

2014-11-04T10:58:48.364Z [488A1B70 verbose 'VpxaHalCnxHostagent' opID=WFU-e04c5e84] [WaitForUpdatesDone] Starting next WaitForUpdates() call to hostd
2014-11-04T10:58:48.364Z [488A1B70 verbose 'VpxaHalCnxHostagent' opID=WFU-e04c5e84] [WaitForUpdatesDone] Completed callback
2014-11-04T10:58:48.406Z [488A1B70 verbose 'VpxaHalCnxHostagent' opID=WFU-e4a7ca00] [WaitForUpdatesDone] Received callback
2014-11-04T10:58:48.406Z [488A1B70 verbose 'VpxaHalCnxHostagent' opID=WFU-e4a7ca00] [VpxaHalCnxHostagent::ProcessUpdate] Applying updates from 3526 to 3527 (at 3526)
2014-11-04T10:58:48.406Z [488A1B70 verbose 'hostdvm' opID=WFU-e4a7ca00] [VpxaHalVmHostagent] 26: Config changed 'config.extraConfig["vmware.tools.internalversion"].value'
2014-11-04T10:58:48.407Z [488A1B70 verbose 'hostdvm' opID=WFU-e4a7ca00] [VpxaHalVmHostagent] 26: Config changed 'config.tools.toolsVersion'
2014-11-04T10:58:48.407Z [488A1B70 verbose 'hostdvm' opID=WFU-e4a7ca00] [VpxaHalVmHostagent] 26: Runtime changed 'guest.toolsVersion'

Nobody in our department has touched these servers on this level – we normally only manage the VMs, not the hosts. How can this happen on its own?

The servers crashed all three at the same time at 10:50 am withouth anyone doing anything specific. Our support team has been unable to find any log files indicating any kind of issue, which is very weird.

Best Answer

VMware host servers do not automatically update without a deliberate action triggered from vCenter via Update Manager. Please provide the answers to:

  • What specific build numbers of ESXi do you have?
  • What time did the systems reboot?
  • What is shown in the Events log inside of vCenter for the affected hosts? It should be very clear what happened.
  • What do the IBM out-of-band management tools/logs say?

Your servers likely crashed and the IBM management appears to have automatically rebooted the systems, based on the information I see here.

You need to run updates. You're likely triggering a bug with the virtual NIC adapter in your Windows guests. It should be vmxnet3 instead of Intel e1000/e1000e. Build 1331820 of ESXi is ancient and full of problems. When running vSphere in a cluster, it's extremely important to keep things updated.

See:

Why is VMware ESXi 5.5 crashing?

VMware lockup CPU spike