Should an HA failover occur in this scenario

high-availabilityvmware-vsphere

I'm running vSphere 5 in an HA cluster across two hosts (vsphereA and vsphereB). I have the HA cluster configured for host monitoring and datastore heartbeat monitoring with admission control disabled (hopefully I rightfully understand that datastore heartbeat monitoring prevents inadvertent and unwanted HA failovers due to management network isolation). Each host has a single connection to a dedicated iSCSI network and iSCSI target (no MPIO). All vmdk's for all VM's exist on the iSCSI datastore. As a test of HA I disconnected the iSCSI connection on vsphereB and was surprised to see that the running VM's on vsphereB continued to run on vsphereB. The powered off VM's were showing as inaccessible (which I expected due to the fact that they weren't running and the connection from vsphereB to the iSCSI target was severed) but the running VM's continued to run and continued to be "owned" by vsphereB. I expected to see an HA failover occur for those VM's and expected to see them "owned" by vsphereA after the HA failover (which didn't occur). I'm at a loss to understand why an HA failover didn't occur for those VM's. Am I misunderstanding in which cases an HA failover should occur?

Best Answer

You seem to be confusing vMotion and HA, which are different features that do different things.

vMotion is a feature which allows virtual machines to be migrated from one physical host to another with no downtime and minimal (milliseconds) disruption in service. It is done in advance of maintenance and requires the VM and both the source and destination hosts to already be in a healthy state. HA is a feature which restarts failed virtual machines (or inaccessible virtual machines if host isolation is configured) and does result in downtime for the VM, since the entire virtual machine is powered off and restarted.

Important take-away: a vMotion is not an HA failover. An HA failover is an HA failover.

vMotions are triggered by the following things:

  1. A user initiates a vMotion
  2. DRS initiates a vMotion in response to load conditions (thresholds set by the DRS aggressiveness setting), affinity rule violations, or host updates triggered through VUM

HA failovers are triggered by the following things:

  1. A host in your HA cluster has detected that another host in the cluster has failed and is not responding to HA heartbeats using either the configured management networks or heartbeat datastores
  2. Isolation response is configured to shut down or power off VMs, and the host can no longer speak to a majority of cluster nodes, triggering a VM shutdown and subsequent HA failure detection from the remaining majority of the cluster (if there is one, which is one of the dangers of isolation response)
  3. The cluster/VM are configured for VM Monitoring through VMware Tools, the hypervisor has not received a heartbeat for a specific amount of time, and no disk or network activity has occurred for 120 seconds

Bottom line: vMotions occur because of performance events, and HA failovers happen because of availability events.

What you've done is pull the disk out from underneath a running VM. The standard behavior of vSphere, and most hypervisors, in this instance is to leave the virtual machine alone, and let it handle its own disk issues. There's several good reasons for this:

  1. Some operating systems/distros (i.e. pfSense) will work just fine if the underlying disk stops responding
  2. A few dozen VMs starting up at the same time tends to create a "thundering herd" problem -- doing this on storage that's already questionable may not end up being the best idea
  3. Like swapping, the operating system (and applications) will usually do a better job of dealing with storage issues than the hypervisor will
  4. Sometimes storage just hangs -- it's the most failure-prone component in most virtualized environments. Best to try to detect it and alert on it and let an administrator figure out what to do with it before you kick over an entire environment

On the other hand, for many workloads (databases come to mind), it's a good idea to shut down as soon as there's a chance corruption or lost transactions might occur. In a best-case scenario, though, since you can't cleanly quiesce the database without the disk, you're probably ending up in an inconsistent state anyway.

Ultimately: there's some good use cases for having HA respond to unreliable storage, but it doesn't do that today, and the behavior you're seeing is totally normal.

Related Topic