Hyper-v cluster behavior when losing network connectivity

clusterfailoverclusterhyper-v

Setup:

  1. (rather new) Hyper-V R2 cluster with 2 nodes (in failover configuration). Fysical host OS: Windows Server 2008.
  2. About eight VM's (mixed: Windows Server 2008 and Linux)

Yesterday we had a power outage of about 15 minutes.

Our blades are on UPS so the fysical host machines (Windows Server 2008) never went down. Our main switches are not on UPS (yet) and we saw the behaviour similar to the following (as distilled from the event logs).

  1. The nodes in the cluster lost means of communication (because the external switches went down).
  2. The cluster wants to bring down one (the first) of the nodes (to start failover?).
  3. The previous step impacts clustered storage where the virtual machine VHD's are located.
  4. All VM's got brutally terminated and were found in a failed state in the failover manager in the host OS'es. The Linux VM's were kernel panicking and looked like they had their disk ripped out.

This whole setup is rather new to us, so we are still learning about this.

The question:

We are putting switches on UPS soon but were wondering if the above is expected behavior (seems rather fragile) or if there are obvious improvements configuration-wise to handle such scenario's ?

I can upload an evtx file concerning what exactly was going on in case that's necessary.

Best Answer

The most probable explanation for that behavior has to do with the quorum configuration. Take a look at http://technet.microsoft.com/en-us/library/cc731739.aspx.

Basically, when your network switch went down, the two nodes lost communication with each other. At that point, neither node knew what the other one was doing. If one node decided that it was going to assume ownership of all of the clustered resources (ie, virtual machines) and boot them up, who's to say that the other one wouldn't do the same thing? You'd end up in a scenario where both nodes are trying to take total ownership of all the virtual machines, and you'd have some really nasty hard drive corruption on your hands.

The Quorum Configuration solves this problem by stating that in order for a node to function, it must be in contact with a majority of the nodes (and optionally, also a disk or file share). If it can't do that, it'll stop functioning as a member of the cluster.

To verify that this is the case, open the Failover Cluster Manager and check the "Quorum Configuration" on the summary page for the cluster. If it's Node Majority and you have an even number of nodes, then what I described is almost certainly what happened.

The solution is to set up a small disk, called the Disk Witness, (50 MB is more than enough) and add it to the storage for you cluster (but NOT the cluster shared volumes). Then, change the Quorum Configuration to Node and Disk Majority. With this setting, if you experienced the same failure as before, the node that had ownership of the disk at the time of failure would continue functioning (and would actually assume ownership of all of the resources from the other node), and the other node would stop. The VM's that failed-over to the functioning node would experience a brutal restart, but at least they'd be online as quickly as possible.

As you stated, the ideal scenario would be to have your switches on the UPS also. That would've prevented the failure altogether; however, you should also make sure that you're using the recommended quorum configuration for the number of nodes that you have.