Sql-server – MS SQL 2016 AlwaysOn cluster on Win2012R2 – AGs failover if File Share Witness is down

alwaysonfailoverfailoverclustersql serverwindows-cluster

Our current setup includes:

  • eight (8) Windows 2012 R2 nodes in single Failover Cluster, no shared storage, File Share Witness (on DC)

  • MS SQL 2016 AlwaysOn with a few AG groups

  • default 'If resource fails' policies

Cluster Validation Report shows a few minor warnings (difference in updates etc.) but overall everything seems to be fine.

Recently due to roughly half an hour DC downtime and consequent unavailability of File Share Witness, one of AGs failed over. Which isn't exactly what we have expected, since our idea was Quorum of all 8 nodes still persisted, so no failovers were expected.

Having read seemingly all available documentation on quorum/FSW/etc., i still don't have a clear answer or understanding why failover did happen.

FC Event logs contain, among the others, the following ambiguity:

FailoverClustering Event ID:1069 Resource Control Manager

Cluster resource 'File Share Witness' of type 'File Share Witness' in clustered role 'Cluster Group' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

On the node that swapped to secondary (NODE5), System Event log contains:

16.03.2017 12:39:47 Cluster resource 'File Share Witness' of type 'File Share Witness' in clustered role 'Cluster Group' failed due to an attempt to block a required state change in that cluster resource.

16.03.2017 12:39:47 File share witness resource 'File Share Witness' failed to arbitrate for the file share '\\DC\CLUSTER'. Please ensure that file share '\\DC\CLUSTER' exists and is accessible by the cluster.

16.03.2017 12:39:48 The Cluster service failed to bring clustered role 'Cluster Group' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

16.03.2017 12:39:48 Cluster resource 'File Share Witness' of type 'File Share Witness' in clustered role 'Cluster Group' failed due to an attempt to block a required state change in that cluster resource.

16.03.2017 12:39:48 File share witness resource 'File Share Witness' failed to arbitrate for the file share '\\DC\CLUSTER'. Please ensure that file share '\\DC\CLUSTER' exists and is accessible by the cluster.

And Failover Cluster event log:

Cluster resource 'File Share Witness' in clustered role 'Cluster Group' has transitioned from state Terminating to state Failed.

<...>

The Cluster service is attempting to fail over the clustered role 'Cluster Group' from node 'NODE5' to node 'NODE6'.

<...>

Clustered role 'db5' is moving to cluster node 'NODE6'.

To my mind this basically means that the failover was caused by the fact that File Share Witness gone offline. But – why?

And we're wondering are there ways to fix this behaviour. Any clarification or advice is welcome, thanks!

Best Answer

To my mind this basically means that the failover was caused by the fact that File Share Witness gone offline. But - why?

That's not what it means. Reading through the logs that were posted, I can see the core cluster group failed to another node (in hopes that it fixes the connectivity issue with the witness), however there is nothing in regards to SQL Server. You'll need to find where in the logs SQL Server had the failure and trace it back to see why the cluster decided to initiate an automatic failure.

The fact that an automatic failure occurred means the cluster had quorum. If it didn't, the automatic failure wouldn't have happened.

And we're wondering are there ways to fix this behaviour. Any clarification or advice is welcome, thanks!

Nothing to fix as this isn't what's happening. Look into the log to see what the reason for the automatic failure was, that's why it failed - not because it couldn't health check the FSW.