Storage Spaces Direct – Resolving SMB Errors

failoverclusterhyper-vsmbclientstorage-spaceswindows-server-2016

So we've got this 4-node Storage Space Direct (S2D) cluster, working for more than 1.5 year without any major issue. The OS is Windows Server 2016.

  • Firewall down for all profiles
  • No antivirus installed, Windows Defender OFF
  • Active Directory delegations untouched
  • No change in the network infrastructure has been reported
  • RDMA was disabled 1 year ago, as we found out the NIC didn't fully support it

Two days ago, we noticed a lot or error messages in the cluster event log, and the backup jobs of all Hyper-V VM hosted on the cluster failed (made via VEEAM).

Investigation quickly showed there is are many issue with the SMB connections.

Any of the 4 hosts :

  • can ping other resources in the network
  • can't connect any shared folders
  • NTP sync fails (net time \\server fails, so is w32tm /monitor)

Obviously, the File Share Witness fails as well, and some issue with Domain services to be reported…

We tried to reboot the nodes separately, and after a reboot the SMB connections are just fine… for a few minutes/hours, and then the issue arise again.

The impact on the cluster, along with the File Share Witness beeing offline, is we can't easily perform a Live Migration of the VMs between the nodes (succeeds randomly). A Quick Migration happens like a charm, though. As SMB connections are not possible, we can't move the VM to another cluster or standalone host.

We fear the cluster will go haywire if a node fails uncontrollably. Even though the VM are stable, we still can't perform a backup (we could perform an export).

Have any of you heard about that issue with S2D or the Microsoft Failover cluster role ? It might also be unrelated to the cluster itself…

What can be done to find the root cause of this issue ?

Here are samples of the logs found in the cluster role, and in the event logs for SMBCLient :

From the Cluster console:

Cluster network name resource 'Cluster Name' encountered an error
enabling the network name on this node. The reason for the failure
was: 'Unable to obtain a logon token'.

The error code was '1311'.

You may take the network name resource offline and online again to
retry.

Event with ID 30803 :

Failed to establish a network connection.

Error: {Device Timeout} The specified I/O operation on %hs was not
completed before the time-out period expired.

Server name: server.domain.com

Server address: x.x.x.x:445 Connection type: Wsk

Guidance: This indicates a problem with the underlying network or
transport, such as with TCP/IP, and not with SMB. A firewall that
blocks TCP port 445, or TCP port 5445 when using an iWARP RDMA
adapter, can also cause this issue.

Another one, ID 30804 :

A network connection was disconnected.

Server name: \server.domain.com Server address: x.x.x.x:445
Connection type: Wsk

Guidance: This indicates that the client's connection to the server
was disconnected.

Frequent, unexpected disconnects when using an RDMA over Converged
Ethernet (RoCE) adapter may indicate a network misconfiguration. RoCE
requires Priority Flow Control (PFC) to be configured for every host,
switch and router on the RoCE network. Failure to properly configure
PFC will cause packet loss, frequent disconnects and poor performance.

Best Answer

I found the solution, it was a stupid thing. The hosts had several NIC for network access to different VLANs. Some of the NIC where mapped to a Virtual Switch, and some of them were shared with the OS ('Allow management operating system to share this network adapter').

I noticed the SMB packet often used the wrong interface (DMZ), and of course the request was denied.

The Powershell command I used to identify the wrong route used by the SMB traffic :

Find-NetRoute -RemoteIPAddress x.x.x.x

(where x.x.x.x is the a remote ressource on your network)

This showed the DMZ interface, instead of the LAN interface. Removing the 'Allow management operating system to share this network adapter' on the DMZ vSwitch solved the issue for me.

I still don't understand how this cluster worked so well for 1.5 year, with this configuration. But well, now it is solved, the FSW and all other operations work well.

Hope this can help ;)