Sql-server – Windows Server 2008 Cluster Errors

clusterfailoversql serverwindows-clusterwindows-server-2008

We have a couple dozen Win2008 and 2008R2 Enterprise clusters that are used for SQL Server 2008 and 2008 R2 Enterprise/Datacenter. In the past we have had many issues with random failovers and "Network is Partitioned" errors on several servers on the other side of the globe. This was mostly resolved in updating NIC drivers and uninstalling Forefront Endpoint Protection (not sure how that played into everything but it helped).

Fast forward six months to November and we are getting constant alerts from SCOM and in the Event Log that the clusters (two in particular) are failing with "Network is Partitioned" errors several times a week, but no failure actually occurred. SQL Server is still up and running, no interruption in service is noticed on the web front ends. The errors seem to originate from the 'Passive' node and replicate through the network (we receive the first alerts from Passive, then active, then web front end) but all nodes/network adapters/disks/applications/ip's/websites remain functional. We cannot find the reason for these errors continually popping up when nothing appears to be wrong with the cluster, network or anything at all. Any ideas about the cause or possible direction we could go to investigate would be great.

Best Answer

When you get a network is partitioned error, it means that the server that is currently running your cluster applications is isolated in some fashion from the other nodes. It is entirely possible (and likely), that you services will continue to run, assuming there are no other faults. The warning is telling you that if a failover were needed, it will likely fail (usually due to the node not having a path to hand off the disks/CSVs).

Be sure to carefully check the network topology and your cluster network settings between the servers in question. We had a nasty experience with this where the failover cluster was using multipath NICs for inter-node comms that were different than the ones SQL server were using (i.e. separate VLANs). Because both the primary and backup cluster connections were pathing in such a way that quorum could be lost if only one switch went down, the SQL server would still show as being online, but the cluster showed partitioned, meaning that if the server (or a switch) were to fail, it would bring the cluster down hard.