Failover cluster failed to failover due to thesterious IP conflict

failoverclustermicrosoft-cluster-serverwindows-server-2008

I'm having a mysterious problem with my Failover cluster,

Cluster name: PrintCluster01.domain.com
Members: PrintServer01.domain.com  andPrintServer02.domain.com

in the Failover Cluster Management – Cluster Event I received the Critical error message 1135 and 1177:

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 15/06/2011 9:07:49 PM
Event ID: 1177
Task Category: None
Level: Critical
Keywords: 
User: SYSTEM
Computer: PrintServer01.domain.com
Description:
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk. 
Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.


Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 15/06/2011 9:07:28 PM
Event ID: 1135
Task Category: None
Level: Critical
Keywords: 
User: SYSTEM
Computer: PrintServer01.domain.com
Description:
Cluster node 'PrintServer02' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

After further investigation, I found some interesting error here, from the very first critical error message logged in the Event viewer on PrintServer02:

Log Name: System
Source: Tcpip
Date: 15/06/2011 9:07:29 PM
Event ID: 4199
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: PrintServer02-VM.domain.com
Description:
The system detected an address conflict for IP address 192.168.127.142 with the system having network hardware address 00-50-56-AE-29-23. Network operations on this system may be disrupted as a result.

192.168.127.142 –> secondary IP of PrintServer01
how could that be possible it conflict by one of the PrintServer01 node ? the detailed is as below:

**From PrintServer01**
Ethernet adapter Local Area Connection* 8:

Connection-specific DNS Suffix . :
 Description . . . . . . . . . . . : Microsoft Failover Cluster Virtual Adapter
 Physical Address. . . . . . . . . : 02-50-56-AE-29-23
 DHCP Enabled. . . . . . . . . . . : No
 Autoconfiguration Enabled . . . . : Yes
 IPv4 Address. . . . . . . . . . . : 169.254.1.183(Preferred)
 Subnet Mask . . . . . . . . . . . : 255.255.0.0
 Default Gateway . . . . . . . . . :
 NetBIOS over Tcpip. . . . . . . . : Enabled

I have double check in all of the cluster members that all IP addresses is now unique.

however I'm sure that I the IP is static not by DHCP as from the IPCONFIG results below:

From **PrintServer01** (the Active Node)
Windows IP Configuration

Host Name . . . . . . . . . . . . : PrintServer01
 Primary Dns Suffix . . . . . . . : domain.com
 Node Type . . . . . . . . . . . . : Hybrid
 IP Routing Enabled. . . . . . . . : No
 WINS Proxy Enabled. . . . . . . . : No
 DNS Suffix Search List. . . . . . : domain.com
 domain.com.au

Ethernet adapter Local Area Connection* 8:

Connection-specific DNS Suffix . :
 Description . . . . . . . . . . . : Microsoft Failover Cluster Virtual Adapter
 Physical Address. . . . . . . . . : 02-50-56-AE-29-23
 DHCP Enabled. . . . . . . . . . . : No
 Autoconfiguration Enabled . . . . : Yes
 IPv4 Address. . . . . . . . . . . : 169.254.1.183(Preferred)
 Subnet Mask . . . . . . . . . . . : 255.255.0.0
 Default Gateway . . . . . . . . . :
 NetBIOS over Tcpip. . . . . . . . : Enabled

Ethernet adapter Cluster Public Network:

Connection-specific DNS Suffix . :
 Description . . . . . . . . . . . : Intel® PRO/1000 MT Network Connection
 Physical Address. . . . . . . . . : 00-50-56-AE-29-23
 DHCP Enabled. . . . . . . . . . . : No
 Autoconfiguration Enabled . . . . : Yes
 IPv4 Address. . . . . . . . . . . : 192.168.127.155(Preferred)
 Subnet Mask . . . . . . . . . . . : 255.255.255.0
 IPv4 Address. . . . . . . . . . . : 192.168.127.88(Preferred)
 Subnet Mask . . . . . . . . . . . : 255.255.255.0
 IPv4 Address. . . . . . . . . . . : 192.168.127.142(Preferred)
 Subnet Mask . . . . . . . . . . . : 255.255.255.0
 IPv4 Address. . . . . . . . . . . : 192.168.127.143(Preferred)
 Subnet Mask . . . . . . . . . . . : 255.255.255.0
 IPv4 Address. . . . . . . . . . . : 192.168.127.144(Preferred)
 Subnet Mask . . . . . . . . . . . : 255.255.255.0
 Default Gateway . . . . . . . . . : 192.168.127.254
 DNS Servers . . . . . . . . . . . : 192.168.127.10
 192.168.127.11
 Primary WINS Server . . . . . . . : 192.168.127.10
 Secondary WINS Server . . . . . . : 192.168.127.11
 NetBIOS over Tcpip. . . . . . . . : Enabled

Ethernet adapter Cluster Private Network:

Connection-specific DNS Suffix . :
 Description . . . . . . . . . . . : Intel® PRO/1000 MT Network Connection #2
 Physical Address. . . . . . . . . : 00-50-56-AE-43-EC
 DHCP Enabled. . . . . . . . . . . : No
 Autoconfiguration Enabled . . . . : Yes
 IPv4 Address. . . . . . . . . . . : 10.184.2.2(Preferred)
 Subnet Mask . . . . . . . . . . . : 255.255.255.0
 Default Gateway . . . . . . . . . :
 NetBIOS over Tcpip. . . . . . . . : Disabled


From **PrintServer02**
Windows IP Configuration

Host Name . . . . . . . . . . . . : PrintServer02
 Primary Dns Suffix . . . . . . . : domain.com
 Node Type . . . . . . . . . . . . : Hybrid
 IP Routing Enabled. . . . . . . . : No
 WINS Proxy Enabled. . . . . . . . : No
 DNS Suffix Search List. . . . . . : domain.com
 domain.com.au

Ethernet adapter Local Area Connection* 8:

Connection-specific DNS Suffix . :
 Description . . . . . . . . . . . : Microsoft Failover Cluster Virtual Adapter
 Physical Address. . . . . . . . . : 02-50-56-AE-5F-E5
 DHCP Enabled. . . . . . . . . . . : No
 Autoconfiguration Enabled . . . . : Yes
 IPv4 Address. . . . . . . . . . . : 169.254.2.86(Preferred)
 Subnet Mask . . . . . . . . . . . : 255.255.0.0
 Default Gateway . . . . . . . . . :
 NetBIOS over Tcpip. . . . . . . . : Enabled

Ethernet adapter Cluster Public Network:

Connection-specific DNS Suffix . :
 Description . . . . . . . . . . . : Intel® PRO/1000 MT Network Connection
 Physical Address. . . . . . . . . : 00-50-56-AE-79-FA
 DHCP Enabled. . . . . . . . . . . : No
 Autoconfiguration Enabled . . . . : Yes
 IPv4 Address. . . . . . . . . . . : 192.168.127.172(Preferred)
 Subnet Mask . . . . . . . . . . . : 255.255.255.0
 IPv4 Address. . . . . . . . . . . : 192.168.127.119(Preferred)
 Subnet Mask . . . . . . . . . . . : 255.255.255.0
 Default Gateway . . . . . . . . . : 192.168.127.254
 DNS Servers . . . . . . . . . . . : 192.168.127.10
 192.168.127.11
 Primary WINS Server . . . . . . . : 192.168.127.11
 Secondary WINS Server . . . . . . : 192.168.127.10
 NetBIOS over Tcpip. . . . . . . . : Enabled

Ethernet adapter Cluster Private Network:

Connection-specific DNS Suffix . :
 Description . . . . . . . . . . . : Intel® PRO/1000 MT Network Connection #2
 Physical Address. . . . . . . . . : 00-50-56-AE-77-8D
 DHCP Enabled. . . . . . . . . . . : No
 Autoconfiguration Enabled . . . . : Yes
 IPv4 Address. . . . . . . . . . . : 10.184.2.3(Preferred)
 Subnet Mask . . . . . . . . . . . : 255.255.255.0
 Default Gateway . . . . . . . . . :
 NetBIOS over Tcpip. . . . . . . . : Disabled

Any help would be greatly appreciated.

Thanks,
AWT

Best Answer

The IP Address conflict error occurs when more than one node in a cluster attempts to bring a resource group (and its associated IP(s)) online at the same time.

This can happen if the cluster nodes momentarily lose contact with each other. Each node assumes the other node has failed, as a result the 'passive' node will bring all resource groups online when they are in fact still online on the 'active' node.

I have seen this problem in our VMWare environment when one of the ESX(i) hosts are overloaded - sometimes even just during HBA bus rescans, suddenly the MSCS nodes very breifly lose contact and this mess occurs.

Related Topic