Troubleshooting a network interruption

networking

My company is pretty much entirely a Windows shop: Microsoft firewall, all Windows servers, etc etc. The hardware is mostly Cisco or Cisco-like.
For about three weeks, we've experienced "random" network interruptions. They're not very long, but they occur, and interrupt the workflow*. They don't happen at the same time. We don't know what changed to make them begin to occur. We've asked Optimum Lightpath, and their systems are running fine, so the problem seems to be in-house.

How would you troubleshoot this / set up logs to parse / properly set up wireshark's filters (I know, I know – RTFM …) / sacrifice a goat ?


  • Workflow interruption: any kind of work which requires access to a server (such as a webpage, or database access) is interrupted.

Copying here dashmir's message since it contains most of the relevant information:

Hopefully I may have resolved the issue today due to a bad switch but We are a multi-specialty physician practices. We have 3 buildings connected using dark fiber and 22 remotes sites. Half are connect using e-lines, the other half are site to site vpn.

the interruption is brief about 10-15 seconds but enough to disrupt workflow and chaos. Doctors who are on EMR mostly screaming. And then everything goes back to normal.

Connection is lost between all switches, servers, applications etc.

We have exchange in a CCR cluster. Firewall same is fault tolerant and does load balancing and these are some of the errors on our exchangeand on our firewall.

Event ID 1135 — Cluster Service Startup Updated: November 25, 2009 Applies To: Windows Server 2008 R2

The Cluster service is the essential software component that controls all aspects of failover cluster operation and manages the cluster configuration database. If the Cluster service fails to start on a failover cluster node, the node cannot function as part of the cluster.
Event Details
Product: Windows Operating System
ID: 1135
Source: Microsoft-Windows-FailoverClustering
Version: 6.1
Symbolic Name: EVENT_NODE_DOWN
Message: Cluster node '%1' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. Resolve Check network hardware and configuration If you do not currently have Event Viewer open, see "Opening Event Viewer and viewing events related to failover clustering." After reviewing event messages, choose actions that apply to your situation: • Run the Validate a Configuration Wizard, selecting only the network and inventory tests. For more information, see "Using the Validate a Configuration Wizard to review the network configuration."
* Check the system event log for hardware or software errors related to the network adapters on this node.
* Check the network adapter, cables, and network configuration for the networks that connect the nodes.
* Check hubs, switches, or bridges in the networks that connect the nodes. To perform the following procedures, you must be a member of the local Administrators group on each clustered server, and the account you use must be a domain account, or you must have been delegated the equivalent authority. Using the Validate a Configuration Wizard to review the network configuration To use the Validate a Configuration Wizard to review the network configuration:
1. To open the failover cluster snap-in, click Start, click Administrative Tools, and then click Failover Cluster Management. If the User Account Control dialog box appears, confirm that the action it displays is what you want, and then click Continue.
2. In the Failover Cluster Management snap-in, in the console tree, make sure Failover Cluster Management is selected. Then under Management, click Validate a Configuration.
3. Follow the instructions in the wizard to specify the cluster you want to test.
4. On the Testing Options page, select Run only tests I select.
5. On the Test Selection page, clear all check boxes except those for the Network tests.
6. Follow the instructions in the wizard to run the tests.
7. On the Summary page, click View Report. Opening Event Viewer and viewing events related to failover clustering To open Event Viewer and view events related to failover clustering:
1. If Server Manager is not already open, click Start, click Administrative Tools, and then click Server Manager. If the User Account Control dialog box appears, confirm that the action it displays is what you want, and then click Continue.
2. In the console tree, expand Diagnostics, expand Event Viewer, expand Windows Logs, and then click System.
3. To filter the events so that only events with a Source of FailoverClustering are shown, in the Actions pane, click Filter Current Log. On the Filter tab, in the Event sources box, select FailoverClustering. Select other options as appropriate, and then click OK.
4. To sort the displayed events by date and time, in the center pane, click the Date and Time column heading. Verify To perform this procedure, you must be a member of the local Administrators group on each clustered server, and the account you use must be a domain account, or you must have been delegated the equivalent authority. Verifying that the Cluster service is started on all the nodes in a failover cluster To verify that the Cluster service is started on all the nodes in a failover cluster:
1. To open the failover cluster snap-in, click Start, click Administrative Tools, and then click Failover Cluster Management. If the User Account Control dialog box appears, confirm that the action it displays is what you want, and then click Continue.
2. In the Failover Cluster Management snap-in, if the cluster you want to manage is not displayed, in the console tree, right-click Failover Cluster Management, click Manage a Cluster, and then select or specify the cluster that you want.
3. If the console tree is collapsed, expand the tree under the cluster you want to manage, and then click Nodes.
4. View the status for each node. If a node is Up, the Cluster service is started on that node. Another way to check whether the Cluster service is started is to run a command on a node in the cluster. Using a command to check whether the Cluster service is started on a node To use a command to check whether the Cluster service is started on a node:
1. On the node that you are checking, click Start, point to All Programs, click Accessories, right-click Command Prompt, and then click Run as administrator.
2. Type: CLUSTER NODE /STATUS If the node status is Up, the Cluster service is started on that node. Related Management Information

These are some errors I am seeing on our firewall.

Event Type: Warning Event Source: WLBS
Event Category: None Event ID: 18
Date: 2/9/2010 Time: 6:09:47 PM User:
N/A Computer: HAWKEYE Description: NLB
Cluster 172.16.2.35 : Duplicate
cluster subnets detected. The network
may have been inadvertently
partitioned.

he following Windows NT Load Balancing
Service (WLBS) Event 18 appears in
Event Viewer: Duplicate cluster
subnets detected. The network may have
been inadvertently partitioned. WLBS
Cluster appears to function normally.
Back to the top CAUSE This event is
generated on the remerging of a
cluster that has been split into more
than one cluster. This event can be
caused by: • Pulling the net tap on a
server, which will cause the server to
converge with itself and two clusters
will form. • Severing a trunk between
two switches if the cluster is
deployed across them. • A
malfunctioning switch or a switch
flooded by network congestion. Back to
the top RESOLUTION During the time
that the cluster was partitioned, the
members of the cluster converged into
two or more separate clusters. This
event is an informational message that
reports the network had been
partitioned and the WLBS hosts now
have correctly converged in just one
cluster. This event is benign but if
it is logged repeatedly there may be
an issue with the underlying network
or the network infrastructure may be
insufficient for the volume of
traffic.

Best Answer

Ok, So after a week of dissecting my network I have come to a conclusion.

The work was tedious but it had to be done. I ended up going to each of the sites and unpolugging everything and re-attaching all the switches one at a time.

I had found another loop between the buildings and 2 switches with the same IP Address. Now everything works fine.

Thanks