Troubleshooting intermittent network failures and slowdown

networkingwindows-server-2008-r2

I put up a diagram of our network and equipment here: http://imgur.com/bp7l0

Symptoms

  • Twice in 3 weeks, we have experienced intermittent network failures. These usually manifest as a timeout on a web page, or sometimes missing site content (stylesheets don't load, e.g.). The problem has occurred on all floors in our building. Normally a forced refresh of the page will fix.
  • Tracert to the web page will work every time I've tried it, even when I am consistently getting page load errors on every second or third new URL. Sometimes the second hop fails, although this may simply reflect that ICMP is blocked by that IP address.
  • Some users have experienced slow network performance.
  • Meanwhile, overall network usage appears to be normal, well below the limit of the 10 MB pipe.
  • Doing a speedtest at speedtest.net gives normal results–a little below the limit, as expected due to not being the sole user on the network.
  • Once when I was out and received an emergency call, I suggested to our IT staff to restart either the router or the firewall. They restarted the firewall, which apparently cleared up the problem for a few weeks.

Overview of the network
See diagram here: http://imgur.com/bp7l0.

We have two network connections, a primary and a failover connection. Both network connections are plugged in directly to the firewall. From the firewall to our primary switch, the connection is copper, cat5e. The port is configured to full duplex 100 megabit. Some users are plugged in directly to this switch via an IDF, other users on different floors have a separate switch, connected to the primary switch via fiber, and go from there to an IDF.

During the window when I was able to observe the firewall, the failover connection does not appear to have been engaged. The way it works is when a bandwidth threshold is reached (10 MB) the secondary connection kicks on. It also is used if the primary connection dies completely.

Troubleshooting already performed

  • Connected to managed switch, looked at statistics for port with copper link. Everything seems normal, but I don't know 100% what to look for. I looked for drops and collisions; both were low on this particular port. Not sure of the time range for the data collection without an external logging server.
  • Watched statistics on the firewall for a while. Observed bandwidth utilization, error reports. No unusual flood of connections.

My question

What should I investigate next, and what steps should I take? Any guesses as to what type of problem I'm encountering here–cable, switch, firewall, or ISP? What are some tools that can help me test the various components involved here? The problem is tough because it is intermittent. I think I can use SNMP to collect data from the switch for a longer time period, as well as for the firewall, but that would be a big project with a lot to learn for me. Are there any configuration changes worth making? Adjusting a timeout that I can easily do globally?

Any help would be much appreciated. Thanks!

Best Answer

Without getting into a lot of very specific guidance, which I'm sure others will offer:

  1. Don't make any changes without knowing that the specific component being changed is the cause of the problem and that the change will resolve the problem. Making random changes in the hopes that something will work is analogous to driving a car blindfolded. You may fix the problem but it will only be due to sheer luck and you'll never know what the real cause was.

  2. You've already hit on something: the firewall. If it's the case that rebooting the firewall resolved the issue last time then that's where I'd start. Take a look, if possible, at whatever counters are available on the firewall such as CPU and memory usage, traffic loads, dropped packets, etc. Put a network sniffer on the inside and outside of the firewall and run some tests from a client machine. Do you see packets dropped on the inside? On the outside? What's the timing of the traffic look like as it enters and exits the firewall? Is there a marked delay?

From there I'd move to the switch or the router and perform the same tests, depending on the results of testing the firewall.

Related Topic