How to trace reason for internet connection dropout

networking

Scenario: small business with around 40 users behind a Watchguard XTM 3.0 firewall and 20Mb leased line internet connection.

Problem: Users experience occasional dropouts in internet connectivity. These are particularly annoying during VOIP calls eg Skype as connection breaks. Browsing to internet sites is also affected when the dropouts occur. The dropouts are regular enough to be a business problem, though most of the time everything is fine.

Comments: We think the issue is at our end since calls to the same Skype recipients from elsewhere eg home broadband seem to work fine. The problem has also persisted through an upgrade from ADSL to leased line. However we would like to know definitively if the problem is on the LAN or the WAN. Switches are currently unmanaged but shortly to be replaced with new managed switches. The dropouts occur for users anywhere on the LAN as far as we can tell.

Any idea on how to trace the cause of the dropouts? I've wondered if there is a way to test continuity of connection within the XTM? You can see easily that there are no long dropouts but how can we test for short dropouts (but long enough to break a Skype call)?

The more likely reason is something on the LAN – how do we narrow this down without leaving people disconnected for long periods?

Tim

Best Answer

Finding the source of this kind of problem can be extremely frustrating, especially if they are rare. However, this is how I approach intermittent network problems

  1. Map the network to the best of your ability
  2. Identify potentially problematic systems
  3. Create a (preferably automated) monitoring solution to identify where the problem is located
  4. Handle the problem.

Step 1 and 2 should be relatively straight forward. A drawing on a whiteboard with the complete path and the involved systems is helpful. For step 3 I tend to use Nagios or other longterm monitoring solutions. There are many plugins for nagios which may be useful, and you can configure it to monitor many properties of the systems with a very high resolution from your NOC. The monitoring has two purposes. One is to gather information for later debugging, but it also informs you about problems which lets you correlate them more easily to sources. When it comes to intermittent network connectivity issues I make sure to configure the routing monitoring and connectivity tests to all systems along the path.

Once I have found a solution to the problem I deploy it, and leave the monitoring in place until I am confident that the problem has been resolved.

By the way, unmanaged equipment has no place in a production network, as you have probably figured out by now. Debugging problems in a LAN without access to at least SNMP on the switches is a huge headache. And if you are unlucky a single patch between two network ports somewhere in the network is enough to make your network crash and burn...

Related Topic