How to diagnose severe network problems in small network

microsoft-ftmg-2010network-monitoringnetworking

We have fairly small network with managed and unmanaged switches (Netgear GS748T, Linksys SLM2024, DGS-1008D, DES-1008D, DES-1026G, SRW224G4), about 8-10 hosts Hyper-V with multiple virtual machines, few hosts with VWMare and about 100 local users and another 100 vpn users (not connected all the time). Lately we've introduced Forefront TMG (making it a central point) in our network and made big changes to VLANs (from one 192.168.1.X network to 5-10 VLAN's splitting network into test machines, critical servers, iSCSI, Heart Bit – cluster HV, trusted users, untrusted users, etc). Most if not all network cards use Teaming, Aggregation and Trunk.

For the last weeks, months the network has been unstable with iSCSI problems during night when backups are done. Yesterday our network decided to go down during the day and was unavailable for 2 hours. During that time switches hanged 2 times and required hard resets and overall the network was not working correctly during that time. After 2 hours everything went back to fairly normal but it seems like it's gonna come back anytime soon.

Switches don't offer much monitoring capabilities, neither does the backup iscsi drives. Some errors in TMG:

Forefront TMG disconnected a non-TCP connection from 172.16.10.5 because the connection limit for this IP address was exceeded. Larger custom connection limits should be configured for the IP addresses of chained proxy servers and back-to-back Forefront TMG computers with a NAT relationship.

Forefront TMG disconnected a non-TCP connection from 172.16.10.12 because the connection limit for this IP address was exceeded. Larger custom connection limits should be configured for the IP addresses of chained proxy servers and back-to-back Forefront TMG computers with a NAT relationship.

The number of concurrent TCP connections from the source IP address 178.215.xxx.xxx exceeded the configured limit. As a result, Forefront TMG will not allow the creation of new TCP connections from this source IP. This IP address probably belongs to an attacker or an infected host. See product documentation for more info about Forefront TMG flood mitigation.

The number of denied connections from the source IP address 77.1xxx.xxx exceeded the configured limit. This may indicate that the host is infected or is attempting an attack on the Forefront TMG computer.

Forefront TMG disconnected a non-TCP connection from 172.16.10.10 because the connection limit for this IP address was exceeded. Larger custom connection limits should be configured for the IP addresses of chained proxy servers and back-to-back Forefront TMG computers with a NAT relationship.

Forefront TMG disconnected a non-TCP connection from 172.16.10.16 because the connection limit for this IP address was exceeded. Larger custom connection limits should be configured for the IP addresses of chained proxy servers and back-to-back Forefront TMG computers with a NAT relationship.

The number of denied connections from the source IP address 195.ZZZ exceeded the configured limit. This may indicate that the host is infected or is attempting an attack on the Forefront TMG computer.

The number of denied connections from the source IP address 85.ZZZ exceeded the configured limit. This may indicate that the host is infected or is attempting an attack on the Forefront TMG computer.

Forefront TMG disconnected a non-TCP connection from 172.16.231.12 because the connection limit for this IP address was exceeded. Larger custom connection limits should be configured for the IP addresses of chained proxy servers and back-to-back Forefront TMG computers with a NAT relationship.

Forefront TMG was unable to decompress a response body from stooq.pl because the response was compressed by a method which is not supported by Forefront TMG. This happens when a Web server is configured to supply responses compressed by a method that is not supported by Forefront TMG regardless of the type of compression requested.

If you want Forefront TMG to block such responses, configure the policy rule's HTTP policy to block the Content-Encoding header in responses. Otherwise, such responses will be forwarded without decompression to the client and can be cached.
You can cancel or reduce the frequency of the alert generated by this event in Forefront TMG Management.

The connectivity verifier "Farm: Sharepoint.xxx.pl – Farm" reported an error when trying to connect to 14cms.xxx.xx.
Reason: The request has timed out.

The connectivity verifier "DHCP1" reported an error when trying to connect to DHCP1.xxx.xx.
Reason: The request has timed out.

We already played with TMG and setup some higher limits for our AD/DNS servers as we seens this messages before but it seems like it's happening all over.

Best Answer

"During that time switches hanged 2 times and required hard resets"

I'm not trying to be elitist here, but Linksys/D-Link/Netgear isn't even mid-size grade hardware. iSCSI and Virtualization requires a very stable and quick network to perform properly.

I strongly suggest you buy better networking gear (Cisco, HP etc).