When monitoring interface errors, what percentage of traffic should you set your 'critical' threshold to according to best practices and does it depend on the interface type (T1, Ethernet etc)? It would be a huge bonus if you can explain the justification for the particular percentage. I've found a few thread comments on various sites that mention 1%, but with no real justification.
Ethernet – Monitoring best practice for thresholding errors on an interface
best practicesethernetmonitoring
Related Solutions
The simple answer is to make the CAM timer equal or slightly longer than the corresponding interface ARP timer, but there are at least three different options to select from...
Option 1: Lower all interface ARP Timers
This option works best if you have a decent-sized layer2 switched network, a reasonable number of ARP entries and few routed interfaces. This method also is preferable if you like to see PC mac entries age out of the topology quickly.
- On all IOS ethernet interfaces facing an ethernet switch:
arp timeout 240
- On all IOS ethernet interfaces facing an ethernet switch:
hold-queue 200 in
andhold-queue 200 out
to avoid dropping ARP packets during periodic ARP-refreshes (these limits could be higher, or lower depending on how many ARP refreshes you think that you'll need to handle at once). If you are adjusting Selective Packet Discard values, then you should follow the guidelines in the paper I linked.
This forces Cisco IOS to refresh the ARP table within four minutes, if it hasn't happened otherwise for a given ARP entry. The obvious disadvantage is that this doesn't scale well if you have lots of ARP entries... the limits vary by platform. I have used this with a few hundred ARPs per router on Catalyst 4500 / 6500 (the Layer3 SVIs) without any issues.
Option 2: Increase the switch CAM Timers
This option works best if you have a large number of ARP entries (i.e. thousands, such as an intense VMWare environment could see).
- On all IOS switches:
mac address-table aging-time 14400
, ormac address-table aging-time 14400 vlan <vlan-id>
for any Vlan that is of concern.
This change adjusts timers that most people assume are fixed at 300 seconds (on Cisco IOS), so be sure to include this in continuity docs. The side-effect of this is that CAM table entries linger for 4 hours after the PC is removed (which can be either good or bad, depending on your PoV). If 4 hours is too long, see the next option...
Option 3: Change both the interface ARP timers, and the switch CAM Timers
This option avoids hideously-long CAM timers in Option 2 at the expense of more configuration. You can choose whether you need 900 seconds, 1800 seconds, or whatever... just make sure your CAM and ARP timers match; thus, you will need to configure both Option 1 and Option 2 in your topologies.
You may consider the following general guidelines when implementing VLANs:
- Grouping devices by traffic patterns – Devices that communicate extensively between each other are good candidates to be grouped into a common VLAN.
- Grouping devices for security – It is often a good practice to put servers and key infrastructure in their own VLAN, isolating them from the general broadcast traffic and enabling greater protection.
- Grouping devices by traffic types – As discussed in this How To, VoIP quality is improved by isolating VoIP devices to their own VLAN. Other traffic types may also warrant their own VLAN. Traffic types include network management traffic, IP multicast traffic such as video, file and print services, email, Internet browsing, database access, shared network applications, and traffic generated by peer-to-peer applications.
- Grouping devices geographically – In a network with limited trunking, it may be beneficial to combine the devices in each location into their own VLAN.
In your case (if I got it right) looks like they are grouping VLANS by "Department" (teachers, labs, classrooms...).
Here you may find some useful information regarding network implementation on an educational environment (Chapter 3): http://www.cisco.com/c/en/us/td/docs/solutions/Enterprise/Education/SchoolsSRA_DG/SchoolsSRA-DG.pdf
Best Answer
Ethernet standard officially allows 10^-12 bit-error-rate, while in practice the hardware meet much better BER than which standard demands.
You should also be able to bing for 'SQA' (Service Quality Assurance) or 'SLA' (Service Level Agreement), some companies publish them, you could use them to check what your competitors are offering and offer something to that level.
Our SQA states to customers that 0.02% is minor fault (we will fix if ticket is opened), which I think is quite large packet loss for fibre connection, but same SQA covers also DSL so we didn't want to be too aggressive with it. So far this has been sufficient to customers, but we are prepared to reduce the number if it is hurting sales.
There are several bingable tools online, where you can check how much packet loss hurts TCP, which can be useful information when deciding what is acceptable loss for your application/product: