Packet Loss – CRC Errors and ICMP Packet Loss Correlation

etherneticmpippacket-lossping

I would have a question regarding physical problem detection in a link with ping.

The assumption is as follows: I have a fiber or cable with much of noise due to bad hardware (bad cable or bad transceiver for example), so this cable or fiber will statistically generate CRC Errors for X% of ethernet frames, is it true?

So, could you confirm me this point or tell me if I'm wrong: With large ping (65k packet for example), one ping will generate approximately 65000 / 1500(mtu) = 43 frames, as ip fragments, so normally the probability to get ICMP packets loss (because normally if one IP fragment is lost the entire IP packet is lost) with large ping is higher than with small ping?

So, the global question is, whereas ping and ICMP are for layer 3 problems detection, with large ping, could we easier detect a physical problem on a link?

Best Answer

Sending a good ten minutes of 0-interval MTU-sized DF pings with contents 0x0000 and a second test with contents 0xffff is an excellent way to apply some stress to simple transmission technologies. Lost packets -- or overly delayed packets after the first few packets -- are a clear indication that further investigation is required. It's also a good moment to check that the reported round-trip time is within reasonableness (it's very easy for a transmission provider to provision a circuit which crosses the country and back rather than crossing the city).

Ping is great for finding faults. However, ping alone isn't a great acceptance test for being sure there are no faults. The rest of this answer explains why.

As part of the ping test you should connect with each of your network elements on the path (hosts, switches, routers) and record the transmission traffic and errors counters before the start and after the end. Rising error counters of any type require further investigation. Don't ignore small rises in error counters: even a low rate of loss will devastate TCP performance.

This still isn't to say that the link is acceptable. Let's take 1000Base-LX, ethernet over single-mode fiber. It's possible that the light levels at the receiver are under the specification for that transceiver's model. But we have an above-average sample of that transceiver so all is well. But then that transceiver has a fault and we replace it with a below-average-but-within-specification sample. The link cannot restore to service even though we have fixed the fault. So as part of the acceptance testing we need to check that light levels are within specification at both ends; and we need to check that there is a viable power budget at the extremes of both transmitting and receiving transceiver's performance (to make this easy, manufacturers will give their SFPs nominal ranges where they have done the power budget calculations, such as 10Km for 1000Base-LX/LH. But for any link longer than 10Km you should do your own power budget: five minutes arithmetic can save you hundreds of dollars in allowing you to safely purchase a lower-power SFP). SFPs often have a feature "DOM" which allows you to check the receive light level from the device's command line.

More complex transmission technologies have forward error correction. So the link appears to work under high transmission error rates, but if error rates are higher or more sustained then the FEC is overwhelmed and the transmission passes rubbish. So for these links we are very interested in the counts of error corrections. Interpreting those FEC counters requires understanding the physical transmission, as we're now low enough in the "stack" that we can no longer pretend that media isn't naturally free of errors. But even in these systems a simple ping test is enough stress to give initial results.

Finally, you should be aware that PCs are a cheap but not perfect test platform. So sometimes packet drops are because of the end-systems rather than the transmission. This can be simple IP-layer issues (such as a MTU inconsistent with the subnet, always a possibility when backbone links should be running with a MTU > 9000) or host performance issues (particularly >10Gbps). The cost of "real" ethernet test platforms is extraordinarily high, because you're paying for those issues to have been fully sorted via hardware or clever software (eg, running within the NIC).

Related Topic