Packet Loss – Acceptable Levels for Public Internet Traffic

packet-lossroutingtcptroubleshooting

The context: an Internet facing web site, streaming near real-time market prices. Paying customers can be anywhere on the Internet.

The issue: Customers can experience total packet loss for 10s, 30s and more. This brings the TCP connection delivering the prices to a halt, thus rendering the website useless. All connectivity from the data center to the peering points have been sized correctly, and have capacity to spare.

It has been asserted that this is commonplace in public Internet. My gut however says no, and all of my experience as a consumer is to the contrary, however I have no hard facts to draw upon.

Presuming that packet loss is caused by congestion, when a route becomes congested surely the router holding that route will choose an alternate, and furthermore will do so more rapidly that 10s?

Finally, is there any feedback available when a route congests?

M.

Best Answer

The issue: Customers can experience total packet loss for 10s, 30s and more. This brings the TCP connection delivering the prices to a halt, thus rendering the website useless. All connectivity from the data center to the peering points have been sized correctly, and have capacity to spare.

It has been asserted that this is commonplace in public Internet. My gut however says no, and all of my experience as a consumer is to the contrary, however I have no hard facts to draw upon.

Your gut instinct is correct. Speaking as a guy who has built networks for the last 20 years, you should never blindly accept significant ongoing packet loss on a wired network, and do not let any equipment vendor or ISP tell you otherwise. Regular 10 to 30 second outages are unacceptable; if you have a choice, perhaps you could let your upstream know that you're going to start considering your options unless they take the problem seriously. The sad reality is that bandwidth isn't free, ISPs profit margins are usually quite low, and some providers will do everything they can to save money even if it means unhappy customers (crazy as that may sound).

Assuming I'm sending basic TCP traffic, the most packet loss I will accept for a public internet service (where you expect some loss from occasional over-subscription) is 0.10% over a 24 hour period, and no more than 0.50% over any 5 minute period Note 1.

I tolerate 0.0% packet loss inside our wired corporate network Note 2. As long as the packet drops are not expected, corporate network engineers should upgrade their circuits (or LAN) if they start seeing packet loss.

Presuming that packet loss is caused by congestion, when a route becomes congested surely the router holding that route will choose an alternate, and furthermore will do so more rapidly that 10s?

By default, routing protocols like OSPF or BGP won't re-route based on congestion unless the congestion makes them drop their routing protocol packets long enough to declare the neighbor down (on the order of 40 seconds to 120 seconds with default timers); the tricky issue here is that routing protocol packets are prioritized higher than other traffic in many vendors equipment. That means the circuit has to be really congested for that whole interval before the routing protocols will drop.

Finally, is there any feedback available when a route congests?

There are technical tools for managing which upstream provider you choose for a given route (assuming you have multiple upstreams); consider Cisco's PfR as just one possible way to work around this problem.


Note 1: While I consider 0.5% packet loss over 5 minutes enough to notice, I wouldn't make a big deal out of it... this threshold is just a number where you can concede that there might start to be a noticable TCP performance problem; 0.5% loss probably isn't going to get your attention. However, if the packet loss is both regular and bad enough to notice (such as your case), insist that your provider fix the problem (as long as the loss is in their network).

Note 2: In my experience, some switches simply cannot operate with zero packet drops when multiple users are on this switch. At this point you have choices: A) Forklift all the existing switches (if the packet drop problem is big enough), B) Wait for a patch from the vendor to fix the issue, or C) Accept the packet drops as a "feature" of the switch. By way of example, I have faced this choice with the Cisco 3850's shared-buffer architecture.