Cisco – Lowest Possible Failover Latency

ciscofailoverlayer2layer3

Let’s say I have two networking devices (either L2 switches, L3 switches, or routers), and the two devices have two ethernet connections between them: port 1 to port 1, and port 2 to port 2. If one of the connections is broken, I want the traffic to fail over with the lowest possible latency (less than 1ms) to the other connection. What method of failover would result in the lowest possible latency failover? A layer 2 protocol like RSTP, a layer 3 protocol like OSPF with hello timers tuned waaaay down, or something else?

Best Answer

Failover time is going to be dictated by two chief factors:

1.) Time to detection - How quickly the connected systems can determine a link has failed.

2.) Time to convergence - Once the failure is detected how quickly traffic can be redirected to a working path.

Honestly the first item (failure detection) is harder to get right and ends up adding the most time to the process. Loss of physical link is clearly an easy indicator but it's important to remember that this may only be seen in one direction (ex: one strand of fiber fails, but not the other) and that any kind of intervening repeater may actually mask the behavior.

To address this the best mechanisms tend to use some kind of lightweight echo mechanism - constant OAM traffic of some sort moving back and forth between connected peers to validate that data is actually passing. Synchronous communication links tend to incorporate this into the basic protocol pretty effectively (SONET being a great example) but we've had to graft technologies on to accommodate higher-level protocols - either in the form of protocol hellos/keep-alives (as found in routing protocols) or, more recently, with lightweight protocols like bidirectional forwarding detection (BFD), which was mentioned in a recent question. The idea with bfd is the use of a common keepalive mechanism (again - very lightweight) operating at a high frequency (usually in the low hundreds of milliseconds), often with hardware assist.

The second part (reconvergence) has a lot of other issues associated but its difficulty tends to be directly proportionate to the width of the network. For example - reconverging connectivity between two switches or routers with a pair of redundant links is trivial. Finding an alternate path on a complex international network with thousands of network devices? A whole other ballgame. This, incidentally, is where SONET's gold-standard of 50ms comes from - as APS calls for each node to have an alternate path already hot and ready to receive failed traffic.

So - to answer the question... The best possible case is one where a link fails quickly and completely (i.e. someone snips a cable). This delivers immediate results to both connected devices and, in practice, you're not going to see a whole lot of difference between removing one of a pair of equal-cost routes from an L3 forwarding table versus updating the hash tables in an L2 port channel.

That said - if you're running an L2 port channel without a protocol to detect a link failure and one link happens to go unidirectional then you might well hit a situation where some portion of traffic is silently dropped on an indefinite basis (i.e. no recovery). If you're relying on LACP or UDLD to pick up this condition then it may take ones- or tens- of seconds to detect (depending on how the protocols are configured). A stock configuration of OSPF is going to take 40 seconds (4 consecutive losses of a 10 second keepalive) to mark a link as failed. A vanilla BGP connection on some implementations could easily be 3+ minutes. If you add BFD to any of these protocols (LACP / OSPF / BGP) then detection time could be as quick as ~150-200 milliseconds but in actual practice is probably more like 300ms in the real world.

So is 1ms consistently possible under all conditions on common hardware? Probably not, unless you've got hardware capable of reliably sending, receiving and processing OA&M traffic at double-digit nanosecond precision (and there is a whole rat's nest of issues keeping such mechanisms stable). The real question tends to be figuring out the convergence speed that makes the most sense given the protocols running over the link. For standard Ethernet and IP getting in the < 250-300 ms range (from actual failure to full recovery) for any circumstances (with low double digits under common circumstances) has proven more than sufficient.

Related Topic