I am only directly aware of one issue with BFD, which is CPU demand. I am currently investigating an issues with a Cisco 7301 which when pushing more traffic during our peak hours, compared to the rest of the day, BFD is sometimes timing out and routing trips over to the next link.
It seems that under high traffic volumes the router CPU usage is rising (which isn't unusual) but at about 40-50% CPU BFD packets aren't receiving enough resources.
However I have found the following information which suggests additional issues with BFD (From this NANOG presentation, there is more in the presentation, it's a good one, give it a read!)
What are the caveats?
- Two main ones:
- BFD can have high resource demands depending on your
scale.
- BFD is not visible to Layer 2 bundling protocols. (Ethernet
LAGs or POS bundles)
BFD Resource Demands
- The number of BFD sessions on each linecard or router can
impact how well BFD scales for you.
-Each unique platform has its own limits.
- Bundled interfaces supporting min tx/rx of 250ms or 2 seconds have
been seen.
- In some cases, BFD instances on a router may need to be operated on
the route-processor depending on the implementation (non-adjacency
based BFD sessions).
- Test your platform first before deploying BFD. Attempt to put
load on the RP or LC CPU with your configured settings. This
can be done by:
- Executing CPU-heavy commands
- Flooding packets to TTL expire on the destination
BFD Resource Demands (cont’d)
- What values are safe to try?
- Based upon speaking to several operators, 300ms with a
multiplier of 3 (900ms detection) appears to be a safe value
that works on most equipment fairly well.
- This is a significant improvement over some of the
alternatives.
BFD and L2 link-bundling
- BFD is unaware of underlying L2 link bundle members.
- A 4x10GigE L2 bundle (802.3ad) would appear as a single L3
adjacency. BFD packets would be transmitted on a single
member link, rather than out all 4 links.
- A failure of the link with BFD on it would result in the entire L3
adjacency failing.
- However, in some scenarios the failed member link may result in only a
single BFD packet being dropped. Subsequent packets may route over
working member links.
The difference between aggregate labels and normal labels is such that normal labels directly point to L2 rewrite details (an interface and L2 address). This means a normal label will be label switched by the egress PE node directly out, without doing an IP lookup.
Adversely, aggregate labels can potentially represent many different egress options, so L2 rewrite information is not associated with the label itself. This means that an egress PE node must perform an IP lookup for the packet, to determine appropriate L2 rewrite information.
Typical reasons why you might have an aggregate label instead of normal label are:
- Need to perform neighbor discovery (IPv4 ARP, IPv6 ND)
- Need to perform ACL lookup (egress ACL in customer interface)
- Running whole VRF under single label (table-label)
Some of these restrictions (particularly 2) are not valid to all platforms.
How traceroute is affected in MPLS VPN environment is by the transit P, when generating the TTL exceeded message, will not know how to return it (it does not have routing table entry to the sender). So a transit P node will send the TTL exceeded message with original label stack all the way to the egress PE node, in hope that the egress PE note has an idea of how to return the TTL exceeded message to the sender.
This feature is automatically on in Cisco IOS but needs 'icmp-tunneling' configured in Juniper JunOS.
Based on this, I would suspect that perhaps your CE devices are not accepting packets when source address is a P node link network, and as they are not accepting the ICMP message, they are not able to return it to the sender.
A Possible way test to this theory would be to enable per-vrf label:
- IOS: mpls label mode all-vrfs protocol bgp-vpnv4 per-vrf
- JunOS: set routing-instances FOO vrf-table-label
Generally speaking I do not recommend propagating TTL, especially on VPN environment, at least in our case customers get confused and anxious about it. They worry why their VPN has foreign addresses showing.
Another thing which confuses people causing them to open a support ticket, is when they are running a traceroute from say the UK to the US, because they see >100ms latency between two core routers in UK, not realizing that the whole path has same latency all the way to the west coast of the US, because all the packets take a detour from there.
This issue is mostly unfixable by design, however in IOS you can determine how many labels at most to pop (mpls ip ttl-expiration pop N) when you are generating TTL exceeded. This gives you a somewhat decent approximation if INET == 1 label, VPN == >1 label, so you can configure it so that VPN traffic is tunnelled and INET traffic gets directly returned without egress PE node detour. But as I said, this is just an approximation of desired functionality, as features like in-transit repairs may cause your label stack not to be always same size for the same service.
Best Answer
BGP-LU is used where you need to join multiple networks together (e.g.: running distinct IGPs) while still being able to provide a transport label between any two nodes.
A couple of use cases that come to mind:
Large cellular backhaul network - may have 10s of thousands of base stations. It wouldn't be feasible to have these all participate in a single (say) OSPF area and distribute link-nets and loopbacks. Using BGP-LU you could break it up into regional IGPs, but still establish end-to-end LSPs across regions to a centralised head-end
Merging two existing networks together - you're a large ISP running OSPF as your IGP, and you acquire another large ISP running IS-IS as it's IGP. With BGP-LU you can create LSPs across these two networks without having to perform some unholy route re-distribution between two link-state protocols.