First, I'm adding this question and answering myself because this type of behavior was absolutely no where to be found, hopefully it will help someone.
Problem:
We use auto bandwidth to handle the bandwidth subscriptions for our LSPs. The LSPs are equal cost and appear in our forwarding/routing tables appropriately as available next hops for each destination.
However for a single destination, the 4 equal cost LSPs are not load balancing equally (or even close to equally). We understand that JUNOS uses a per-flow load balancing algorithm despite the statement "per-packet" in the policy to enable load-balancing. But that does not explain the major difference between each subscription for the LSP (this subscription imbalance happens multiple times per day, it is not a one off occurrence), like so:
jhead@R1> show route protocol rsvp 1.1.1.1 detail
1.1.1.1/32 (2 entries, 1 announced)
State: <FlashAll>
*RSVP Preference: 7/1
Next hop: 192.168.1.1 via xe-0/0/0.0 weight 0x1 balance 35%, selected
Label-switched-path LSP1
Next hop: 192.168.1.2 via xe-1/0/0.0 weight 0x1 balance 35%
Label-switched-path LSP2
Next hop: 192.168.1.3 via xe-0/0/1.0 weight 0x1 balance 26%
Label-switched-path LSP3
Next hop: 192.168.1.4 via xe-0/0/0.0 weight 0x1 balance 5%
Label-switched-path LSP4
R1-R4 are MX480's and CORE-R1-R4 are MX960's.
Below are graphs comparing RSVP subscription and utilization of the LSP. Red is subscription, green is utilization. You can see that the utilization follows the reservation almost exactly throughout the day. We should see subscriptions be very close to each other between the LSPs toward the same destination.
Topology:
R1 – R4 are ingress routers for all of the LSP's, they have either 2 or 4 LSP's toward each core router.
Configuration:
The LSP configuration is a single destination from R1, just as an example. All LSP's are configured exactly the same way (again, with either 2 or 4).
[edit protocols mpls]
statistics {
file mpls-stats;
interval 300;
auto-bandwidth;
}
traffic-engineering bgp;
label-switched-path LSP1 {
to 1.1.1.1;
optimize-timer 300;
auto-bandwidth {
adjust-interval 7200;
adjust-threshold 10;
minimum-bandwidth 100m;
maximum-bandwidth 4g;
adjust-threshold-overflow-limit 2;
adjust-threshold-underflow-limit 4;
}
primary primary-loose;
}
label-switched-path LSP2 {
to 1.1.1.1;
optimize-timer 300;
auto-bandwidth {
adjust-interval 7200;
adjust-threshold 10;
minimum-bandwidth 100m;
maximum-bandwidth 4g;
adjust-threshold-overflow-limit 2;
adjust-threshold-underflow-limit 4;
}
primary primary-loose;
}
label-switched-path LSP3 {
to 1.1.1.1;
optimize-timer 300;
auto-bandwidth {
adjust-interval 7200;
adjust-threshold 10;
minimum-bandwidth 100m;
maximum-bandwidth 4g;
adjust-threshold-overflow-limit 2;
adjust-threshold-underflow-limit 4;
}
primary primary-loose;
}
label-switched-path LSP4 {
to 1.1.1.1;
optimize-timer 300;
auto-bandwidth {
adjust-interval 7200;
adjust-threshold 10;
minimum-bandwidth 100m;
maximum-bandwidth 4g;
adjust-threshold-overflow-limit 2;
adjust-threshold-underflow-limit 4;
}
primary primary-loose;
}
[edit protocols rsvp]
load-balance bandwidth
interface xe-0/0/0.0 {
bandwidth 9g;
}
interface xe-0/0/1.0 {
bandwidth 9g;
}
interface xe-1/0/0.0 {
bandwidth 9g;
}
[edit routing-options forwarding-table]
export load-balance;
Best Answer
The problem is the:
If you look at the Juniper documentation for Unequal Cost Load Balancing RSVP LSPs, it states:
This implies that regardless of that feature being configured, that no equal cost load balancing will happen if you do not statically set a bandwidth value on an individual LSP, like so:
However, auto-bandwidth does in fact count as setting a bandwidth value, despite it not being present in the configuration.
When auto bandwidth is enabled, RPD will begin monitoring bandwidth consumption. It will assign bandwidth values based on utilization, and then the "load-balance bandwidth" statement in RSVP will immediately begin attempting to keep the traffic ratios within those subscriptions (35, 35, 26, 5 respectively). The problem with this is that it never gives auto-bandwidth the chance to adjust evenly, because the "load-balance bandwidth"s goal, is to keep the traffic as close to those ratios as possible. This makes sense when they're set of something like, 10, 30, 20, 40.
It is essentially a race condition between "load-balance bandwidth" and "auto-bandwidth"
After removing:
[edit protocols rsvp] load-balance bandwidth
Traffic adjusted (with a slight hiccup, seen below):
NOTE: This is an example from a different router that was affected by the same issue.
Since you remove the ability to load-balance (for RSVP), the PFE will reprogram to only a single path until an auto-bandwidth adjust occurs automatically, or you can force an adjustment:
And below, are the bandwidth adjusts for 2 LSP's with the same symptoms, the configurations change and adjustments happened mid-day Friday, you can see the different in subscriptions almost immediately.