Bgp – can some one explain BGP router graceful restart flow

bgp

Can some one please explain the BGP router graceful restart flow , how the messages will exchange before restart and after session re-establish between BGP router and it's peer.

Best Answer

Consider the following topology R1---R2---R3.

The lines R1---R2 and R2---R3 represent both physical connections and EBGP sessions.

We are going to be looking at BGP UPDATEs that flow from R1 to R3, so data plane traffic that flows from R3 to R1.

In this example R2 will be restarting.

R1 ----> R2

OPEN
  Graceful restart capability => "I support graceful restart"
    Restart flags
      R = 0 => "I have not restarted"
      Restart Time = 30 => "If I restart in the future, I expect to be done in 30 seconds"
    AFI SAFI = IPv4 Unicast => "I support GR for IPv4-Unicast"
      AFI SAFI Flags
        F = 0 => "This is only relevant after a restart, so 0 for now"

R2 ----> R1

OPEN (similar to above)

R2 ----> R3

OPEN (similar to above)

R3 ----> R2

OPEN (similar to above)

Router R1 originates some prefixes 1.1.0.0/16 and 2.2.0.0/16

R1 ----> R2

UPDATE
  AFI-SAFI = IPv4-Unicast
  Prefix = 1.1.0.0/16, 2.2.0.0/16
  Attributes
    Next Hop = R1
    AS Path = 100
    etc.

Router R2 installs entries in the R2 FIB to forward traffic to 1.1.0.0/16 and 2.2.0.0/16 through router R1.

Router R2 propagates the BGP UPDATEs received from R1 to R3:

R2 ----> R3

UPDATE
  AFI-SAFI = IPv4-Unicast
  Prefix = 1.1.0.0/16, 2.2.0.0/16
  Attributes
    Next Hop = R2
    AS Path = 200 100
    etc.

Router R3 installs entries in the R3 FIB to forward traffic to 1.1.0.0/16 and 2.2.0.0/16 through router R2.

*** The control plane on router R2 goes down for some reason (crash, upgrade) ***

The forwarding plane on router R2:

Keeps forwarding packets using the routes that are currently installed in the FIB: traffic to 1.1.0.0/16 and 2.2.0.0/16 continues to be forwarded to router R1.
Marks the routes in the FIB as stale (the spec says that the routes in the RIB get marked as stale, but this is difficult to implement since the control plane was just blown away).

Router R3 notices that the BGP session to R2 goes down (e.g. because BFD timeout or BGP KEEPALIVE timeout or link down).

Normally, router R3 would remove the BGP routes received from R2 (namely 1.1.0.0/16 and 2.2.0.0/16) from its RIB and from its FIB.

However, because router R2 advertised that it supports graceful restart, router R3 will

Keep the routes from R2 in its RIB
Mark all routes in the RIB as "stale"
Keep the routes from R2 in its FIB
Continue forwarding traffic for those routes in the FIB

*** Topology change ***

While router R2 is down, something changes in the topology of the network.

Let's say something changes that causes router R1 to stop advertising prefix 1.1.0.0/16

Router R1 will withdraw route 1.1.0.0/16 from its RIB and FIB, and it sends a BGP WITHDRAW to all of its neighbors.

However, R1 cannot send a WITHDRAW message to R2 because the R1-R2 BGP session is down at this point.

We will recover from this "getting out of sync" problem later on.

*** The control plane on router R2 comes back up ***

We assume it took less than 30 seconds (the value of the Restart Time field in the OPEN messages that router R2 sent at the beginning) for the control plane to come back up. If it took longer, router R3 would have "given" up and flushed the routes learned from R2 from its RIB and FIB.

The BGP sessions come back up:

R1 ----> R2

OPEN
  Graceful restart capability => "I support graceful restart"
    Restart flags
      R = 0 => "I have not restarted"
      Restart Time = 30 => "If I restart in the future, I expect to be done in 30 seconds"
    AFI SAFI = IPv4 Unicast => "I support GR for IPv4-Unicast"
      AFI SAFI Flags
        F = 0 => "This is only relevant after a restart, so 0 for now"

R2 ----> R1

OPEN
  Graceful restart capability => "I support graceful restart"
    Restart flags
      R = 1 => "*** I DID RESTART ***"
      Restart Time = 30 => "If I restart in the future, I expect to be done in 30 seconds"
    AFI SAFI = IPv4 Unicast => "I support GR for IPv4-Unicast"
      AFI SAFI Flags
        F = 1 => "*** I DID PRESERVE FORWARDING STATE IN THE FIB ***"

R3 ----> R2

Similar to R1->R2 OPEN

R2 ----> R3

Similar to R2->R1 OPEN

*** Resynchronization ***

Router R2 knows it restarted, so it is NOT going to send any UPDATEs until it has received all UPDATEs from its neighbors (R1 and R3), selected the best routes, and updated its RIB and FIB.

Router R1 knows that its neighbor R2 restarted, so it is going to re-send all routes to R2 followed by an end-of-rib marker, and then flush any stale routes received from R2 from its RIB/FIB (there are not any in this example).

Similarly, router R3 knows that its neighbor R2 restarted, so it is going to re-send all routes to R2 (there are not any in this example), followed by an end-of-rib marker, and then flush any stale routes received from R2 from its RIB/FIB.

So, let's walk through this in detail:

Router R1 originates prefixes 2.2.0.0/16 (but not 1.1.0.0/16 anymore due to the topology change mentioned above):

R1 ----> R2

UPDATE
  AFI-SAFI = IPv4-Unicast
  Prefix = 2.2.0.0/16
  Attributes
    Next Hop = R1
    AS Path = 100
    etc.

Router R2 installs 2.2.0.0/16 in its RIB and in its FIB.

Router R2 already had an entry for 2.2.0.0/16 in its FIB which was marked stale. This stale marking is now removed; it is fresh again.

Router R2 does not receive an 1.1.0.0/16 UPDATE from R1. Hence, R2 does not have an entry for 1.1.0.0/16 in its RIB. But R2 does still have an entry for 1.1.0.0/16 in its FIB which is and remains marked stale.

Router R1 has finished sending all routes to R2, so it sends an end-of-rib marker to R2:

R1 ----> R2

UPDATE
  AFI-SAFI = IPv4-Unicast
  End-of-RIB marker

At this point, router R2 has received an end-of-rib marker from R1, but not yet from R3. So, it does not yet take any action (it needs to have received an end-of-rib marker from all neighbors).

Now, let's look at router R3.

In this example router R3 does not have any prefixes to send to R2, so it immediately sends an End-of-RIB marker:

R3 ----> R2

UPDATE
  AFI-SAFI = IPv4-Unicast
  End-of-RIB marker

At this point, router R2 has received End-of-RIB markers from all of its neighbors (R1 and R3), so it will take the following actions:

R2 will run the best route selection process for every destination prefix in its RIB (in this example only 2.2.0.0/16)
R2 will install the selected best route for every prefix in the RIB into the FIB (only 2.2.0.0/16)
R2 will flush any remaining stale routes from the FIB (in this case 1.1.0.0/16)
R2 will start sending UPDATEs to advertise (propagate) the routes in its RIB to the neighbors:

Router R2 propagates the BGP UPDATEs received from R1 to R3:

R2 ----> R3

UPDATE
  AFI-SAFI = IPv4-Unicast
  Prefix = 2.2.0.0/16
  Attributes
    Next Hop = R2
    AS Path = 200 100
    etc.

At this point router R2 has finished sending all routes to R3, so it sends an End-of-RIB marker to R3:

R2 ----> R3

UPDATE
  AFI-SAFI = IPv4-Unicast
  End-of-RIB marker

Note that router R2 does not have routes to send to R1 (specifically it does not send the route for 2.2.0.0/16 back to R1 because of the AS-path loop). So, R2 immediately sends an End-of-RIB marker to R1 as well:

R2 ----> R1

UPDATE
  AFI-SAFI = IPv4-Unicast
  End-of-RIB marker

When router R3 receives the end-of-rib marker from R2, it flushes all stale routes from R1 (in this case 1.1.0.0/16) from both is RIB and FIB.

Router R1 does the same when it receives the end-of-rib marker from R2, but it this example there is nothing to flush since R2 did not advertise any routes to R1.

Related Solutions

BGP Router ID – How to Choose When Using IPv6 Only

Autonomous-System-Wide Unique BGP Identifier for BGP-4 answers your question. Basically BGP Identifier need only to be unique within your AS since 2011.

Routing – After TCP is established which BGP peer will send open message first

which peer will send the open message first?

Normally, the speaker that opens the socket sends the first OPEN message. But it actually doesn't matter (ref the DelayOpen timer), because BGP also provides a way to delay the OPEN message so the opposite peer can send first:

    Option 1:     DelayOpen

    Description: The DelayOpen optional session attribute allows
                 implementations to be configured to delay sending
                 an OPEN message for a specific time period
                 (DelayOpenTime).  The delay allows the remote BGP
                 Peer time to send the first OPEN message.

         Value:       TRUE or FALSE

In the event that both speakers open duplicate TCP sessions and send OPEN messages on each socket simultaneously, the BGP Identifier is used to resolve which socket should be closed. See RFC 4271, Section 6.8:

6.8. BGP Connection Collision Detection

If a pair of BGP speakers try to establish a BGP connection with each other 
simultaneously, then two parallel connections well be formed. If the source IP address 
used by one of these connections is the same as the destination IP address used by the 
other, and the destination IP address used by the first connection is the same as the 
source IP address used by the other, connection collision has occurred. In the event 
of connection collision, one of the connections MUST be closed.

Based on the value of the BGP Identifier, a convention is established for detecting 
which BGP connection is to be preserved when a collision occurs. The convention is to 
compare the BGP Identifiers of the peers involved in the collision and to retain only 
the connection initiated by the BGP speaker with the higher-valued BGP Identifier.

Is there any good BGP Peer fsm diagram?

Wikipedia has this simplified BGP FSM.

Best Answer

Related Solutions

BGP Router ID – How to Choose When Using IPv6 Only

Routing – After TCP is established which BGP peer will send open message first

Related Topic