VPC – VPC Loop Avoidance Scenario

loopvpc

Consider the attached topology.
enter image description here

There is a vPC domain, with vPC 20 for Sw2 and vPC 10 for Sw1.
vPC peer link exists between NxOS1 and NxOS2.

Suppose, Sw2 sends broadcast traffic.This broadcast traffic (after hashing) will choose any of the links while going upstream to NxOS1.
Now the broadcast traffic is received on NXOS1.

NxOS1 sends broadcast traffic to all ports except port20. Hence, NxOS1 will send the traffic egress via Port 10 and the Peer link.

Now, the NxOS2 receives the traffic, it knows what was the status of port 10 and port 20 of NxOS1 (via CFS protocol).
So, NxOS2 knows that SW1 might have received traffic from NxOS1 via Port 10 of NxOS1.

NxOS2 also knows that (via CFS protocol) the broadcast traffic would have come to NxOS1(on port 20) from SW2. (I believe when the port channel is established, the system-mac of Sw2 is known to both NxOS1 and NxOS2).

Hence, NxOS2 DOES NOT further send broadcast traffic(in egress direction downstream) on port 20 and port 10 of NxOS2.

Q1:Please validate if my above understanding is correct.

Q2:Here, CFSoE is playing a great role(for mac address synchronisation).What would happen if CFSoE stops working(assume peer link goes down)? Even then how would loop occur in above scenario?

Best Answer

The VPC peer link has a few special roles - it carries the CFSoE traffic (mentioned below) to allow the peers to sync (enabled automatically when feature vpc is turned on) and carries traffic for orphan ports. It doesn't behave completely like a normal port channel and, as a result, it's best practice to keep non-VPC VLAN's on a different physical link. One significant difference is, as mentioned, that broadcast traffic is sent over the peer link but normally would not be sent to any non-orphan member link. Orphans still need to receive broadcasts. This design best practice guide is for the 7K but more-or-less equally applies to 5K, 9K or 3K.

CFS is basically the mechanism by which state is synchronized between VPC members. It's the way in which parameters are exchanged to assure configuration consistency, the place where STP configuration is normalized, where multicast/IGMP information is programmed and, if configured, is how IP ARP and/or IPv6 ND information is preemptively synchronized. What is unequivocally not the case is that CFS is tracking on a per-packet basis which traffic should- and should not- traverse the peer link. Its job is basically to maintain synchronization of basic configuration state, not real-time traffic control.

To your specific question -

The assumption in a VPC environment is that under normal operating conditions all devices have member links to both VPC peers. As such, broadcasts received on NXOS1 in your diagram would be sent to the other member links on that same switch (...which is to say NXOS2). It is also sent over the peer link to NXOS2. NXOS2 will not send a broadcast received on the peer link to any of its member ports (as, again, it's assumed NXOS1 has already done this). If it has any orphan ports, however, it will replicate to them. As an example - imagine the link from SW1 to NXOS1 had failed. SW2 sends a broadcast that happens to hash to NXOS1. NXOS1 forwards across its peer link to NXOS2 which sees that SW1 is operating as an orphan (learned via CFS) and will send the broadcast accordingly.

To your second question - if the VPC peer link fails but the keep-alive is still up then the VPC peer that's operating as a secondary will shut down its VPC member ports (including L3 SVI's), thus forcing the downstream switches to only use links to the VPC primary. There are knobs to mediate this - to, say, allow orphans or some SVI's to stay up but this should be approached with extreme caution as you could easily end up in a situation where the secondary switch is attracting L3 traffic from upstream that it can't actually reach at L2.

EDIT (to answer question in comments)

The question of what goes wrong in a split-brain situation (i.e. both P/L and keep-alive failing) depends on the surrounding topology. In the topology you've drawn where there are literally only L2 devices and the switches are both dual-connected to both VPC peer devices and there aren't any other links? There isn't a ton of risk of loops if the downstream switches continue to see the member links as correctly aggregated, as split-horizon would be correctly maintained for broadcasts. There could be some protocol issues depending on the VPC configuration as the downstream switches would see STP PDU's with different sources coming from different members of a port-channel, but that behavior depends on how VPC was configured on the two switches.

Dual/independent (non-synchronized) IGMP would break some things and could potentially see duplicate multicast packets flooded as both VPC masters would perform the same function independently (...this is one of those important things that CFS handles over the peer link).

Now - if the topology is more involved? It can get much uglier. Add in L3 on that VPC pair and perhaps some orphan links and all of a sudden you're going to be dropping traffic as two routers advertise reachability to subnets they can only partially see. Put a link between S1 and S2 and you could end up with an unstable network as they keep seeing TCN's as the two VPC pairs keep sending BPDU's with conflicting information. Or imagine a back-to-back VPC where the upstream VPC pair is now hashing traffic (including broadcast/multicast) across the two VPC switches which, in turn, end up flooding the same packet across various downstream member links. Plenty of opportunity for duplicated broadcasts, improper flooding, etc.

Finally - if any of these issues causes LACP to break - or if seeing differing LACP bundle information showing up on different channels breaks the downstream switch then you've got a situation where the downstream switches are now potentially forwarding between the VPC switches. That's definitely ripe for a L2 loop.

These types of scenarios are why the best practice documents so stringently emphasize the necessity of physically redundant peer links (two links on a 5K, spread across multiple modules on a 9500 or 7K) and why running the keepalive traffic on a completely independent set of links (and infrastructure) is so crucial. As in any active/active clustering system a split-brain is always dangerous.