Routing – What happens when the ARP cache overflows

arpbrocaderouterrouting

In at least one implementation there is a hard limit on the capacity of the ARP table. What happens when the ARP cache is full and a packet is offered with a destination (or next-hop) that isn't cached? What happens under the hood, and what is the effect on the service quality?

For example, Brocade NetIron XMR and Brocade MLX routers have a configurable ip-arp system maximum. The default value in that case is 8192; the size of a /19 subnet. It is not clear from the documentation whether this is per interface or for the whole router, but for the purpose of this question, we can assume it is per interface.

Few networkers would configure a /19 subnet on an interface on purpose, but that isn't what happened. We were migrating a core router from a Cisco model to a Brocade. One of the many differences between Cisco and Brocade is that Cisco accepts static routes that are defined with both an outbound interface and a next-hop address, but Brocade insists on one or the other. We dropped the next-hop address and kept the interface. Later, we learned the error of our ways, and changed from interface to next-hop address, but everything seemed to be working initially.

+----+ iface0    +----+
| R1 |-----------| R2 |---> (10.1.0.0/16 this way)
+----+.1       .2+----+
      10.0.0.0/30

Before the migration, R1 was a Cisco, and had the following route.

ip route 10.1.0.0 255.255.0.0 iface0 10.0.0.2

After the migration, R1 was a Brocade, and had the following route.

ip route 10.1.0.0 255.255.0.0 iface0

R2 is a Cisco router, and Cisco routers perform proxy ARP by default. This is the (mis-)configuration in production that set the stage for what turned out to be an ARP cache overflow.

  1. R1 receives a packet destined for the 10.1.0.0/16 network.
  2. On the basis of the static interface route, R1 ARPs for the destination on iface0
  3. R2 recognizes that it can reach the destination, and responds to the ARP with its own MAC.
  4. R1 caches the ARP result that combines an IP in a remote network with the MAC of R2.

This happens for every distinct destination in 10.1.0.0/16. Consequently, even though the /16 is properly sub-netted beyond R2, and there are only two nodes on the link adjoining R1 and R2, R1 suffers ARP cache overload because it induces R2 to behave as if all 65k addresses are directly connected.

The reason I'm asking this question is because I hope it will help me make sense of the network service trouble reports (days later) that led us, eventually, to the overflowing ARP cache. In the spirit of the StackExchange model, I tried to distill that to what I believe is a crisp, specific question that can be answered objectively.

EDIT 1 To be clear, I am asking about part of the glue layer between data link (layer 2) and network (layer 3), not the MAC forwarding table within the data link layer. A host or router builds the former to map IP addresses to MAC addresses, while a switch builds the latter to map MAC addresses to ports.

EDIT 2 While I appreciate the effort to which responders have gone to explain why some implementations are not subject to ARP cache overflow, I feel that it is important for this question to address those that are. The question is "what happens when", not "is vendor X susceptible to". I've done my part now by describing a concrete example.

EDIT 3 Another question this is not is "how do I prevent the ARP cache from overflowing?"

Best Answer

Edit 2:

As you mentioned...

ip route 10.1.0.0 255.255.0.0 iface0

Forces the Brocade to proxy-arp for every destination in 10.1.0.0/16 as if it was directly connected to iface0.

I can't respond about Brocade's ARP cache implementation, but I would simply point out the easy solution to your problem... configure your route differently:

ip route 10.1.0.0 255.255.0.0 CiscoNextHopIP

By doing this, you prevent the Brocade from ARP-ing for all of 10.1.0.0/16 (note, you might need to renumber the link between R1 and R2 to be outside 10.1.0.0/16, depending on Brocade's implementation of things).


Original answer:

I expect that in most, or even all, implementations, there is a hard limit on the capacity of the ARP table.

Cisco IOS CPU routers are only limited by the amount of DRAM in the router, but that is typically not going to be a limiting factor. Some switches (like Catalyst 6500) have a hard limitation on the adjacency table (which is correlated to the ARP table); Sup2T has 1 Million adjacencies.

So, what happens when the ARP cache is full and a packet is offered with a destination (or next-hop) that isn't cached?

Cisco IOS CPU routers don't run out of space in the ARP table, because those ARPs are stored in DRAM. Let's assume you're talking about Sup2T. Think of it like this, suppose you had a Cat6500 + Sup2T and you configured all Vlans possible, technically that is

4094 total Vlans - Vlan1002 - Vlan1003 - Vlan1004 - Vlan1005 = 4090 Vlans

Assume you make each Vlan a /24 (so that's 252 possible ARPs), and you pack every Vlan full... that is 1 Million ARP entries.

4094 * 252 = 1,030,680 ARP Entries

Every one of those ARPs would consume a certain amount of memory in the ARP table itself, plus the IOS adjacency table. I dont know what it is, but let's say the total ARP overhead is 10 Bytes...

That means you have now consumed 10MB for ARP overhead; it still isn't very much space... if you were that low on memory, you would see something like %SYS-2-MALLOCFAIL.

With that many ARPs and a four hour ARP timeout, you would have to service almost 70 ARPs per second on average; it's more likely that the maintenance on 1 million ARP entries would drain the CPU of the router (potentially CPUHOG messages).

At this point, you could start bouncing routing protocol adjacencies and have IPs that are just unreachable because the router CPU was too busy to ARP for the IP.