Router – Cause of high CPU load on Juniper peering router’s routing engine

juniperrouter

Recently the routing engine CPU utilization on two of our Juniper peering routers increased from ~10-20% average load to 80+%. I'm trying to figure out what's causing this (and how to get this high load back down).

Some info on the routers: both run the same JunOS version, both are connected to the same two peering IXP LANs and have a large number (several hundreds) of (almost identical) IPv4 and IPv6 sessions. Both routers have a connection to a different IP transit provider and are connected in the same way to the rest of our network. The routing engines' CPU load isn't flatline on 80+%, there are drops back to normal levels for minutes to hours, but these drops are not that often.

Things I've checked:

  • no configuration changes have been made at the moment the increase started
  • there's no increase in non-unicast traffic directed at the control plane
  • there's no (substantial) change in the amount of traffic being forwarded (though even a increase shouldn't matter)
  • show system processes summary indicates the rpd process is causing the high CPU load
  • there are no rapidly flapping BGP peers causing a large amount of BGP changes

One possible explanation I can come up with is a peer (or more than one) on one of the IXP's both routers are connected to sending a large number of BGP updates. Currently I only have statistics on the number of BGP messages for my transit sessions (showing no abnormal activity) and with several hundreds of BGP sessions on the peering LANs it's not that easy to spot the problematic session(s) if I should create graphs for all sessions.

My questions are:

  • are there any other things I should check to find the cause of this
    increase in CPU load on the routing engines?
  • how can I easily find out which sessions are causing these problems
    (if my assumption is right)? Enabling BGP traceoptions generates huge
    amounts of data, but I'm not sure if it gives me any real insights.

Best Answer

There might be some helpful information for you at the Juniper Knowledge Center.

If RPD is consuming high CPU, then perform the following checks and verify the following parameters:

  • Check the interfaces: Check if any interfaces are flapping on the router. This can be verified by looking at the output of the show log messages and show interfaces ge-x/y/z extensive commands. Troubleshoot why they are flapping; if possible you can consider enabling the hold-time for link up and link down.

  • Check if there are syslog error messages related to interfaces or any FPC/PIC, by looking at the output of show log messages.

  • Check the routes: Verify the total number of routes that are learned by the router by looking at the output of show route summary. Check if it has reached the maximum limit.

  • Check the RPD tasks: Identify what is keeping the process busy. This can be checked by first enabling set task accounting on. Important: This itself might increase the load on CPU and its utilization; so do not forget to turn it off when you are done with the required output collection. Then run show task accounting and look for the thread with the high CPU time:

    user@router> show task accounting
    Task                       Started    User Time  System Time  Longest Run
    Scheduler                   146051        1.085        0.090        0.000
    Memory                           1        0.000            0        0.000  <omit>
    BGP.128.0.0.4+179              268       13.975        0.087        0.328
    BGP.0.0.0.0+179      18375163 1w5d 23:16:57.823    48:52.877        0.142
    BGP RT Background              134        8.826        0.023        0.099
    

Find out why a thread, which is related to a particular prefix or a protocol, is taking high CPU.

  • You can also verify if routes are oscillating (or route churns) by looking at the output of the shell command: %rtsockmon –t

  • Check RPD Memory. Some times High memory utilization might indirectly lead to high CPU.