Cisco IOS – Distinguishing Router Reload from sysUpTime SNMP Counter Wrap

cisco-iossnmp

If I poll over SNMP for example Cisco IOS interface ifHCInOctets counter and last reading is lower than previous reading, then I know that either the device has reloaded, ifHCInOctets counter has wrapped, there was an online hardware insertion/removal(OIR) which affected this particular interface or interface was deleted and recreated(this is possible in case of VLAN interface, Port-Channel interface, etc). Now I would like to distinguish between router reload and all those other possibilities for ifHCInOctets to start from zero. At first snmpEngineTime(range 0 – 2147483647 according to Cisco SNMP object navigator) seemed to be a perfect solution as this counter wraps after 68 years, but it also starts from zero if SNMP agent is restarted, i.e. stopped(no snmp-server) and started(snmp-server community public RO). This means that one still needs to check sysUpTime, which as far as I know, starts from zero only in case system is restarted, but unfortunately wraps after every 497 days. This means that simple algorithm seen below would not work if sysUpTime wraps between the same checks when ifHCInOctets becomes zero:

if (( prev_ifHCInOctets > cur_ifHCInOctets )); then
  if (( prev_sysUpTime > cur_sysUpTime )); then
    echo "router reloaded"
  else
    echo "counter wrapped, OIR or interface recreated"
  fi
fi

It would be perfection itself if there is a "sysUpHCTime" counter, but looks like there is not. What options do I have? I guess one possibility is simply to ignore this highly unlikely situation where both cur_ifHCInOctets(current reading of ifHCInOctets counter) and cur_sysUpTime(current reading of sysUpTime counter) are smaller than previous readings because both counters wrapped within the same polling interval. However, just out of interest, what would be the options here? I guess at least one possible option is not to check if prev_sysUpTime > cur_sysUpTime, but to check if delta between prev_sysUpTime and cur_sysUpTime is roughly equivalent to script check interval? I mean for example let's imagine a situation where prev_sysUpTime variable was 42949500 and script knows that it got this value 300 seconds ago. Now the cur_sysUpTime read by script is 128. As a next step script checks if cur_sysUpTime+(42949672-prev_sysUpTime) is around 300(for example within range 295 – 305) and if it is, then it is 100% sure that sysUpTime started from zero because of counter wrap and not because device reload. 42949672 used in this formula is the maximum value of SNMP sysUpTime counter if milliseconds are not included, i.e. maximum value of SNMP sysUpTime is 2^32, but last two digits represent milliseconds so for example 4294967296 is 42949672 seconds(about 497 days) and 96 milliseconds.

Sorry for the long post and please let me know if anything is unclear.

Best Answer

I would approach it this way: fetch sysuptime, then calculate the boot date/time and the predicted next wrap date/time. Write both to the log. Calculate the next poll time and if it is after the predicted next wrap date/time by a little margin, write a 'wrap expected' bit to the log. Next time you fetch, look at the 'wrap expected' bit and the predicted next wrap time from the log, and if the bit is 1 and if the predicted wrap time and current boot times are pretty close (extremely close if your server and router are both using NTP) then you know it has wrapped and not rebooted. If not, you know it rebooted. If the wrap expected bit wasn't set, simply go back to the main script logic and calculate the new boot time and predicted wrap time, make your wrap prediction bit, and write it all to the log.

You are trying to guard against a reboot false positive by not anticipating the uptime wrap, AND a reboot true negative by assuming the counter wrapped when it actually rebooted. To do both you need pretty careful timing, and even then it just reduces the probability (down past 1 in 1,000,000), but it doesn't eliminate them.

If you want to go full tilt, you can do something like adding a second detection layer on top. For example look at the UDP traffic counter: since you are polling via snmp you will be constantly incrementing it a tiny bit. Since there probably arent many other SNMP polls taking place, it will not likely wrap very often (if at all compared to reboots for other reasons) so if you looked at sysuptime going down AND udp traffic count going down you can increase the confidence that you caught a reboot.