If I poll over SNMP for example Cisco IOS interface ifHCInOctets
counter and last reading is lower than previous reading, then I know that either the device has reloaded, ifHCInOctets
counter has wrapped, there was an online hardware insertion/removal(OIR) which affected this particular interface or interface was deleted and recreated(this is possible in case of VLAN interface, Port-Channel interface, etc). Now I would like to distinguish between router reload and all those other possibilities for ifHCInOctets
to start from zero. At first snmpEngineTime
(range 0 – 2147483647 according to Cisco SNMP object navigator) seemed to be a perfect solution as this counter wraps after 68 years, but it also starts from zero if SNMP agent is restarted, i.e. stopped(no snmp-server
) and started(snmp-server community public RO
). This means that one still needs to check sysUpTime
, which as far as I know, starts from zero only in case system is restarted, but unfortunately wraps after every 497 days. This means that simple algorithm seen below would not work if sysUpTime
wraps between the same checks when ifHCInOctets
becomes zero:
if (( prev_ifHCInOctets > cur_ifHCInOctets )); then
if (( prev_sysUpTime > cur_sysUpTime )); then
echo "router reloaded"
else
echo "counter wrapped, OIR or interface recreated"
fi
fi
It would be perfection itself if there is a "sysUpHCTime
" counter, but looks like there is not. What options do I have? I guess one possibility is simply to ignore this highly unlikely situation where both cur_ifHCInOctets
(current reading of ifHCInOctets
counter) and cur_sysUpTime
(current reading of sysUpTime
counter) are smaller than previous readings because both counters wrapped within the same polling interval. However, just out of interest, what would be the options here? I guess at least one possible option is not to check if prev_sysUpTime > cur_sysUpTime
, but to check if delta between prev_sysUpTime
and cur_sysUpTime
is roughly equivalent to script check interval? I mean for example let's imagine a situation where prev_sysUpTime
variable was 42949500 and script knows that it got this value 300 seconds ago. Now the cur_sysUpTime
read by script is 128. As a next step script checks if cur_sysUpTime
+(42949672-prev_sysUpTime
) is around 300(for example within range 295 – 305) and if it is, then it is 100% sure that sysUpTime
started from zero because of counter wrap and not because device reload. 42949672 used in this formula is the maximum value of SNMP sysUpTime
counter if milliseconds are not included, i.e. maximum value of SNMP sysUpTime
is 2^32, but last two digits represent milliseconds so for example 4294967296 is 42949672 seconds(about 497 days) and 96 milliseconds.
Sorry for the long post and please let me know if anything is unclear.
Best Answer
I would approach it this way: fetch sysuptime, then calculate the boot date/time and the predicted next wrap date/time. Write both to the log. Calculate the next poll time and if it is after the predicted next wrap date/time by a little margin, write a 'wrap expected' bit to the log. Next time you fetch, look at the 'wrap expected' bit and the predicted next wrap time from the log, and if the bit is 1 and if the predicted wrap time and current boot times are pretty close (extremely close if your server and router are both using NTP) then you know it has wrapped and not rebooted. If not, you know it rebooted. If the wrap expected bit wasn't set, simply go back to the main script logic and calculate the new boot time and predicted wrap time, make your wrap prediction bit, and write it all to the log.
You are trying to guard against a reboot false positive by not anticipating the uptime wrap, AND a reboot true negative by assuming the counter wrapped when it actually rebooted. To do both you need pretty careful timing, and even then it just reduces the probability (down past 1 in 1,000,000), but it doesn't eliminate them.
If you want to go full tilt, you can do something like adding a second detection layer on top. For example look at the UDP traffic counter: since you are polling via snmp you will be constantly incrementing it a tiny bit. Since there probably arent many other SNMP polls taking place, it will not likely wrap very often (if at all compared to reboots for other reasons) so if you looked at sysuptime going down AND udp traffic count going down you can increase the confidence that you caught a reboot.