Cisco – Why we get high fluctuation on bandwidth measurement Cacti graph

ciscomonitoringswitchvss

We were on redundancy test of Etherchannel and Routing on our network. During this intervention we made some measurement. Our monitoring tool is Cacti for graph.
The equipment monitored is a 4500-X on VSS. Each link is on a different physical chassis.

Schema :

etherchannel 1

Test chronology :
[t0] Link on te1/1/14 port was physically removed. The Te2/1/14 is active. Po1 is operational.
[t0+15] Link on Te1/1/14 port returned to service and checked that the port back in etherchannel Po1
[t0+20] Link on te1/1/14 port was physically removed. The Te2/1/14 is active. Po1 is operational.
[t0+35] Link on Te1/1/14 port returned to service and checked that the port back in etherchannel Po1

In our tests, we monitored traffic etherchannel Po1 through Cacti (graph below) and noticed a significant change in the value of the flow when we disabled the te1/1/14 link (link te2/1/14 assets) rather stable during the reverse. We checked too the counters on int Po1 and these were maintained fairly stable.

Graph

Two interface of 10G are bundled on Etherchannels with LACP configured. Inside the etherchannel their is 2 vlans. One for Multicast traffic and another for Internet/All Traffic.

Do you know a possible cause of this behavior ?

Best Answer

To extend ytti's comment.

Your poll interval seems really small, every 10 seconds if I'm reading right. There's a few reasons you could get that result.

Equipment side:

  • Bad choice of counters, if you're using 32-bit counters they could be rolling over every ~3.4 seconds if you're running a 10g interface at line rate
  • Counter updating, many larger devices only update counters two or three times a minute, and they can never be relied on to be in sync. Every 30 seconds is as low as I'd bother polling, and even then I'd always want at least two points before triggering any alert or taking action
  • There can be a gotcha as packets sent for CPU processing (netflow perhaps) may be counted straight away vs those not going to RE being batched (have seen this on Juniper MX)

Poller side:

  • Is the poller polling accurately at the interval, and if not is it injecting its result with the actual polling time (eg, x bits in y.z seconds) so a sensible rate can be calculated
  • What happens when counters reset, or SNMP GET's aren't responded to, different tools respond to these in different ways
Related Topic