Graphite stops collecting data randomly

graphitemetrics

We have a Graphite server to collect data through collectd, statsd, JMXTrans … Since a few days, we frequently have holes in our data. Digging through the data we still have, we can see an increase in the carbon cache size (from 50K to 4M). We don't see an increase in the number of metrics collected (metricsReceived is stable at around 300K). We have an increase in the number of queries from 1000 to 1500 on average.

Strangely, the cpuUsage decreases slightly from 100% (we have 4 CPU) to 50% when the cache size increase.

Strangely again, we see an increase in the number if octets read from disk, and a decrease in the number of octets written.

We have carbon configure mostly with default values:

  • MAX_CACHE_SIZE = inf
  • MAX_UPDATES_PER_SECOND = 5000
  • MAX_CREATES_PER_MINUTE = 2000

Obviously, something has changed in our system, but we dont understand what, nor how we can find this cause …

Any help ?

Best Answer

This is not a graphite stack's bug, but rather a IO bottleneck, most probably because your storage does not have the high enough IOPS. Because of this, the queue keeps building up, and overflows at 4M. At that point, You lose that much queued data, which is reflected later, as random 'gaps' in your graph. Your system cannot keep-up with the scale at which it is receiving metrics. It keeps filling up and overflowing.

Strangely, the cpuUsage decreases slightly from 100% (we have 4 CPU) to 50% when the cache size increase.

This is because your system begins swapping and the CPUs get a lot of 'idle time', because of the IO wait.

To add context, i have 500 provisioned IOPS at aws on a system on which i receive some 40K metrics. The queue is stable at 50K.

Related Topic