Graphite stops collecting data randomly

graphitemetrics

We have a Graphite server to collect data through collectd, statsd, JMXTrans … Since a few days, we frequently have holes in our data. Digging through the data we still have, we can see an increase in the carbon cache size (from 50K to 4M). We don't see an increase in the number of metrics collected (metricsReceived is stable at around 300K). We have an increase in the number of queries from 1000 to 1500 on average.

Strangely, the cpuUsage decreases slightly from 100% (we have 4 CPU) to 50% when the cache size increase.

Strangely again, we see an increase in the number if octets read from disk, and a decrease in the number of octets written.

We have carbon configure mostly with default values:

MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 5000
MAX_CREATES_PER_MINUTE = 2000

Obviously, something has changed in our system, but we dont understand what, nor how we can find this cause …

Any help ?

Best Answer

This is not a graphite stack's bug, but rather a IO bottleneck, most probably because your storage does not have the high enough IOPS. Because of this, the queue keeps building up, and overflows at 4M. At that point, You lose that much queued data, which is reflected later, as random 'gaps' in your graph. Your system cannot keep-up with the scale at which it is receiving metrics. It keeps filling up and overflowing.

Strangely, the cpuUsage decreases slightly from 100% (we have 4 CPU) to 50% when the cache size increase.

This is because your system begins swapping and the CPUs get a lot of 'idle time', because of the IO wait.

To add context, i have 500 provisioned IOPS at aws on a system on which i receive some 40K metrics. The queue is stable at 50K.

Related Solutions

Statsd messages not showing up in Graphite dashboard

Turns out carbon was not running. When i would run this command:

sudo /opt/graphite/bin/carbon-cache.py start

It would return:

Pidfile /opt/graphite/storage/carbon-cache-a.pid already exists, is carbon-cache already running?

I figured it was running. But I tried running example-client.py, and it said it couldn't connect to port 2003.

Statsd, Graphite and graphs

The problem is the storage-schema retention:
retentions = 1m:395d - which is taken from graphite wiki http://graphite.wikidot.com/installation

I had to use retentions = 10:2160,60:10080,600:262974 or something similar. This takes in consideration values saved every 10 seconds.

Also, although I restarted graphite after changing storage-schema.conf, I had to use a different metric name because the previous would retain the same behavior/retention (and I can reproduce this).
So instead of echo 'ssh.invalid_users:1|c', I had to use
echo 'ssh.invalid_userstest2:1|c'.

Best Answer

Related Solutions

Statsd messages not showing up in Graphite dashboard

Statsd, Graphite and graphs

Related Topic