Graphite/Carbon cluster returning incomplete data

Trying to setup a Graphite/Carbon cluster. I have an elastic load balancer that directs traffic between two nodes in my cluster, each with one web app, relay, and cache.

In this example, I sent 1000 counts for Metric1 to the cluster.

Here's a diagram:

Diagram of Graphite cluster

The problem

As seen above in the diagram, each server holds approximately half of the actual metric count. When queried via the web app, it only returns one half of the actual count. According to this fantastic post, this is expected behavior, because the web app returns the first result it sees. This implies (and is documented) that only complete counts should be stored on nodes (in my example, one or both of the nodes should have 1000.)

So my issue appears to be the improper sharding and replication of the count. In my example above, when a new count comes in from the web, it can be redirected to either NodeA or NodeB. I had assumed that counts can enter the cluster via any relay. To test this assumption, I removed the load balancer from the cluster, and directed all incoming counts to NodeA's relay. This worked: the full count appeared on one node, then replicated to the second, and the full count was returned correctly from the web app.

My question

The carbon-relay appears to act as an application-level load balancer. This is fine, however I'm concerned that when inbound traffic becomes too great, using a single carbon-relay as a load balancer will become a bottleneck, and single point of failure. I'd much prefer to use an actual load balancer to evenly distribute incoming traffic across the cluster's relays. However, carbon-relay doesn't seem to play nice, hence the problem illustrated above.

Why did the relay cluster split Metric1 between the two caches in the above scenario? (When a load balancer distributed the input into different relays?)
Can I use an elastic load balancer in front of my Graphite/Carbon cluster? Have I misconfigured my cluster for this purpose?
If I can't, should I put my primary carbon-relay on its own box to function as a load balancer?

Best Answer

Turns out my config's DESTINATIONS actually pointed to the carbon-caches instead of the other carbon-relay, via a typo of the port #. Fixing the config to actually represent the diagram pictured in the question seemed to fix the problem: data now appears in complete form on each node (after replication.)

As a side note however, I am now suffering from a problem of inconsistent results from the web app's render API, as detailed in this question. It may or may not be related to the configuration detailed above.

Best Answer

Related Solutions

Cluster Management – How to Shut Down Nodes During Low Load

Power

Power up/down Logic

Clustering Software Stack

Highly-available, Web-accessible and scalable deployment of statsd and graphite

Related Topic