Massive scalability when querying metrics stored in Graphite

Printer-friendly version

Is your Graphite backend not fast enough for either receiving metrics, or querying, or both? Split it up for a massive performance boost.

Perhaps you've started with a single Graphite instance, collecting metrics from all your other instances, and you're also using it to query metrics in various ways. Perhaps you're collecting around 1k metrics from each instance, with a sampling period of 1 minute or less. You're doing this for dozens of instances, and everything works well.

But your infrastructure grows, and now you have hundreds of instances. And maybe you're collecting more metrics from each instance - you've added a StatsD collector to the mix, and engineers start throwing their own metrics through StatsD into the Graphite backend. The Graphite instance is now struggling, and there's only so much you can do by using faster instance types.

Here's the thing: don't do that, don't limit yourself to a single Graphite backend. Split your backend into multiple instances, and aggregate all data with a single frontend. This configuration can easily scale to handle very large amounts of metrics.

Graphite architecture

Graphite has several components working together. Here are some (not all) of these components.

Whisper is the on-disk database that provides the long term storage for metrics. Carbon is the receiver for metrics. A WSGI webapp frontend provides a REST API for extracting the data out of Whisper for analysis and display.

The webapp frontend typically, in a single-instance scenario, gets all its data from a local Whisper backend. But a semi-hidden feature here is that the webapp can be configured to query other webapps instead. If you do that, the other webapps function as simple proxies, and your aggregator webapp is now querying multiple Whisper backends on other instances. This is the key to enable frontend clustering for Graphite.

Frontend clustering

Create several local Graphite instances, each one running a full set of Carbon receiver + Whisper DB + frontend webapp. Then split the metrics evenly across your various Cabon receivers. Then create a single Graphite frontend and point it at all the local frontends.

To configure the aggregator frontend and make it query all local frontends, the config file you need is local_settings.py on the aggregator. Its location depends on how you've installed Graphite. If you've installed with a package, it might be somewhere in /etc. If you've compiled from source and installed into /opt/graphite, it might be in /opt/graphite/webapp/graphite or thereabouts.

In that file, the variable that contains the cluster settings is CLUSTER_SERVERS. Here's an example from an aggregator frontend using three local Graphite instances:

CLUSTER_SERVERS = ["1.2.3.4:8080", "5.6.7.8:8080", "9.10.11.12:8080"]

The example assumes your local frontends are all listening on port 8080; you're pointing the aggregator frontend at each one of them by IP address. I'm guessing a hostname would work as well, but I've never tried it. Edit this variable, restart the webserver that contains your aggregator frontend, and you're ready to go.

For completion, here's a /etc/httpd/conf.d/graphite.conf that I've used to make the WSGI webapp work within Apache. This file should be fairly similar on all Graphite frontends, either local or aggregator. It assumes Graphite has been installed in /opt/graphite.

Listen 8080
<VirtualHost *:8080>

  ServerName graphiteXYZ.domain.com

  # enable CORS to make it work when webserver hostname is not right,
  # or when Graphite and Grafana are on different hostnames.
  # disable it otherwise
  Header set Access-Control-Allow-Origin "*"
  Header set Access-Control-Allow-Methods "GET, OPTIONS"
  Header set Access-Control-Allow-Headers "origin, authorization, accept"
  # CORS ends here

  <Directory "/opt/graphite/webapp">
    AllowOverride None
    Require all granted
  </Directory>

  WSGIDaemonProcess graphite processes=5 threads=5 display-name='%{GROUP}' \
      inactivity-timeout=120 user=graphite group=graphite
  WSGIProcessGroup graphite
  WSGIImportScript /opt/graphite/webapp/graphite.wsgi \
      process-group=graphite application-group=%{GLOBAL}
  WSGIScriptAlias / /opt/graphite/webapp/graphite.wsgi

  Alias /content/ /opt/graphite/webapp/content/
  <Location "/content/">
    SetHandler None
  </Location>

  ErrorLog /var/log/httpd/graphite-web_error.log

  # Possible values include: debug, info, notice, warn, error, crit,
  # alert, emerg.
  LogLevel warn

  CustomLog /var/log/httpd/graphite-web_access.log combined

</VirtualHost>

Be careful with the metrics namespace

In such a cluster, when you send a query to the aggregator frontend to retrieve a metric, the query is forwarded to all local Graphite instances behind it. If the same metric name is hosted on several local Graphite instances, the one who answers first wins; all other results will be discarded.

For that reason, it's best if you make sure the metrics namespace on one local Graphite instance does not overlap with any other Graphite instance. This is easily accomplished if all your metrics follow a rational naming convention, such as:

provider.region.vpc-name.instance-type.instance-id.type-of-metrics.service.metric-name

Example name for a single metric coming out of one instance:

aws.us-west-1.dev10.tomcat.tomcat5.os.cpu.idle

Then, if each VPC has its own local Graphite instance, overlaps will be automatically eliminated. The Graphite instance in us-west-1.dev10 will have metrics such as...

aws.us-west-1.dev10.tomcat.tomcat5.os.cpu.idle

...whereas the Graphite instance in us-east-1.dev5 will have metrics such as:

aws.us-east-1.dev5.tomcat.tomcat5.os.cpu.idle

Then you could do global queries such as...

*.*.*.tomcat.*.os.cpu.idle

...and you'll get the CPU Idle metric for all Tomcat instances everywhere, without any overlaps.

Just make sure each local Graphite instance is not overloaded, and you could scale up your Graphite infrastructure this way to handle tremendous amounts of individual metrics.

Once the aggregator frontend is running correctly and is able to query all local Graphite instances, you could install a dashboard such as Grafana and point it at the aggregator. It will work as if it was using a single giant Graphite backend with all your metrics available at once. The Grafana query builder will be able to discover all the metrics from all the local Graphite instances, and use them to create live graphs.

Tags: