Calculating days until disk is full

disk-space-utilizationgraphitemonitoring

We use graphite to track history of disk utilisation over time. Our alerting system looks at the data from graphite to alert us when the free space falls below a certain number of blocks.

I'd like to get smarter alerts – what I really care about is "how long do I have before I have to do something about the free space?", e.g. if the trend shows that in 7 days I'll run out of disk space then raise a Warning, if it's less than 2 days then raise an Error.

Graphite's standard dashboard interface can be pretty smart with derivatives and Holt Winters Confidence bands but so far I haven't found a way to convert this to actionable metrics. I'm also fine with crunching the numbers in other ways (just extract the raw numbers from graphite and run a script to do that).

One complication is that the graph is not smooth – files get added and removed but the general trend over time is for disk space usage to increase, so perhaps there is a need to look at local minimum's (if looking at the "disk free" metric) and draw a trend between the troughs.

Has anyone done this?

Best Answer

We keep a "mean time till full" or "mean time to failure" metric for this purpose, using the statistical trend and its standard deviation to add the smarter (less dumb) logic over a simple static threshold.

Simplest Alert: Just an arbitrary threshold. Doesn't consider anything to do with the actual diskspace usage.

  • Example: current% > 90%

Simple TTF: A little smarter. Calculate the unused percentage minus a buffer and divide by the zero protected rate. Not very statistically robust, but has saved my butt a few times when my users upload their cat video corpus (true story).

  • Example: (100% - 5% - current%) / MAX(rate(current%), 0.001%)

Better TTF: But I wanted to avoid alerting for static read-only volumes at 99% (unless they ever have any changes), and I wanted more proactive notice for noisy volumes, and to detect applications with un-managed diskspace footprints. Oh, and the occasional negative values in the Simple TTF just bothered me.

  • Example: MAX(100% - 1% - stdev(current%) - current%, 0) / MAX(rate(current%), 0.001%)

I still keep a static buffer of 1%. Both the standard deviation and the consumption rate increase on abnormal usage patterns, which sometimes over compensates. In graphana or alertmanager speak you'll end up with some rather expensive sub-queries. But I did get the smoother timeseries, and less noisy alert I was seeking.

  • Example: clamp_min((100 - 1 - stddev_over_time(usedPct{}[12h:]) - max_over_time(usedPct{}[6h:])) / clamp_min(deriv(usedPct{}[12:]),0.00001), 0)

Quieter drives make for very smooth alerts.

Longer ranges tame even the noisiest public volumes.