We use graphite to track history of disk utilisation over time. Our alerting system looks at the data from graphite to alert us when the free space falls below a certain number of blocks.
I'd like to get smarter alerts – what I really care about is "how long do I have before I have to do something about the free space?", e.g. if the trend shows that in 7 days I'll run out of disk space then raise a Warning, if it's less than 2 days then raise an Error.
Graphite's standard dashboard interface can be pretty smart with derivatives and Holt Winters Confidence bands but so far I haven't found a way to convert this to actionable metrics. I'm also fine with crunching the numbers in other ways (just extract the raw numbers from graphite and run a script to do that).
One complication is that the graph is not smooth – files get added and removed but the general trend over time is for disk space usage to increase, so perhaps there is a need to look at local minimum's (if looking at the "disk free" metric) and draw a trend between the troughs.
Has anyone done this?
Best Answer
We keep a "mean time till full" or "mean time to failure" metric for this purpose, using the statistical trend and its standard deviation to add the smarter (less dumb) logic over a simple static threshold.
Simplest Alert: Just an arbitrary threshold. Doesn't consider anything to do with the actual diskspace usage.
Simple TTF: A little smarter. Calculate the unused percentage minus a buffer and divide by the zero protected rate. Not very statistically robust, but has saved my butt a few times when my users upload their cat video corpus (true story).
Better TTF: But I wanted to avoid alerting for static read-only volumes at 99% (unless they ever have any changes), and I wanted more proactive notice for noisy volumes, and to detect applications with un-managed diskspace footprints. Oh, and the occasional negative values in the Simple TTF just bothered me.
I still keep a static buffer of 1%. Both the standard deviation and the consumption rate increase on abnormal usage patterns, which sometimes over compensates. In graphana or alertmanager speak you'll end up with some rather expensive sub-queries. But I did get the smoother timeseries, and less noisy alert I was seeking.
clamp_min((100 - 1 - stddev_over_time(usedPct{}[12h:]) - max_over_time(usedPct{}[6h:])) / clamp_min(deriv(usedPct{}[12:]),0.00001), 0)
Quieter drives make for very smooth alerts.
Longer ranges tame even the noisiest public volumes.