Benchmarking – Why Discard the Lowest Time?

benchmarkingstatistics

I've quite frequently seen benchmarks where the tester discarded the highest and the lowest time out of N runs.

Discarding the highest time I understand; it's probably high because of some other processes running suddenly demanding more CPU.

But doesn't the lowest time indicate the best possible performance when the benchmark was running full tilt without interruptions?

Best Answer

The lowest timing might indeed represent the "true" timing without outside interference, or it might be a measurement error. E.g. the boosting behaviour of a CPU might speed up the first run in a larger benchmarking suite, or a less congested network might speed up a net-dependent benchmark during certain times of day. Similarly, the highest timing might represent the true worst case, or non-representative interference. E.g. a background service might have started during the benchmark, or a SMR hard drive cache is being flushed during an IO-based benchmark.

Such interference indicates a flawed experimental design that fails to control for these influences, but it's not always possible or economical to design the perfect experiment. So we have to deal with the messy real-world data that we have.

Statistics like the mean (average) of some values is very sensitive to outliers. It is thus common to use a trimmed mean where we remove outliers, in the hopes of getting closer to the "true" mean. Various methods for determining outliers exist, with the simplest approach being to remove the top/bottom p%, for some value p. Another option is to use techniques like bootstrapping that let us estimate how reliable the estimate is: instead of removing top/bottom observations, we remove random observations and repeat the calculations multiple times.

However, it is not generally necessary to calculate the mean run time when doing benchmarking. For comparing the typical behaviour, we can use measures like the median or other quantiles. Especially when measuring latencies, quantiles like the 95%-percentile are often used (meaning: 95% of measurements were this fast or faster).

It is also unnecessary to calculate the mean when trying to determine whether one program is significantly faster than another. Instead of parametric statistical tests that require such parameters to be estimated from the sample, it is possible to use non-parametric tests. E.g. the Mann–Whitney Rank Sum Test only considers the order of values from two samples, not their actual values. While this is more flexible and more rigorous, this does lose some statistical power (you might not be able to detect a significant difference even if it exists).

Related Topic