I’m not especially a great fan of micro-benchmarks, they’re handy as a compliment to the larger tests but I often find that they lead to confusion because they are interpreted as reflecting reality when nothing could be farther from the truth. The fact is that micro-benchmarks are intended to magnify some particular phenomenon to allow it to be studied. You can expect that normal things like cache pressure, even memory consumption generally, will be hidden – often on purpose for a particular test or tests. Nonetheless many people, including me, feel more comfortable when there is a battery of micro-benchmarks to back up the “real” tests because micro-benchmarks are much more specific and therefore more actionable.
However, benchmarks in general have serious stability issues. Sometimes even run-to-run stability over the same test bits is hard to achieve. I have some words of caution regarding the variability of results in micro-benchmarks, and benchmarks generally.
Even if you do your best to address the largest sources of internal variability in a benchmark by techniques like controlling GCs, adequate warm-up, synchronizing participating processes, making sensible affinity choices, and whatnot, you are still left with significant sources of what I will call external variability.
The symptom of external variability is that you’ll get a nice tight results from your benchmark (a good benchmark might have a Coefficient of Variation of well below 1%) but the observed result is much slower (or, rarely, faster) than a typical run. The underlying cause varies, some typical ones are:
These problems are often not the dominant sources of variability initially but as the benchmark is improved, controlling for other factors, they become dominant.
Statistically the underlying issue is this: repeated benchmark measurements in a run are not independent: a phenomenon affecting a particular iteration is highly likely to also affect the next iteration. A closer statistical model might be independent measurements punctuated by a Poisson Process that produces long lasting disruptions. "Waiter, there is a fish in my benchmark."
My experience with this sort of thing, and I expect the battle-scarred will agree, is that reducing external variability is an ongoing war. There isn’t really a cure. But it does pay to not have hair-trigger reflex on regression warnings, and to screen for false positives in your results.