Copyright | (c) 2021 Andrew Lelechenko |
---|---|
License | MIT |
Safe Haskell | None |
Language | Haskell2010 |
Featherlight benchmark framework (only one file!) for performance measurement with API mimicking criterion
and gauge
.
How lightweight is it?
There is only one source file Test.Tasty.Bench and no external dependencies
except tasty
.
So if you already depend on tasty
for a test suite, there
is nothing else to install.
Compare this to criterion
(10+ modules, 50+ dependencies) and gauge
(40+ modules, depends on basement
and vector
).
How is it possible?
Our benchmarks are literally regular tasty
tests, so we can leverage all existing
machinery for command-line options, resource management, structuring,
listing and filtering benchmarks, running and reporting results. It also means
that tasty-bench
can be used in conjunction with other tasty
ingredients.
Unlike criterion
and gauge
we use a very simple statistical model described below.
This is arguably a questionable choice, but it works pretty well in practice.
A rare developer is sufficiently well-versed in probability theory
to make sense and use of all numbers generated by criterion
.
How to switch?
Cabal mixins allow to taste tasty-bench
instead of criterion
or gauge
without changing a single line of code:
cabal-version: 2.0 benchmark foo ... build-depends: tasty-bench mixins: tasty-bench (Test.Tasty.Bench as Criterion)
This works vice versa as well: if you use tasty-bench
, but at some point
need a more comprehensive statistical analysis,
it is easy to switch temporarily back to criterion
.
How to write a benchmark?
Benchmarks are declared in a separate section of cabal
file:
cabal-version: 2.0 name: bench-fibo version: 0.0 build-type: Simple synopsis: Example of a benchmark benchmark bench-fibo main-is: BenchFibo.hs type: exitcode-stdio-1.0 build-depends: base, tasty-bench
And here is BenchFibo.hs
:
import Test.Tasty.Bench fibo :: Int -> Integer fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2) main :: IO () main = defaultMain [ bgroup "fibonacci numbers" [ bench "fifth" $ nf fibo 5 , bench "tenth" $ nf fibo 10 , bench "twentieth" $ nf fibo 20 ] ]
Since tasty-bench
provides an API compatible with criterion
,
one can refer to its documentation for more examples.
How to read results?
Running the example above (cabal
bench
or stack
bench
)
results in the following output:
All fibonacci numbers fifth: OK (2.13s) 63 ns ± 3.4 ns tenth: OK (1.71s) 809 ns ± 73 ns twentieth: OK (3.39s) 104 μs ± 4.9 μs All 3 tests passed (7.25s)
The output says that, for instance, the first benchmark was repeatedly executed for 2.13 seconds (wall time), its mean time was 63 nanoseconds and, assuming ideal precision of a system clock, execution time does not often diverge from the mean further than ±3.4 nanoseconds (double standard deviation, which for normal distributions corresponds to 95% probability). Take standard deviation numbers with a grain of salt; there are lies, damned lies, and statistics.
Note that this data is not directly comparable with criterion
output:
benchmarking fibonacci numbers/fifth time 62.78 ns (61.99 ns .. 63.41 ns) 0.999 R² (0.999 R² .. 1.000 R²) mean 62.39 ns (61.93 ns .. 62.94 ns) std dev 1.753 ns (1.427 ns .. 2.258 ns)
One might interpret the second line as saying that
95% of measurements fell into 61.99–63.41 ns interval, but this is wrong.
It states that the OLS regression
of execution time (which is not exactly the mean time) is most probably
somewhere between 61.99 ns and 63.41 ns,
but does not say a thing about individual measurements.
To understand how far away a typical measurement deviates
you need to add/subtract double standard deviation yourself
(which gives 62.78 ns ± 3.506 ns, similar to tasty-bench
above).
To add to the confusion, gauge
in --small
mode outputs
not the second line of criterion
report as one might expect,
but a mean value from the penultimate line and a standard deviation:
fibonacci numbers/fifth mean 62.39 ns ( +- 1.753 ns )
The interval ±1.753 ns answers for 68% of samples only, double it to estimate the behavior in 95% of cases.
Statistical model
Here is a procedure used by tasty-bench
to measure execution time:
- Set \( n \leftarrow 1 \).
- Measure execution time \( t_n \) of \( n \) iterations and execution time \( t_{2n} \) of \( 2n \) iterations.
- Find \( t \) which minimizes deviation of \( (nt, 2nt) \) from \( (t_n, t_{2n}) \).
- If deviation is small enough (see
--stdev
below), return \( t \) as a mean execution time. - Otherwise set \( n \leftarrow 2n \) and jump back to Step 2.
This is roughly similar to the linear regression approach which criterion
takes,
but we fit only two last points. This allows us to simplify away all heavy-weight
statistical analysis. More importantly, earlier measurements,
which are presumably shorter and noisier, do not affect overall result.
This is in contrast to criterion
, which fits all measurements and
is biased to use more data points corresponding to shorter runs
(it employs \( n \leftarrow 1.05n \) progression).
An alert reader could object that we measure standard deviation for samples with \( n \) and \( 2n \) iterations, but report it scaled to a single iteration. Strictly speaking, this is justified only if we assume that deviating factors are either roughly periodic (e. g., coarseness of a system clock, garbage collection) or are likely to affect several successive iterations in the same way (e. g., slow down by another concurrent process).
Obligatory disclaimer: statistics is a tricky matter, there is no
one-size-fits-all approach.
In the absence of a good theory
simplistic approaches are as (un)sound as obscure ones.
Those who seek statistical soundness should rather collect raw data
and process it themselves in R/Python. Data reported by tasty-bench
is only of indicative and comparative significance.
Tip
Passing +RTS
-T
(via cabal
bench
--benchmark-options
'+RTS
-T'
or stack
bench
--ba
'+RTS
-T'
) enables tasty-bench
to estimate and report
memory usage such as allocated and copied bytes.
Command-line options
Use --help
to list command-line options.
-p
,--pattern
- This is a standard
tasty
option, which allows filtering benchmarks by a pattern orawk
expression. Please refer totasty
documentation for details. --csv
- File to write results in CSV format. If specified, suppresses console output.
-t
,--timeout
- This is a standard
tasty
option, setting timeout for individual benchmarks in seconds. Use it when benchmarks tend to take too long:tasty-bench
will make an effort to report results (even if of subpar quality) before timeout. Setting timeout too tight (insufficient for at least three iterations of benchmark) will result in a benchmark failure. Do not use--timeout
without a reason: it forks an additional thread and thus affects reliability of measurements. --stdev
- Target relative standard deviation of measurements in percents (5% by default).
Large values correspond to fast and loose benchmarks, and small ones to long and precise.
If it takes far too long, consider setting
--timeout
, which will interrupt benchmarks, potentially before reaching the target deviation.
Synopsis
- defaultMain :: [Benchmark] -> IO ()
- type Benchmark = TestTree
- bench :: String -> Benchmarkable -> Benchmark
- bgroup :: String -> [Benchmark] -> Benchmark
- data Benchmarkable
- nf :: NFData b => (a -> b) -> a -> Benchmarkable
- whnf :: (a -> b) -> a -> Benchmarkable
- nfIO :: NFData a => IO a -> Benchmarkable
- whnfIO :: NFData a => IO a -> Benchmarkable
- nfAppIO :: NFData b => (a -> IO b) -> a -> Benchmarkable
- whnfAppIO :: (a -> IO b) -> a -> Benchmarkable
- csvReporter :: Ingredient
Running Benchmark
defaultMain :: [Benchmark] -> IO () Source #
Run benchmarks and report results.
Wrapper around defaultMain
(+ csvReporter
)
to provide an interface compatible with defaultMain
and defaultMain
.
bench :: String -> Benchmarkable -> Benchmark Source #
Attach a name to Benchmarkable
.
This is actually a synonym of singleTest
to provide an interface compatible with bench
and bench
.
Creating Benchmarkable
data Benchmarkable Source #
Something that can be benchmarked.
Drop-in replacement for Benchmarkable
and Benchmarkable
.
Instances
IsTest Benchmarkable Source # | |
Defined in Test.Tasty.Bench |
nf :: NFData b => (a -> b) -> a -> Benchmarkable Source #
nf
f
x
measures time to compute
a normal form (by means of rnf
) of f
x
.
Note that forcing a normal form requires an additional
traverse of the structure. In certain scenarios (imagine benchmarking tail
),
especially when NFData
instance is badly written,
this traversal may take non-negligible time and affect results.
whnf :: (a -> b) -> a -> Benchmarkable Source #
nfIO :: NFData a => IO a -> Benchmarkable Source #
nfIO
x
measures time to evaluate side-effects of x
and compute its normal form (by means of rnf
).
Pure subexpression of an effectful computation x
may be evaluated only once and get cached; use nfAppIO
to avoid this.
Note that forcing a normal form requires an additional
traverse of the structure. In certain scenarios,
especially when NFData
instance is badly written,
this traversal may take non-negligible time and affect results.
whnfIO :: NFData a => IO a -> Benchmarkable Source #
whnfIO
x
measures time to evaluate side-effects of x
and compute its weak head normal form.
Pure subexpression of an effectful computation x
may be evaluated only once and get cached; use whnfAppIO
to avoid this.
Computing only a weak head normal form is
rarely what intuitively is meant by "evaluation".
Unless you understand precisely, what is measured,
it is recommended to use nfIO
instead.
nfAppIO :: NFData b => (a -> IO b) -> a -> Benchmarkable Source #
nfAppIO
f
x
measures time to evaluate side-effects of f
x
and compute its normal form (by means of rnf
).
Note that forcing a normal form requires an additional
traverse of the structure. In certain scenarios,
especially when NFData
instance is badly written,
this traversal may take non-negligible time and affect results.
whnfAppIO :: (a -> IO b) -> a -> Benchmarkable Source #
CSV ingredient
csvReporter :: Ingredient Source #
Add this ingredient to run benchmarks and save results in CSV format.
It activates when --csv
FILE
command line option is specified.
defaultMainWithIngredients [listingTests, csvReporter, consoleTestReporter] benchmarks
Remember that successful activation of an ingredient suppresses all subsequent
ingredients. If you wish to produce CSV in addition to other reports,
use composeReporters
:
defaultMainWithIngredients [listingTests, composeReporters csvReporter consoleTestReporter] benchmarks