This section describes some things that may be of interest to
developers and other people interested in internals of asv
.
Contents
A benchmark suite directory has the following layout. The
$
-prefixed variables refer to values in the asv.conf.json
file.
asv.conf.json
: The configuration file.
See asv.conf.json reference.
$benchmark_dir
: Contains the benchmark code, created by the
user. Each subdirectory needs an __init__.py
.
$project/
: A clone of the project being benchmarked.
Information about the history is grabbed from here, but the actual
building happens in the environment-specific clones described below.
$env_dir/
: Contains the environments used for building and
benchmarking. There is one environment in here for each specific
combination of Python version and library dependency. Generally,
the dependencies are only installed once, and then reused on
subsequent runs of asv
, but the project itself needs to be
rebuilt for each commit being benchmarked.
$ENVIRONMENT_HASH/
: The directory name of each environment is
the md5hash of the list of dependencies and the Python version.
This is not very user friendly, but this keeps the filename within
reasonable limits.
asv-env-info.json
: Contains information about the
environment, mainly the Python version and dependencies used.
project/
: An environment-specific clone of the project
repository. Each environment has its own clone so that builds
can be run in parallel without fear of clobbering (particularly
for projects that generate source files outside of the
build/
directory. These clones are created from the main
$project/
directory using the --shared
option to git
clone
so that the repository history is stored in one place to
save on disk space.
The project is built in this directory with the standard
distutils
python setup.py build
command. This means
repeated builds happen in the same place and ccache is able to cache and reuse many of
the build products.
wheels/
: If wheel_cache_size
in asv.conf.json
is set
to something other than 0, this contains Wheels of the last N project
builds for this environment. In this way, if a build for a
particular commit has already been performed and cached, it can
be restored much more quickly. Each subdirectory is a commit
hash, containing one .whl
file and a timestamp.
usr/
, lib/
, bin/
etc.: These are the virtualenv or
Conda environment directories that we install the project into
and then run benchmarks from.
$results_dir/
: This is the “database” of results from benchmark
runs.
benchmarks.json
: Contains metadata about all of the
benchmarks in the suite. It is a dictionary from benchmark
names (a fully-qualified dot-separated path) to dictionaries
containing information about that benchmark. Useful keys
include:
code
: The Python code of the benchmarkparams
: List of lists describing parameter values of a
parameterized benchmark. If benchmark is not parameterized, an
empty list. Otherwise, the n-th entry of the list is a list of
the Python repr()
strings for the values the n-th parameter
should loop over.param_names
: Names for parameters for a parameterized
benchmark. Must be of the same length as the params
list.version
: An arbitrary string identifying the benchmark
version. Default value is hash of code
, but user can
override.Other keys are specific to the kind of benchmark, and correspond to Benchmark attributes.
MACHINE/
: Within the results directory is a directory for each
machine. Putting results from different machines in separate
directories makes the results trivial to merge, which is useful
when benchmarking across different platforms or architectures.
HASH-pythonX.X-depA-depB.json
: Each JSON file within a
particular machine represents a run of benchmarks for a
particular project commit in a particular environment. Useful
keys include:
commit_hash
: The project commit that the benchmarks were
run on.
date
: A JavaScript date stamp of the date of the commit
(not when the benchmarks were run).
params
: Information about the machine the benchmarks were
run on.
results
: A dictionary from benchmark names to benchmark
results. The item can also be directly the float/list of result
values instead of a dictionary.
The dictionary form can have the following keys:
result
: contains the summarized result value(s) of
the benchmark. For a non-parameterized benchmark, the value is
the result: float, NaN or null. For parameterized
benchmarks, it is a list of such values (see params
below).
The values are either numbers indicating result from
successful run, null
indicating a failed benchmark,
or NaN
indicating a benchmark explicitly skipped by the
benchmark suite.
params
: contains a copy of the parameter values of the
benchmark, as described above. If the user has modified the benchmark
after the benchmark was run, these may differ from the
current values. The result
value is a list of
results. Each entry corresponds to one combination of the
parameter values. The n-th entry in the list corresponds to
the parameter combination itertools.product(*params)[n]
,
i.e., the results appear in cartesian product order, with
the last parameters varying fastest.
This key is omitted if the benchmark is not parameterized.
samples
: contains the raw data samples produced.
For a non-parameterized benchmark, the result is a single
list of float values. For parameterized benchmarks,
it is a list of such lists (see below).
The samples are in the order they were measured in.
This key is omitted if there are no samples recorded.
stats
: dictionary containing results of statistical
analysis. Contains keys ci_99
(confidence interval
estimate for the result), q_25
, q_75
(percentiles),
min
, max
, mean
, std
, repeat
, and
number
.
This key is omitted if there is no statistical analysis.
started_at
: A dictionary from benchmark names to JavaScript
time stamps indicating the start time of the last benchmark run.
ended_at
: A dictionary from benchmark names to JavaScript
time stamps indicating the end time of the last benchmark run.
benchmark_version
: A dictionary from benchmark names to benchmark
version identifier (an arbitrary string). Results whose version
is not equal to the version of the benchmark are ignored.
If the value is missing, no version comparisons are done
(backward compatibility).
$html_dir/
: The output of asv publish
, that turns the raw
results in $results_dir/
into something viewable in a web
browser. It is an important feature of asv
that the results can
be shared on a static web server, so there is no server side
component, and the result data is accessed through AJAX calls from
JavaScript. Most of the files at the root of $html_dir/
are
completely static and are just copied verbatim from asv/www/
in
the source tree.
index.json
: Contains an index into the benchmark data,
describing what is available. Important keys include:benchmarks
: A dictionary of benchmarks. At the moment, this
is identical to the content in $results_dir/benchmarks.json
.revision_to_hash
: A dictionary mapping revision number to commit
hash. This allows to show commits tooltip in graph and commits involved
in a regression.revision_to_date
: A dictionary mapping JavaScript date stamps to
revisions (including tags). This allows the x-scale of a plot to be scaled
by date.machines
: Describes the machines used for testing.params
: A dictionary of parameters against which benchmark
results can be selected. Each entry is a list of valid values
for that parameter.tags
: A dictionary of git tags and their revisions, so this
information can be displayed in the plot.graphs/
: This is a nested tree of directories where each level
is a parameter from the params
dictionary, in asciibetical
order. The web interface, given a set of parameters that are set,
get easily grab the associated graph.BENCHMARK_NAME.json
: At the leaves of this tree are the
actual benchmark graphs. It contains a list of pairs, where
each pair is of the form (timestamp, result_value)
. For
parameterized benchmarks, result_value
is a list of results,
corresponding to itertools.product
iteration over the
parameter combinations, similarly as in the result files. For
non-parameterized benchmarks, it is directly the result.
Missing values (eg. failed and skipped benchmarks) are
represented by null
.For full-stack testing, we use Selenium WebDriver and its Python bindings. Additional documentation for Selenium Python bindings is here.
The browser back-end can be selected via:
python setup.py test -a "--webdriver=PhantomJS"
py.test --webdriver=PhantomJS
The allowed values include None
(default), PhantomJS
,
Chrome
, Firefox
, ChromeHeadless
, FirefoxHeadless
, or
arbitrary Python code initializing a Selenium webdriver instance.
To use them, at least one of the following needs to be installed:
apt-get install chromium-chromedriver
, on Fedora via
dnf install chromedriver
.For other options regarding the webdriver to use, see py.test --help
.
Regression detection in ASV is based on detecting stepwise changes in the graphs. The assumptions on the data are as follows: the curves are piecewise constant plus random noise. We don’t know the scaling of the data or the amplitude of the noise, but assume the relative weight of the noise amplitude is known for each data point.
ASV measures the noise amplitude of each data point, based on a number of samples. We use this information for weighting the different data points:
i.e., we assume the uncertainty in each measurement point is
proportional to the estimated confidence interval for each data point.
Their inverses are taken as the relative weights w_j
. If w_j=0
or undefined, we replace it with the median weight, or with 1
if
all are undefined. The step detection algorithm determines the
absolute noise amplitude itself based on all available data, which is
more robust than relying on the individual measurements.
Step detection is a well-studied problem. In this implementation, we mainly follow a variant of the approach outlined in [Friedrich2008] and elsewhere. This provides a fast algorithm for solving the piecewise weighted fitting problem
The differences are: as we do not need exact solutions, we add
additional heuristics to work around the \({\mathcal O}(n^2)\)
scaling, which is too harsh for pure-Python code. For details, see
asv.step_detect.solve_potts_approx
. Moreover, we follow a
slightly different approach on obtaining a suitable number of
intervals, by selecting an optimal value for \(\gamma\), based on
a variant of the information criterion problem discussed in
[Yao1988].
[Friedrich2008] | (1, 2) F. Friedrich et al., ‘’Complexity Penalized M-Estimation: Fast Computation’‘, Journal of Computational and Graphical Statistics 17.1, 201-224 (2008). http://dx.doi.org/10.1198/106186008X285591 |
[Yao1988] | (1, 2) Y.-C. Yao, ‘’Estimating the number of change-points via Schwarz criterion’‘, Statistics & Probability Letters 6, 181-189 (1988). http://dx.doi.org/10.1016/0167-7152(88)90118-6 |
To proceed, we need an argument by which to select a suitable \(\gamma\) in (1). Some of the literature on step detection, e.g. [Yao1988], suggests results based on Schwarz information criteria,
where \(\sigma^2\) is maximum likelihood variance estimator (if
noise is gaussian). For the implementation, see
asv.step_detect.solve_potts_autogamma
.
What follows is a handwaving plausibility argument why such an objective function makes sense, and how to end up with \(l_1\) rather than gaussians. Better approaches are probably to be found in step detection literature. If you have a better formulation, contributions/corrections are welcome!
We assume a Bayesian model:
Here, \(y_i\) are the \(m\) data points at hand, \(k\) is the number of intervals, \(\mu_i\) are the values of the function at the intervals, \(j_i\) are the interval breakpoints; \(j_0=0\), \(j_k=m\), \(j_{r-1}<j_r\). The noise is assumed Laplace rather than gaussian, which results to the more robust \(l_1\) norm fitting rather than \(l_2\). The noise amplitude \(\sigma\) is not known. \(N\) is a normalization constant that depends on \(m\) but not on the other parameters.
The optimal \(k\) comes from Bayesian reasoning: \(\hat{k} = \mathop{\mathrm{argmax}}_k P(k|\{y\})\), where
The prior \(\pi(\{y\})\) does not matter for \(\hat{k}\); the other priors are assumed flat. We would need to estimate the behavior of the integral in the limit \(m\to\infty\). We do not succeed in doing this rigorously here, although it might be done in the literature.
Consider first saddle-point integration over \(\{\mu\}\), expanding around the max-likelihood values \(\mu_r^*\). The max-likelihood estimates are the weighted medians of the data points in each interval. Change in the exponent when \(\mu\) is perturbed is
Note that \(\sum_{i=j_{r-1}+1}^{j_r} w_i\mathrm{sgn}(y_i-\mu^*_r)=0\), so that response to small variations \(\delta\mu_r\) is \(m\)-independent. For larger variations, we have
where \(N_r(\delta\mu)=\sum_{i} w_i s_i\) where \(s_i = \pm1\) depending on whether \(y_i\) is above or below the perturbed median. Let us assume that in a typical case, \(N_r(\delta\mu)\sim{}m_r\bar{W}_r^2\delta\mu/\sigma\) where \(\bar{W}_r = \frac{1}{m_r}\sum_i w_i\) is the average weight of the interval and \(m_r\) the number of points in the interval. This recovers a result we would have obtained in the gaussian noise case
For the gaussian case, this would not have required any questionable assumptions. After integration over \(\{\delta\mu\}\) we are left with
We now approximate the rest of the integrals/sums with only the max-likelihood terms, and assume \(m_j^*\sim{}m/k\). Then,
where we neglect terms that don’t affect asymptotics for \(m\to\infty\), and \(C\) are some constants not depending on both \(m, k\). The result is of course the Schwarz criterion for \(k\) free model parameters. We can suspect that the factor \(k/2\) should be replaced by a different number, since we have \(2k\) parameters. If also the other integrals/sums can be approximated in the same way as the \(\{\mu\}\) ones, we should obtain the missing part.
Substituting in the max-likelihood value
we get
This is now similar to (2), apart from numerical prefactors. The final fitting problem then becomes
with \(r(m) = \frac{\ln m}{2m}\). Note that it is invariant vs. rescaling of weights \(w_i\mapsto{}\alpha{}w_i\), i.e., the invariance of the original problem is retained. As we know this function \(r(m)\) is not necessarily completely correct, and it seems doing the calculation rigorously requires more effort than can be justified by the requirements of the application, we now take a pragmatic view and fudge the function to \(r(m) = \beta \frac{\ln m}{m}\) with \(\beta\) chosen so that things appear to work in practice for the problem at hand.
According to [Friedrich2008], problem (12) can be solved in \({\cal O}(n^3)\) time. This is too slow, however. We can however approach this on the basis of the easier problem (1). It produces a family of solutions \([k^*(\gamma), \{\mu^*(\gamma)\}, \{j^*(\gamma)\}]\). We now evaluate (12) restricted to the curve parameterized by \(\gamma\). In particular, \([\{\mu^*(\gamma)\}, \{j^*(\gamma)\}]\) solves (12) under the constraint \(k=k^*(\gamma)\). If \(k^*(\gamma)\) obtains all values in the set \(\{1,\ldots,m\}\) when \(\gamma\) is varied, the original problem is solved completely. This probably is not a far-fetched assumption; in practice it appears such Bayesian information criterion provides a reasonable way for selecting a suitable \(\gamma\).
It’s possible to fit any data perfectly by choosing size-1 intervals, one per each data point. For such a fit, the logarithm (12) gives \(-\infty\) which then always minimizes SC. This artifact of the model above needs special handling.
Indeed, for \(\sigma\to0\), (3) reduces to
which in (4) gives a contribution (assuming no repeated y-values)
with \(f(\sigma)\to1\) for \(\sigma\to0\). A similar situation occurs also in other cases where perfect fitting occurs (repeated y-values). With the flat, scale-free prior \(\pi(\ldots)\propto1/\sigma\) used above, the result is undefined.
A simple fix is to give up complete scale free-ness of the results, i.e., fixing a minimal noise level \(\pi(\sigma,\{\mu\},\{j\}|k)\propto\theta(\sigma-\sigma_0)/\sigma\) with some \(\sigma_0(\{\mu\},\{j\},k)>0\). The effect in the \(\sigma\) integral is cutting off the log-divergence, so that with sufficient accuracy we can in (12) replace
Here, we fix a measurement accuracy floor with the following guess:
sigma_0 = 0.1 * w0 * min(abs(diff(mu)))
and sigma_0 = 0.001 * w0
* abs(mu)
when there is only a single interval. Here, w0
is the
median weight.
For the purposes of regression detection, we do not report all steps
the above approach provides. For details, see
asv.step_detect.detect_regressions
.
asv
measures also variance in the timings. This information is
currently used to provide relative data weighting (see above).