Step Detection¶

Regression detection in ASV is based on detecting stepwise changes in the graphs. The assumptions on the data are as follows: the curves are piecewise constant plus random noise. We don’t know the scaling of the data or the amplitude of the noise, but assume the relative weight of the noise amplitude is known for each data point.

ASV measures the noise amplitude of each data point, based on a number of samples. We use this information for weighting the different data points:

\[\sigma_j = \sigma \mathrm{CI}_{99} = \sigma / w_j\]

i.e., we assume the uncertainty in each measurement point is proportional to the estimated confidence interval for each data point. Their inverses are taken as the relative weights w_j. If w_j=0 or undefined, we replace it with the median weight, or with 1 if all are undefined. The step detection algorithm determines the absolute noise amplitude itself based on all available data, which is more robust than relying on the individual measurements.

Step detection is a well-studied problem. In this implementation, we mainly follow a variant of the approach outlined in [SDM_FKLW08] and elsewhere. This provides a fast algorithm for solving the piecewise weighted fitting problem

(1)¶\[\mathop{\mathrm{argmin}}_{k,\{j\},\{\mu\}} \gamma k + \sum_{r=1}^k\sum_{i=j_{r-1}}^{j_r} w_i |y_i - \mu_r|\]

The differences are: as we do not need exact solutions, we add additional heuristics to work around the \({\mathcal O}(n^2)\) scaling, which is too harsh for pure-Python code. For details, see asv.step_detect.solve_potts_approx(). Moreover, we follow a slightly different approach on obtaining a suitable number of intervals, by selecting an optimal value for \(\gamma\), based on a variant of the information criterion problem discussed in [SDM_Yao88].

Bayesian information¶

To proceed, we need an argument by which to select a suitable \(\gamma\) in (1). Some of the literature on step detection, e.g. [SDM_Yao88], suggests results based on Schwarz information criteria,

(2)¶\[\text{SC} = \frac{m}{2} \ln \sigma^2 + k \ln m = \text{min!}\]

where \(\sigma^2\) is maximum likelihood variance estimator (if noise is gaussian). For the implementation, see asv.step_detect.solve_potts_autogamma().

What follows is a handwaving plausibility argument why such an objective function makes sense, and how to end up with \(l_1\) rather than gaussians. Better approaches are probably to be found in step detection literature. If you have a better formulation, contributions/corrections are welcome!

We assume a Bayesian model:

(3)¶\[P(\{y_i\}_{i=1}^m|\sigma,k,\{\mu_i\}_{i=1}^k,\{j_i\}_{i=1}^{k-1}) = N \sigma^{-m} \exp( -\sigma^{-1}\sum_{r=1}^k\sum_{i=j_{r-1}+1}^{j_r} w_i |y_i - \mu_r| )\]

Here, \(y_i\) are the \(m\) data points at hand, \(k\) is the number of intervals, \(\mu_i\) are the values of the function at the intervals, \(j_i\) are the interval breakpoints; \(j_0=0\), \(j_k=m\), \(j_{r-1}<j_r\). The noise is assumed Laplace rather than gaussian, which results to the more robust \(l_1\) norm fitting rather than \(l_2\). The noise amplitude \(\sigma\) is not known. \(N\) is a normalization constant that depends on \(m\) but not on the other parameters.

The optimal \(k\) comes from Bayesian reasoning: \(\hat{k} = \mathop{\mathrm{argmax}}_k P(k|\{y\})\), where

(4)¶\[P(k|\{y\}) = \frac{\pi(k)}{\pi(\{y\})}\int d\sigma (d\mu)^k \sum_{\{j\}} P(\{y\}|\sigma,k,\{\mu\},\{j\}) \pi(\sigma, \{\mu\},\{j\}|k)\]

The prior \(\pi(\{y\})\) does not matter for \(\hat{k}\); the other priors are assumed flat. We would need to estimate the behavior of the integral in the limit \(m\to\infty\). We do not succeed in doing this rigorously here, although it might be done in the literature.

Consider first saddle-point integration over \(\{\mu\}\), expanding around the max-likelihood values \(\mu_r^*\). The max-likelihood estimates are the weighted medians of the data points in each interval. Change in the exponent when \(\mu\) is perturbed is

(5)¶\[\Delta = -\sigma^{-1}\sum_{r=1}^k \sum_{i=j_{r-1}+1}^{j_r}w_i[|y_i-\mu^*_r - \delta\mu_r| - |y_i-\mu^*_r|]\]

Note that \(\sum_{i=j_{r-1}+1}^{j_r} w_i\mathrm{sgn}(y_i-\mu^*_r)=0\), so that response to small variations \(\delta\mu_r\) is \(m\)-independent. For larger variations, we have

(6)¶\[\Delta = -\sigma^{-1}\sum_{r=1}^k N_r(\delta \mu_r) |\delta \mu_r|\]

where \(N_r(\delta\mu)=\sum_{i} w_i s_i\) where \(s_i = \pm1\) depending on whether \(y_i\) is above or below the perturbed median. Let us assume that in a typical case, \(N_r(\delta\mu)\sim{}m_r\bar{W}_r^2\delta\mu/\sigma\) where \(\bar{W}_r = \frac{1}{m_r}\sum_i w_i\) is the average weight of the interval and \(m_r\) the number of points in the interval. This recovers a result we would have obtained in the gaussian noise case

(7)¶\[\Delta \sim -\sigma^{-2} \sum_r W_r^2 m_r |\delta \mu_r|^2\]

For the gaussian case, this would not have required any questionable assumptions. After integration over \(\{\delta\mu\}\) we are left with

(8)¶\[\int(\ldots) \propto \int d\sigma \sum_{\{j\}} (2\pi)^{k/2}\sigma^k [\bar{W}_1\cdots \bar{W}_k]^{-1} [m_1\cdots m_k]^{-1/2} P(\{y\}|\sigma,k,\{\mu_*\},\{j\}) \pi(\sigma, \{j\}|k)\]

We now approximate the rest of the integrals/sums with only the max-likelihood terms, and assume \(m_j^*\sim{}m/k\). Then,

(9)¶\[\ln P(k|\{y\}) &\simeq C_1(m) + C_2(k) + \frac{k}{2}\ln(2\pi) - \frac{k}{2} \ln(m/k) - k \ln \bar{W} + \ln P(\{y\}|\sigma_*,k,\{\mu_*\},\{j_*\}) \\ &\approx \tilde{C}_1(m) + \tilde{C}_2(k) - \frac{k}{2} \ln m + \ln P(\{y\}|\sigma_*,k,\{\mu_*\},\{j_*\})\]

where we neglect terms that don’t affect asymptotics for \(m\to\infty\), and \(C\) are some constants not depending on both \(m, k\). The result is of course the Schwarz criterion for \(k\) free model parameters. We can suspect that the factor \(k/2\) should be replaced by a different number, since we have \(2k\) parameters. If also the other integrals/sums can be approximated in the same way as the \(\{\mu\}\) ones, we should obtain the missing part.

Substituting in the max-likelihood value

(10)¶\[ \sigma_* = \frac{1}{m} \sum_{r=1}^k\sum_{i=j_{r-1}^*+1}^{j_r^*} w_i|y_i - \mu_r^*|\]

we get

(11)¶\[\ln P(k|\{y\}) \sim C - \frac{k}{2} \ln m - m \ln\sum_{r=1}^k\sum_{i=j_{r-1}^*}^{j_r^*} w_i |y_i - \mu_r^*|\]

This is now similar to (2), apart from numerical prefactors. The final fitting problem then becomes

(12)¶\[\mathop{\mathrm{argmin}}_{k,\{j\},\{\mu\}} r(m) k + \ln\sum_{r=1}^k\sum_{i=j_{r-1}}^{j_r} w_i |y_i - \mu_r|\]

with \(r(m) = \frac{\ln m}{2m}\). Note that it is invariant vs. rescaling of weights \(w_i\mapsto{}\alpha{}w_i\), i.e., the invariance of the original problem is retained. As we know this function \(r(m)\) is not necessarily completely correct, and it seems doing the calculation rigorously requires more effort than can be justified by the requirements of the application, we now take a pragmatic view and fudge the function to \(r(m) = \beta \frac{\ln m}{m}\) with \(\beta\) chosen so that things appear to work in practice for the problem at hand.

According to [SDM_FKLW08], problem (12) can be solved in \({\cal O}(n^3)\) time. This is too slow, however. We can however approach this on the basis of the easier problem (1). It produces a family of solutions \([k^*(\gamma), \{\mu^*(\gamma)\}, \{j^*(\gamma)\}]\). We now evaluate (12) restricted to the curve parameterized by \(\gamma\). In particular, \([\{\mu^*(\gamma)\}, \{j^*(\gamma)\}]\) solves (12) under the constraint \(k=k^*(\gamma)\). If \(k^*(\gamma)\) obtains all values in the set \(\{1,\ldots,m\}\) when \(\gamma\) is varied, the original problem is solved completely. This probably is not a far-fetched assumption; in practice it appears such Bayesian information criterion provides a reasonable way for selecting a suitable \(\gamma\).

Overfitting¶

It’s possible to fit any data perfectly by choosing size-1 intervals, one per each data point. For such a fit, the logarithm (12) gives \(-\infty\) which then always minimizes SC. This artifact of the model above needs special handling.

Indeed, for \(\sigma\to0\), (3) reduces to

(13)¶\[P(\{y_i\}_{i=1}^m|\sigma,k,\{\mu_i\}_{i=1}^k,\{j_i\}_{i=1}^{k-1}) = \prod_{r=1}^k \prod_{i=j_{r-1} + 1}^{j_r} \delta(y_i - \mu_r)\]

which in (4) gives a contribution (assuming no repeated y-values)

(14)¶\[P(k|\{y\}) = \frac{\pi(n)}{\pi(\{y\})}\delta_{n,k}\int d\sigma \pi(\sigma, \{y\},\{i\}|n) f(\sigma) + \ldots\]

with \(f(\sigma)\to1\) for \(\sigma\to0\). A similar situation occurs also in other cases where perfect fitting occurs (repeated y-values). With the flat, scale-free prior \(\pi(\ldots)\propto1/\sigma\) used above, the result is undefined.

A simple fix is to give up complete scale free-ness of the results, i.e., fixing a minimal noise level \(\pi(\sigma,\{\mu\},\{j\}|k)\propto\theta(\sigma-\sigma_0)/\sigma\) with some \(\sigma_0(\{\mu\},\{j\},k)>0\). The effect in the \(\sigma\) integral is cutting off the log-divergence, so that with sufficient accuracy we can in (12) replace

(15)¶\[\ln \sigma \mapsto \ln(\sigma_0 + \sigma)\]

Here, we fix a measurement accuracy floor with the following guess: sigma_0 = 0.1 * w0 * min(abs(diff(mu))) and sigma_0 = 0.001 * w0 * abs(mu) when there is only a single interval. Here, w0 is the median weight.

Autocorrelated noise¶

Practical experience shows that the noise in the benchmark results can be correlated. Often benchmarks are run for multiple commits at once, for example the new commits at a given time, and the benchmark machine does something else between the runs. Alternatively, the background load from other processes on the machine varies with time.

To give a basic model for the noise correlations, we include AR(1) Laplace noise in (3),

(16)¶\[P(\{y_i\}_{i=1}^m|\sigma,\rho,k,\{\mu_i\}_{i=1}^k,\{j_i\}_{i=1}^{k-1}) = N \sigma^{-m} \exp(-\sigma^{-1}\sum_{r=1}^k\sum_{i=j_{r-1}+1}^{j_r} |\epsilon_{i,r} - \rho \epsilon_{i-1,r}|)\]

where \(\epsilon_{i,r}=y_i-\mu_{r}\) with \(\epsilon_{j_{r-1},r}=y_{j_{r-1}}-\mu_{r-1}\) and \(\epsilon_{j_0,1}=0\) are the deviations from the stepwise model. The correlation measure \(\rho\) is unknown, but assumed to be constant in \((-1,1)\).

Since the parameter \(\rho\) is global, it does not change the parameter counting part of the Schwarz criterion. The maximum likelihood term however does depend on \(\rho\), so that the problem becomes:

(17)¶\[\mathop{\mathrm{argmin}}_{k,\rho,\{j\},\{\mu\}} r(m) k + \ln\sum_{r=1}^k\sum_{i=j_{r-1}}^{j_r} |\epsilon_{i,r} - \rho\epsilon_{i-1,r}|\]

To save computation time, we do not solve this optimization problem exactly. Instead, we again minimize along the \(\mu_r^*(\gamma)\), \(j_r^*(\gamma)\) curve provided by the solution to (1), and use (17) only in selecting the optimal value of the \(\gamma\) parameter.

The minimization vs. \(\rho\) can be done numerically for given \(\mu_r^*(\gamma)\), \(j_r^*(\gamma)\). This minimization step is computationally cheap compared to the piecewise fit, so including it will not significantly change the runtime of the total algorithm.

Postprocessing¶

For the purposes of regression detection, we do not report all steps the above approach provides. For details, see asv.step_detect.detect_regressions.

Making use of measured variance¶

asv measures also variance in the timings. This information is currently used to provide relative data weighting (see above).

References¶

[SDM_FKLW08] (1,2)

F. Friedrich, A. Kempe, V. Liebscher, and G. Winkler. Complexity Penalized M-Estimation: Fast Computation. Journal of Computational and Graphical Statistics, 17(1):201–224, 2008. arXiv:27594299, doi:10.1198/106186008X285591.

[SDM_Yao88] (1,2)

Yi-Ching Yao. Estimating the number of change-points via Schwarz' criterion. Statistics & Probability Letters, 6(3):181–189, February 1988. doi:10.1016/0167-7152(88)90118-6.