Variance for percentile metrics

Owner	Approvers	Participants
Sasha Patotski	Ameya Chaudhari, Craig Boucher, Daniel Miller, Ulf Knoblich

The main two types of metrics supported in Mangrove are average metrics and percentiles. At the moment Mangrove has a generic solution providing variance computations for average metrics. One of the supported compute fabrics (Hypercube) does have support for percentile variance as well, but this solution cannot be extended to other fabrics. This RFC (Request For Comment) provides a general algorithm for the variance computation for percentiles.

Requirements

The new method of computing percentile metrics' variance should implement the outer CI algorithm, but it should also create all the necessary expressions computing the variance in one pass through the data.

The main challenge

The main two types of metrics supported in Mangrove are average metrics and percentiles. Average metrics are conceptually simpler: an aggregation AVG(x) can be replaced with a ratio SUM(x)/SUM(1) with both numerator and denominator being recursive. Recursivity means that aggregating to a higher aggregation level can be done through applying the aggregation to all the intermediate levels. For example, if the aggregation levels are Page-Session-User-Flight (i.e. each session has many pages, each user has many sessions, each flight has many users), then SUM<Flight>(x) = SUM<Flight>(SUM<User>(SUM<Session>(x))) for a Page-level value x.

Percentiles cannot easily be expressed in terms of recursive aggregations, which complicates the computation graph. Currently this is being acheved via introducing a Histogram object with a set of custom functions and aggregations. For a detailed description, see segmentation documentation.

Moreover, the outer CI algorithm for percentile variance computation requires knowing the final value of the percentile at the lower level of aggregation. This means that naive implementation would require two passes though the computation: first pass to obtain the percentile values, second pass to use those values for the variance computation.

Proposal

We propose the following algorithm and its implementation.

Algorithm

Setup and assumptions

Suppose we need to compute variance for the metric PERCENTILE(x,p), p-th percentile of the value x. Recall that we only support percentiles for integer values, so x must be of type int. We will only describe delta-method variance calculation as it is more complex and the standard variance is a special case of delta-method. Let's assume for convenience that the MetricsPlan has the following aggregation levels (identified by Tables of type Aggregation): Page, Session, User, Output with User being the experiment unit table. This assumption is not limiting in any way, it simply gives convenient names to the tables. If a given MetricsPlan has more aggregation levels, or has several output tables, or has TableReferences wrapping some of the aggregations, the algorithm will stay pretty much the same.

Because we mentioned that we only describe delta-method algorithm, let's assume x in PERCENTILE(x, p) is a column (specified by a ColumnReference) on the table Page (i.e. below the experiment unit level User). Without loss in generality, let's assume x represents Page Load Time (PLT) of the corresponding page, just for convenience of referring to it.

General algorithm

Let us briefly summarize the variance calculation for percentile metrics as described in Deng, Knoblich, Lu. We will be using the assumptions described above. Let $X_i$ denote the Page-level PLT values, $N$ — the number of pages, $K$ — number of users. Let $X_{[p]}$ be the $p$-th percentile value. Let $Y_i = I(X_i \leq X_{[p]})$ be the indicator of whether a given PLT value $X_i$ is less than or equal to the percentile value $X_{[p]}$ or not.

Let $S_j = \sum\limits_{\text{pages for user } j } Y_i$ and $N_j = \sum\limits_{\text{pages for user } j } 1$ — samples for the corresponding random variables $S$ (representing total PLT for pages per user) and $N$ (total number of pages per user). These are now I.I.D. Let $\mu_S = \mathrm{E}(S)$, $\mu_N = \mathrm{E}(N)$ be the corresponding expectations. Let $\sigma^2_S = K\cdot (\mathrm{E}(S^2) - \mu^2_S)$, $\sigma^2_N = K\cdot (\mathrm{E}(N^2) - \mu^2_N)$ and $\sigma_{SN} = \mathrm{E}(S N) - \mu_S\mu_N$. Then we compute $$ \sigma^2 = \frac{N}{K\mu_N^2}\left( \sigma^2_S - 2 \frac{\mu_S}{\mu_N}\sigma_{SN} + \frac{\mu_S^2}{\mu_N^2}\sigma^2_N \right) $$ Using $\sigma^2$, we compute $L,U = p\pm z_{\alpha/2}\sigma/\sqrt{N}$. Finally, we fetch the two ranks $X_{[L]}$ and $X_{[U]}$ and the expression computing the final variance is $$ \mathrm{var} = N \left( \frac{X_{[U]}-X_{[L]}}{2\cdot 1.96} \right)^2 $$

The main idea

As described in segmentation documentation, we will use Histograms. The algorithm adds new expressions to 3 tables: the table Page containing the value x, the experiment unit table User and the output table Output.

On Page, we create a new Histogram object containing a single key x and the corresponding value 1 (since this is a histogram for a single element). Let's call it h = new Histogram(x).

On User, we add s = HistogramUnion(h) — the union of the histograms for each user. In other words, this column s contains the histogram of all page load times for all the pages that belong to the user. The keys are different values of PLT, the values are the counts of how often such a PLT was seen for the user. The random variables $S_j$ from the previous section are simply the values GetSumOfValuesUpTo(s, perc) while $N_j$ are GetSumOfValues(s).

On Output, we add three new Histogram objects. Let's name them so, ss and sn. First, so = HistogramUnion(s) is simply the total union of all the user-level histograms. Next, ss is a histogram whose keys are the same as the keys of so, namely all possible values for PLT. For each key plt in ss, the corresponding value is $\sum\limits_{\text{users } } (\sum\limits_{k_i < plt} c_i )^2$ where k_i is the key and c_i is the corresponding value in that user's histogram object s. Finally, sn is a a histogram whose keys are the same as the keys of so and ss. For each key plt in sn, the corresponding value is $\sum\limits_{\text{users } } \sum\limits_{\text{ all keys } k_i} c_i \cdot \sum\limits_{k_i < \mathrm{plt}} c_i$ where k_i is the key and c_i is the corresponding value in that user's histogram object s.

We will explain why we need such unusual objects below. First let's finish describing the variance algorithm.

We need to add several other expressions to the Output table:

perc = GetPercentile(so, p), the value $X_{[p]}$ of the percentile,
K = SUM(1), the total number of users,
N = GetSumOfValues(so), the total number of pages,
mus = GetSumOfValuesUpTo(so, perc)/K equals to $\mu_S$ above,
mun = N/K equals to $\mu_N$,
sig2n = SUM(Nj*Nj)/K - mun*mun is the value of $\sigma^2_N$, where Nj=GetSumOfValues(s) is the number of pages per user value above,
sig2s = GetSumOfValuesUpTo(ss, perc)/K - mus*mus is $\sigma^2_S$,
sigsn = GetSumOfValuesUpTo(sn, perc)/K - mus*mun is $\sigma_{SN}$,
this allows us to compute sigma from the formula above (the expressions is complicate so we won't write it all down here)
xL = GetPercentile(so, p - 1.96*sigma/Math.Sqrt(N)), the value of $X_{[L]}$ above,
xU = GetPercentile(so, p + 1.96*sigma/Math.Sqrt(N)), the value of $X_{[U]}$ above,
finally, this gives us all the components needed for the variance $\mathrm{var} = N \left( \frac{X_{[U]}-X_{[L]}}{2\cdot 1.96} \right)^2$, or standard deviation $\mathrm{stdev} = \sqrt{N}\left( \frac{X_{[U]}-X_{[L]}}{2\cdot 1.96} \right)$.

There are several tricks we used here. The main idea is that we are utilizing Histogram objects to store additional information at each aggregation level, which allows us to defer the actual variance calculation until the very last aggregation level. The first trick is to realize that to compute $\sum S_j$ needed in the computation of $\mu_S$ we do not actually need to compute neither $S_j = \sum Y_i$ nor $Y_i$ (which is great because we cannot compute them directly since we don't have the value $X_{[p]}$ of the percentile). Instead, $\sum S_j$ can be obtained from the histogram so directly by adding up the counts for all the PLT values that are less than or equal perc, since $$ \sum\limits_{\text{users}} S_j = \sum\limits_{\text{user }j} \sum\limits_{\text{pages }i } I( X_i \leq X_{[p]} ) = \sum\limits_{\text{ all pages } i } I( X_i \leq X_{[p]} ) = [\text{since }I\text{ is a binary function }] = \sum\limits_{\text{all pages i with PLT }\leq X_{[p]}} 1 = \text{count all pages with PLT }\leq X_{[p]}. $$

Because we don't know the value perc of the percentile, the histograms ss and sn contain the values for $\sum S^2_j$ and the sum $\sum S_j\cdot N_j$ (needed for $\sigma^2_S$ and $\sigma_{SN}$) for all posible actual values of perc. At the last aggregation level, we are able to obtain the value of perc from so, and hence we can then "look up" the correct sum values from the histograms ss and sn.

Specifics of implementation in Mangrove

First, we need to implement additional Histogram operations used above, namely the unary operation GetSumOfValues(Histogram h) and the binary operation GetSumOfValuesUpTo(Histogram h, double value). The first one simply returns the sum of the all the values in the histogram. The second sorts the histogram keys, and returns the sum of the values for the keys smaller than value. Both of of these implementations are fairly straightforward.

Next, we need to implement two additional unary operations: one for computing the ss histogram and one for computing sn histogram. For ss, assume the user histogram s had keys $k_i$ with the corresponding values $c_i$. The operation then does the following:

sort the keys $k_i$ (let's assume $k_i < k_{i+1}$ for all $i$ to simplify the notation)
the new value $v_0$ for the smallest key $k_0$ is equal to $v_0 = c_0^2$
then, for each $i>0$, the new value corresponding to the key $k_i$ is $v_i = (c_1+...+c_i)^2 - (c_1+...+c_{i-1})^2 = c_i^2 + 2c_i(c_1+...+c_{i-1})$.

For example, say for a user $u$ the histogram s has the following data:

PLT	count
5	2
9	1
10	2

Then the result of the operation is

PLT	ss
5	$2^2=4$
9	$(1+2)^2-2^2=5$
10	$(2+1+2)^2-(1+2)^2=16$

The point is that after doing the usual HistogramUnion operation, the result of the binary operation GetSumOfValuesUpTo(ss, perc) would give exactly the sum $\sum S_j^2$ needed for the variance computation, see above. Let us call this operation SumOfSquaresDifferences.

To get the sn histogram the operation is even simpler, we don't even need to sort the keys. If the user histogram s had keys $k_i$ with the corresponding values $c_i$, the new values are $v_i = n\cdot c_i$ where $n = \sum c_i$ is the total sum of all the values in the histogram. This value of $n$ is the same as the result of GetSumOfValues(s).

For example, say for a user $u$ the histogram s has the following data:

PLT	count
5	2
9	1
10	2

Then the result of the operation is

PLT	sn
5	$2\cdot 5=10$
9	$1\cdot 5=5$
10	$2\cdot 5=10$

Like for ss, doing the HistogramUnion aggregation and applying the binary operation GetSumOfValuesUpTo(sn, perc) produces the sum $\sum S_jN_j$ needed for the variance computation, see above. Let us call this operation MultiplyBySumOfValues.

Because we need to utilize Histograms at the variance computation step, we need to call ReplaceNonRecursiveAggregationTypes transformer at that step. We will change this transformer to have a more fine-grain control over which aggregations to replace (since we only need to replace PERCENTILE metrics).

Simplification for non-delta-method variance

If the percentile metric PERCENTILE(u, 75) is at the experiment unit level (e.g. User) then the formulas and the implementation would be simplified as follows.

From the algorithm perspective, using the notations above and understanding that $N=K$ in this case, the formulas will be $\sigma^2_S = N\cdot (\mathrm{E}(S^2) - \mu^2_S)$ and $$ \sigma^2 = \frac{1}{N^2} \sigma^2_S $$ The rest of the algorithm stays exactly the same.

From the Mangrove implementation point of view, there will be much less new expressions that would need to be created.

On User, we create a new Histogram object containing a single key u and the corresponding value 1. Let's call it h = new Histogram(x).

On Output, we add two new Histogram objects. Let's name them so and ss. Similar to the delta-method case above, so = HistogramUnion(h) and ss = HistogramUnion(u_ss).

Other expressions needed on the Output table in this case:

perc = GetPercentile(so, p), the value $X_{[p]}$ of the percentile,
N = SUM(1), the total number of users,
mus = GetSumOfValuesUpTo(so, perc)/N equals to $\mu_S$ above,
sig2s = mus - mus*mus is $\sigma^2_S$,
sigma = Math.Sqrt(sig2s),
xL = GetPercentile(so, p - 1.96*sigma/Math.Sqrt(N)), the value of $X_{[L]}$ as in the case before,
xU = GetPercentile(so, p + 1.96*sigma/Math.Sqrt(N)), the value of $X_{[U]}$ as in the case before,
finally, the variance $\mathrm{var} = N \left( \frac{X_{[U]}-X_{[L]}}{2\cdot 1.96} \right)^2$, or, if we care about standard deviation, $\mathrm{stdev} = \sqrt{N}\left( \frac{X_{[U]}-X_{[L]}}{2\cdot 1.96} \right)$.

Note: we are only adding expressions into two tables: User and Output.

Additional remarks

Implementations of Histogram object in Scope

Because many of the percentile-related operations and aggregations on histograms require the keys to be sorted, it might be beneficial for the base class for Histogram object in Scope to inherit from a class other than System.Generic.Collections.Dictionary, for example System.Generic.Collections.SortedDictionary. This would require re-implementing the HistogramUnion UDO. We plan to keep using Dictionary at the moment, but we actively investigate the performance of other solutions.

Other compute fabrics

We plan to implement this algorithm for computing percentile variances in Scope, Hypercube and Metrics Vectorization (MV). Feasibility for Kusto and SparkSQL requires additional investigations (in particular, we do not yet support histograms in these two fabrics).

Performance issues

The histograms so, ss and sn are fairly large so there is a concern about the memory usage. We plan to benchmark this solution once it is implemented. A possible optimization to reduce memory footprint could be to have a single dictionary with 3 values per key, combining the three histograms into a single object. We will investigate this option further in a separate document.