Variance for percentile metrics
Owner | Approvers | Participants |
---|---|---|
Sasha Patotski | Ameya Chaudhari, Craig Boucher, Daniel Miller, Ulf Knoblich |
The main two types of metrics supported in Mangrove are average metrics and percentiles. At the moment Mangrove has a generic solution providing variance computations for average metrics. One of the supported compute fabrics (Hypercube) does have support for percentile variance as well, but this solution cannot be extended to other fabrics. This RFC (Request For Comment) provides a general algorithm for the variance computation for percentiles.
Requirements
The new method of computing percentile metrics' variance should implement the outer CI algorithm, but it should also create all the necessary expressions computing the variance in one pass through the data.
The main challenge
The main two types of metrics supported in Mangrove are average metrics and percentiles.
Average metrics are conceptually simpler: an aggregation AVG(x)
can be replaced with a
ratio SUM(x)/SUM(1)
with both numerator and denominator being recursive. Recursivity means
that aggregating to a higher aggregation level can be done through applying the aggregation to all the
intermediate levels. For example, if the aggregation levels are Page-Session-User-Flight
(i.e.
each session has many pages, each user has many sessions, each flight has many users), then SUM<Flight>(x) = SUM<Flight>(SUM<User>(SUM<Session>(x)))
for a Page
-level value x
.
Percentiles cannot easily be expressed in terms of recursive
aggregations, which complicates the computation graph. Currently this is being acheved via
introducing a Histogram
object with a set of custom functions and aggregations.
For a detailed description, see segmentation documentation.
Moreover, the outer CI algorithm for percentile variance computation requires knowing the final value of the percentile at the lower level of aggregation. This means that naive implementation would require two passes though the computation: first pass to obtain the percentile values, second pass to use those values for the variance computation.
Proposal
We propose the following algorithm and its implementation.
Algorithm
Setup and assumptions
Suppose we need to compute variance for the metric PERCENTILE(x,p)
, p
-th percentile
of the value x
. Recall that we only support percentiles for integer values, so x
must be of type int
.
We will only describe delta-method variance calculation as it is more complex and
the standard variance is a special case of delta-method. Let's assume for convenience that
the MetricsPlan
has the following aggregation levels (identified by Table
s of type Aggregation
):
Page
, Session
, User
, Output
with User
being the experiment unit table.
This assumption is not limiting in any way, it simply gives convenient names to the tables.
If a given MetricsPlan
has more aggregation levels, or has several output tables, or has TableReference
s
wrapping some of the aggregations, the algorithm will stay pretty much the same.
Because we mentioned that we only describe delta-method algorithm,
let's assume x
in PERCENTILE(x, p)
is a column (specified by a ColumnReference
)
on the table Page
(i.e. below the experiment unit level User
). Without loss in generality,
let's assume x
represents Page Load Time (PLT) of the corresponding page, just for convenience of referring to it.
General algorithm
Let us briefly summarize the variance calculation for percentile metrics as described in
Deng, Knoblich, Lu. We will be using the assumptions described above.
Let $X_i$ denote the Page
-level PLT values, $N$ — the number of pages, $K$ — number of users.
Let $X_{[p]}$ be the $p$-th percentile value. Let $Y_i = I(X_i \leq X_{[p]})$ be the indicator of whether
a given PLT value $X_i$ is less than or equal to the percentile value $X_{[p]}$ or not.
Let $S_j = \sum\limits_{\text{pages for user } j } Y_i$ and $N_j = \sum\limits_{\text{pages for user } j } 1$ — samples for the corresponding random variables $S$ (representing total PLT for pages per user) and $N$ (total number of pages per user). These are now I.I.D. Let $\mu_S = \mathrm{E}(S)$, $\mu_N = \mathrm{E}(N)$ be the corresponding expectations. Let $\sigma^2_S = K\cdot (\mathrm{E}(S^2) - \mu^2_S)$, $\sigma^2_N = K\cdot (\mathrm{E}(N^2) - \mu^2_N)$ and $\sigma_{SN} = \mathrm{E}(S N) - \mu_S\mu_N$. Then we compute $$ \sigma^2 = \frac{N}{K\mu_N^2}\left( \sigma^2_S - 2 \frac{\mu_S}{\mu_N}\sigma_{SN} + \frac{\mu_S^2}{\mu_N^2}\sigma^2_N \right) $$ Using $\sigma^2$, we compute $L,U = p\pm z_{\alpha/2}\sigma/\sqrt{N}$. Finally, we fetch the two ranks $X_{[L]}$ and $X_{[U]}$ and the expression computing the final variance is $$ \mathrm{var} = N \left( \frac{X_{[U]}-X_{[L]}}{2\cdot 1.96} \right)^2 $$
The main idea
As described in segmentation documentation,
we will use Histogram
s. The algorithm adds new expressions to 3 tables: the table Page
containing the value x
,
the experiment unit table User
and the output table Output
.
On Page
, we create a new Histogram
object containing a single key x
and the corresponding value 1
(since this is a histogram for a single element). Let's call it h = new Histogram(x)
.
On User
, we add s = HistogramUnion(h)
— the union of the histograms for each user.
In other words, this column s
contains the histogram of all page load times for all
the pages that belong to the user. The keys are different values of PLT, the values are
the counts of how often such a PLT was seen for the user. The random variables $S_j$
from the previous section are simply the values GetSumOfValuesUpTo(s, perc)
while $N_j$ are GetSumOfValues(s)
.
On Output
, we add three new Histogram
objects. Let's name them so
, ss
and sn
.
First, so = HistogramUnion(s)
is simply the total union of all the user-level histograms.
Next, ss
is a histogram whose keys are the same as the keys of so
, namely all possible
values for PLT. For each key plt
in ss
, the corresponding value is $\sum\limits_{\text{users } } (\sum\limits_{k_i < plt} c_i )^2$
where k_i
is the key and c_i
is the corresponding value in that user's histogram object s
.
Finally, sn
is a a histogram whose keys are the same as the keys of so
and ss
.
For each key plt
in sn
, the corresponding value is $\sum\limits_{\text{users } } \sum\limits_{\text{ all keys } k_i} c_i \cdot \sum\limits_{k_i < \mathrm{plt}} c_i$
where k_i
is the key and c_i
is the corresponding value in that user's histogram object s
.
We will explain why we need such unusual objects below. First let's finish describing the variance algorithm.
We need to add several other expressions to the Output
table:
perc = GetPercentile(so, p)
, the value $X_{[p]}$ of the percentile,K = SUM(1)
, the total number of users,N = GetSumOfValues(so)
, the total number of pages,mus = GetSumOfValuesUpTo(so, perc)/K
equals to $\mu_S$ above,mun = N/K
equals to $\mu_N$,sig2n = SUM(Nj*Nj)/K - mun*mun
is the value of $\sigma^2_N$, whereNj=GetSumOfValues(s)
is the number of pages per user value above,sig2s = GetSumOfValuesUpTo(ss, perc)/K - mus*mus
is $\sigma^2_S$,sigsn = GetSumOfValuesUpTo(sn, perc)/K - mus*mun
is $\sigma_{SN}$,- this allows us to compute
sigma
from the formula above (the expressions is complicate so we won't write it all down here) xL = GetPercentile(so, p - 1.96*sigma/Math.Sqrt(N))
, the value of $X_{[L]}$ above,xU = GetPercentile(so, p + 1.96*sigma/Math.Sqrt(N))
, the value of $X_{[U]}$ above,- finally, this gives us all the components needed for the variance $\mathrm{var} = N \left( \frac{X_{[U]}-X_{[L]}}{2\cdot 1.96} \right)^2$, or standard deviation $\mathrm{stdev} = \sqrt{N}\left( \frac{X_{[U]}-X_{[L]}}{2\cdot 1.96} \right)$.
There are several tricks we used here. The main idea is that we are utilizing Histogram
objects to
store additional information at each aggregation level, which allows us to defer the actual variance
calculation until the very last aggregation level. The first trick is to realize that to compute $\sum S_j$
needed in the computation of $\mu_S$ we do not actually need to compute neither $S_j = \sum Y_i$ nor $Y_i$
(which is great because we cannot compute them directly since we don't have the value $X_{[p]}$ of the percentile).
Instead, $\sum S_j$ can be obtained from the histogram so
directly by adding up the counts for all the
PLT values that are less than or equal perc
, since
$$
\sum\limits_{\text{users}} S_j = \sum\limits_{\text{user }j} \sum\limits_{\text{pages }i } I( X_i \leq X_{[p]} ) = \sum\limits_{\text{ all pages } i } I( X_i \leq X_{[p]} ) = [\text{since }I\text{ is a binary function }] = \sum\limits_{\text{all pages i with PLT }\leq X_{[p]}} 1 = \text{count all pages with PLT }\leq X_{[p]}.
$$
Because we don't know the value perc
of the
percentile, the histograms ss
and sn
contain the values for
$\sum S^2_j$ and the sum $\sum S_j\cdot N_j$ (needed for $\sigma^2_S$ and $\sigma_{SN}$) for all posible actual values of perc
.
At the last aggregation level, we are able to obtain the value of perc
from so
, and hence we can then
"look up" the correct sum values from the histograms ss
and sn
.
Specifics of implementation in Mangrove
First, we need to implement additional Histogram
operations used above, namely
the unary operation GetSumOfValues(Histogram h)
and the binary operation GetSumOfValuesUpTo(Histogram h, double value)
.
The first one simply returns the sum of the all the values in the histogram. The second sorts the histogram keys, and returns the
sum of the values for the keys smaller than value
. Both of of these implementations are fairly straightforward.
Next, we need to implement two additional unary operations: one for computing the ss
histogram and one for computing sn
histogram.
For ss
, assume the user histogram s
had keys $k_i$ with the corresponding values $c_i$. The operation then does the following:
- sort the keys $k_i$ (let's assume $k_i < k_{i+1}$ for all $i$ to simplify the notation)
- the new value $v_0$ for the smallest key $k_0$ is equal to $v_0 = c_0^2$
- then, for each $i>0$, the new value corresponding to the key $k_i$ is $v_i = (c_1+...+c_i)^2 - (c_1+...+c_{i-1})^2 = c_i^2 + 2c_i(c_1+...+c_{i-1})$.
For example, say for a user $u$ the histogram s
has the following data:
PLT | count |
---|---|
5 | 2 |
9 | 1 |
10 | 2 |
Then the result of the operation is
PLT | ss |
---|---|
5 | $2^2=4$ |
9 | $(1+2)^2-2^2=5$ |
10 | $(2+1+2)^2-(1+2)^2=16$ |
The point is that after doing the usual HistogramUnion
operation, the result of the binary operation GetSumOfValuesUpTo(ss, perc)
would
give exactly the sum $\sum S_j^2$ needed for the variance computation, see above. Let us call this operation SumOfSquaresDifferences
.
To get the sn
histogram the operation is even simpler, we don't even need to sort the keys.
If the user histogram s
had keys $k_i$ with the corresponding values $c_i$,
the new values are $v_i = n\cdot c_i$ where $n = \sum c_i$ is the total
sum of all the values in the histogram. This value of $n$ is the same as the result of GetSumOfValues(s)
.
For example, say for a user $u$ the histogram s
has the following data:
PLT | count |
---|---|
5 | 2 |
9 | 1 |
10 | 2 |
Then the result of the operation is
PLT | sn |
---|---|
5 | $2\cdot 5=10$ |
9 | $1\cdot 5=5$ |
10 | $2\cdot 5=10$ |
Like for ss
, doing the HistogramUnion
aggregation and applying the binary operation GetSumOfValuesUpTo(sn, perc)
produces
the sum $\sum S_jN_j$ needed for the variance computation, see above. Let us call this operation MultiplyBySumOfValues
.
Because we need to utilize Histogram
s at the variance computation step, we need to call ReplaceNonRecursiveAggregationTypes
transformer
at that step. We will change this transformer to have a more fine-grain control over which aggregations to replace
(since we only need to replace PERCENTILE
metrics).
Simplification for non-delta-method variance
If the percentile metric PERCENTILE(u, 75)
is at the experiment unit level (e.g. User
) then the formulas
and the implementation would be simplified as follows.
From the algorithm perspective, using the notations above and understanding that $N=K$ in this case, the formulas will be $\sigma^2_S = N\cdot (\mathrm{E}(S^2) - \mu^2_S)$ and $$ \sigma^2 = \frac{1}{N^2} \sigma^2_S $$ The rest of the algorithm stays exactly the same.
From the Mangrove implementation point of view, there will be much less new expressions that would need to be created.
On User
, we create a new Histogram
object containing a single key u
and the corresponding value 1
.
Let's call it h = new Histogram(x)
.
On Output
, we add two new Histogram
objects. Let's name them so
and ss
.
Similar to the delta-method case above, so = HistogramUnion(h)
and ss = HistogramUnion(u_ss)
.
Other expressions needed on the Output
table in this case:
perc = GetPercentile(so, p)
, the value $X_{[p]}$ of the percentile,N = SUM(1)
, the total number of users,mus = GetSumOfValuesUpTo(so, perc)/N
equals to $\mu_S$ above,sig2s = mus - mus*mus
is $\sigma^2_S$,sigma = Math.Sqrt(sig2s)
,xL = GetPercentile(so, p - 1.96*sigma/Math.Sqrt(N))
, the value of $X_{[L]}$ as in the case before,xU = GetPercentile(so, p + 1.96*sigma/Math.Sqrt(N))
, the value of $X_{[U]}$ as in the case before,- finally, the variance $\mathrm{var} = N \left( \frac{X_{[U]}-X_{[L]}}{2\cdot 1.96} \right)^2$, or, if we care about standard deviation, $\mathrm{stdev} = \sqrt{N}\left( \frac{X_{[U]}-X_{[L]}}{2\cdot 1.96} \right)$.
Note: we are only adding expressions into two tables: User
and Output
.
Additional remarks
Implementations of Histogram object in Scope
Because many of the percentile-related operations and aggregations on histograms require
the keys to be sorted, it might be beneficial for the base class for Histogram
object in Scope to inherit from a class
other than System.Generic.Collections.Dictionary
, for example System.Generic.Collections.SortedDictionary
.
This would require re-implementing the HistogramUnion
UDO. We plan to keep using Dictionary
at the moment,
but we actively investigate the performance of other solutions.
Other compute fabrics
We plan to implement this algorithm for computing percentile variances in Scope, Hypercube and Metrics Vectorization (MV). Feasibility for Kusto and SparkSQL requires additional investigations (in particular, we do not yet support histograms in these two fabrics).
Performance issues
The histograms so
, ss
and sn
are fairly large so there is a concern about the memory usage. We plan to benchmark this
solution once it is implemented. A possible optimization to reduce memory footprint could be to have a single dictionary with 3 values per key,
combining the three histograms into a single object. We will investigate this option further in a separate document.