Add Variance
| Owner | Approvers | Participants |
|---|---|---|
| Amin Saied | Daniel Miller, Craig Boucher | Sasha Patotski |
This document addresses two points related to the AddVariance class.
- A high level description of where expressions are added in the variance computation.
- An outline of some of the assumptions and design decisions made when
implementing the
AddVarianceclass in Mangrove.
Variance expressions architecture
Recall that there are two methods for computing variance of a given metric:
- The standard variance - in the case that the metric is defined at the experiment unit (EU) level,
- The delta method variance - in the case that the metric is defined below the EU level.
We give an example of each, describing where expressions are added.
1. Standard variance example
Consider the following simple metrics plan,
base <- user <- userOut
with a metric AVG(x) on userOut. In order to compute the variance of
AVG(x) we need to compute x**2 on the user table. The variance can
then be computed on the userOut table.
2. Delta method variance example
Consider the following metrics plan,
base <- session <- user <- userOut
<--------- sessionOut
In this case we need to add expressions to the session, user and
sessionOut tables as follows.
session: (x!=null) ? 1 : 0 AS y
user: SUM(x) AS num, SUM(y) AS den, COUNT(y) AS count, num**2, den**2, num*den
sessionOut: Var(num), Var(den) Cov(num, den)
From this, the variance can be computed on the sessionOut table via the delta
method.
Assumptions
An important concept associated to a metrics plan is that of level. The level
of a table is the condition used in the GROUP BY that defines that table. We
make two assumptions related to the level of a table.
- Metrics at different levels live on output tables that reference tables at different levels.
- There is a bijection between levels and aggregation tables.
We spell these assumptions out below.
1. Output tables reference the appropriate levels
Recall that there are two methods for computing variance of a given metric:
- The standard variance - in the case that the metric is defined at the experiment unit (EU) level,
- The delta method variance - in the case that the metric is defined below the EU level.
To determine which method should be used to compute variance, we need to determine on what level the metric is defined. We assume that metrics at different level will live on output tables that reference tables at different levels!
For example, consider the following table structure:
base <- session <- user <- userOut
<--------- sessionOut
Suppose we have a session-level metric that lives on the sessionOut table.
Since sessionOut aggregates session, which is below the EU (user in this
case), we know to appeal to the delta method to compute the variance for this
metric.
Note that we rely on the table dependency to make this inference, in particular,
that sessionOut aggregates session. An example of how this dependency
structure might be lost is if the output tables were combined:
base <- session <- user <- combinedOut
Note: The segmentation transforer does combine the output tables, and so we require the AddVariance transformer to be applied before the segmentation transformer.
2. Aggregations define levels
In the AddVariance class, we make the simplifying assumption that levels are
defined by aggregation tables. Concretely, we say two tables a and b are
at the same level if:
- There exists a unique path
a <- t1 <- --- <- tn <- b, - None of
t1, ..., tnorbare Aggregation tables.
For example, we do not allow for the following dependency structures:
<- user1
base <- session <- userOut
<- user2
<- user1 <- userOut1
base <- session
<- user2 <- userOut2
With this assumption, we are able to simplify the logic of adding variance as follows. At various places in the logic, it is necessary to determine the output level tables. We do this by finding the leaf tables, walking back up the tree until you reach an aggregation table, and then finding all the decendents of that aggregation. However, if there are multiple aggregation table parents at the same level, this algorithm fails.
Future Work
It is not impossible to deal with the issue of unknown dependency structure outlined above. It requires spotting the right pattern in the metric definition to determine on what level it is defined. The current logic for adding variance is sufficiently modular that it will allow us to incorporate this in the future.