Add Variance

Owner	Approvers	Participants
Amin Saied	Daniel Miller, Craig Boucher	Sasha Patotski

This document addresses two points related to the AddVariance class.

A high level description of where expressions are added in the variance computation.
An outline of some of the assumptions and design decisions made when implementing the AddVariance class in Mangrove.

Variance expressions architecture

Recall that there are two methods for computing variance of a given metric:

The standard variance - in the case that the metric is defined at the experiment unit (EU) level,
The delta method variance - in the case that the metric is defined below the EU level.

We give an example of each, describing where expressions are added.

1. Standard variance example

Consider the following simple metrics plan,

base <- user <- userOut

with a metric AVG(x) on userOut. In order to compute the variance of AVG(x) we need to compute x**2 on the user table. The variance can then be computed on the userOut table.

2. Delta method variance example

Consider the following metrics plan,

base <- session <- user <- userOut
                <--------- sessionOut

In this case we need to add expressions to the session, user and sessionOut tables as follows.

session: (x!=null) ? 1 : 0 AS y
user: SUM(x) AS num, SUM(y) AS den, COUNT(y) AS count, num**2, den**2, num*den
sessionOut: Var(num), Var(den) Cov(num, den)

From this, the variance can be computed on the sessionOut table via the delta method.

Assumptions

An important concept associated to a metrics plan is that of level. The level of a table is the condition used in the GROUP BY that defines that table. We make two assumptions related to the level of a table.

Metrics at different levels live on output tables that reference tables at different levels.
There is a bijection between levels and aggregation tables.

We spell these assumptions out below.

1. Output tables reference the appropriate levels

Recall that there are two methods for computing variance of a given metric:

The standard variance - in the case that the metric is defined at the experiment unit (EU) level,
The delta method variance - in the case that the metric is defined below the EU level.

To determine which method should be used to compute variance, we need to determine on what level the metric is defined. We assume that metrics at different level will live on output tables that reference tables at different levels!

For example, consider the following table structure:

base <- session <- user <- userOut
                <--------- sessionOut

Suppose we have a session-level metric that lives on the sessionOut table. Since sessionOut aggregates session, which is below the EU (user in this case), we know to appeal to the delta method to compute the variance for this metric.

Note that we rely on the table dependency to make this inference, in particular, that sessionOut aggregates session. An example of how this dependency structure might be lost is if the output tables were combined:

base <- session <- user <- combinedOut

Note: The segmentation transforer does combine the output tables, and so we require the AddVariance transformer to be applied before the segmentation transformer.

2. Aggregations define levels

In the AddVariance class, we make the simplifying assumption that levels are defined by aggregation tables. Concretely, we say two tables a and b are at the same level if:

There exists a unique path a <- t1 <- --- <- tn <- b,
None of t1, ..., tn or b are Aggregation tables.

For example, we do not allow for the following dependency structures:

                <- user1
base <- session          <- userOut
                <- user2

                <- user1 <- userOut1
base <- session
                <- user2 <- userOut2

With this assumption, we are able to simplify the logic of adding variance as follows. At various places in the logic, it is necessary to determine the output level tables. We do this by finding the leaf tables, walking back up the tree until you reach an aggregation table, and then finding all the decendents of that aggregation. However, if there are multiple aggregation table parents at the same level, this algorithm fails.

Future Work

It is not impossible to deal with the issue of unknown dependency structure outlined above. It requires spotting the right pattern in the metric definition to determine on what level it is defined. The current logic for adding variance is sufficiently modular that it will allow us to incorporate this in the future.