Aggregations and their expected behaviour

Owner	Approvers	Participants
Sasha Patotski	Dan Miller, Ulf Knoblich	Jing Jin, Pawel Janowski, Aleksandr Fabijan, Widad Machmouchi, Andrew Lohr, Jonathan Litz, Carl Mitchell, Andy Drake, Garnet Vaz

This RFC (Request For Comment) describes what Mangrove means by "aggregation levels", the constraints, assumptions, and the behaviour that Mangrove users can expect under those constraints and assumptions. Moreover, we give a precise description of how aggregations work in Mangrove, with a focus on handling null values.

Problem

Recently a number of unexpected behaviours were discovered in Foray (the current default code generation framework in the Cosmos pipeline). These behaviours were caused by more than 3 aggregation levels with some of the columns defining the levels containing null values. Because of the mismatch between the user's expectations and the computation results, Mangrove needs to be as precise as possible in communicating the expected behaviour.

Summary of the proposal

In short, we propose the following for Mangrove:

The result of the computation should not depend on the compute fabric used.
- Currently there is a number of known small discrepancies between platforms (Foray vs. Hypercube vs. Foray MV vs. Mangrove Hypercube vs. other Mangrove fabrics)
Allow arbitrary positive number of aggregation levels.
- Mangrove Hypercube only allows 4 (although it is not a hard limit, it is easy to add more levels if needed), other Mangrove fabrics allow any number already. Foray and Foray MV do not have official support for more than 3 levels.
Aggregation levels should be linearly ordered from the least coarse to the most coarse, with the most coarse level being the randomization unit level.
- This is the current assumptions across all the platforms already.
Allow having null values in the aggregation level IDs.
- Different platforms have different behaviour.
Make no assumptions about the uniqueness of the aggregation level IDs, for example two different users can have sessions with the same ID and those would still be treated as different sessions.
- Foray does that currently, the other platforms have differences.

See below for more details on what exactly is proposed and why. The document describes the state where Mangrove wants to be, no the state it's currently at. Some supported compute fabrics are closer to what's described below than others. The team is working on estimating the necessary work to be done.

Proposed Solution

Precise definition of a level

When creating a metric set, the users are required to provide an ordered list of column names from the schema. These columns are addressed as "aggregation levels", or simply "levels". The order of the levels is from "lowest" level to "highest" level, with the highest level being defined by the column containing the randomization unit IDs.

Note: We do assume that the highest level is the column with the randomization unit IDs. However, we do not assume anything about the lowest level, in particular we do not assume that it contains unique data row identifiers. In fact, we do not assume that unique row identifier exists in the data. As a corrolary, we do not make any assumptions or checks about whether the data has duplicated rows.

The specified level order means the order of "coarseness", with the lowest levels being the least coarse (i.e. the most granular) and the the higher levels being more coarse (i.e. the least granular). These assumptions are not about the underlying data, but rather about the data's meaning. For the precise assumptions on the underlying data see the section about the data assumptions below. This all will hopefully become clear in Examples 1, 2 and 3 below.

Note: We do not have any limitations on the supported number of aggregation levels, except that there must be at least one (containing the randomization unit ID).

Any metric defined in this metric set must be at one of the levels. The denominator of a metric contains the combinations of values in all the levels of the metric plus all the levels above that level. In particular, the denominator of a metric at the lowest level contains the combinations of values in all the specified levels, while a metric at the highest (i.e. randomization unit) level only has values from the highest level column in the metric's denominator. Let us illustrate this in an example.

Example 1

Let's say the user defines three aggregation levels: Page ← Session ← User. Then a Page-level metric is computed out of the unique values of the combination Page + Session + User. So, average PLT per page would be adding up the PLT values for each unique value of the Page + Session + User combination and dividing by the number of these values.

If a metric is at the Session level, then the metric would be computed out of unique combinations of Session + User. Finally, User-level metrics are computed per unique values of the User column.

Note: combinations of Page + Session + User is not the same as converting these values to string and concatenating them. We want to distinguish between null string and empty string, so we want to keep a combination "p"+null+"u" separate from "p"+""+"u". But if we did string concatenation, both would simply give the string "pu" since concatenating null string is the same as concatenating an empty string. See below for more discussion on how Mangrove will treat null values.

Note: In particular, this means that the values in Page column don't have to be globally unique to be counted as unique: the same value in Page column for two different Session or User values would be counted as a different value of Page when computing metrics. For example, Page can simply enumerate pages for a user, i.e. user ABC's pages 1,2,3,4 are different from user's XYZ's pages 1,2,3,4.

Note: flights are not aggregation levels for Mangrove. Flights are treated as segments. So if an analysis request specifies the flight column as being a segment (and for regular AB scorecards it would be indeed the case), the values from the flight column (and other segment values) would also be grouped by. For example, using the notation above, if it was a regular AB scorecard with flights logged in a column called Flight, for a Page-level metric we would really be grouping by Page + Session + User + Flight. If we had other segments, there would be more combinations we would be grouping by. Because this RFC is only concerned with aggregation levels, we will not mention segments in the future.

Semantic correctness of aggregation levels

It is the user's responsibility to ensure that the order of the aggregation levels is semantically correct, i.e. an entity at a level is fully contained in a single parent entity at higher levels. Mangrove will do the aggregations for any set of columns specified by the user as levels, for the order the user specified. However, Mangrove cannot check the meaning of the data in these columns.

In terms of the example above, each Page belongs to a single Session and each Session belongs to a single User. Let's consider some concrete examples. In both examples below, the key point is that the aggregations are actually done over the combinations of values in the columns defining the aggregation levels (from the current level up).

Example 2: linear but incorrect order

In Example 1 the order specified assumes that User can have several Sessions and a Session can have multiple Pages, but it is not meaningful for a Page to belong to several Sessions or Users. If the order was to be switched to, say, Session ← Page ← User, the calculations would still succeed. However, Session-level metrics would be computed out of Session + Page + User combinations, making them effectively Page-level metrics.

Example 3: columns with no clear semantic order

For some signals there is no well-defined order. For example, Session and Date are not necessarily nested: Session can cross the Date boundary, and each Date can have several Sessions. Changing the level order of these columns would drastically change what is being computed. For example, if the order specified is Session ← Date ← User, then any metric about Sessions would be split across Dates as well. In other words, a Session that crosses the Date would be treated as two separate Sessions. If the order is Date ← Session ← User, then Session-level metric would ignore the Date, so they won't be split into separate Sessions if they cross the Date. However, in that case any Date-level metric would not actually be about the Dates, but would be about Sessions within those Dates (for a given user).

To emphasize, even if some columns have no clear semantic order, a user can still use them as aggregation levels. However, in that case the result could be not what the user expects since Mangrove will still go through the levels in the specified linear order. In other words, Mangrove cannot check anything about the underlying data to check that the level order is "meaningful", i.e. going from less to more coarse. The computation will succeed, and Mangrove will perform the aggregations in exactly the order specified by the user. It is up to the user to ensure the level order is what the user intended, otherwise the results of the computation could be not what the user expects. Even if the columns have no clear semantic order (like in Example 3 above) the user must linearly order them, one way or another.

To give an analogy with a compiler, the compiler can check basic correctness of the syntax and can guarantee that the code will be executed exactly as written, but it is up to the programmer to make sure the program does the right thing and does not have bugs.

Note: it could be possible for a user to create Blitz extensions that would use Mangrove tools to do some "data sanity checks". However, these extensions are not part of the scope for this RFC, and we do not currently have plans to support such things in production.

Assumptions about the underlying data

Data type

We require the columns specifying the levels to be of type string, which could be null. Moreover, we will make the same assumption for segments as well. In the context of segments and aggregation levels, null string values will not have any difference in behaviour from other string values, be it the empty string "" or any other string. There will be differences if operations are applied to null values vs. empty strings, see the Null values section below.

Why allow null values as IDs in aggregation levels?

There are real-life scenarios where having null as an aggregation level ID is very natural.

For example, if LiveShare extension for VSCode is running experiments, it is natural to have the following aggregation levels: EventID ← ConversationID ← SessionID ← UserID. Here ConversationID is logged only for users who actually had conversations using the LiveShare feature, otherwise that value is null. Besides having null values in ConversationID field, it is a meaningful aggregation level sequence: a user can have several sessions, each session can have several conversations in it, each conversation is comprised of events.

There are similar examples in other products as well, e.g. Office, Exchange etc.

Why only string aggregation levels?

There are several reasons, both internal from the implementation point of view and external from the user point of view.

From the user point of view, this is not a huge overhead in terms of compute time and data engineering. At the same time, some data types do not have a canonical string representation, and we want to give the user control over how exactly the data is being converted to string.

Internally, this would give a unified data type for all dimensions (= segments + aggregation levels), and would allow us to manipulate null values in those columns without creating any value collisions (e.g. if we want to replace all null values with a certain fixed value and then filter them out).

Uniqueness

There are no uniqueness assumptions about the underlying data in the columns defining the aggregation levels. In particular, we do not require the IDs defining the levels to be globally unique.

In terms of Example 1 above with the Page ← Session ← User level order, it's OK to have the same Page ID of 1 for both user ABC and user XYZ. These Page IDs would be treated as different.

Null values

To summarize the sections below, we need to have a fixed unified way to handle null values regardless of the compute fabric targeted by the user. When fixing the behaviour, we want to align with how U-SQL handles null values, unless we have a strong reason to deviate from that behaviour.

Is null value equal to null value?

In different languages null values are treated differently when doing arithmetic and logical operations with them. For example, in C# (and hence Scope) null == null is true, while in Kusto it is actually false, and null != null is false in Kusto as well.

Because Mangrove is fabric-agnostic, it needs to fix one behaviour and guarantee that the behaiour will not change between fabrics. To achieve that, Mangrove has two native operations IsNull and IsNotNull for null comparisons.

We highly recommend using these operations instead of == null and != null. Mangrove compiler will try to convert the latter into the former, but it will not guarantee the right behaviour in all cases. For example, in Scope one can define two custom functions (externs) f1, f2 that could both produce null values. Then, Mangrove has no way to modify the expression f1 == f2 to ensure that whenever null values are compared it is done via IsNull or IsNotNull, since Mangrove has no way to understand the internals of these two functions.

Arithmetic and aggregations on null values

Before going into how Mangrove handles null values in aggregation levels, we need to fix the behaviour of the arithmetic and aggregation of null values.

Let's consider the following data as an illustration. The important part is that NullColumn has only null values, Column has null and non-null values.

`NullColumn`	`Column`
`null`	0
`null`	0
`null`	null
`null`	1
`null`	1

The results of all the supported aggregations on these two columns will be as follows:

Aggregaion	Result on `NullColumn`	Result on `Column`	Comments
`Count`	0	4	Null values ignored, but if all are null then 0
`Sum`	0	0+0+1+1 = 2	Null values ignored, but if all are null then 0
`Min`	null	0	Null values ignored, but if all are null then null
`Max`	null	1	Null values ignored, but if all are null then null
`DCount`	0	2	Null values ignored, but if all are null then 0
`Avg`	null	(0+0+1+1)/4 = 0.5	Null values ignored, but if all are null then null
`Percentile(-,0.75)`	null	1	Null values ignored, but if all are null then null

Note: such behaviour is consistent with most SQL-like languages, e.g. SparkSQL, Kusto, Scope.

Note: to emphasize, the aggregations listed above are MDL aggregations, not aggregations from any specific fabric. So in particular Count is a unary aggregation meaning that there is no Count() without input parameters like there is in Scope, it's always Count(expression).

Corrolary: dummy aggregations matter

The behaviours described above imply the following behaviour.

If the user defines a "dummy" aggregation level then the choice of "dummy" aggregations between levels will matter. By a "dummy" aggregation we mean an aggregation when there is 1 to 1 correspondence between the values in the data on the previous level and that level. For example, if RowID is a column of row IDs, then defining RowID to be the first aggregation level would make it a "dummy" one: there is no real aggregation happening on that level since in the "group by" statement no values would actually be grouped.

Such "dummy" aggregations are very common on our platform because Foray had this as a requirement. There, the metrics will need to have the dummy aggregation. For example, one could not define Avg(BaseLevelColumn), one had to add a "dummy" aggregation to RowID column first, i.e. Avg(Max<RowID>(BaseLevelColumn)) or Avg(Sum<RowID>(BaseLevelColumn)). Mangrove no longer requires to have such dummy aggregations, but it still needs to deal with that legacy structure.

In Foray, it did not matter which "dummy" operation to put there, it would be ignored either way. In Mangrove, it will matter which dummy aggregation to put. For example, if Column has null values, then Max<RowID>(Column) and Sum<RowID>(Column) would give different results: Max<RowID> would keep the null values, while Sum<RowID> would convert nulls into zeros.

We are working on the migration plan to avoid any possible issues caused by this discrepancy between the two code generation frameworks.

Levels with null values

When grouping by a column containing null values, the null value is treated like a regular value, and would not be filtered away.

For example, if column Page from the example above contains null values, then in the expression Avg<Page>(Revenue) the null value will not be filtered out because these values are being grouped by, while in the expression Count<Session>(Page) the null values for Page will be ignored because those values are being aggregated (see the previous section for the exact behaviour of Count).

In particular, when defining metrics at a level with possible null values in the corresponding column, the metrics could potentially have "outliers", since all the values for null would be grouped together. Using the example above, if Page had null values then all of them within the same Session + User combination would be treated as the same.

For example, if the user needs a metric to ignore the null values in Page when computing the metric Avg<Page>(Revenue), they would need to filter them out explicitly: Avg<Page>(Page != null ? Revenue : null).

Note: for columns of type string we will distinguish between null values and empty string values.

Note: the proposed behaviour would mean that the current implementation of QCNT would include null values. This could easily be changed by applying the alternative definition, i.e. instead of Sum(Max<Level>(1)) do Count(Max<Level>(Level)).

Why such behaviour for null values?

Recursive aggregations are recursive

An aggregation Agg is called recursive if for any two levels A,B with A >= B in the level hierarchy and any column x the following identity holds: Agg<A>(Agg<B>(x)) = Agg<A>(x).

One key observation is that we want the aggregations of recursive type, e.g. Sum, Max, Min, to actually be recursive aggregations according to the definition above.

In terms of Example 1 above, this means, for instance, that for any expression x we want the identity Sum<User>(Sum<Session>(x)) = Sum<User>(x) to hold. If Sum<Session> would only sum over non-null values of Session, the identity above would be broken: rows with session value null would be filtered out from the sum on the left, but would not be filtered out from the right whenever User values is non-null.

The assumption of recursiveness of the aggregations is one of the key assumptions about the structure of computations in Mangrove. This is the main reason we decided to treat null values as described above.

Similar to SQL behaviour

Such behaviour of distinguishing null values between grouping by them and doing arithmetic on them is very common among SQL-like languages, so it was a natural choice.

Non-linearly ordered levels

Mangrove has no plans to support non-linear order structures for aggregation levels. Currently, there is no way for a user to define anything but a linear sequence of levels when creating a metric set, and this is not going to change: Mangrove will accept a linear order of levels, and trying to define anything but a linear order would result in compilation failure (in MDL compiler, before it even gets to Mangrove).

To use the notations from the Example 3 above, Session and Date are not meaningfully nested: Session can spread across different Dates and each Date can have many Sessions in it. What we are saying here is that there is no mechanism for the user to communicate this uncertainty to Mangrove: Mangrove expects a user to put the levels into an ordered list, and Mangrove will do the aggregations using that order.

Diamonds

For example, one might want to have a "diamond"-shaped structure of levels, with levels A,B,C,D depending like A ← B ← D and A ← C ← D. For example, to have 4 aggregation levels Page, Session, Day and User, with no relation between Session and Day: a sessions can span many days, a day can have many sessions.

We will not support such scenario. There are several reason, resolving which would require a significant re-architecture of the Mangrove object model:

It becomes unclear how to chose columns to group by at the unrelated nodes. Let's say we are computing a Page-level metric PageCrashRate = Avg(Max<Page>(Crash)). In the select statement for the Page table, it is unclear which aggregands to pick. Picking all 4 would force the aggregation to be linear, since grouping would be done by Page + Session + Day + User, which means by the time we are computing the Max<Page>(Crash) part we already combined Session and Day.
It would put restrictions on which metrics could be computed for which segments. In the example above, we would not be able to compute Date-level metrics for Session-level segments, or vice versa.

For this scenario, we recommend our users to have two separate metric sets, with aggregation level structures A ← B ← D and A ← C ← D.

Trees

The issue discussed in Diamonds section about metrics at incomparable levels and aggregands in the parent tables would not arize if the aggregation levels formed a tree (since there would be no shared parent table). For example, if we had levels A,B,C with dependencies A ← C and B ← C, then metric at level A would need to have A,C as aggregands, at level B --- B and C, and at level C --- only C as aggregand. This avoids the main issue of Diamonds section since here for every metric there is unique path through the aggregation levels. However, this scenario would still cause the issue of not all metrics available for all segments. This might be a desired scenario in some cases, but for now we decided to not consider such structures.

Note: this does not mean that Mangrove will not support joining additional data at arbitrary levels, e.g. side stream triggering. Here we only talk about the structure of the aggregation levels.