Aggregations and their expected behaviour
Owner | Approvers | Participants |
---|---|---|
Sasha Patotski | Dan Miller, Ulf Knoblich | Jing Jin, Pawel Janowski, Aleksandr Fabijan, Widad Machmouchi, Andrew Lohr, Jonathan Litz, Carl Mitchell, Andy Drake, Garnet Vaz |
This RFC (Request For Comment) describes what Mangrove means by "aggregation levels", the constraints, assumptions, and the behaviour that Mangrove users can expect under those constraints and assumptions. Moreover, we give a precise description of how aggregations work in Mangrove, with a focus on handling null values.
Problem
Recently a number of unexpected behaviours were discovered in Foray (the current default code generation framework in the Cosmos pipeline). These behaviours were caused by more than 3 aggregation levels with some of the columns defining the levels containing null values. Because of the mismatch between the user's expectations and the computation results, Mangrove needs to be as precise as possible in communicating the expected behaviour.
Summary of the proposal
In short, we propose the following for Mangrove:
- The result of the computation should not depend on the compute fabric used.
- Currently there is a number of known small discrepancies between platforms (Foray vs. Hypercube vs. Foray MV vs. Mangrove Hypercube vs. other Mangrove fabrics)
- Allow arbitrary positive number of aggregation levels.
- Mangrove Hypercube only allows 4 (although it is not a hard limit, it is easy to add more levels if needed), other Mangrove fabrics allow any number already. Foray and Foray MV do not have official support for more than 3 levels.
- Aggregation levels should be linearly ordered from the least coarse to the most coarse,
with the most coarse level being the randomization unit level.
- This is the current assumptions across all the platforms already.
- Allow having null values in the aggregation level IDs.
- Different platforms have different behaviour.
- Make no assumptions about the uniqueness of the aggregation level IDs, for example
two different users can have sessions with the same ID and those would still be treated
as different sessions.
- Foray does that currently, the other platforms have differences.
See below for more details on what exactly is proposed and why. The document describes the state where Mangrove wants to be, no the state it's currently at. Some supported compute fabrics are closer to what's described below than others. The team is working on estimating the necessary work to be done.
Proposed Solution
Precise definition of a level
When creating a metric set, the users are required to provide an ordered list of column names from the schema. These columns are addressed as "aggregation levels", or simply "levels". The order of the levels is from "lowest" level to "highest" level, with the highest level being defined by the column containing the randomization unit IDs.
Note: We do assume that the highest level is the column with the randomization unit IDs. However, we do not assume anything about the lowest level, in particular we do not assume that it contains unique data row identifiers. In fact, we do not assume that unique row identifier exists in the data. As a corrolary, we do not make any assumptions or checks about whether the data has duplicated rows.
The specified level order means the order of "coarseness", with the lowest levels being the least coarse (i.e. the most granular) and the the higher levels being more coarse (i.e. the least granular). These assumptions are not about the underlying data, but rather about the data's meaning. For the precise assumptions on the underlying data see the section about the data assumptions below. This all will hopefully become clear in Examples 1, 2 and 3 below.
Note: We do not have any limitations on the supported number of aggregation levels, except that there must be at least one (containing the randomization unit ID).
Any metric defined in this metric set must be at one of the levels. The denominator of a metric contains the combinations of values in all the levels of the metric plus all the levels above that level. In particular, the denominator of a metric at the lowest level contains the combinations of values in all the specified levels, while a metric at the highest (i.e. randomization unit) level only has values from the highest level column in the metric's denominator. Let us illustrate this in an example.
Example 1
Let's say the user defines three aggregation levels: Page
← Session
← User
.
Then a Page
-level metric is computed out of the unique values of the combination
Page
+ Session
+ User
. So, average PLT per page would be adding up the PLT values
for each unique value of the Page
+ Session
+ User
combination and dividing by the number of these values.
If a metric is at the Session
level, then the metric would be computed out of
unique combinations of Session
+ User
. Finally, User
-level metrics are computed per
unique values of the User
column.
Note: combinations of Page
+ Session
+ User
is not the same as
converting these values to string
and concatenating them. We want to distinguish
between null string and empty string, so we want to keep a combination "p"
+null
+"u"
separate from "p"
+""
+"u"
. But if we did string concatenation, both would
simply give the string "pu"
since concatenating null string is the same as concatenating
an empty string. See below for more discussion on how Mangrove will treat null values.
Note: In particular, this means that the values in Page
column don't have to be
globally unique to be counted as unique: the same value in Page
column for two different Session
or User
values would be counted as a different value of Page
when computing metrics.
For example, Page
can simply enumerate pages for a user, i.e. user ABC
's pages
1,2,3,4
are different from user's XYZ
's pages 1,2,3,4
.
Note: flights are not aggregation levels for Mangrove. Flights are treated
as segments. So if an analysis request specifies the flight column as being a
segment (and for regular AB scorecards it would be indeed the case), the values
from the flight column (and other segment values) would also be grouped by.
For example, using the notation above, if it was a regular AB scorecard
with flights logged in a column called Flight
,
for a Page
-level metric we would really be grouping by Page
+ Session
+ User
+ Flight
.
If we had other segments, there would be more combinations we would be grouping by.
Because this RFC is only concerned with aggregation levels, we will not
mention segments in the future.
Semantic correctness of aggregation levels
It is the user's responsibility to ensure that the order of the aggregation levels is semantically correct, i.e. an entity at a level is fully contained in a single parent entity at higher levels. Mangrove will do the aggregations for any set of columns specified by the user as levels, for the order the user specified. However, Mangrove cannot check the meaning of the data in these columns.
In terms of the example above, each Page
belongs to a single Session
and each Session
belongs to a single User
. Let's consider some concrete examples.
In both examples below, the key point is that the aggregations are actually done over
the combinations of values in the columns defining the aggregation levels (from the
current level up).
Example 2: linear but incorrect order
In Example 1 the order specified assumes that User
can have several Session
s and
a Session
can have multiple Page
s, but it is not meaningful for a
Page
to belong to several Session
s or User
s. If the order was to be switched to,
say, Session
← Page
← User
, the calculations would still succeed. However,
Session
-level metrics would be computed out of Session
+ Page
+ User
combinations,
making them effectively Page
-level metrics.
Example 3: columns with no clear semantic order
For some signals there is no well-defined order. For example,
Session
and Date
are not necessarily nested: Session
can cross the Date
boundary, and each
Date
can have several Session
s. Changing the level order of these columns would
drastically change what is being computed. For example, if the order specified
is Session
← Date
← User
, then any metric about Session
s would be split
across Date
s as well. In other words, a Session
that crosses the Date
would be treated
as two separate Session
s. If the order is Date
← Session
← User
, then
Session
-level metric would ignore the Date
, so they won't be split into
separate Session
s if they cross the Date
. However, in that case any Date
-level metric
would not actually be about the Date
s, but would be about Session
s within
those Date
s (for a given user).
To emphasize, even if some columns have no clear semantic order, a user can still use them as aggregation levels. However, in that case the result could be not what the user expects since Mangrove will still go through the levels in the specified linear order. In other words, Mangrove cannot check anything about the underlying data to check that the level order is "meaningful", i.e. going from less to more coarse. The computation will succeed, and Mangrove will perform the aggregations in exactly the order specified by the user. It is up to the user to ensure the level order is what the user intended, otherwise the results of the computation could be not what the user expects. Even if the columns have no clear semantic order (like in Example 3 above) the user must linearly order them, one way or another.
To give an analogy with a compiler, the compiler can check basic correctness of the syntax and can guarantee that the code will be executed exactly as written, but it is up to the programmer to make sure the program does the right thing and does not have bugs.
Note: it could be possible for a user to create Blitz extensions that would use Mangrove tools to do some "data sanity checks". However, these extensions are not part of the scope for this RFC, and we do not currently have plans to support such things in production.
Assumptions about the underlying data
Data type
We require the columns specifying the levels to be of type string
,
which could be null
. Moreover, we will make the same assumption for segments as well.
In the context of segments and aggregation levels, null string values will not have any difference
in behaviour from other string values, be it the empty string ""
or any other string.
There will be differences if operations are applied to null values vs. empty strings,
see the Null values section below.
Why allow null values as IDs in aggregation levels?
There are real-life scenarios where having null
as an aggregation level ID is very
natural.
For example, if LiveShare extension for VSCode is running experiments, it is
natural to have the following aggregation levels:
EventID
← ConversationID
← SessionID
← UserID
.
Here ConversationID
is logged only for users who actually had conversations
using the LiveShare feature, otherwise that value is null
. Besides having null
values in ConversationID
field, it is a meaningful aggregation level sequence:
a user can have several sessions, each session can have several conversations in it,
each conversation is comprised of events.
There are similar examples in other products as well, e.g. Office, Exchange etc.
Why only string aggregation levels?
There are several reasons, both internal from the implementation point of view and external from the user point of view.
From the user point of view, this is not a huge overhead in terms of compute time and data engineering. At the same time, some data types do not have a canonical string representation, and we want to give the user control over how exactly the data is being converted to string.
Internally, this would give a unified data type for all dimensions (= segments + aggregation levels), and would allow us to manipulate null values in those columns without creating any value collisions (e.g. if we want to replace all null values with a certain fixed value and then filter them out).
Uniqueness
There are no uniqueness assumptions about the underlying data in the columns defining the aggregation levels. In particular, we do not require the IDs defining the levels to be globally unique.
In terms of Example 1 above with the Page
← Session
← User
level order,
it's OK to have the same Page
ID of 1
for both user ABC
and user XYZ
.
These Page
IDs would be treated as different.
Null values
To summarize the sections below, we need to have a fixed unified way to handle null values regardless of the compute fabric targeted by the user. When fixing the behaviour, we want to align with how U-SQL handles null values, unless we have a strong reason to deviate from that behaviour.
Is null value equal to null value?
In different languages null values are treated differently when doing arithmetic
and logical operations with them. For example, in C# (and hence Scope)
null == null
is true, while in Kusto it is actually false, and
null != null
is false in Kusto as well.
Because Mangrove is fabric-agnostic, it needs to fix one behaviour and
guarantee that the behaiour will not change between fabrics. To achieve that,
Mangrove has two native operations IsNull
and IsNotNull
for null comparisons.
We highly recommend using these operations instead of == null
and != null
. Mangrove compiler will try to convert the latter into the former,
but it will not guarantee the right behaviour in all cases.
For example, in Scope one can define two custom functions (externs) f1, f2
that could both
produce null values. Then, Mangrove has no way to modify the expression
f1 == f2
to ensure that whenever null values are compared it is done via
IsNull
or IsNotNull
, since Mangrove has no way to understand the
internals of these two functions.
Arithmetic and aggregations on null values
Before going into how Mangrove handles null values in aggregation levels, we need to fix the behaviour of the arithmetic and aggregation of null values.
Let's consider the following data as an illustration. The important part is that
NullColumn
has only null values, Column
has null and non-null values.
NullColumn |
Column |
---|---|
null |
0 |
null |
0 |
null |
null |
null |
1 |
null |
1 |
The results of all the supported aggregations on these two columns will be as follows:
Aggregaion | Result on NullColumn |
Result on Column |
Comments |
---|---|---|---|
Count |
0 | 4 | Null values ignored, but if all are null then 0 |
Sum |
0 | 0+0+1+1 = 2 | Null values ignored, but if all are null then 0 |
Min |
null | 0 | Null values ignored, but if all are null then null |
Max |
null | 1 | Null values ignored, but if all are null then null |
DCount |
0 | 2 | Null values ignored, but if all are null then 0 |
Avg |
null | (0+0+1+1)/4 = 0.5 | Null values ignored, but if all are null then null |
Percentile(-,0.75) |
null | 1 | Null values ignored, but if all are null then null |
Note: such behaviour is consistent with most SQL-like languages, e.g. SparkSQL, Kusto, Scope.
Note: to emphasize, the aggregations listed above are MDL aggregations, not aggregations
from any specific fabric. So in particular Count
is a unary aggregation meaning that
there is no Count()
without input parameters like there is in Scope, it's
always Count(expression)
.
Corrolary: dummy aggregations matter
The behaviours described above imply the following behaviour.
If the user defines a "dummy" aggregation level then the choice of "dummy"
aggregations between levels will matter. By a "dummy" aggregation we mean an aggregation
when there is 1 to 1 correspondence between the values in the data on the
previous level and that level. For example, if RowID
is a column of row IDs,
then defining RowID
to be the first aggregation level would make it a "dummy" one:
there is no real aggregation happening on that level since in the "group by" statement
no values would actually be grouped.
Such "dummy" aggregations are very common on our platform because Foray
had this as a requirement. There, the metrics will need to have the dummy aggregation.
For example, one could not define Avg(BaseLevelColumn)
, one had to add a "dummy" aggregation
to RowID
column first, i.e. Avg(Max<RowID>(BaseLevelColumn))
or
Avg(Sum<RowID>(BaseLevelColumn))
. Mangrove no longer requires to have such dummy
aggregations, but it still needs to deal with that legacy structure.
In Foray, it did not matter which "dummy" operation to put there, it would be ignored
either way. In Mangrove, it will matter which dummy aggregation to put.
For example, if Column
has null values, then Max<RowID>(Column)
and Sum<RowID>(Column)
would give different results: Max<RowID>
would keep the null values,
while Sum<RowID>
would convert nulls into zeros.
We are working on the migration plan to avoid any possible issues caused by this discrepancy between the two code generation frameworks.
Levels with null values
When grouping by a column containing null values, the null value is treated like a regular value, and would not be filtered away.
For example, if column Page
from the example above contains null values,
then in the expression Avg<Page>(Revenue)
the null value will not be filtered out
because these values are being grouped by, while in the expression Count<Session>(Page)
the null values for Page
will be ignored because those values are being aggregated
(see the previous section for the exact behaviour of Count
).
In particular, when defining metrics at a level with possible null values in
the corresponding column, the metrics could potentially have "outliers", since
all the values for null would be grouped together. Using the example above,
if Page
had null values then all of them within the same Session
+ User
combination would be treated as the same.
For example, if the user needs a metric to ignore the null values in Page
when
computing the metric Avg<Page>(Revenue)
, they would need to filter them out
explicitly: Avg<Page>(Page != null ? Revenue : null)
.
Note: for columns of type string
we will distinguish between null values
and empty string values.
Note: the proposed behaviour would mean that the current implementation of QCNT
would include null values. This could easily be changed by applying the alternative definition,
i.e. instead of Sum(Max<Level>(1))
do Count(Max<Level>(Level))
.
Why such behaviour for null values?
Recursive aggregations are recursive
An aggregation Agg
is called recursive if for any two levels A,B
with A >= B
in the level hierarchy and any column x
the following identity holds:
Agg<A>(Agg<B>(x)) = Agg<A>(x)
.
One key observation is that we want the aggregations of recursive type,
e.g. Sum, Max, Min
, to actually be recursive aggregations according to the definition above.
In terms of Example 1 above, this means, for instance, that for any expression x
we want
the identity Sum<User>(Sum<Session>(x)) = Sum<User>(x)
to hold.
If Sum<Session>
would only sum over non-null values of Session
,
the identity above would be broken: rows with session value null would be filtered
out from the sum on the left, but would not be filtered out from the right whenever User
values is non-null.
The assumption of recursiveness of the aggregations is one of the key assumptions about the structure of computations in Mangrove. This is the main reason we decided to treat null values as described above.
Similar to SQL behaviour
Such behaviour of distinguishing null values between grouping by them and doing arithmetic on them is very common among SQL-like languages, so it was a natural choice.
Non-linearly ordered levels
Mangrove has no plans to support non-linear order structures for aggregation levels. Currently, there is no way for a user to define anything but a linear sequence of levels when creating a metric set, and this is not going to change: Mangrove will accept a linear order of levels, and trying to define anything but a linear order would result in compilation failure (in MDL compiler, before it even gets to Mangrove).
To use the notations from the Example 3 above, Session
and Date
are not
meaningfully nested: Session
can spread across different Date
s and
each Date
can have many Session
s in it. What we are saying here is that
there is no mechanism for the user to communicate this uncertainty to Mangrove:
Mangrove expects a user to put the levels into an ordered list, and Mangrove will
do the aggregations using that order.
Diamonds
For example, one might want to have a "diamond"-shaped structure of levels, with levels A,B,C,D
depending like A
← B
← D
and A
← C
← D
. For example, to have
4 aggregation levels Page
, Session
, Day
and User
, with no relation between
Session
and Day
: a sessions can span many days, a day can have many sessions.
We will not support such scenario. There are several reason, resolving which would require a significant re-architecture of the Mangrove object model:
- It becomes unclear how to chose columns to group by at the unrelated nodes.
Let's say we are computing a
Page
-level metricPageCrashRate = Avg(Max<Page>(Crash))
. In the select statement for thePage
table, it is unclear which aggregands to pick. Picking all 4 would force the aggregation to be linear, since grouping would be done byPage
+Session
+Day
+User
, which means by the time we are computing theMax<Page>(Crash)
part we already combinedSession
andDay
. - It would put restrictions on which metrics could be computed for which segments.
In the example above, we would not be able to compute
Date
-level metrics forSession
-level segments, or vice versa.
For this scenario, we recommend our users to have two separate metric sets,
with aggregation level structures A
← B
← D
and A
← C
← D
.
Trees
The issue discussed in Diamonds section about metrics at incomparable
levels and aggregands in the parent tables would not arize if the aggregation levels
formed a tree (since there would be no shared parent table). For example, if we had
levels A,B,C
with dependencies A
← C
and B
← C
, then metric at level A
would need to have A,C
as aggregands, at level B
--- B
and C
,
and at level C
--- only C
as aggregand. This avoids the main issue of Diamonds section
since here for every metric there is unique path through the aggregation levels.
However, this scenario would still cause the issue of not all metrics available for all segments.
This might be a desired scenario in some cases, but for now we decided to not consider such structures.
Note: this does not mean that Mangrove will not support joining additional data at arbitrary levels, e.g. side stream triggering. Here we only talk about the structure of the aggregation levels.