Expressions metadata

Owner	Approvers	Participants
Sasha Patotski	Craig Boucher, Guangnan Shi, Scott Wang, Luffy Chen	Johnny Chan, Manish Kumar

There are several known scenarios, see below, where the existing structure of the MetricsPlan object is not enough to completely describe the computation in a reliable way. This RFC (Request For Comment) describes a new enrichment of the core object model to address some of the issue.

Problem statement

In the most generic terms, the issue can be described as follows. There is some information that conceptually is part of a MetricsPlan and that needs to be preserved throughtout the sequence of transformations. However, such information can be a run-time configuration, and therefore cannot be made a part of the MetricsPlan itself.

The issue will hopefully become clear with examples.

Timestamp columns. Data source can conceptually have several columns of different data types and formats containing timestamp information. Different analyses can require different columns to be used, hence it's a runtime information. In other words, a ComputationConfig should provide such information. However, timestamp is a key column for several transformers and emitters (e.g. Hypercube and MV emitters, AddTrigger transformer since it needs to order the data by time inside the windowing function, etc.). Thus, those transformers and emitters must be able to identify the correct timestamp column. Key problem: columns get re-named all the time, and there is no way to enforce config information to be relevant throughout the transformations.
Flight column. Essentially the same issue as above. There can be several flight columns, and it's a runtime decision which column to use.
Enrichers. By their design, enrichers require a specific set of ColumnReferences with a fixed set of names. Moreover, enrichers can only produce ColumnReferences with a fixed set of names. Unfortunately, in Mangrove ColumnReferences can be changed, renamed, moved etc. and so there is no guarantee that the column required by the enricher will still be present and will still have the correct names by the time the emitter is applied. There is also no clear way to enforce such "name stability".

Intrinsic properties of MetricsPlan vs. runtime information

Currently in ExP, information like the flight column or the timestamp column are intrinsic properties of the metric set, something that is specified at the metric set creation. This RFC, however, treats these columns as part of runtime confinguration, something that is specified per analysis. How to reconcile the two approaches?

The answer is that we do not want to limit Mangrove's ability to be flexible. If a user is onboarding to ExP, they will get the "standard" approach to their code generation. However, we want to keep the possibility open for a user to use Mangrove native parser + Blitz to generate code from yaml metric sets. For such users, it would be much easier to specify timestamp, flight etc. columns simply via a name in the config.

Proposal

Key idea

The key idea is to add "metadata" to certain expressions that would be preserved throughout the transformations. More concretely, ColumnReferences already have a Metadata field. So far, it has not been used in our code. We propose re-defining what this field is, without changing the object model.

Specifically, we suggest introducing an IExpressionMetadata interface and several metadata types implementing it. The key-value pairs in ColumnReference.Metadata would be mappings nameof(T) => SerializedObjectOfTypeT.

What it would look like

The IExpressionMetadata might as well be empty to be as generic as possible.

The usage would be something like

if(expr.HasMetadata<T>(out var metadata))
{
  // do something with expr and metadata
}

with HasMetadata would be something like

public static bool HasMetadata(this ColumnReference expr, out T metadata)
  where T: IExpressionMetadata
{
  // check if expr.Metadata.TryParseJson<T>() is successful
}

The type T could be FlightMetadata, or EnricherMetadata, or TimestampMetadata, or anything else. For example, TimestampMetadata could contain the timestamp format string. FlightMetadata could contain flight names, regex for parsing the flight, etc.

In particular, EnricherMetadata could contain the list of expected column names for this enricher.

public class EnricherMetadata: IExpressionMetadata
{
  public IEnumerable<string> GroupColumns { get; }
  public IEnumerable<string> RequiredColumns { get; }
  public IEnumerable<string> EnrichedColumns { get; }
  // mapping sortKey => isAscendnigSortOrder
  public IDictionary<string, bool> SortKeys { get; }
}

A remark on enrichers

Currently enrichers are modeled as unique Extern expressions on ExternTables.

The final model for enrichers should be the following:

All the required columns must be listed as parents of the Extern expression, including the enriched columns.
The enriched columns will be of type ColumnReference, and their parents will be Literals of the corresponding type with Content being the default values listed in MDL for the enriched columns.
Each such Extern will be wrapped with another ColumnReference having the name of the enricher + the corresponding EnricherMetadata as its metadata.

Benefits of the proposed approach

The proposed approach will have several good qualities:

No changes to the core object model: we will reuse the existing ColumnReference.Metadata.
Easy for transformers to find the required "special" columns based on the metadata type.
Having metadata as separate types provides a robust contract, as opposed to e.g. just using special constants or enums.
Can enforce rules around metadata via Validators.
Metadata can be arbitrarily rich, and hence widely applicable: no restriction on the format.

Important questions to address

Who should create the metadata?

Some metadata is run-time (e.g. "Flight" columns), some is known at metric set compile time (e.g. columns required and produced by the enrichers). Thus, it is natural for some of the metadata to be part of ColumnReferences in the MetricsPlan, while some will be defined at run-time in the Coordinator (and perhaps in Blitz if it is too ExP-specific).

However, exposing the proposed metadata machinery would add another contract between Mangrove and the rest of ExP (in particular MDL Core). It might be a better idea to make it internal, and for Blitz/Mangrove to do the enrichments based on the information from the config. This presents challenges with enrichers as described above, so it might be that such a contract is unavoidable.

When to copy metadata?

The metadata should be preserved with all Clone operations. Otherwise, no metadata "propagation" should be done. In particular, the metadata should not affect either children or parents of a "special" expression.

Note: currently transformers allow pretty much unrestricted freedom in modifying the MetricsPlans. We should keep it this way, but enforce the requirements via validators and assumptions "globally" (i.e. "this emitter required Flight column, but it doesn't matter where along the way it was obtained").

Note: following up on the note above, transformers should be able to change the metadata as well. However, we will still use the validators to ensure that the right type of metadata is present in the emitters.