Expressions metadata
| Owner | Approvers | Participants |
|---|---|---|
| Sasha Patotski | Craig Boucher, Guangnan Shi, Scott Wang, Luffy Chen | Johnny Chan, Manish Kumar |
There are several known scenarios, see below, where the existing structure of the
MetricsPlan object is not enough to completely describe the computation in a
reliable way. This RFC (Request For Comment) describes a new enrichment
of the core object model to address some of the issue.
Problem statement
In the most generic terms, the issue can be described as follows. There is
some information that conceptually is part of a MetricsPlan and that
needs to be preserved throughtout the sequence of transformations. However,
such information can be a run-time configuration, and therefore cannot be
made a part of the MetricsPlan itself.
The issue will hopefully become clear with examples.
- Timestamp columns. Data source can conceptually have several columns of
different data types and formats containing timestamp information. Different
analyses can require different columns to be used, hence it's a runtime information.
In other words, a
ComputationConfigshould provide such information. However, timestamp is a key column for several transformers and emitters (e.g. Hypercube and MV emitters,AddTriggertransformer since it needs to order the data by time inside the windowing function, etc.). Thus, those transformers and emitters must be able to identify the correct timestamp column. Key problem: columns get re-named all the time, and there is no way to enforce config information to be relevant throughout the transformations. - Flight column. Essentially the same issue as above. There can be several flight columns, and it's a runtime decision which column to use.
- Enrichers. By their design, enrichers require a specific set of
ColumnReferences with a fixed set of names. Moreover, enrichers can only produceColumnReferences with a fixed set of names. Unfortunately, in MangroveColumnReferences can be changed, renamed, moved etc. and so there is no guarantee that the column required by the enricher will still be present and will still have the correct names by the time the emitter is applied. There is also no clear way to enforce such "name stability".
Intrinsic properties of MetricsPlan vs. runtime information
Currently in ExP, information like the flight column or the timestamp column are intrinsic properties of the metric set, something that is specified at the metric set creation. This RFC, however, treats these columns as part of runtime confinguration, something that is specified per analysis. How to reconcile the two approaches?
The answer is that we do not want to limit Mangrove's ability to be flexible. If a user is onboarding to ExP, they will get the "standard" approach to their code generation. However, we want to keep the possibility open for a user to use Mangrove native parser + Blitz to generate code from yaml metric sets. For such users, it would be much easier to specify timestamp, flight etc. columns simply via a name in the config.
Proposal
Key idea
The key idea is to add "metadata" to certain expressions that would be preserved
throughout the transformations. More concretely, ColumnReferences
already have
a Metadata field. So far, it has not been used in our code. We propose
re-defining what this field is, without changing the object model.
Specifically, we suggest introducing an IExpressionMetadata interface
and several metadata types implementing it. The key-value pairs in
ColumnReference.Metadata would be mappings nameof(T) => SerializedObjectOfTypeT.
What it would look like
The IExpressionMetadata might as well be empty to be as generic as possible.
The usage would be something like
if(expr.HasMetadata<T>(out var metadata))
{
// do something with expr and metadata
}
with HasMetadata would be something like
public static bool HasMetadata(this ColumnReference expr, out T metadata)
where T: IExpressionMetadata
{
// check if expr.Metadata.TryParseJson<T>() is successful
}
The type T could be FlightMetadata, or EnricherMetadata, or TimestampMetadata,
or anything else. For example, TimestampMetadata could contain the timestamp
format string. FlightMetadata could contain flight names, regex for parsing the flight, etc.
In particular, EnricherMetadata could contain the list of expected column names
for this enricher.
public class EnricherMetadata: IExpressionMetadata
{
public IEnumerable<string> GroupColumns { get; }
public IEnumerable<string> RequiredColumns { get; }
public IEnumerable<string> EnrichedColumns { get; }
// mapping sortKey => isAscendnigSortOrder
public IDictionary<string, bool> SortKeys { get; }
}
A remark on enrichers
Currently enrichers are modeled as unique Extern expressions on ExternTables.
The final model for enrichers should be the following:
- All the required columns must be listed as parents of the
Externexpression, including the enriched columns. - The enriched columns will be of type
ColumnReference, and their parents will beLiterals of the corresponding type withContentbeing the default values listed in MDL for the enriched columns. - Each such
Externwill be wrapped with anotherColumnReferencehaving the name of the enricher + the correspondingEnricherMetadataas its metadata.
Benefits of the proposed approach
The proposed approach will have several good qualities:
- No changes to the core object model: we will reuse the existing
ColumnReference.Metadata. - Easy for transformers to find the required "special" columns based on the metadata type.
- Having metadata as separate types provides a robust contract, as opposed to e.g.
just using special constants or
enums. - Can enforce rules around metadata via
Validators. - Metadata can be arbitrarily rich, and hence widely applicable: no restriction on the format.
Important questions to address
Who should create the metadata?
Some metadata is run-time (e.g. "Flight" columns), some is known at metric set
compile time (e.g. columns required and produced by the enrichers). Thus, it is
natural for some of the metadata to be part of ColumnReferences in the
MetricsPlan, while some will be defined at run-time in the Coordinator
(and perhaps in Blitz if it is too ExP-specific).
However, exposing the proposed metadata machinery would add another contract between Mangrove and the rest of ExP (in particular MDL Core). It might be a better idea to make it internal, and for Blitz/Mangrove to do the enrichments based on the information from the config. This presents challenges with enrichers as described above, so it might be that such a contract is unavoidable.
When to copy metadata?
The metadata should be preserved with all Clone operations. Otherwise, no
metadata "propagation" should be done. In particular, the metadata should not
affect either children or parents of a "special" expression.
Note: currently transformers allow pretty much unrestricted freedom in
modifying the MetricsPlans. We should keep it this way, but enforce the requirements
via validators and assumptions "globally" (i.e. "this emitter required Flight column,
but it doesn't matter where along the way it was obtained").
Note: following up on the note above, transformers should be able to change the metadata as well. However, we will still use the validators to ensure that the right type of metadata is present in the emitters.