Expressions metadata
Owner | Approvers | Participants |
---|---|---|
Sasha Patotski | Craig Boucher, Guangnan Shi, Scott Wang, Luffy Chen | Johnny Chan, Manish Kumar |
There are several known scenarios, see below, where the existing structure of the
MetricsPlan
object is not enough to completely describe the computation in a
reliable way. This RFC (Request For Comment) describes a new enrichment
of the core object model to address some of the issue.
Problem statement
In the most generic terms, the issue can be described as follows. There is
some information that conceptually is part of a MetricsPlan
and that
needs to be preserved throughtout the sequence of transformations. However,
such information can be a run-time configuration, and therefore cannot be
made a part of the MetricsPlan
itself.
The issue will hopefully become clear with examples.
- Timestamp columns. Data source can conceptually have several columns of
different data types and formats containing timestamp information. Different
analyses can require different columns to be used, hence it's a runtime information.
In other words, a
ComputationConfig
should provide such information. However, timestamp is a key column for several transformers and emitters (e.g. Hypercube and MV emitters,AddTrigger
transformer since it needs to order the data by time inside the windowing function, etc.). Thus, those transformers and emitters must be able to identify the correct timestamp column. Key problem: columns get re-named all the time, and there is no way to enforce config information to be relevant throughout the transformations. - Flight column. Essentially the same issue as above. There can be several flight columns, and it's a runtime decision which column to use.
- Enrichers. By their design, enrichers require a specific set of
ColumnReference
s with a fixed set of names. Moreover, enrichers can only produceColumnReference
s with a fixed set of names. Unfortunately, in MangroveColumnReference
s can be changed, renamed, moved etc. and so there is no guarantee that the column required by the enricher will still be present and will still have the correct names by the time the emitter is applied. There is also no clear way to enforce such "name stability".
Intrinsic properties of MetricsPlan vs. runtime information
Currently in ExP, information like the flight column or the timestamp column are intrinsic properties of the metric set, something that is specified at the metric set creation. This RFC, however, treats these columns as part of runtime confinguration, something that is specified per analysis. How to reconcile the two approaches?
The answer is that we do not want to limit Mangrove's ability to be flexible. If a user is onboarding to ExP, they will get the "standard" approach to their code generation. However, we want to keep the possibility open for a user to use Mangrove native parser + Blitz to generate code from yaml metric sets. For such users, it would be much easier to specify timestamp, flight etc. columns simply via a name in the config.
Proposal
Key idea
The key idea is to add "metadata" to certain expressions that would be preserved
throughout the transformations. More concretely, ColumnReference
s
already have
a Metadata
field. So far, it has not been used in our code. We propose
re-defining what this field is, without changing the object model.
Specifically, we suggest introducing an IExpressionMetadata
interface
and several metadata types implementing it. The key-value pairs in
ColumnReference.Metadata
would be mappings nameof(T) => SerializedObjectOfTypeT
.
What it would look like
The IExpressionMetadata
might as well be empty to be as generic as possible.
The usage would be something like
if(expr.HasMetadata<T>(out var metadata))
{
// do something with expr and metadata
}
with HasMetadata
would be something like
public static bool HasMetadata(this ColumnReference expr, out T metadata)
where T: IExpressionMetadata
{
// check if expr.Metadata.TryParseJson<T>() is successful
}
The type T
could be FlightMetadata
, or EnricherMetadata
, or TimestampMetadata
,
or anything else. For example, TimestampMetadata
could contain the timestamp
format string. FlightMetadata
could contain flight names, regex for parsing the flight, etc.
In particular, EnricherMetadata
could contain the list of expected column names
for this enricher.
public class EnricherMetadata: IExpressionMetadata
{
public IEnumerable<string> GroupColumns { get; }
public IEnumerable<string> RequiredColumns { get; }
public IEnumerable<string> EnrichedColumns { get; }
// mapping sortKey => isAscendnigSortOrder
public IDictionary<string, bool> SortKeys { get; }
}
A remark on enrichers
Currently enrichers are modeled as unique Extern
expressions on ExternTable
s.
The final model for enrichers should be the following:
- All the required columns must be listed as parents of the
Extern
expression, including the enriched columns. - The enriched columns will be of type
ColumnReference
, and their parents will beLiteral
s of the corresponding type withContent
being the default values listed in MDL for the enriched columns. - Each such
Extern
will be wrapped with anotherColumnReference
having the name of the enricher + the correspondingEnricherMetadata
as its metadata.
Benefits of the proposed approach
The proposed approach will have several good qualities:
- No changes to the core object model: we will reuse the existing
ColumnReference.Metadata
. - Easy for transformers to find the required "special" columns based on the metadata type.
- Having metadata as separate types provides a robust contract, as opposed to e.g.
just using special constants or
enum
s. - Can enforce rules around metadata via
Validator
s. - Metadata can be arbitrarily rich, and hence widely applicable: no restriction on the format.
Important questions to address
Who should create the metadata?
Some metadata is run-time (e.g. "Flight" columns), some is known at metric set
compile time (e.g. columns required and produced by the enrichers). Thus, it is
natural for some of the metadata to be part of ColumnReference
s in the
MetricsPlan
, while some will be defined at run-time in the Coordinator
(and perhaps in Blitz
if it is too ExP-specific).
However, exposing the proposed metadata machinery would add another contract between Mangrove and the rest of ExP (in particular MDL Core). It might be a better idea to make it internal, and for Blitz/Mangrove to do the enrichments based on the information from the config. This presents challenges with enrichers as described above, so it might be that such a contract is unavoidable.
When to copy metadata?
The metadata should be preserved with all Clone
operations. Otherwise, no
metadata "propagation" should be done. In particular, the metadata should not
affect either children or parents of a "special" expression.
Note: currently transformers allow pretty much unrestricted freedom in
modifying the MetricsPlan
s. We should keep it this way, but enforce the requirements
via validators and assumptions "globally" (i.e. "this emitter required Flight column,
but it doesn't matter where along the way it was obtained").
Note: following up on the note above, transformers should be able to change the metadata as well. However, we will still use the validators to ensure that the right type of metadata is present in the emitters.