Wildcard support

Owner	Approvers	Participants
Jong Ho Lee	Daniel Miller, Ulf Knoblich	Ameya Chaudhari

Some metrics are defined using wildcards such as __Start__ or __End__. But wildcard values do not exist in the raw data, instead they are scorecard job level parameters. We should consider wildcard differently from other columns defined in the raw data. This RFC (Request For Comment) provides an example using wildcards and provides the solution on how to handle them.

Example

'TruncatedOverallRevenuePerUser' metric in SLAPI is defined using custom column 'ExperimentDuration'. Definition of the custom column 'ExperimentDuration' is (double)DateTime.Parse(__End__).Subtract(DateTime.Parse(__Start__)).Ticks / System.TimeSpan.TicksPerDay + 1 __Start__ and __End__ are replaced when the Foray scripts are generated using the values of analysis start time and end time.

Problems

Problems with the wildcards are that they are not defined in the raw data. For example, if we cache 'ExperimentDuration' daily, 'ExperimentDuration' will always be 2. So, we should not store values using wild cards in the cache and they must be provided when the scorecard job is requested. For Foray and MV, all scorecard job information is provided when the script is generated. But for Hypercube, metric set generation and scorecard job generation are separated, and one metric set is reused for multiple times. Therefore, we can't provide values of wild cards when the metric set is generated. If one DLL is reused in MV later, wild cards should be replaced at job submission time.

Currently, wild cards are used with date time functions. But many date time functions are C# oriented like the definition of 'ExperimentDuration' above. In the definition of 'ExperimentDuration', tick unit is used. However, Unix timestamp is more widely used in 3p and DateTime.Parse can't be used in JVM directly.

Variables with 4 underscores are assumed to be wild cards implicitly, but it is Foray oriented.

Proposal

Hypercube engine will be updated to take a dictionary for each scorecard job using job configuration. For 'TruncatedOverallRevenuePerUser', Dictionary(__Start__ -> '2019-07-03 12:00:00', __End__ -> '2019-07-05 12:00:00') can be provided in the job configuration. Metric set is generated so that __Start__ and __End__ values are from this dictionary. For example, if 'p' is the name of this dictionary variable, p.get(__Start__) will be used to replace __Start__ in the definition of the metric.

We also need to define more standard date time operations to support different compute fabrics. For duration, we can define getDuration(start string, end string) function and each compute fabric will specify the detailed implementation. Alternatively, we can provide 'duration' value directly to the dictionary and redefine a metric using 'duration'.

I suggest to define job level variables explictly instead of using wildcards implictly. We can define custom job level variables such as 'duration' using other job level variables such as 'start' and 'end'. In this case, we can compute 'duration' once and avoid multiple computation of 'duration' for each user.