Turnkey Caching

Associated with Feature 10442

Owner	Approvers	Participants
Daniel Miller	Nan Ma, Jerel Frauenheim, Venky Venkateshaiah	David Lydston, Dor Naveh, Max Caughron

Development Context

Cost savings and scaling treatment counts for our 1P customer is are two of the three FY20 Semester 1 (Vibranium) goals. This turnkey cache project has also been identified as a rare project that will help deliver explicitly on these goals and in future iterations may speed up 3P integrations. Moreover, the project will lower the amount of time needed by analysts to build custom caches for other partners as they become necessary to scale their treatment counts.

The turnkey cache project eminated from Bing's offer to fund one developer-year for the Analysis & Experimentation team in exchange for the return of 10k Cosmos tokens (35k → 25k). After exploring other alternatives such as merged logs on Spark (detailed discussion below), the ExP team landed on turnkey caching and encouraging scorecard computation on Spark as the best engineering options to meet goal of scaling up experimentaiton while lowering the tokens needed to compute them. The decsion was made in equal parts on the confidence in the ability for turnkey cache to return tokens as well as the standalone use of the caches themselves for the benefits above. Subsequently, another vehicle to reutrn tokens has emerged with Scorecard Batching, and it is propoosed that both projects be pursued. It should also be noted that the success of the turnkey project for Bing profiles will be in part determined by the degree to which other customers can benefit from its developemnt further down the road.

What is Turnkey Cache

In the ExP context "cache" refers to the process of transforming and preparing the data to be more efficiently processed by the Cosmsos compute enginge by running a repeated job that usually needs to be reran muliple times into a single job that is stored and applied to all of the regular. As such the savings are increased as the customer runs more experiments. As a cache this trades expensive computation cost for much less expensive (but not free) storage costs.

Currently the ExP compute engine is duplicating precompute data cooking for analysis jobs sharing the same daily metrics. This is an inefficient use of compute resources, created longer compute queues. Two customized caches have been developed by analysts to precook some of the data to lessen queue time. These take analyst development time, have limited reuse value, and have multiple potential failure points and have limited use cases. The idea of a turnkey cache is to take advantage of the hard won lessons learned in buidling the two custom caches to abstract how they are built and and document how to build them for other teams that would benefit from them.

Putting this together

Token savings

Lessen duplicated work by preprocessing daily data and using one computation for all the analysis requests sharing the same config (the more you scale the more valuable the cache becomes) Error detection
Earlier catching of errors before data is moved to compute
Absence of duplicated jobs means that errors are detected once as opposed to each time the job is run. Standardization
Formalizes hard earned expertise form custom cache development
Minimize cache development time / less friction in building new caches
Lowers friction in the development new metricSets
Limit failure points associated with custom caches -develops a more modular cache adaptable to other customers
compute fabric agnostic splitting logic

Technical Details

This spec document gives a reasonably precise proposal for how to implement "turnkey caching", a set of short-term and long-term milestones, and a description of which kinds of scorecard jobs turnkey caching will and will not be able to support, in the short- and long- terms. Special emphasis will be given to concrete, short-term deliverables, as they are needed to provide A&E with confidence that we can return the requisite number of tokens to Bing within a calendar year.

Useful context: SearchDM token usage review.

Note: the approval of this RFC only means that the reviewers believe this proposal is one which could be carried out with the given resources (1 dev year), and that it would accomplish the stated goal (remove 10k tokens from the searchDM VC). There may be other ways of carrying out the token-savings goal. Moreover, this proposal is not an attempt to make all engineering decisions in advance, but more to motivate the work and make sure remember all the key milestones.

Design Considerations

Currently, A&E can compute the following set of scorecard jobs via the Hypercube (Spark) framework:

Non-filter/trigger, A/B-type, Bing scorecards using the mobile or web verticals. This includes seedfinder, but not retro-AA.

Support for retro-AA is being implemented, and should be ready within a couple weeks.

There are dozens of Bing and MSN analysis profiles running in jobs on the searchDM VC. The goal of this design is to allow the Hypercube framework to pick up the non-filter/trigger jobs for those profiles at a negligible one-time developer cost / profile, and controlled Cosmos storage and compute cost / profile.

It is unlikely that most of this work will accrue to 3P (Azure ExP), partly because our Azure ExP Data Contract explicitly requires flattened and performant data. However, the splitting logic discussed below may be reusable in the context of Spark and Kusto. Currently, the only way of using Hypercube to compute metrics on data living in a Kusto cluster is via data export to Azure blob storage. However, the Kusto team is working on a Spark connector which will support distributed data transfer. It may be advantageous to use the splitting logic we develop to split metrics computation into two stages, the first being simple extract logic in the Kusto cluster, the second being metrics computation in a Spark cluster.

Proposal

Run "vanilla" and most filter/trigger scorecard jobs via Spark for an analysis profile the "standard" way (Cosmos cache, then compute metrics in a Spark cluster), with the following changes:

Generating the cache script and metric set JAR using Blitz to "split" a MetricsPlan into "cache" and "consume" components. The splitting logic will follow the ideas currently used by MetricNet, namely push any Cosmos-specific computation (e.g. UDO extraction, custom enrichers) into the cache, and run fabric-agnostic computation (sum, average, etc.) in Spark.
- The cache script will be generated by applying the Scope-script-generation functionality Blitz already has to the "cache" component.
- The metric set JAR will be generated by applying the Spark-script-generation functionality currently being added to Mangrove to the "consume" MetricsPlan.
Schedule, submit, and monitor cache script jobs using a production-level framework. This will not be plugged into the existing Unified Cache framework. Instead, it will plug into the Foray pipeline.
- Preliminary investigations (comparing random cache and scorecard job) yield the estimate: cost of cache == cost of 1 scorecare / day. Our current caches for Bing web and Bing mobile have sizes of ≈ 1.5 TB and 150 GB per day, respectively. Assuming an average of .5 TB / cache / day, this gives the "all-up" storage cost of a cache at: 15 TB.
- This is a great opportunity to improve our ability to catch potential failures early. One possibility is to attempt to Scope compile the generated cache scripts on metric set deployment and alert if compilation fails. This would allow DRIs to rollback "bad" metric set deployments early, instead of waiting for cache job failure early the next morning.
Submit scorecard jobs via Control Tower and the MT (MagneTar) cluster as we are already doing via the Hypercube project.
- This is where token savings come in. From the perspective of this proposal, the cost of computing a job on Spark is zero, since the goal is to give back tokens from the Cosmos VC.

Note: while the contract with Bing is "1 dev-year for 10k tokens", A&E will not likely use the newly hired developer to implement this project. Instead, that developer can work on other projects, and we will make the choice of who, specifically, will work on turnkey caching later on.

Design Details

The proposal above is rather high-level. Here is a more technical overview of the proposed components. This design is for a base-level cache only (we do not plan to support session-level or user-level caching in the near future), but more concrete designs should be reviewed by Ulf Knoblich and Andreas Dressel to ensure it is easily extensible to support other forms of caching.

Splitting

A MetricsPlan encodes a complete description of the computations described in metric set XML. Fabric-agnostic operations like addition, multiplication, average, percentile, etc. are expressed as corresponding Expression objects. Some operations in a metric set, e.g. enrichers and functions defined in static DLLs, only make sense in the context of a Scope script. The splitting logic should find a partition of the given MetricsPlan into "cache" and "consume" components, such that the following conditions are met:

All aggregations exist in the consume component.
All extern expressions and tables exist in the cache component.
The "cache" component contains all parents of expressions in it.
The "cache" component contains all data source columns.

Under these constraints, there may be a range of possible partition techniques. Here are two natural ones, where the adjective describes the generosity of the cache.

"Greedy": put only aggregations and their children into the consume component. More precisely, a greedy cache is the splitting satisfying #1–4 which contains the most expression nodes.
"Generous": put only externs and their parents into the cache component. More precisely, a greedy cache is the splitting satisfying #1–4 which contains the fewest expression nodes.

I lean towards generous caches, as their schema are more likely to be stable. Since schema changes require backfills, there is a strong benefit to reducing the frequency of such changes. The Hypercube team's current private caches are all generous.

If a metric set has complicated C# operations or enrichers above aggregation, it cannot be supported by turnkey caching.

Semantically-invariant Names

The contract between cache and consume components of a split MetricsPlan is the shared schema. That is, the output schema of the cache must be identical with the input schema of the consume component. In order to reduce the frequency of backfills, the name of a column in the cache should depend only on the semantic definition of that column, i.e. a > 1? 3: a and a > (1)? ((3)): a should be given the same column name in the cache. I believe the current Hypercube approach — scrub intermediate column names and parentheses from an expression, then hash a string representation of that scrubbed expression — should suffice for turnkey caching.

Note: currently the Mangrove type system does not support a notion of ordered output schema, i.e. ColumnReference nodes do not have an ordering. This work may involve giving them an explicit ordering, or using e.g. alphabetical ordering on the output schema.

Filter / trigger cache

We will also generate, once per metric set, a cache of filter/ trigger conditions satisfying the following conditions:

The condition expression does not use any additional references or resources. Trying to resolve conflicts between different filter / trigger conditions trying to use different versions of the same-named DLL is too hard.
The condition expression has not previously appeared in any failed jobs.

Following a previous prototype filter / trigger cache for Hypercube, this cache should wrap all the conditions in try / catch blocks, which will make the cache resilient to all condition failures but out-of-memory errors, which occur very infrequently. The try / catch blocks would return null in case of failure, which could be used to mark conditions as unusable from the cache. This would not degrade the user experience, as if a condition fails in the cache, it will also fail in any later one-off jobs which attempt to use it.

Concerns

Two potential concerns / risks to this approach.

DRI cost of maintaining as many as several dozen caches. Given that the existing Hypercube cache has been far more reliable than the unified cache, I don't believe the DRI cost will be high.
Token usage and storage space for the many caches. This is a much bigger risk. For profiles which don't run any more frequently than once / day, this approach will certainly not help. For profiles which do run frequently, this proposal shouldn't use any more Cosmos tokens, but it will likely use increased storage space. The STC-A team has committed to making sure A&E has access to at least 1.5 PB ≈ (100 caches) × (30 days / cache) × (500 GB / cache / day) storage capacity.

Supported Jobs

When this proposal has been implemented, Spark will be able to support all scorecard jobs satisfying the following conditions:

The analysis type is A/B, retro-AA, or seedfinder.
The analysis profile is in a whitelist.
The job does not use custom (user-injected) segments.
If the job is filter/trigger, the condition satisfies the conditions in the section on "filter / trigger cache" above.

These jobs could be run either "first-come-first-serve" for safety, or "Spark as primary" for greater token savings. The underlying caches would, like the existing Hypercube cache, only contain primitive types. If metrics used UDOs, the caches would contain the values computed from those UDOs before aggregation. There is no limitation on the VC.

Note that the actual design proposal is not Bing-specific. For other partners having profiles which are heavily used, the same benefits may apply. However, this proposal does not make any commitments about applicability to non-Bing partners, unless they (e.g. EXO / OXO) are willing to invest a non-zero amount of developer resources.

Savings

Here is a concrete breakdown of the potential savings. Note that the word "savings" here means savings from the VC. There is still a compute cost on the MT cluster, however that does not weigh against compute A&E carries out on the searchDM VC.

The number for turnkey cache is computed as follows. There are 7.4k tokens being used for non filter/trigger, non-"Bing web/mobile" jobs. Of those jobs, nearly 80% are from profiles running over 60 jobs / month (hence at least 50% token savings). That translates to at least 2.8k token savings from turnkey cache, not applied to a filter/trigger cache.

Category	Savings (token usage)
Retro-AA	1.5k
Shadow Mode discipline	2.7k
Spark as primary (not FCFS)	2.6k
Turnkey cache	2.8k
Filter/trigger cache	4k
total	9.6–13.6k

Milestones

The following documents may provide useful context:

This section provides a concrete sequence of milestones, which could also be called checkpoints or deliverables, along with an estimate of the difficulty of achieving each milestone. The confidence levels are relative, i.e. milestones marked as "Low Confidence" are likely achievable, but have the most risk compared to other milestones.

Milestones 1–4 all lie in Microsoft's Vibranium semester (July–December 2019), but all milestones after 4 lie in 2020 and beyond.

[High 2019.07.01–2019.08.09] "Manual proof-of-concept". Prototype the cache script generation and job submission components of the turnkey cache. Validate that logic via manual integration with the Hypercube engine.
- Both components (code generation and job submission) may use stubbed versions of the other. I.e., the job submission developers may use a hard-coded script, and the cache script developers may manually submit the generated jobs.
- Code to generate running Scope scripts (even including enrichers) and runnable metric set JAR is already complete. So Mangrove work will primarily be around the "splitting" logic.
- #10860
- #10963
[Medium 2019.08.12–2019.09.20] "Automated proof-of-concept". Bing and one canary customer can each get one scorecard generated from a fully automated turnkey cache pipeline. We can also present the customers with a most precise estimate of the token savings a fully ramped-up turnkey cache will provide.
- This involves integration of Mangrove (to provide code generation), the Fast Pipelines team (to submit the generated Scopes scripts) and the Hypercube team (to submit the generated metric set JARs).
- The "estimate of token savings" may be an analyst-written presentation. It does not need to be automated.
- #11412
- #11413
[High 2019.09.23–2019.11.01] "Production pipeline". Bing and a canary pipeline each have a single new profile SLA'd which is run through the turnkey cache.
- The requirement to SLA a profile is a forcing function to make sure our DRI's believe in the robustness and alerting quality of the turnkey cache pipeline.
- This stage will involve making sure there is sufficient telemetry for useful monitoring / alerting, and fixing any bugs that surface.
- #11415
[Medium 2019.11.04–2019.12.13] "Ramp up to N turnkey caches". Bing has enough turnkey caches to save at least 5k tokens.
- This period will involve slowly ramping up the turnkey cache to a higher number of profiles. Initially, they will be added one at a time, and even at the end in batches of no more than five, to avoid excessive churn on the scorecard computation pipeline.
- This work will also involve fine-tuning the telemetry and health monitoring (latency, failure rate) of the pipeline, as well as bugfixes and data parity checks against the non-turnkey-cache pipeline.
- #11418
"Filter / trigger cache". In the next semester, design and implement a cache for filter / trigger expressions if needed.

Throughout this work, A&E will share status updates and progress reports with Bing through the existing bi-weekly sync cadence.

Considered Alternatives

Metrics Vectorization

Ameya Chaudhari has completed a production-ready implementation of the ideas initiated in the DragonGlass project (notes). It was recently validated against a large number of analysis profiles, and shows good promise for some, but not all, Bing analysis profiles.

Search Merged Log on Spark

The STC-A team is considering a "Spark version of SML", which will have a schema similar to the xSLAPI view. Is there a possibility of running metrics computation for Bing directly off the Spark SML? Sadly, not easily. The object model nearly all SLAPI metrics depend on relies on complicated .NET Framework user-defined-objects which capture deserialized JSON event and page data. Even if someone developed a first-class "C# : Spark connector" (productionized Mobius), it would almost certainly only support .NET Standard or .NET Core, meaning a complete re-write of the object model, and all reporting and analytic code (not just metric definitions for experimentation) depending on it. This would be a truly massive undertaking, and it is not realistic for A&E to wait the multiple years it would take to achieve this.