Blitz ↦ Azure Exp

Owner	Approvers	Participants
Daniel Miller	Kathy Guo, Vinoth Ganesh	Craig Boucher, Guangnan Shi, Ludan Zhang, Micky Liu

The Azure Exp pipeline leverages Azure Databricks (Apache Spark) for scorecard compute, and is now serving three initial customers: Ibiza (the Azure portal), ACOM (azure.com), and Azure Playfab, with the expectation that many more are on the way. As the total number of customers, the number of analyses each customer runs, and the complexity of each analysis all increase, we need a way to balance the timely shipment of new features with pipeline safety and trustworthiness.

This RFC (Request For Comment) focuses on the adoption of new versions of Blitz (the command-line wrapper for Mangrove) in the Azure Exp pipeline. It describes the competing set of requirements the deployment process will need to satisfy, proposes a process for appropriately satisfying all those requirements, and outlines some considered alternatives and potential next steps.

Throughout this document, it is assumed that from the perspective of the Azure Exp pipeline, there is a single, versioned Mangrove artifact, which we may take to be either the Docker container wrapping Blitz, or the .NET Framework Blitz binaries. In either case, that code-generation artifact is deployed to some kind of artifact store as part of a CI/CD process.

Summary

The current state of "deploying Mangrove" is not ideal. It takes ≈ 1 month for a change to the Mangrove codebase to enter production. At the same time, the STC-A team has been delivering a stream of new features well ahead of schedule. We have an urgent need for a more consistent, robust set of tests that changes to Mangrove / pipeline code must pass.

This document proposes a 1-sprint effort, the end state of will be:

It takes < 1 day for a Mangrove change to enter production.
Data scientists, program managers, and software engineers are all able to follow documentation to add examples and/or use cases to the standard testing data.
Both the Mangrove and Azure Exp pipeline codebases are using the testing data sets in their build pipelines.

Requirements

Note: these requirements are not in any order of priority (the perceptive reader will notice they are alphabetical). All three are critical to a healthy product, but to perfectly satisfy them all simultaneously is impossible.

Agility

It should be as fast and easy as reasonably possible for new features and bugfixes to be exposed to users via the Azure Exp pipeline. This is important for multiple reasons:

It is important to be able to ship hotfixes (e.g., fix for a just-discovered critical security flaw) as quickly as possible.
Users want to be able to use new features as soon as possible.
- While this is the same as #1 from an engineering perspective, it is not the same from the customer perspective. Users are typically much more willing to wait for new features than they are to wait for hotfixes.
Regardless of how quickly new features are adopted in the Azure Exp pipeline, development within the Mangrove repo will continue at a rapid pace. The longer the gap between deployments, the longer the list of changes that could have caused any new bugs will be.

Another component of agility is: fail as fast as possible. This means that someone contributing to the Mangrove codebase should be able to discover bugs in their code as quickly as possible. Hopefully by running quick tests on their own computer, but at least in pull request or daily build.

Flexibility

This testing artifact will need to be useable for a range of testing scenarios.

For testing Hypercube or Spark Sql, it will need to be usable in a Linux environment.
For testing native Scope and Metrics Vectorization, it will need to be usable in a Windows environment.
For testing Kusto and generated C#, both of which have a strict speed requirement, it should be performant.
For testing the Hypercube Engine, it needs to be usable from the JVM ecosystem (Java / Scala).
It needs to be easy for other teams (Azure Exp pipeline team, A&E data science team) to understand how the expected input / output data is generated, and add new columns or rows to the testing data.

Safety

Within reason, it should only be possible to ship new features that don't cause regressions in uptime or performance. That is, any new version of Blitz adopted in the pipeline should not have significantly worse performance on any jobs, nor should it begin failing on any jobs for which it did not fail in the past.

Trustworthiness

Trustworthiness may be thought of "safety for new features". Since customers depend on the output of the Azure Exp scorecard compute pipeline for making correct ship / no-ship decisions, they need to be able to trust that any new features (e.g., JSONPath extraction, filter / trigger conditions, custom segments) are implemented correctly. This, almost by definition, cannot be caught by testing against the customer data and metric sets which are currently in production. There has to be a way of sanity-checking the implementation of new functionality before that implementation is shipped.

Proposal

The Mangrove team will expose a versioned series of artifacts serialized using protocol buffers, along with a .NET client for the protobuf contract. The Mangrove team will dogfood these artifacts in the "generate and run script" integration tests. The pipeline team will consume that artifact in their build and release tests, to verify that Mangrove-generated code:

Compiles.
Runs successfully.
Generates the expected output.

The artifacts will be deployed to Azure blob storage, versioned using the scheme described below. Only the Mangrove build will be given write access to the underlying blob instance: every consumer will only be given the Storage Blob Data Reader role.

The protobuf artifact will contain an array "mini-artifacts" (e.g., "simple testing data", "customer-representative data", "fuzz data"), each of which will contain the following stuff:

Sample input data, in CSV, JSON, Parquet, TSV, and Cosmos view formats.
- The Cosmos view will leverage the VALUES construct demonstrated in VALUES.script.
Expected output data, in both xCard-compatible ("long") and Sql-style ("wide") TSV formats.
Mapping compute fabric ↦ Blitz configuration file.
Hypercube configuration ("job JSON").
Compatible copy of the Hypercube Engine JAR.
Serialized MetricsPlan (.mp file).
Generated code in all formats (Kusto, Spark Sql, Scope, Metrics Vectorization, Java / Scala).
Minimum and maximum compatible versions of Blitz.

The .NET client will also include a standard set of code for comparing the "output" of a scorecard job against some expected output. It will also contain extension methods for easily reading / writing expected input / output from the supported file types (CSV, JSON, Parquet, TSV).

At least one of the .mp files will correspond to a real metric set, hosted in metric service v3 (MSv3).

Here is some more detail about the expected initial testing data sets:

"simple". Two-level "minimum viable example" with only a couple dozen rows and very simple schema. It should be very fast to run sample jobs against this data. Passing tests against this data set is a minimum requirement for any new pipeline or code-generation features.
"fake customer". This will be a three-level data set with a few hundred rows containing all the reasonable behavior we expect to appear in customer data. The "shape" of this data will be reviewed by Azure Exp customer representatives, and any pipeline failures will be converted into new rows or columns in this data set.
"fuzz / extreme". This will be a four-level data set which stress-tests the code-generation scorecard pipeline. It should contain all the corner cases (e.g. four aggregation levels, non-ASCII strings, malformed JSON, maximum integer values) that we can think of.

New testing data sets may be added as the need arises. However, since adding a few new rows or columns is much cheaper from a testing perspective, that will be strongly encouraged in comparison to adding new data sets.

Versioning

Two versioning schemas will be used for each artifact. The version will be exposed via blob name convention.

Mangrove.Testing.Data.MIN.pb
Mangrove.Testing.Data.MAX.MIN.pb.

The MIN in #1 represents the minimum version of Blitz which supports all the functionality used in the metric set. For any Blitz version v >= MIN, using Blitz version v will work, but Blitz versions v < MIN are expected to fail in some way (during code-gen, during code run, or via untrustworthy output) with the testing artifact.

The Mangrove release will overwrite #1 whenever a new set of testing data is generated which is compatible with the same minimum Blitz version.

The MAX in #2 represents the maximum version of Blitz which is compatible with the artifact, while MIN represents the minimum version. It is expected that most of the time, MAX > MIN, but they will be equal e.g. when a version of Blitz supporting new functionality is released, along with the testing data verifying that functionality.

"Bug bash" metric set

The "fake customer" data set will correspond to the MangroveBugBash metric set, hosted in Vivace. Since the testing artifact will contain a serialized metrics plan (.mp file), its field .Metadata.Version will contain the precise version of the metric set compatible with the input and expected output data. Consuming codebases may use that version to reproduce the expected output when calling production systems (e.g., metric service "v3") to obtain the metrics plan.

⚠ If a backwards-incompatible change is made to the MangroveBugBash metric set (e.g., a metric expecting a new data source column) then the Azure Exp pipeline's pre-deployment tests will fail, because their underlying codebase is set up in a way that does not allow the version of MangroveBugBash to be configured.

There is no easy way around this issue, because breaking changes to the metric set have to be made in order to test genuinely new features. To mitigate the issue, part of this work will include a wiki page on how to update the pre-deployment when such a breaking change is made. This does not rule out the possibility of future work to to allow the pre-deployment tests to pin a specific version.

Implementation

Note: this section is included because although the Mangrove team will be the primary producer of this testing data, other consuming teams (e.g., Azure Exp pipeline team, A&E data science team) will want to be able to easily add to and modify this testing data. So the underlying implementation needs to be flexible and easy-to-use.

This will be implemented via a new project, Mangrove.Testing.Data, under the closed-source part of the repository, which Integration.Test will take a dependency on. It will follow the Grpc C# example code to generate the "client" code for (de)serializing .pb files from the underlying (generated) C# classes during the build.

It will run a function which will deterministically generate in-memory sample data of the required "shape". This will be achieved by initializing a unique Random instance with a checked-in seed. That instance will be leveraged in a "helper" base-class, allowing users to write code like this (using the Bogus library), to generate data with values in the appropriate ranges.

class Simple
{
  public string Market { get; set; }
  public int Revenue { get; set; }
  public string UserId { get; set; }
  public long Timestamp { get; set; }
}
class SimpleBuilder : DataBuilder<Simple, long>
{
  override int Count => 30;
  override long OrderBy(Simple t) => t.Timestamp;
  override void Rules(Faker f, Simple s)
  {
    s.Timestamp = new DateTimeOffset(f.Date.Between("2019-12-23", "2019-12-25")).ToUnixTimeSeconds();
    s.Revenue = (int)f.LogNormal(0, 20);
    s.UserId = f.Inflate(1, factor: 10);
    s.Market = Markets.SelectUsing(t.UserId);
}

It will then use System.Text.Json APIs to serialize the generated data to JSON, CsvHelper to generate CSV / TSV files, and Parquet.Net to generate Parquet files.

The expected output will be generated using the C# emitter as the reference implementation. This will reduce the amount of boilerplate code needed (which would have to include yet-another-implementation of the "outer CI" method for estimating percentile variance, described in RFC #11). It will replace the initial Python scripts that Amin Saied wrote, and entirely remove the need for checked-in data.

Considered alternatives

This section outlines some of the considered alternatives to the proposed plan, along with some reasons for why they were not chosen.

Blob storage

Rather than the overhead of wrapping the binaries in a .NET Standard library, why not just deploy a zip file containing them all directly to Azure blob storage? The main reason: very consuming codebase would have to re-implement the set of file path conventions distinguishing e.g. between "simple" and "customer representative input data, output data, etc. This would lead to duplicated, error-prone code.

Customer data

Rather than "fake" customer data, why not include a checked-in copy of anonymized Ibiza and/or ACOM data? For a couple reasons:

For some customers (Ibiza) there are strict compliance requirements that forbid the copying of any data outside the Azure Managed Environment.
For customers which might allow this (ACOM) it would be a very heavy-weight process, requiring legal signoff every time a new copy of the data is needed.

Docker container

Publish a Docker image containing these testing artifacts to Azure Container Registry. On the positive side, all our testing logic already has infrastructure for interacting with ACR. However, this approach has multiple problems.

It is not cross-platform. Windows containers cannot be hosted on Linux, and since Linux Containers on Windows (LCOW) is still in preview, it is not supported in Azure DevOps build agents. This was pointed out in RFC # 8, and motivated the shift towards a dual-OS build.
- Also see: Run Linux containers in an Azure DevOps Windows hosted build agent.
It will be hard to reproduce the build locally in the Mangrove repository. For example, reproducing the Azure Exp pipeline build locally requires manually crafted PowerShell scripts.
It's too heavy-weight. Spinning up / down a Docker container just to run compiled C# code in memory is way slower than it needs to be.
Within the Mangrove repository, this data is used both for relatively heavy-weight testing (e.g., run Spark Sql via a single-node Spark cluster running in Docker) and also very lightweight testing (compile + run C# in-memory). To foster developer agility,
This approach suffers from the same reliance on consumers remembering and re-implementing path conventions.

Latest version

Why not publish a blob Mangrove.Testing.Data.latest.pb? That way consuming teams wouldn't have to make code or configuration changes to pick up new versions of the testing artifact. It is actually important than consuming teams make a code change to pick up new versions manually, because for many of them, the version of Blitz (Mangrove) they take a dependency on is hard-coded. New testing artifacts will often test functionality that isn't supported in old versions of Blitz. If consumers picked those up automatically, their tests would start failing, because they would still be using an old version of Blitz.

NuGet package

Why not share the testing artifact as a NuGet package? This would remove the overhead of Azure blob storage and protobuf. There are two ways of including binaries in a NuGet package, and neither of them are satisfactory:

A NuGet package is a .zip file "under the hood", so just place the files there. This has the same drawbacks as "zip ↦ blob storage".
A NuGet package where the files are stored as embedded resources in the underlying binary, along with helper methods for resolving those resources. This is not consumable from Java or Scala.

Shadow mode

There is a long-range project, currently spearheaded by Kathy Guo and Ulf Knoblich, to implement a "smarter shadow mode" which would run both old and new versions of Mangrove's code-generation functionality in parallel and automatically "diff" the output. For more details, see the initial design doc Validation for compute engine changes That document contains three proposed "stages" of validation for changes:

Unit testing.
- The Mangrove codebase already has very comprehensive set of unit tests (over 2.5k when this RFC was written) which have excellent code coverage and are actively contributed to by all members of the v-team.
Integration / E2E tests.
- This proposal fleshes out stage #2 in the document above. In addition, it allows the testing infrastructure to be shared across multiple codebases.
Shadow mode.
- The proposal in this document does not compete with shadow mode. Indeed, in the long run, risky changes could be automatically put through shadow mode after passing the build and release validations which leverage the Mangrove testing data. However, this proposal accomplishes two things that shadow mode cannot.
  1. Agility: developers can get fast (within a couple minutes) feedback on their work, without heavy-duty deployment or multi-day wait times.
  2. New features: there is no way to test compute behavior on new features by comparing with the existing set of jobs.