Scope Performance Investigation

Owner	Approvers	Participants
Thomas Portet	Daniel Miller, Sasha Patotski, Ulf Knoblich	Daniel Miller, Sasha Patotski, Ulf Knoblich

This document describes the various Scope performance tests that were run to determine which areas of Mangrove could be worth optimizing.

TLDR:

could not detect a slow-down due to unused columns;

could detect a slow-down due to inlining when the inlined expression is a heavy C# function.

Workflow

Each Scope script tests one basic assumption (for instance, whether inlining slows things down) on diverse types of expressions. External parameters control the type of expression being used (basic, complicated, function, aggregation), and whether the script is written “a la Mangrove” (a.k.a. Treatment, T) or “normally” (a.k.a. Control, C), for lack of a better word.

For each script, first a single structured stream is read (a full day of UnifiedCache for Bing on desktop), then some operations are performed, and an output is written. Each script is run n = 10 times, alternating between T and C. We then do a z-test to assess whether the observed difference in total PN hours is statistically significant or not.

Assumptions being tested

We are going to test 3 assumptions: 2 about inlining, 1 about unused columns. Here below are the specifics.

Inlining_01

This script (located here) will help determine whether:

RS = SELECT
	<Expression> AS a,
	<Expression> AS b,
	<Expression> AS c,
	<Expression> AS d,
	...
FROM Data;
OUTPUT RS;

is slower than:

RS = SELECT
	<Expression> AS a,
FROM Data;
RS2 = SELECT
	a AS a,
	a AS b,
	a AS c,
	a AS d,
	...
FROM RS;
OUTPUT RS2;

In the first snippet “a la Mangrove”, <Expression> is repeated many times.

Inlining_02

This script (located here) will help determine whether:

RS = SELECT
	<Expression> + <Expression> + <Expression> + <Expression> + … AS a
FROM Data;
OUTPUT RS;

is slower than:

RS = SELECT
	<Expression> AS aux
FROM Data;
RS2 = SELECT
	aux + aux + aux + aux + … AS a
FROM RS;
OUTPUT RS2;

In the first snippet “a la Mangrove”, <Expression> is repeated many times.

UnusedColumns_01

This script (located here) will help determine whether:

RS = SELECT
	<Expression1> AS a,
	<Expression2> AS b,
	<Expression3> AS c,
	<Expression4> AS d,
	...
FROM Data;
RS2 = SELECT
	(a + 1) AS a
FROM RS;
OUTPUT RS2;

is slower than:

RS = SELECT
	<Expression1> AS a
FROM Data;
RS2 = SELECT
	(a + 1) AS a
FROM RS;
OUTPUT RS2;

The first snippet “a la Mangrove” contains unused expressions (all different from each other).

Test expressions

Each assumption is tested with 4 types of expressions: basic, complicated, function, aggregation. Here are the specific expressions.

basic

<Expression> = (AllClickCount + 1.0)

For testing the assumption on unused columns, we need different expressions, so instead of 1.0 we add 2.0, 3.0, 4.0….

complicated

<Expression> = ((((((((double)((AllClickCount))))) > 1E-37) ? (((((((((((((double)((AllClickCount))))) > 1E-37) ? (1.0 / ((double)((AllClickCount)))) : 0.0)) * ((AllClickCount + 2))) - ((((((((((double)((AllClickCount))))) > 1E-37) ? (1.0 / ((double)((AllClickCount)))) : 0.0)) * ((AllClickCount + 3))) * (((((((double)((AllClickCount))))) > 1E-37) ? (1.0 / ((double)((AllClickCount)))) : 0.0))) * ((AllClickCount + 3)))) / ((((((((((double)((AllClickCount))))) > 1E-37) ? (((AllClickCount + 4)) / ((double)((AllClickCount)))) : 0.0)))))) + ((-2 * (((((((((double)((AllClickCount))))) > 1E-37) ? (((AllClickCount + 3)) / ((double)((AllClickCount)))) : 0.0))) / ((((((((((double)((AllClickCount))))) > 1E-37) ? (((AllClickCount + 4)) / ((double)((AllClickCount)))) : 0.0))))))) * ((((((((((double)((AllClickCount))))) > 1E-37) ? (1.0 / ((double)((AllClickCount)))) : 0.0)) * ((AllClickCount + 5))) - ((((((((((double)((AllClickCount))))) > 1E-37) ? (1.0 / ((double)((AllClickCount)))) : 0.0)) * ((AllClickCount + 3))) * (((((((double)((AllClickCount))))) > 1E-37) ? (1.0 / ((double)((AllClickCount)))) : 0.0))) * ((AllClickCount + 4)))) / ((((((((double)((AllClickCount))))) > 1E-37) ? (((AllClickCount + 4)) / ((double)((AllClickCount)))) : 0.0)))))) + ((((((((((((double)((AllClickCount))))) > 1E-37) ? (((AllClickCount + 3)) / ((double)((AllClickCount)))) : 0.0))) / ((((((((double)((AllClickCount))))) > 1E-37) ? (((AllClickCount + 4)) / ((double)((AllClickCount)))) : 0.0)))))) * ((((((((((double)((AllClickCount))))) > 1E-37) ? (1.0 / ((double)((AllClickCount)))) : 0.0)) * ((AllClickCount + 8))) - ((((((((((double)((AllClickCount))))) > 1E-37) ? (1.0 / ((double)((AllClickCount)))) : 0.0)) * ((AllClickCount + 4))) * (((((((double)((AllClickCount))))) > 1E-37) ? (1.0 / ((double)((AllClickCount)))) : 0.0))) * ((AllClickCount + 4)))) / ((((((((((double)((AllClickCount))))) > 1E-37) ? (((AllClickCount + 4)) / ((double)((AllClickCount)))) : 0.0)))))))) / ((double)((AllClickCount)))) : 0.0)))

This was adapted from a typical expression for variance calculation using Delta Method, from which functions like Abs or Pow and aggregates like SUM have been removed. For testing the assumption on unused columns, we need different expressions, so we replace AllClickCount in the above by AllClickCount * 2, AllClickCount * 3, AllClickCount * 4…

aggregation

<Expression> = SUM(AllClickCount)
GROUP BY UserId

For testing the assumption on unused columns, we need different expressions, so we also use the following aggregates: COUNT, AVG, ANY_VALUE, LAST, MAX, MIN, STDEV, and VAR. We apply them to AllClickCount, Metrics_AdClickCount, and AllClickCount * Metrics_AdClickCount.

function

<Expression> = FindPrimeNumber(dummy1)

where the function that returns (almost) the hundredth prime is defined as:

public static long FindPrimeNumber(int dummy1)
{
    int n = 100; // the n value
    int count=0;
    long a = 2;
    while(count<n)
    {
        long b = 2;
        int prime = 1;// to check if found a prime
        while(b * b <= a)
        {
            if(a % b == 0)
            {
                prime = 0;
                break;
            }
            b++;
        }
        if(prime > 0)
        {
            count++;
        }
        a++;
    }
    long res = --a;
    return (res + dummy1); // trick to make the function return something different
}

This function was taken from this stackoverflow post and was modified to take in a dummy argument, so it could be applied to various values of a column and return something different each time. Note that the correctness of the function does not matter here; what matters is the busyness of the function.

For testing the assumption on unused columns, we need different expressions, so we use a modified version taking in a second dummy parameter that we add to the result.

Results

Using a larger dataset (about 2-3 Tb) had the expected effects: the jobs are not dominated by initialization anymore, and last much longer (about 2-3 total PN hours, at least).

The p-values for the job duration statistical tests were high. The lowest p-value was 0.11, and no difference could be detected for basic, complicated and aggregation expressions.

When the function expression was a heavy calculation, a difference in job duration could be unambiguously measured, for the 2 Inlining assumptions. This is in agreement with the previous observations of job failures for these cases when using a Sleep() function. With the particular data and function expression for the current set of jobs, the observed slowdown was about 15x (i.e. jobs took about 45 total PN hours).

Another way to see the difference between the jobs for T and C is to look at the generated C# code. The code snippets below show the difference in generated code for the Inlining_01 assumption in the function expression case. One can see on the snippet just below (C) that the FindPrimeNumber() function is called once, then the result is stored in tmp_0 variable and reused to populate all columns.

public override IEnumerable<Row> Process(RowSet input, Row outputRow, string[] args)
{
  Row_8FAEED8CF84D7792 outRow = (Row_8FAEED8CF84D7792)outputRow;
  foreach(Row_AllClickCount_Integer row in input.Rows)
  {
    int col_AllClickCount = row.AllClickCount.Integer;
    bool bPicked = false;
    int exceptionIndex = 0;
    try
    {
      {
        bPicked = true;
        long tmp_0 = Helper.FindPrimeNumber(col_AllClickCount);
        exceptionIndex++;
        outRow.a.Set((tmp_0));
        exceptionIndex++;
        outRow.b.Set((tmp_0));
        exceptionIndex++;
        outRow.c.Set((tmp_0));
        exceptionIndex++;
        outRow.d.Set((tmp_0));
        exceptionIndex++;
        ...

In the following snippet for T, one can see that the function is evaluated multiple times, for populating each column.

public override IEnumerable<Row> Process(RowSet input, Row outputRow, string[] args)
{
  Row_8FAEED8CF84D7792 outRow = (Row_8FAEED8CF84D7792)outputRow;
  foreach (Row_AllClickCount_Integer row in input.Rows)
  {
    int exceptionIndex = 0;
    try
    {
      int col_AllClickCount = row.AllClickCount.Integer;

      outRow.a.Set((Helper.FindPrimeNumber(col_AllClickCount)));
      exceptionIndex++;
      outRow.b.Set((Helper.FindPrimeNumber(col_AllClickCount)));
      exceptionIndex++;
      outRow.c.Set((Helper.FindPrimeNumber(col_AllClickCount)));
      exceptionIndex++;
      outRow.d.Set((Helper.FindPrimeNumber(col_AllClickCount)));
      exceptionIndex++;
      ...

Conclusions

The above performance tests have shown that:

inlining does cause a performance hit since Scope does not “optimize away” the repeated expressions;
carrying unused columns does not affect performance.

Quantifying the performance impact in general is no easy task, since in practice it will depend on the amount of data, on the amount of computations, and on the nature of duplicated expressions. It is thus quite hard to predict the ROI of Mangrove changes to improve the performance of generated Scope scripts running in production.

All we can say with certainty is that removing unused columns does not seem worth the effort, and removing inlining should improve performance, although unclear to what extent.

Scope Performance Investigation

Contents

Workflow

Assumptions being tested

Inlining_01

Inlining_02

UnusedColumns_01

Test expressions

basic

complicated

aggregation

function

Results

Conclusions