Scope Performance Investigation
Owner | Approvers | Participants |
---|---|---|
Thomas Portet | Daniel Miller, Sasha Patotski, Ulf Knoblich | Daniel Miller, Sasha Patotski, Ulf Knoblich |
Contents
This document describes the various Scope performance tests that were run to determine which areas of Mangrove could be worth optimizing.
TLDR:- could not detect a slow-down due to unused columns;
- could detect a slow-down due to inlining when the inlined expression is a heavy C# function.
Workflow
Each Scope script tests one basic assumption (for instance, whether inlining slows things down) on diverse types of expressions. External parameters control the type of expression being used (basic, complicated, function, aggregation), and whether the script is written “a la Mangrove” (a.k.a. Treatment, T) or “normally” (a.k.a. Control, C), for lack of a better word.
For each script, first a single structured stream is read (a full day of UnifiedCache for Bing on desktop), then some operations are performed, and an output is written. Each script is run n = 10 times, alternating between T and C. We then do a z-test to assess whether the observed difference in total PN hours is statistically significant or not.
Assumptions being tested
We are going to test 3 assumptions: 2 about inlining, 1 about unused columns. Here below are the specifics.
Inlining_01
This script (located here) will help determine whether:
RS = SELECT
<Expression> AS a,
<Expression> AS b,
<Expression> AS c,
<Expression> AS d,
...
FROM Data;
OUTPUT RS;
is slower than:
RS = SELECT
<Expression> AS a,
FROM Data;
RS2 = SELECT
a AS a,
a AS b,
a AS c,
a AS d,
...
FROM RS;
OUTPUT RS2;
In the first snippet “a la Mangrove”, <Expression>
is repeated many times.
Inlining_02
This script (located here) will help determine whether:
RS = SELECT
<Expression> + <Expression> + <Expression> + <Expression> + … AS a
FROM Data;
OUTPUT RS;
is slower than:
RS = SELECT
<Expression> AS aux
FROM Data;
RS2 = SELECT
aux + aux + aux + aux + … AS a
FROM RS;
OUTPUT RS2;
In the first snippet “a la Mangrove”, <Expression>
is repeated many times.
UnusedColumns_01
This script (located here) will help determine whether:
RS = SELECT
<Expression1> AS a,
<Expression2> AS b,
<Expression3> AS c,
<Expression4> AS d,
...
FROM Data;
RS2 = SELECT
(a + 1) AS a
FROM RS;
OUTPUT RS2;
is slower than:
RS = SELECT
<Expression1> AS a
FROM Data;
RS2 = SELECT
(a + 1) AS a
FROM RS;
OUTPUT RS2;
The first snippet “a la Mangrove” contains unused expressions (all different from each other).
Test expressions
Each assumption is tested with 4 types of expressions: basic, complicated, function, aggregation. Here are the specific expressions.
basic
<Expression> = (AllClickCount + 1.0)
For testing the assumption on unused columns, we need different expressions, so instead of 1.0 we add 2.0, 3.0, 4.0….
complicated
<Expression> = ((((((((double)((AllClickCount))))) > 1E-37) ? (((((((((((((double)((AllClickCount))))) > 1E-37) ? (1.0 / ((double)((AllClickCount)))) : 0.0)) * ((AllClickCount + 2))) - ((((((((((double)((AllClickCount))))) > 1E-37) ? (1.0 / ((double)((AllClickCount)))) : 0.0)) * ((AllClickCount + 3))) * (((((((double)((AllClickCount))))) > 1E-37) ? (1.0 / ((double)((AllClickCount)))) : 0.0))) * ((AllClickCount + 3)))) / ((((((((((double)((AllClickCount))))) > 1E-37) ? (((AllClickCount + 4)) / ((double)((AllClickCount)))) : 0.0)))))) + ((-2 * (((((((((double)((AllClickCount))))) > 1E-37) ? (((AllClickCount + 3)) / ((double)((AllClickCount)))) : 0.0))) / ((((((((((double)((AllClickCount))))) > 1E-37) ? (((AllClickCount + 4)) / ((double)((AllClickCount)))) : 0.0))))))) * ((((((((((double)((AllClickCount))))) > 1E-37) ? (1.0 / ((double)((AllClickCount)))) : 0.0)) * ((AllClickCount + 5))) - ((((((((((double)((AllClickCount))))) > 1E-37) ? (1.0 / ((double)((AllClickCount)))) : 0.0)) * ((AllClickCount + 3))) * (((((((double)((AllClickCount))))) > 1E-37) ? (1.0 / ((double)((AllClickCount)))) : 0.0))) * ((AllClickCount + 4)))) / ((((((((double)((AllClickCount))))) > 1E-37) ? (((AllClickCount + 4)) / ((double)((AllClickCount)))) : 0.0)))))) + ((((((((((((double)((AllClickCount))))) > 1E-37) ? (((AllClickCount + 3)) / ((double)((AllClickCount)))) : 0.0))) / ((((((((double)((AllClickCount))))) > 1E-37) ? (((AllClickCount + 4)) / ((double)((AllClickCount)))) : 0.0)))))) * ((((((((((double)((AllClickCount))))) > 1E-37) ? (1.0 / ((double)((AllClickCount)))) : 0.0)) * ((AllClickCount + 8))) - ((((((((((double)((AllClickCount))))) > 1E-37) ? (1.0 / ((double)((AllClickCount)))) : 0.0)) * ((AllClickCount + 4))) * (((((((double)((AllClickCount))))) > 1E-37) ? (1.0 / ((double)((AllClickCount)))) : 0.0))) * ((AllClickCount + 4)))) / ((((((((((double)((AllClickCount))))) > 1E-37) ? (((AllClickCount + 4)) / ((double)((AllClickCount)))) : 0.0)))))))) / ((double)((AllClickCount)))) : 0.0)))
This was adapted from a typical expression for variance calculation using Delta Method, from
which functions like Abs
or Pow
and aggregates like SUM
have been removed.
For testing the assumption on unused columns, we need different expressions, so we replace
AllClickCount
in the above by AllClickCount * 2
, AllClickCount * 3
, AllClickCount * 4
…
aggregation
<Expression> = SUM(AllClickCount)
GROUP BY UserId
For testing the assumption on unused columns, we need different expressions, so we also use the
following aggregates: COUNT
, AVG
, ANY_VALUE
, LAST
, MAX
, MIN
,
STDEV
, and VAR
. We apply them to AllClickCount
, Metrics_AdClickCount
, and AllClickCount * Metrics_AdClickCount
.
function
<Expression> = FindPrimeNumber(dummy1)
where the function that returns (almost) the hundredth prime is defined as:
public static long FindPrimeNumber(int dummy1)
{
int n = 100; // the n value
int count=0;
long a = 2;
while(count<n)
{
long b = 2;
int prime = 1;// to check if found a prime
while(b * b <= a)
{
if(a % b == 0)
{
prime = 0;
break;
}
b++;
}
if(prime > 0)
{
count++;
}
a++;
}
long res = --a;
return (res + dummy1); // trick to make the function return something different
}
This function was taken from this stackoverflow post and was modified to take in a dummy argument, so it could be applied to various values of a column and return something different each time. Note that the correctness of the function does not matter here; what matters is the busyness of the function.
For testing the assumption on unused columns, we need different expressions, so we use a modified version taking in a second dummy parameter that we add to the result.
Results
Using a larger dataset (about 2-3 Tb) had the expected effects: the jobs are not dominated by initialization anymore, and last much longer (about 2-3 total PN hours, at least).
The p-values for the job duration statistical tests were high. The lowest p-value was 0.11, and no difference could be detected for basic, complicated and aggregation expressions.
When the function expression was a heavy calculation, a difference in job duration could be unambiguously measured, for the 2 Inlining assumptions. This is in agreement with the previous observations of job failures for these cases when using a Sleep() function. With the particular data and function expression for the current set of jobs, the observed slowdown was about 15x (i.e. jobs took about 45 total PN hours).
Another way to see the difference between the jobs for T and C is to look at the generated C#
code. The code snippets below show the difference in generated code for the Inlining_01
assumption in the function expression case. One can see on the snippet just below (C) that the
FindPrimeNumber()
function is called once, then the result is stored in tmp_0
variable
and reused to populate all columns.
public override IEnumerable<Row> Process(RowSet input, Row outputRow, string[] args)
{
Row_8FAEED8CF84D7792 outRow = (Row_8FAEED8CF84D7792)outputRow;
foreach(Row_AllClickCount_Integer row in input.Rows)
{
int col_AllClickCount = row.AllClickCount.Integer;
bool bPicked = false;
int exceptionIndex = 0;
try
{
{
bPicked = true;
long tmp_0 = Helper.FindPrimeNumber(col_AllClickCount);
exceptionIndex++;
outRow.a.Set((tmp_0));
exceptionIndex++;
outRow.b.Set((tmp_0));
exceptionIndex++;
outRow.c.Set((tmp_0));
exceptionIndex++;
outRow.d.Set((tmp_0));
exceptionIndex++;
...
In the following snippet for T, one can see that the function is evaluated multiple times, for populating each column.
public override IEnumerable<Row> Process(RowSet input, Row outputRow, string[] args)
{
Row_8FAEED8CF84D7792 outRow = (Row_8FAEED8CF84D7792)outputRow;
foreach (Row_AllClickCount_Integer row in input.Rows)
{
int exceptionIndex = 0;
try
{
int col_AllClickCount = row.AllClickCount.Integer;
outRow.a.Set((Helper.FindPrimeNumber(col_AllClickCount)));
exceptionIndex++;
outRow.b.Set((Helper.FindPrimeNumber(col_AllClickCount)));
exceptionIndex++;
outRow.c.Set((Helper.FindPrimeNumber(col_AllClickCount)));
exceptionIndex++;
outRow.d.Set((Helper.FindPrimeNumber(col_AllClickCount)));
exceptionIndex++;
...
Conclusions
The above performance tests have shown that:
- inlining does cause a performance hit since Scope does not “optimize away” the repeated expressions;
- carrying unused columns does not affect performance.
Quantifying the performance impact in general is no easy task, since in practice it will depend on the amount of data, on the amount of computations, and on the nature of duplicated expressions. It is thus quite hard to predict the ROI of Mangrove changes to improve the performance of generated Scope scripts running in production.
All we can say with certainty is that removing unused columns does not seem worth the effort, and removing inlining should improve performance, although unclear to what extent.