Thursday, August 22, 2013

Outlining: a workaround for JITs and big functions

Just In Time (JIT) compilers, like JavaScript engines, receive source code and compile it at runtime. That means that end users can notice how long things take to compile, and avoiding noticeable compilation pauses is important. But JavaScript engines and other modern JITs perform optimizations whose complexity is worse than linear, for example SSA analysis and register allocation. That means that aside from total program size, function size is important as well: the time it takes to compile your program may be mostly affected by a few large functions that have lots of code and variables in them. If compilation is N^2 per function (where N is the number of AST nodes or variables or such), then having function sizes

  100, 100, 100, 100, 100, 100, 100, 100, 100, 100

is much better than

  1000

even though the total amount of code is the same (10 functions of size 100 take 10 * 100^2 or 100,000 units of time, while a single function of size 1000 takes 1,000,000).

Now, very large functions are generally rare, since most people wouldn't write them by hand. But automatic code generators can create them, as well as compilers that perform optimizations like inlining (e.g., closure compiler, LLVM as used by emscripten and mandreel, etc.).

What do JavaScript engines do with very large functions? Generally speaking, since compiling them takes very long, JITs have just not fully optimized them and left them in the interpreter or baseline JIT. This avoids the long compilation pause, but leaves the code running slower. As more JS engines add background compilation, the compilation pause is not as noticeable, but it still means a significant amount of additional time that the code executes before it is fully optimized.

This problem is not unique to JavaScript engines, of course, it is a concern for any JIT that does complex optimizations, for example the JVM. However, JavaScript engines are particularly concerned with noticeable compilation pauses because they usually run on end users' machines (as opposed to remote servers), and they run arbitrary content off the web.

What can we do about this? It could be solved in one of two general ways, either in the JavaScript engine, or by preprocessing the code ahead of time.

In the JavaScript engine, as already mentioned before, background compilation makes it feasible to compile even big and slowly-compiling functions, since the pause is not directly noticeable. But this still means a long delay before the fully-optimized version runs, and the background thread consumes battery power and perhaps competes with other threads and processes. Also, even in JavaScript engines with background compilation, there are often limits still present, just to avoid the risk of a background thread running a ridiculously long time. Finally, a JavaScript engine might be able to compile separate functions in parallel but not parallelize inside a single function. So all of these are partial solutions at best.

Another JavaScript engine option is to not compile entire functions at a time. SpiderMonkey experimented with this using chunked compilation, basically parts of a function could be compiled separately. This was done successfully in JaegerMonkey, but it is my understanding that it turned out to be quite complex and bug prone, and deemed not worth doing in the newer IonMonkey.

(Of course the best JavaScript engine solution is to just make compilation faster. But huge amounts of work have already gone into that in all modern JavaScript engines, so it is not realistic to expect sudden large improvements there.)

That leaves the other option, of preprocessing the code ahead of time. One way in emscripten for example is to tell LLVM to not perform inlining on some part of your code (if inlining is the cause of a big function), but in general there is not always a simple solution.

Another preprocessing option is to work at the JavaScript level and break up huge functions into smaller pieces. I originally thought this would be too complicated and/or generate too much overhead, but I was eventually convinced to go down this route. And a good thing too ;) because even without much tuning, I think the results are promising:


The X axis is the outlining limit, or how large a function size is acceptable before we start to break it up into smaller pieces, by taking chunks of it and moving them outside into a new function - sort of the opposite of inlining, hence outlining. The number itself is the amount of AST nodes. 0 is a special value meaning we do not do anything, then as you move to the right we do more and more breaking up, as we aim for smaller maximal function sizes.

The Y axis is time (in milliseconds), and each measurement has both compilation and run time. Compilation measures the total time needed to compile all the code by the JavaScript engine, and runtime is how long the benchmark - a SQLite testcase - takes to run. (To measure compilation time, I looked at startup time in Firefox, which is convenient because Firefox does ahead of time (AOT) compilation of the asm.js subset of JavaScript, giving an easy way to measure how long it takes to fully optimize all the code in the program. In non-AOT compilers there would be less startup time of course, but as explained before the costs of long compilation time would still be felt: either functions would not be optimized at all, or compile slowly either on the main thread or a background thread, etc. - those effects are less convenient to measure, but still there.)

The benchmark starts at 5 seconds to compile and 5 seconds to run (running creates a SQLite database in memory, generates a few thousand entries, then does some selects etc. on them). 5 seconds to compile all the code is obviously a long time, and most of it is from a single function, yy_reduce. An outlining limit of 160,000 is enough to break that function up into 2 pieces but do nothing else, and the result is over 4x faster compilation (from over 5 seconds to just over 1 second) with a runtime slowdown of only 7%. Outlining more aggressively to 80,000 speeds up compilation by 6x, but the runtime slowdown increases to 35%. At the far right of the graph the overhead becomes very painful (runtime execution is over 4x slower), so the best compromise is probably somewhere in the middle to left of the graph.

Overall there is definitely a tradeoff here, but luckily even just breaking up the very largest function can be very helpful. Polynomial compilation times can be disproportionately influenced by the very largest function, so that makes sense.

What is the additional overhead that affects runtime as we go to the right on the graph? To understand what is going on, let's detail what the outlining optimization does. As already mention, the idea is to take code in the function and move it outside. To do that, it performs three stacked optimizations:

1. Aggressive variable elimination. Emscripten normally works hard to eliminate unneeded variables, for example

var x = y*2;
f(x);

would become

f(y*2);

But there is a potential tradeoff here. While in this example we replace a variable with a small expression, if the expression is large and shows up multiple times, it can increase code size while reducing variable count. However, for outlining, we do want to remove variables at (almost) all costs, because the more local variables we have, the more variables might happen to be shared between code staying in the function and code being moved out. That means we need to ship those variables between the two functions, basically by spilling them to the stack and then reading them from there, which adds overhead (see later for more details). The first outliner pass therefore finds practically all variables that are safe to remove, and removes them, even if code size increases.

2. Code flattening. It is straightforward to split up code that is in a straight line,

if (a) f(a);
if (b) f(b);
if (c) f(c);
if (d) f(d);

Here we can pick any line at which to split. But things are harder if the code is nested

if (a) {
  f(a);
} else {
  if (b) {
    f(b);
  } else {
    if (c) {
      f(c);
    } else {
      if (d) f(d);
    }
  }
}

The second pass "flattens" code by breaking up if-else chains, for example in this case it would generate something like

var m = 1;
if (a) {
  m = 0;
  f(a);
}
if (m && b) { 
  m = 0;
  f(b);
}
if (m && c) {
  m = 0;
  f(c);
}
if (m && d) {
  f(d);
}

The ifs are no longer nested, and form a straightline list of statements which is easier to process. However, we have added overhead here, both in terms of code size and in additional work that is done.

3. Outline code. Now that we minimized the number of local variables and made the code sufficiently flat to be easy to process, we recursively search the AST for straightline lists of statements where we can outline a particular range of them. When we find a suitable one (for details of the search, see the code), we outline it: create a new function, move the code into it, and add spill/unspill code on both sides to pass over the local variables (only the ones that are necessary, of course). A further issue to handle is to "forward" control flow changes, for example if we outlined a break then we basically forward that to the calling function and so forth. Then in the original function the code has been replaced with something like this:

HEAP[x] = 0; // clear control var
HEAP[x+4] = a; // spill 2 locals
HEAP[x+8] = b;
outlinedCode(); // do the call
b = HEAP[x+8];  // read 2 locals
c = HEAP[x+12];
if (HEAP[x] == 1) {
  continue; // forwarded continue
}

We spill some variables and read back some afterwards (note that they are not necessarily the same ones - here the outlined code does not modify a, so no need to read it). We also use a location in HEAP as a control flow helper variable, to tell us if the outlined code wants us to do, in this case, a continue.

Overall, while there is additional size and overhead here, with this approach we can replace a very large amount of code with just a function call to it plus some bookkeeping. As the graph above shows, in some benchmarks this can decrease compilation time very significantly.

To try this optimization in an emscripten-using project, just add   -s OUTLINING_LIMIT=N   to your compilation flags (during conversion of bitcode to JS), where N is a number.