Monday, November 12, 2012

Emscripten Compiler Upgrades

Several major architecture improvements have landed in the last few weeks in Emscripten, here is an overview.

New Eliminator

The eliminator optimization phase was originally written by Max Shawabkeh. It basically removes unneeded variables, so

  var x = f(a);
  var y = x + g(b);

can be optimized into

  var y = f(a) + g(b);

This can greatly reduce the size of the code as well as improve performance, and was fundamental for our approach of relying on a combination of the eliminator + the closure compiler to go from LLVM's SSA representation into a register-like format: The eliminator removes large amounts of unneeded variables, and the closure compiler then reduces the number of variables further by reusing the ones that remain.

The eliminator could be slow on large functions, however, because it calculated the transitive closure of dependencies between all the variables, an expensive calculation. It also missed out on some opportunities to optimize because of some simplifying assumptions it made in its design. A final downside was it integrated poorly with the rest of our optimizations (in part due to being written in a different language, CoffeeScript).

I rewrote the eliminator entirely from scratch, in order to do a more precise analysis of which variables can be removed. I also simplified the problem slightly by only eliminating variables that have a single use - this makes it far faster, and I don't see any downside in the quality of the generated code (in fact it avoids some possible bad cases, although it took a long time to figure out what was going on in them). The new version is faster in general and far faster on the rare bad cases (100x even), and generates better-performing code to boot.

Parallel Optimizer

With all of the emscripten optimization passes now in JavaScript, I then worked on parallelizing that. We can't run one pass before the previous one finishes, but within each pass, we can work separately in each function - optimizing each function is independent of the other (we used to have some global optimization passes, but their benefit was very limited).

The parallelization is done using Python's process pool. It splits up the JavaScript into chunks and runs those in multiple JavaScript optimizer instances. The speedup can be close to linear in the number of cores. On BananaBread, the optimization passes become almost twice as fast on my dual-core laptop.

Parallel Compiler

With the optimizer parallel, there remain two phases that can be slow: The compiler (the initial code conversion from LLVM to JavaScript) and Closure Compiler. We can't do much for Closure, but in the long term it will become less and less important: we are implementing specific optimizations for the things we used to rely on it for, which leaves just minifying the code.

For the LLVM to JS compiler, I made the emscripten compiler parallel as well: It splits up the LLVM IR into 3 main parts: type data, function data, and globals. The function data part is unsurprisingly by far the largest in all cases I checked (95% or so), and it can in principle be parallelized - so I did that. Like in the optimizer, we use a Python process pool which feeds chunks of function data to multiple JavaScript compiler instances. There is some overhead due to chunking, and the type data and globals phases are not parallelized, but overall this can be a close to linear speedup.

Overall, pretty much all the speed-intensive parts of code generation and optimization in Emscripten are now parallel, and will utilize all your CPU cores automatically. That means that if you experience slow compilation speeds, you can just throw more hardware at it. I hear they are selling 16-core CPUs now ;)

New Relooper

The relooper is an optimization performed (unlike most optimizations) during initial code generation. It takes basic blocks of code and branching information between them and generates high-level JS control flow structures like loops and ifs, which makes the code run far faster. The original relooper algorithm was developed together with the implementation I wrote in the compiler. Eventually some aspects of how it works were found to be suboptimal, so specific optimizations were added to the JS optimizer ('hoistMultiples', 'loopOptimizer'), overall giving us pretty good generated code.

Meanwhile I wrote a new version of the relooper in C++. There were 2 reasons for that choice of language: First, because other projects needed something like it, and C++ was a better language for them, and second, because we had plans to evaluate writing an LLVM backend for emscripten that would also need to reloop in C++ (note: we decided against the LLVM backend in the end). The new version avoids the limitations of the first, and generates better code. In particular it has no need for additional optimizations done after the fact. It also implements some additional tweaks that are missing in the first one, like node splitting in some cases and more precise removal of loop labels when they are not needed, etc. It's also a much cleaner codebase.

I brought that new version of the relooper into Emscripten by compiling it to JS and using it in the JS compiler. This makes compilation faster both because the new relooper is faster than the previous one (not surprising as often compiled code is faster than handwritten code), and because the additional later optimizations are no longer needed, for overall about a 20% speedup on compiling BananaBread. It also generates better code, for example it can avoid a lot of unneeded nesting that the previous relooper had (which caused problems for projects like jsmess).

Note that this update makes Emscripten a 'self-hosting compiler' in a sense: one of the major optimization passes must be compiled to JS from C++, using Emscripten itself. Since this is an optimization pass, there is no chicken-and-egg problem: We bootstrap the relooper by first compiling it without optimizations, which works because we don't need to reloop there. We then use that unoptimized build of the relooper (which reloops properly, but slowly since it itself is unoptimized) in Emscripten to compile the relooper once more, generating the final fully-optimized version of the relooper, or "relooped relooper" if you will.