Tuesday, May 14, 2013

The Elusive Universal Web Bytecode

It's often said that the web needs a bytecode. For example, the very first comment in a very recent article on video codecs on the web says
A proper standardized bytecode for browsers would (most likely) allow developers a broader range of languages to choose from as well as hiding the source code from the browser/viewer (if that's good or not is subjective of course).
 And other comments continue with
Just to throw a random idea out there: LLVM bytecode. That infrastructure already exists, and you get to use the ton of languages that already have a frontend for it (and more in the future, I'm sure).
[..]
I also despise javascript as a language and wish someone would hurry up replacing it with a bytecode so we can use decent languages again.
[..]
Put a proper bytecode engine in the browser instead, and those people that love javascript for some unknowable reason could still use it, and the rest of us that use serious languages could use them too.
[..]
Honestly, .Net/Mono would probably be the best bet. It's mature, there are tons of languages targeting it, and it runs pretty much everywhere already as fast as native code
Ignoring the nonproductive JS-hating comments, basically the point is that people want to use various languages on the web, and they want those languages to run fast. Bytecode VMs have been very popular since Java in the 90's, and they show that multiple languages can run in a single VM while maintaining good performance, so asking for a bytecode for the web seems to make sense at first glance.

But already in the quotes above we see the first problem: Some people want one bytecode, others want another, for various reasons. Some people just like the languages on one VM more than another. Some bytecode VMs are proprietary or patented or tightly controlled by a single corporation, and some people don't like some of those things. So we don't actually have a candidate for a single universal bytecode for the web. What we have is a hope for an ideal bytecode - and multiple potential candidates.

Perhaps though not all of the candidates are relevant? We need to pin down the criteria for determining what is a "web bytecode". The requirements as mentioned by those requesting it include
  • Support all the languages
  • Run code at high speed
To those we can add two additional requirements that are not mentioned in the above quotations, but are often heard:
  • Be a convenient compiler target
  • Have a compact format for transfer
In addition we must add the requirements that anything that runs on the web must fulfill,
  • Be standardized
  • Be platform-independent
  • Be secure
JavaScript can already do the last three (it's already on the web, so it has to). Can it do the first four? I would say yes:
  • Support all the languages: A huge list of languages can compile into JavaScript, and that includes major ones like C, C++, Java, C#, LLVM bytecode, and so forth. There are some rough edges - often porting an app requires changing some amount of code - but nothing that can't be improved on with more work, if the relevant communities focus on it. C++ compilers into JavaScript like Emscripten and Mandreel have years of work put into them and are fairly mature (for example see the Emscripten list of demos). GWT (for Java) has likewise been used in production for many years; the situation for C# is perhaps not quite as good, but improving, and even things like Qt can be compiled into JavaScript. For C#, Qt, etc., it really just depends on the relevant community being focused on the web as one of its targets: We know how to do this stuff, and we know it can work.
  • Run code at high speed: It turns out that C++ compiled to JavaScript can run at about half the speed of native code, which in some cases outperforms Java, and is expected to get better still. Those numbers are when using the asm.js subset of JavaScript, which basically structures the compiler output into something that is easier for a JS engine to optimize. It's still JavaScript, so it runs everywhere and has full backwards compatibility, but it can be run at near-native speed already today.
  • Be a convenient compiler target: First of all, the long list of languages from before shows that many people have successfully targeted JavaScript. That's the best proof that JavaScript is a practical compiler target. Also, there are many languages that compile into either C or LLVM bytecode, and we have more than one compiler capable of compiling those to the web, and one of them is open source, so all those languages have an easy path. Finally, while compiling into a "high-level" language like JavaScript is quite convenient, there are downsides, in particular the lack of support for low-level control flow primitives like goto; however, this is addressed by reusable open source libraries like the Relooper.
  • Have a compact format for transfer: It seems intuitive that a high-level language like JavaScript cannot be compact - it's human-readable, after all. But it turns out though that JS as a compiler target is already quite small, in fact comparable to native code when both are gzipped. Also, even in the largest and most challenging examples, like Unreal Engine 3, the time spent to parse JS into an AST does not need to be high. For example, in that demo it takes just 10 seconds on my machine to both parse and fully optimize the output of over 1 million lines of C++ (remember that much of that optimization time would need to be done no matter what format the code is in, because it has to be a portable one).
So arguably JavaScript is already very close to providing what a bytecode VM is supposed to offer, as listed in the 7 requirements above. And of course this is not the first time that has been said, see here for a previous discussion from November 2010. In the 2.5 years since that link, the case for that approach has gotten significantly stronger, for example, JavaScript's performance on compiled code has improved substantially, and compilers to JavaScript can compile very large C++ applications like Unreal Engine 3, both as mentioned before. At this point the main missing pieces are, first (as already mentioned) improving language support for ones not yet fully mature, and second, a few platform limitations that affect performance, notably lack of SIMD and threads with shared state.

Can JavaScript fill the gaps of SIMD and mutable-memory threads? Time will tell, and I think these things would take significant effort, but I believe it is clear that to standardize them would be orders of magnitude simpler and more realistic than to standardize a completely new bytecode. So a bytecode has no advantage there.

Some of the motivation for a new bytecode appears to come from an elegance standpoint: "JavaScript is hackish", "asm.js is a hack", and so forth, but a new from-scratch bytecode would be (presumably) a thing of perfection. That's an understandable sentiment, but technology has plenty of such things, witness the persistence of x86, C++, and so forth (some would add imperative programming to that list). It's not only true of technology but human civilization as well, for example no natural language has the elegance of Esperanto, and our currently-standing legal and political systems are far from what a from-scratch redesign would arrive at. But large long-standing institutions are easier to improve continuously rather than to completely replace. I think it's not surprising that that's true for the web as well.

(Note that I'm not saying we shouldn't try. We should. But we shouldn't stop trying at the same time to also improve the current situation in a gradual way. My point is that the latter is more likely to succeed.)

Elegance aside, could a from-scratch VM be better than JavaScript? In some ways of course it could, like any redesign from scratch of anything. But I'm not sure that it could fundamentally be better in substantial ways. The main problem is that we just don't know how to create a perfect "one bytecode to rule them all" that is
  • Fast - runs all languages at their maximal speed
  • Portable - runs on all CPUs and OSes
  • Safe - sandboxable so it cannot be used to get control of users' machines
The elusive perfect universal bytecode would need to do all three, but it seems to me that we can only pick two.

Why is this so, when supposedly the CLR and JVM show that the trifecta is possible? The fact is that they do not, if you really take "fast" to mean what I wrote above, which is "runs all languages at their maximal speed" - that's what I mean by "perfect" in the context of the last paragraph. For example, you can run JavaScript on the JVM, but it won't come close to the speed of a modern JS VM. (There are examples of promising work like SPUR, but that was done before the leaps in JS performance that came with CrankShaft, TypeInference, IonMonkey, DFG, etc.).

The basic problem is that to run a dynamic language at full speed, you need to do the things that JavaScript engines, LuaJIT, etc. do, which includes self-modifying code (architecture-specific PICs), or even things like entire interpreters in handwritten optimized assembly. Making those things portable and safe is quite hard - when you make them portable and safe, you make them more generic pretty much by definition. But CPUs have significant-enough differences that doing generic things can lead to slower performance.

The problems don't stop there. A single "bytecode to rule them all" must make some decisions as to its basic types. LuaJIT and several JavaScript VMs represent numbers using a form of NaNboxing, which uses invalid values in doubles to store other types of values. Deciding to NaNbox (and in what way) or not NaNbox is typically a design desicion for an entire VM. NaNboxing might be well and good for JS and Lua, but it might slow down other languages. Another example is how strings are implemented: IronPython, Python on .NET, ran into issues with how Python expects strings to work as opposed to .NET.

Yet another area where decisions must be made is garbage collection. Different languages have different patterns of usage, both determined by the language itself and the culture around the language. For example, the new garbage collector planned for LuaJIT 3.0, a complete redesign from scratch, is not going to be a copying GC, but in other VMs there are copying GCs. Another concern is finalization: Some languages allow hooking into object destruction, either before or after the object is GC'd, while others disallow such things entirely. A design decision on that matter has implications for performance. So it is doubtful that a single GC could be truly optimal for all languages, in the sense of being "perfect" and letting everything run at maximal speed.

So any VM must make decisions and tradeoffs about fundamental features. There is no obvious optimal solution that is right for everything. If there were, all VMs would look the same, but they very much do not. Even relatively similar VMs like the JVM and CLR (which are similar for obvious historic reasons) have fundamental differences.

Perhaps a single VM could include all the possible basic types - both "normal" doubles and ints, and NaNboxed doubles? Both Pascal-type strings and C-type strings? Both asynchronous and synchronous APIs for everything? Of course all these things are possible, but they make things much more complicated. If you really want to squeeze every last ounce of performance out of your VM, you should keep it simple - that's what LuaJIT does, and very well. Trying to support all the things will lead to compromises, which goes against the goal of a VM that "runs all languages at their maximal speed".

(Of course there is one way to support all the things at maximal speed: Use a native platform as your VM. x86 can run Java, LuaJIT and JS all at maximal speed almost by definition. It can even be sandboxed in various ways. But it has lost the third property of being platform-independent.)

Could we perhaps just add another VM like the CLR alongside JavaScript, and get the best of both worlds that way, instead of putting everything we need in one VM? That sounds like an interesting idea at first, but it has technical difficulties and downsides, is complex, and would likely regress existing performance.

Do we actually need "maximal speed"? How about just "reasonable speed"? Definitely, we can't hold out for some perfect VM that can do it all. In the last few paragraphs I've been talking about a "perfect" bytecode VM that can run everything at maximal speed. My point is that it's important to realize that there is no such VM. But, with some compromise we definitely can have a VM that runs many things at very high speeds. Examples of such VMs are the JVM, CLR, and as mentioned before JavaScript VMs as well, since they can run one very popular dynamic language at maximal speed, and they can run statically typed code compiled from C++ about as well or even better than some bytecode VMs (with the already-discussed caveats of SIMD and shared-mutable threads).

For that reason, switching from JavaScript to another VM would not be a strictly better solution in all respects, but instead just shift us to another compromise. For example, JavaScript itself would be slower on the CLR, but C# would be faster, and I'm not sure which of the two can run C++ faster, but my bet is that both can run it at about the same speed.

So I don't think there is much to gain, technically speaking, from considering a new bytecode for the web. The only clear advantage such an approach could give is perhaps a more elegant solution, if we started from scratch and designed a new solution with less baggage. That's an appealing idea, and in general elegance often leads to better results, but as argued earlier there would likely be no significant technical advantages to elegance in this particular case - so it would be elegance for elegance's sake.

I purposefully said we don't need a new bytecode in the last paragraph. We already have JavaScript, which I have claimed is quite close to providing all the advantages that a bytecode VM could. Note that this wasn't entirely expected - not any language can in a straightforward way be transformed into a more general target for other languages. It just so happens though that JavaScript did have just enough low-level support (bitwise operations being 32-bit, for example) to make it a practical C/C++/LLVM IR compiler target, which made it worth investing in projects like the Relooper that work around some of its other limitations. Combined with the already ongoing speed competition among JavaScript engines, the result is that we now have JavaScript VMs that can run multiple languages at high speed.

In summary, we already have what practically amounts to a bytecode VM in our browsers. Work is not complete, though: While we can port many languages very well right now, support for other languages is not quite there yet. If you like a language that is not yet supported on the web, and you want it to run on the web, please contribute to the relevant open source project working on doing that (or start one if there isn't one already). There is no silver bullet here - no other bytecode VM that if only we decided on it, we would have all the languages and libraries we want on the web, "for free" - there is work that needs to be done. But in recent years we have made huge amounts of progress in this area, both in infrastructure for compiling code to JavaScript and in improvements to JavaScript VMs themselves. Let's work together to finish that for all languages.

Thursday, February 21, 2013

Heap corruption checking in Emscripten

I've seen a few weird things when porting codebases, where it just doesn't behave properly. Sometimes this is a bug in the compiler, but it can also be that the source is not fully portable. It can be hard to figure out which it is, so compilers often have automatic tools to help with that sort of thing. Emscripten has several, like SAFE_HEAP mode which checks alignment errors, reading from beyond the heap, and lets you instrument each read and write manually. Recently I added another, CORRUPTION_CHECK which performs simple heap corruption checking.

This idea came to me during a flight and probably it isn't original in any way. Basically, each malloc allocates additional memory, with a buffer zone before and after the "real" allocation that the program sees. The buffer zones are filled with canary values, and later on those are checked to see if they were modified. If they were, the program is writing to memory it has no business writing to, and something has gone very wrong. Then you just increase the frequency of the checks, maybe even adding a few manual calls as you narrow things down, until you see exactly where the corruption happens. With large enough buffer zones, this approach has a good probability of catching writes to random places in memory, and an even better chance of catching typical bugs like allocating 1024 bytes and writing 1025 values. Yesterday I used this tool successfully on a large C++ codebase, the bug turned out to be an incorrect use of std::vector.

But what I really want to talk about here is the implementation of this heap checker tool,

https://github.com/kripken/emscripten/blob/incoming/src/corruptionCheck.js

The idea is fairly simple, and the implementation is less than 100 lines of JavaScript. It replaces the normal malloc and free functions, and basically just does what I said before. I think it's kind of nice how easy it is to do stuff like this on code compiled to JavaScript.

And actually in many ways it is easier to debug C/C++ in JavaScript than C/C++ compiled to native code. For example, even if there is a heap corruption in the C/C++ codebase, we know it cannot corrupt the debugging code in pure JS. JS is a safe language, so JS objects can't be randomly overwritten from the compiled code, which just accesses the typed array heap - and some side objects, but it just can't get to the CorruptionChecker object. That's a nice guarantee to have. So we can debug the compiled code from a safe, scripted environment where it's easy to automate things by just adding some JS.

Friday, December 21, 2012

Emscripten News: LLVM 3.2, etc.

LLVM 3.2 was just released today, and as with every LLVM release emscripten is switching to it. (We only support a single LLVM version at a time; if you don't want to upgrade to LLVM 3.2 just yet, you can use older revisions of emscripten.)

LLVM 3.2 brings as usual a large amount of general improvements and bugfixes. There isn't anything in particular that will be noticeable with emscripten, except for a change in how LLVM does linking. It now requires an explicit list of symbols to keep alive - this was quite puzzling to me at first but respindola explained it, and this is a very nice change for LLVM to make, linking is more consistent there now. We do have all the necessary symbol information in emscripten, but were not passing it to LLVM, now emscripten has been modified to do so.

The result is that in some cases more unneeded code can be removed, resulting in smaller generated code, which is great (for example ammo.js is 2% smaller). However, if you do not explicitly keep a function alive (either by using EMSCRIPTEN_KEEPALIVE or __used__ in the C/C++, or adding it to EXPORTED_FUNCTIONS), then LLVM may remove it.

Another improvement is landing together with this into master, unrelated to LLVM 3.2, is better linking of .a archives. We now only use the object files that are actually required, and will not link in others. This can also reduce the size of generate code, but again, if you are not careful, needed functions may be removed, in particular because the link order of archives matters (libA.a libB.a will only link in parts of libA that were required by things before it on the commandline, not things in libB).

Finally, another change you might notice if you use emscripten is that it now has better support for systems with both Python 2 and 3 installed at the same time. ack wrote a big patch to make our usage of python much cleaner to enable that. One significant consequence is that we now look for python2 in the python script shebangs. So if you run ./emcc and do not have python2 in your path, you will get an error. Solutions are to either run python emcc or add a symlink to python2 from python.

Monday, November 12, 2012

Emscripten Compiler Upgrades

Several major architecture improvements have landed in the last few weeks in Emscripten, here is an overview.

New Eliminator

The eliminator optimization phase was originally written by Max Shawabkeh. It basically removes unneeded variables, so

  var x = f(a);
  var y = x + g(b);

can be optimized into

  var y = f(a) + g(b);

This can greatly reduce the size of the code as well as improve performance, and was fundamental for our approach of relying on a combination of the eliminator + the closure compiler to go from LLVM's SSA representation into a register-like format: The eliminator removes large amounts of unneeded variables, and the closure compiler then reduces the number of variables further by reusing the ones that remain.

The eliminator could be slow on large functions, however, because it calculated the transitive closure of dependencies between all the variables, an expensive calculation. It also missed out on some opportunities to optimize because of some simplifying assumptions it made in its design. A final downside was it integrated poorly with the rest of our optimizations (in part due to being written in a different language, CoffeeScript).

I rewrote the eliminator entirely from scratch, in order to do a more precise analysis of which variables can be removed. I also simplified the problem slightly by only eliminating variables that have a single use - this makes it far faster, and I don't see any downside in the quality of the generated code (in fact it avoids some possible bad cases, although it took a long time to figure out what was going on in them). The new version is faster in general and far faster on the rare bad cases (100x even), and generates better-performing code to boot.

Parallel Optimizer

With all of the emscripten optimization passes now in JavaScript, I then worked on parallelizing that. We can't run one pass before the previous one finishes, but within each pass, we can work separately in each function - optimizing each function is independent of the other (we used to have some global optimization passes, but their benefit was very limited).

The parallelization is done using Python's process pool. It splits up the JavaScript into chunks and runs those in multiple JavaScript optimizer instances. The speedup can be close to linear in the number of cores. On BananaBread, the optimization passes become almost twice as fast on my dual-core laptop.

Parallel Compiler

With the optimizer parallel, there remain two phases that can be slow: The compiler (the initial code conversion from LLVM to JavaScript) and Closure Compiler. We can't do much for Closure, but in the long term it will become less and less important: we are implementing specific optimizations for the things we used to rely on it for, which leaves just minifying the code.

For the LLVM to JS compiler, I made the emscripten compiler parallel as well: It splits up the LLVM IR into 3 main parts: type data, function data, and globals. The function data part is unsurprisingly by far the largest in all cases I checked (95% or so), and it can in principle be parallelized - so I did that. Like in the optimizer, we use a Python process pool which feeds chunks of function data to multiple JavaScript compiler instances. There is some overhead due to chunking, and the type data and globals phases are not parallelized, but overall this can be a close to linear speedup.

Overall, pretty much all the speed-intensive parts of code generation and optimization in Emscripten are now parallel, and will utilize all your CPU cores automatically. That means that if you experience slow compilation speeds, you can just throw more hardware at it. I hear they are selling 16-core CPUs now ;)

New Relooper

The relooper is an optimization performed (unlike most optimizations) during initial code generation. It takes basic blocks of code and branching information between them and generates high-level JS control flow structures like loops and ifs, which makes the code run far faster. The original relooper algorithm was developed together with the implementation I wrote in the compiler. Eventually some aspects of how it works were found to be suboptimal, so specific optimizations were added to the JS optimizer ('hoistMultiples', 'loopOptimizer'), overall giving us pretty good generated code.

Meanwhile I wrote a new version of the relooper in C++. There were 2 reasons for that choice of language: First, because other projects needed something like it, and C++ was a better language for them, and second, because we had plans to evaluate writing an LLVM backend for emscripten that would also need to reloop in C++ (note: we decided against the LLVM backend in the end). The new version avoids the limitations of the first, and generates better code. In particular it has no need for additional optimizations done after the fact. It also implements some additional tweaks that are missing in the first one, like node splitting in some cases and more precise removal of loop labels when they are not needed, etc. It's also a much cleaner codebase.

I brought that new version of the relooper into Emscripten by compiling it to JS and using it in the JS compiler. This makes compilation faster both because the new relooper is faster than the previous one (not surprising as often compiled code is faster than handwritten code), and because the additional later optimizations are no longer needed, for overall about a 20% speedup on compiling BananaBread. It also generates better code, for example it can avoid a lot of unneeded nesting that the previous relooper had (which caused problems for projects like jsmess).

Note that this update makes Emscripten a 'self-hosting compiler' in a sense: one of the major optimization passes must be compiled to JS from C++, using Emscripten itself. Since this is an optimization pass, there is no chicken-and-egg problem: We bootstrap the relooper by first compiling it without optimizations, which works because we don't need to reloop there. We then use that unoptimized build of the relooper (which reloops properly, but slowly since it itself is unoptimized) in Emscripten to compile the relooper once more, generating the final fully-optimized version of the relooper, or "relooped relooper" if you will.


Thursday, October 25, 2012

Emscripten News: BananaBread, Nebula3, GDC Online, Websockets, Worker API, Performance, etc

I haven't found time to blog about individual things, so here is one post that summarizes various Emscripten-related things that happened over the last few months.

BananaBread

BananaBread, a port of the Cube 2 game engine to the web, was launched and then received a few minor updates with bugfixes and some additional experimental levels. Feedback was good, and it was linked to by a Firefox release announcement and later a Chromium release announcement, in both cases to show that each browser is now capable of running first person shooter games.

Nebula 3

Cube 2 isn't the only game engine being ported to the web using Emscripten, this post by a Nebula 3 dev is worth reading, and check out the demo it links to. Nebula is a powerful game engine that like id tech engines gets open source releases now and then, and has been used in some impressive games (like this). Very cool to see it working well in JS+WebGL, especially given the dev's initial skepticism - read the blogpost! :)

GDC Online

I gave a talk together with Kevin Gadd at GDC Online, here are my slides. We talked about compiling games to HTML5, I focused on C++ and Kevin on C#, so overall we covered a lot of potential codebases that could be automatically ported to the web.

Among the demos I gave was of course BananaBread, as an example of a 3D first person shooter compiled from C++ and OpenGL to JavaScript and WebGL. Interestingly, Adobe gave a talk later that day about porting games to web browsers, which compared 4 platforms: WebGL/JS, Flash, NaCl, and Unity, and for the WebGL/JS demo they also presented BananaBread, so it ended up being shown twice ;)

Workers API

Support for worker threads is in the incoming branch, look in emscripten.h and for tests with "worker_api" in them in tests/runner.py. This API basically lets you compile code into "worker libraries" that the main thread can then call and get responses from, giving you an easy way to do message-passing style concurrency.

The API is in initial stages, feedback is welcome.

Networking

Initial support for networking using websockets has also been implemented, see tests with "websockets" in their name. Basic sockets usage works, however we have had troubles with setting up a testing websocket server with binary support, see the issue for details. Because of that this won't work on arbitrary binary data yet. If you know websockets and websocket servers and are interested in helping with this, that would be great.

Another approach we intend to work on, and where help would be welcome, is WebRTC. WebRTC could actually be easier to work with since it supports p2p connections, so it's easy to test a connection from one page to itself. It also supports UDP-style unreliable data, so we should be able to get multiplayer working in BananaBread when that is complete.

Library Bindings to JavaScript

We currently have the "bindings generator" which is used to make ammo.js and box2d.js. It works for them, but needs manual hacking and has various limitations. A more proper approach is being worked on, contributed by Chad Austin, which he called "embind". This is a more explicit, controllable approach to bindings generation, and in time it should give us big improvements in projects like ammo.js and box2d. If you use those projects and want them to improve, the best way is to help with the new embind bindings approach. We have some initial test infrastructure set up, and there are various bugs filed with the tag "embind" if you are interested.

LLVM backend

I did some experiments with an LLVM backend for Emscripten when I had free time over the last few months. The results were interesting, and I got some "hello world" stuff working, during which I learned a lot about how LLVM backends are built.

Overall this is a promising approach, and it is what pretty much all other compilers from languages like C into JS work. However, this is going to be low priority, for two main reasons. First, we simply lack the resources: There are many, many other things that are very important for me to work on in Emscripten (see other points in this blogpost for some), and we have not had luck in interesting people to collaborate on this topic so far. Second, while my investigations were mostly positive, they also found negatives in going the LLVM backend route. Some types of optimizations that make sense for JavaScript are an uncomfortable fit for LLVM's backends, which is not surprising given how different JS is from most target languages. It's possible to overcome those issues, of course, but it isn't the optimal route.

Why do pretty much all the other compilers go the LLVM backend route? I suspect it might have to do with the fact that they typically do not just compile to JS. For example, if you already have a compiler into various targets, then when you consider also compiling into JS, it is simplest to modify your existing approach to do that as well. Emscripten on the other hand is 100% focused on JS only, so that's a fundamental difference. If all you care about is targeting JS, it is not clear that an LLVM backend is the best way to go. (In fact I suspect it is not, but to be 100% sure I would need to fully implement a backend to compare it to.)

Compiler and Code Perf

To continue the previous point, there is however one aspect of an LLVM backend that is greatly beneficial - it's written in efficient C++ and will compile your code quickly. Emscripten on the other hand is written in JavaScript and has some complex optimization passes that do a lot of work on a JS AST, and these can take a long time. A fully optimized build of BananaBread, for example, takes about 3 minutes on my laptop, and while it's a big project there are bigger ones of course that would take even more.

On the one hand, this doesn't matter that much - it's done offline by the compiler. People running the generated code don't notice it. But of course, making developer's lives easier is important too.

In Emscripten the goal has always been to focus more on performance of the generated code rather than performance of the compiler itself, so we have added new optimization passes even when they were expensive in compilation time, as long as they made the generated code faster. And we rely on tools like Closure Compiler that take a long time to run but are worth it.

But compiler perf vs code perf isn't an all of nothing decision. Right now on the incoming branch there are some (not fully finished and slightly buggy, but almost ready) optimizations that improve compilation time quite a bit. And with those in place we can move towards parallel compilation in almost all of the optimization passes, so with 8 cores you might get close to 8x speedups in compilation, etc.

So the current goal is to focus on the existing compiler. It will get much faster than it is now, but it will probably never get close to the speed an LLVM backend could get, that's the tradeoff we are making in order to focus on generating faster code. An additional reason this tradeoff makes sense is that we currently have plans for several new types of optimizations to make the generated code yet faster, and it is far easier for us to work on them in the current compiler than an LLVM backend.

Record/Replay

Finally, we added a system for recording and replaying of Emscripten-compiled projects (see reproduceriter.py). With it you basically compile your project in a special mode, run it in record mode and do stuff, then you can run the project in replay mode and it will replay the exact same output you saw before.

The main use case for this is benchmarks: If you have a program that depends on user input and random things like timing or Math.random(), then it is very hard to generate a good benchmark from it because you get different code being run each time. With the record/replay facility you can basically make a reproducible execution trace.

This has been tested on BananaBread so far, and used to create BananaBench, a benchmark based on BananaBread. You can either run it in the browser or in the shell, and hopefully a real-world benchmark like this will make it easier to optimize browsers for this kind of content.

Tuesday, July 24, 2012

BananaBread (or any compiled codebase) Startup Experience

This post talks about startup experience improvements in BananaBread. BananaBread is a specific 3D game engine compiled from C++ to JavaScript, but the principles are general and could be applied to any large compiled codebase.

Starting up "nicely"

Starting up as quickly as possible is always best on every platform. This is a general issue so I won't focus on it, because on the web, there is also another important criterion, which is to start up in as asynchronous a way as possible. By asynchronous I mean to not run in a single large event on the main thread: Instead it is better for as much as possible to be done on background threads (web workers) and for what does run on the main thread to at least be broken up into small pieces.

Why is being asynchronous important? A single long-running event makes the page unresponsive - no input events are being handled and no output is being shown. This might seem not that important for startup, when there is little interaction anyhow. But even during startup you want to at least show a progress bar to give the user an indication that things are moving along, and also most browsers will eventually warn the user about nonresponsive web pages, showing a scary "slow script" dialog with an option to cancel the script or close the page.

Asynchronize all the things..?

If you're writing a new codebase, you would indeed make everything asynchronous. All pure startup calculations would be done in background threads, and main thread events would be very short. Here is an example of such a recently-launched product: The worst main thread stall during startup seems to be about half a second, not bad at all, and a friendly progress bar updates you on the current status. When you are writing a new codebase it is straightforward to design in a way that makes nice startup like that achievable.

When you're compiling an existing codebase, things are harder, though. A normal desktop application does not need to be written in an asynchronous manner, while it might have a main loop that can easily be made asynchronous (run each main loop iteration separately), startup can be just a single continuous process of execution with some periodic notifications to update a progress meter. That is exactly the situation with Cube 2, the game engine compiled to JavaScript in the BananaBread project.

Now, there is a way to have long-running JavaScript code in browsers: Run it in a web worker. That would be the perfect solution, however, workers do not have access to WebGL or audio, and there is no way for them to send synchronous messages to the main thread, so even proxying those APIs to them is not practical. So unless you can easily box off "pure calculation" parts into workers, you do need to run most or all of your code on the main thread.

But you can still asynchronize even such a codebase: Here is what startup was like until recently: BananaBread r13, and and here is what it looks like now: BananaBread r15. The worst main thread stall is 1.4 seconds on my laptop, which is not great but definitely enough to prevent "slow script" warnings on most machines, and there is now a progress bar.

Means of asynchronization

The first important thing is to find small chunks of computation that are easily done ahead of time and their results cached for later:
  • In BananaBread jpg and png images must be decoded into pixel data. Emscripten does that during the preloading phase, each one is decoded by a separate Image element. This not only breaks things up into small pieces, it also uses the browser's native decoders, so it happens faster than if we had compiled a decoding library with the rest of the game engine. (A clever browser might also do these decodings in parallel..)
  • Crunch files need to be decompressed using a compiled JavaScript decoder. We do that during preloading as well, with the decoder running in a web worker.
  • Cube 2 levels (or maps as they are called) are gzip compressed, and the engine decompresses them during startup. I refactored that and BananaBread now decompresses them using zee.js during preloading, also in a worker.
Taking these three points together, BananaBread can use three cores during the preloading phase. This is actually an improvement on the original game engine which is single-threaded!

After preloading, the compiled engine starts to run and we are necessarily single-threaded. The important thing to do at this stage is to at least break up the startup code into small-enough pieces to avoid freezing the main thread. This requires refactoring the original source code, and is not the most fun thing in the world, but definitely possible. Emscripten provides an API to help with this (emscripten_push_main_loop_blocker etc.), you can define a queue of functions to be called in sequence, each in a separate invocation. So the tricky part is just to deal with the codebase you are porting.

Over a few days I broke up the biggest functions called during startup, getting from a maximum of 6 seconds to 1.4 seconds. Browsers seem to complain after around 10 seconds, so 1.4 isn't perfect, but on machines 7x slower than my laptop things should still be ok. Further breaking up will be hard as it starts to get into intricate parts of the game engine - it's possible, but it would take serious time and effort.

Other notes

Of course, there are other big factors with startup:
  • Download time: My personal server that hosts the BananaBread links above is not that fast, and doesn't even do gzip compression. We hope to get a more serious hosting solution before BananaBread hits 1.0.
  • GPU factors: BananaBread compiles a lot of shaders and uploads a lot of textures during startup. On the plus side, the time these take is probably not much different than a native build of the engine, but it's noticeable there too.
  • Data: Smaller levels lead to faster startup and vice versa. Our levels aren't done yet, we'll optimize them some more.
  • Subjective factors: You can't render during long-running JavaScript calculations, but music will keep playing, which makes for a less-boring experience for the user. Also, a nice splash screen during startup is a good idea, we should do that... ;)

Thursday, July 19, 2012

Experimenting with an LLVM backend for JavaScript

We have started to experiment with an LLVM backend for JavaScript. The reasons, approach and other notes are all on the relevant emscripten wiki page.

If you know LLVM and JavaScript and want to help, please get in touch!