azakai's blog

2 recent Emscripten stories

2015-01-06T14:29:00.000-08:00

In case you missed them:

First, DOSBox is an open source emulator that lets you run old DOS programs (which means, mostly games ;) . It simulates a PC compatible machine, complete with Intel CPU and typical graphics and audio cards of the time. DOSBox was ported by dreamlayers to Emscripten a while ago, which was an impressive achievement, and let you run practically any old DOS game right in your browser, no plugins required. Recently, two separate projects have picked it up and are running with it:

The Internet Archive has an MS-DOS showcase (for more details see Jason Scott's blog). At the time of this post, it has over 2,400 programs up. This is amazing for helping to preserve an entire era in computing history.
The ever-porting caiiiycuk created js-dos.com. This has a much smaller selection of programs, but the site itself is a tribute to the days of MS-DOS, really very nice work - it brings back some of the feeling of those times.

Second, Ogre3D, one of the top open source 3D rendering engines, supports exporting to the web using Emscripten as of version 1.10.0. This is exciting because while there are many 3D open source graphics engines, Ogre3D is one of the most general-purpose and most full-featured. Also, I remember that around 3 years ago Ehsan tried to port Ogre3D, but at the time Emscripten was far too immature. It's great to see it ported now, and officially!

Massive, a new work-in-progress asm.js benchmark - feedback is welcome!

2014-07-15T10:26:00.000-07:00

Massive is a new benchmark for asm.js. While many JavaScript benchmarks already exist, asm.js - a strict subset of JavaScript, designed to be easy to optimize - poses some new challenges. In particular, asm.js is typically generated by compiling from another language, like C++, and people are using that approach to run large asm.js codebases, by porting existing large C++ codebases (for example, game engines like Unity and Unreal).

Very large codebases can be challenging to optimize for several reasons: Often they contain very large functions, for example, which stress register allocation and other compiler optimizations. Total code size can also cause pauses while the browser parses and prepares to execute a very large script. Existing JavaScript benchmarks typically focus on small programs, and tend to focus on throughput, ignoring things like how responsive the browser is (which matters a lot for the user experience). Massive does focus on those things, by running several large real-world codebases compiled to asm.js, and testing them on throughput, responsiveness, preparation time and variance. For more details, see the FAQ at the bottom of the benchmark page.

Massive is not finished yet, it is a work in progress - the results should not be taken seriously yet (bugs might cause some things to not be measured accurately, etc.). Massive is being developed as an open source project, so please test it and report your feedback. Any issues you find or suggestions for improvements are very welcome!

Looking through Emscripten output

2014-06-03T10:31:00.000-07:00

Emscripten compiles C and C++ into JavaScript. You are probably about as likely to want to read its output as you would want to read output from your regular C or C++ compiler - that is, you probably don't! Most likely, you just want it to work when you run it. But in case you are curious, here's a blogpost about how to do that.

Imagine we have a file code.c with contents

#include <stdio.h>

int double_it(int x) {
return x+x;
}

int main() {
printf("hello, world!\n");
}

Compiling it with emcc code.c , we can run it using node a.out.js and we get the expected output of hello, world! So far so good, now lets look in the code.

The first thing you might notice is the size of the file: it's pretty big! Looking inside, the reasons become obvious:

It contains comments. Those would be stripped out in an optimized build.
It contains runtime support code, for example it manages function pointers between C and JS, can convert between JS and C strings, provides utilities like ccall to call from JS to C, etc. An optimized build can reduce those, especially if the closure compiler is used (--closure 1): when enabled, it will remove code not actually called, so if you didn't call some runtime support function, it'll be stripped.
It contains large parts of libc! Unlike a "normal" environment, our compiler's output can't just expect to be linked to libc as it loads. We have to provide everything we need that is not an existing web API. That means we need to provide a basic filesystem, printf/scanf/etc. ourselves. That accounts for most of the size, in fact. Closure compiler helps with the part of this that is written in normal JS, for the part that is compiled from C, it gets stripped by LLVM and is minifed by our asm.js minifier in optimized builds.

For comparison, an optimized build with closure compiler, using -O2 --closure 1, is 1/4 the original size. This still isn't tiny, mainly due to the libc support we have to provide. In a medium to large application, this is negligible (especially when gzipped, which is how it would be sent over the web), but for tiny "hello world" type things it is noticeable.

(Side note: We could probably optimize this quite a bit more. It's been lower priority I guess because the big users of Emscripten have been things like game engines, where both the code and especially the art assets are far larger anyhow.)

Ok, getting back to the naive unoptimized build - let's look for our code, the functions double_it() and main(). Searching for main leads us to

function _main() {
var $vararg_buffer = 0, label = 0, sp = 0;
sp = STACKTOP;
STACKTOP = STACKTOP + 16|0;
$vararg_buffer = sp;
(_printf((8|0),($vararg_buffer|0))|0);
STACKTOP = sp;return 0;
}

This seems like quite a lot for just printing hello world! It's because this is unoptimized code. So let's look at an optimized build. We need to be careful, though - the optimizer will minify the code to compress it, and that makes it unreadable. So let's build with -O2 -profiling, which optimizes in all the ways that do not interfere with inspecting the code (to profile JS, it is very helpful to read it, hence that option keeps it readable but still otherwise optimized; see emcc --help for the -g1, -g2 etc. options which do related things at different levels). Looking at that code, we see

function _main() {
var i1 = 0;
i1 = STACKTOP;
_puts(8) | 0;
STACKTOP = i1;
return 0;
}

There is some stack handling overhead, but now it's clear that all it's doing is calling puts(). Wait, why is it calling puts() and not printf() like we asked? The LLVM optimizer does that, as puts() is faster than printf() on the input we provide (there are no variadic arguments to printf here, so puts is sufficient).

Keeping Code Alive

What about the second function, double_it()? There seems to be no sign of it. The reason is that LLVM's dead code elimination got rid of it - it isn't being used by main(), which LLVM assumes is the only entry point to the entire program! Getting rid of unused code is very useful in general, but here we actually want to look at code that is dead. We can disable dead code elimination by building with -s LINKABLE=1 (a "linkable" program is one we might link with something else, so we assume we can't remove functions even if they aren't currently being used). We can then find

function _double_it(i1) {
i1 = i1 | 0;
return i1 << 1 | 0;
}

(Note btw the "_" that prefixes all compiled functions. This is a convention in Emscripten output.) Ok, this is our double_it() function from before, in asm.js notation: we coerce the input to an integer (using |0), then we multiply it by two and return it.

We can keep code alive by calling it, as well. But if we called it from main, it might get inlined. So disabling dead code elimination is simplest. You can also do this in the C/C++ code, using the C macro EMSCRIPTEN_KEEPALIVE on the function (so, something like int EMSCRIPTEN_KEEPALIVE double_it(int x) { ).

C++ Name Mangling

Note btw that if our file had the suffix cpp instead of c, things would have been less fun. In C++ files, names are mangled, which would cause us to see

function __Z9double_iti(i1) {

You can still search for the function name and find it, but name mangling adds some prefixes and postfixes.

asm.js Stuff

Once we can find our code, it's easy to keep poking around. For example, main() calls puts() - how is that implemented? Searching for _puts (again, remember the prefix _) shows that it is accessed from

var asm = (function(global, env, buffer) {
'use asm';
// ..
var _puts=env._puts;
// ..
// ..main(), which uses _puts..
// ..
})(.., { .. "_puts": _puts .. }, buffer);

All asm.js code is enclosed in a function (this makes it easier to optimize - it does not depend on variables from outside scopes, which could change). puts(), it turns out, is written not in asm.js, but in normal JS, and we pass it into the asm.js block so it is accessible - by simply storing it in a local variable also called _puts. Looking further up in the code, we can find where puts() is implemented in normal JS. As background, Emscripten allows you to implement C library APIs either in C code (which is compiled) or normal JS code, which is processed a little and then just included in the code. The latter are called "JS libraries" and puts() is an example of one.

Conclusion

You don't need to read the code that is output by any of the compilers you use, including Emscripten - compilers emit code meant to be executed, not understood. But still, sometimes it can be interesting to read it. And it's easier to do with a compiler that emits JavaScript, because even if it isn't typical hand-written JavaScript, it is still in a fairly human-readable format.

C++ to JavaScript: Emscripten, Mandreel, and now Duetto

2013-11-14T12:09:00.002-08:00

There are currently at least 3 compilers from C/C++ to JavaScript: Emscripten, Mandreel and the just-launched Duetto. Now that Duetto is out (congrats!), it's possible to do a comparison between the 3, which ends up being interesting as there are some big differences but also big similarities between them.

Disclaimer: I founded the Emscripten project and work on it, so unsurprisingly I have more knowledge about that one than the other two. It is likely that I got some stuff wrong here, please correct me if so! :)

A quick overview is in the following table:

	Emscripten	Mandreel	Duetto

License	open source (permissive)	proprietary	mixed (see below)

Based on	LLVM	LLVM	LLVM

Architecture	LLVM IR backend (external)	LLVM tblgen backend	LLVM IR backend

Memory model	singleton typed array	singleton typed array	JS objects

C/C++ compatibility	full	full	partial

LLVM IR opts	full	full	partial

Backend opts	low-level JS	LLVM	?

Call model	JS native	C stack in typed array	JS native

Loop recreation	Emscripten relooper	Custom (relooper-inspired)	Emscripten relooper

APIs	standard C (SDL, etc.)	custom	HTML5-based

Other		non-JS targets too	client-server

More detail on each of those factors:

License: Licenses range from Emscripten which is fully open source with a permissive license (MIT/LLVM), to Mandreel which is fully proprietary, to Duetto which is somewhere in the middle - the Duetto core compiler has a permissive open source license (LLVM), the Duetto libraries+headers are dual licensed GPL/proprietary, and Duetto premium features are proprietary.

Duetto includes core system libraries that must be linked together with your code, like libc and libcxx. Those have been relicensed to the GPL (from original licenses that were permissive, like MIT for libcxx), and this implies that the Duetto output for a project will be a mix of that project's code plus GPL'd library code, so it looks like the entire thing must be GPL'd, or you must buy a proprietary license.
Based on: All of these compilers use LLVM. It has become in many ways the default choice in this space.
Architecture: While they all use LLVM, these compilers use it in quite different ways. Mandreel has an LLVM backend using the shared backend codegen infrastructure, using tblgen, the selection DAG, etc. - this is the way most LLVM backends are written, for example the x86 and ARM backends, so Mandreel's architecture is the closest to a "typical" LLVM-based compiler. On the other hand, Duetto implements a backend inside LLVM that does not use that shared backend code, and instead basically calls out into Duetto code that processes the llvm::Module itself into JavaScript, which means it processes LLVM IR (which is entirely separate from the backend selection DAG: LLVM IR is lowered into the selection DAG). Emscripten also processes LLVM IR, like Duetto, but does so outside of LLVM itself - it parses LLVM IR that is exported from LLVM.
Memory model: Mandreel and Emscripten share a similar memory model, using a singleton typed array with aliasing views to emulate something extremely similar to how LLVM IR (and C programs) view memory. That includes aliasing (you can load a 32-bit value and split it into two 16-bit values, or you can read two 16-bit values and get the same result), as well as the fact that pointers are all interchangeable (aside from casting rules, they are just numeric values, that refer to a place in a single continguous memory space) and unsafe accesses (if you have a pointer to an int, even if there is no valid data after it, you can still read x[1] or x[17]; hopefully the program made sure there is in fact valid data there!). The odd one out is Duetto which uses JS objects, so each C++ class you instantiate has its own object. This generates more "normal"-looking JS, at least at first glance - there are objects, there are properties accessed on objects - but it is complex to make it match up to the LLVM memory model (for example, pointers to objects in Duetto need to be able to represent both the JS object holding the data, but also a numeric offset so that pointer math works, etc.), which adds overhead (see more discussion in the Performance Analysis section below).

Another result of this memory model is that Duetto allocates and frees JS objects all the time. This has both advantages and disadvantages: On the plus side it means that objects no longer referred to can be reclaimed by the JS engine's garbage collector (GC), so at any point in time the total memory used can be just what is currently alive - I say "can be" and not "will be" because that assumes that the JS engine has a moving GC (many do not), and even so this will only be true right after a GC, and not all the time. And on the minus side for Duetto it means that the overhead of object allocation and collection is present, whereas for Emscripten and Mandreel it is not - in those compilers, objects are just ranges in the singleton typed array, the JS GC is not even aware of them.

Furthermore, each object will take more memory in the Duetto model: JS objects do not just store the raw data in them, but also have various overhead related to the VM (for example, most VMs have a pointer to "shape" info for the object, to optimize accesses to it, and other stuff), whereas in Emscripten and Mandreel literally the same amount of memory is used as C would use. On the other hand, Emscripten and Mandreel have a typed array that may not be fully used at all times, which can be wasteful. That limitation can be partially removed, for example Emscripten has an option to grow the heap only when required (through the POSIX sbrk command), and similarly it could be shrunk as well, which would leave only fragmentation as a concern - but fragmentation is definitely a real concern for long-running code. So which is better will really depend on the application and use case. For short-running code it seems highly likely that Emscripten and Mandreel will use less memory, whereas for long-running code it might go either way.

An advantage of the Duetto model is security: if you read beyond the limit of one array, you will not read data from another. Basically, Duetto is compiling C/C++ into something more like Java or C#, where you do bounds checks on each array and so forth. This can prevent lots of bugs, obviously. However, it is foreign to how C/C++ normally work - this is an advantage not over Emscripten and Mandreel, but over all C++ compilers, it is an inherent difference compared to the normal way C++ compilers are implemented. It does come with some security benefits, however it also has downsides in terms of performance (those bounds checks just mentioned) and compatibility (see next section). Personally, I would argue that if you want that type of security, you would be better off writing code in C# or Java in the first place (which have compilers to JS, for example JSIL and GWT), but I do think that what Duetto is doing is interesting - it's almost like making a new language.
Compatibility: Emscripten and Mandreel should be able to run practically any C/C++ codebase, because as mentioned in the previous point, their memory model is essentially identical to LLVM's, assuming the code is reasonably portable. (As a rule of thumb, if C code is portable enough to run on both x86 and ARM, it will work in Emscripten and Mandreel.) Duetto however has a model which is different than LLVM's, which can require that codebases be ported to it. How much of a problem this is for Duetto will depend the codebases people try to port. There are very portable codebases like Bullet, but much real-world code does depend on memory aliasing and so forth, for example Cube 2. Emscripten experimented with various memory models in the past, and we encountered significant challenges with all of them except the C-like model that Emscripten and Mandreel currently use, which is most compatible with normal C/C++. As a quick test out of curiosity, I tried to build just the script module in Cube 2 using Duetto. I received multiple errors (for example invalid field types in unions, etc.) that due to the complexity of Cube 2 did not look trivial to fix.

Note that for a new codebase this wouldn't matter - people writing new Duetto apps will just limit themselves to what Duetto can do.
LLVM IR optimizations: The previous point spoke about compatibility with C++ code. A related matter is compatibility with LLVM IR: While all these compilers rely on a frontend to transform source code into LLVM IR, so that only the subset of LLVM IR that the frontend generates is relevant, the LLVM optimizer can perform sophisticated transformations from LLVM IR to other LLVM IR. Therefore to be able to utilize the LLVM IR optimizer to its full extent, it is best to support as wide a range of LLVM IR as possible, otherwise you need to tell the optimizer to limit what it does.

Compatibility with LLVM IR depends significantly on the memory model, just like compatibility with C++: the optimizer assumes memory aliases just like C assumes it does, and so forth. So Emscripten and Mandreel, which have a memory model that is practically identical to LLVM's, can benefit from the full range of LLVM optimizations to be run, including the ones that assume aliasing and other non-safe aspects of the C memory model. Duetto on the other hand has a different memory model, one not compatible with all C code, and similarly Duetto is not compatible with all LLVM IR. That is, Duetto runs into problems either when source code would need something in LLVM IR that it cannot handle, or when the LLVM optimizer generates IR it can't handle. For source code, you can port it so it is in the proper subset; for optimizations, you need to either disable or limit them.

It's possible that Duetto does not lose much here - most LLVM IR optimizations do not generate odd things like 512-bit integers nor do they rely heavily on C's memory model - so limiting LLVM's optimizer to what Duetto can handle might work out fine. However, a concern here is that Duetto needs to make sure that all existing and future IR optimizations do not depend on things it can't handle. Only if you have the exact same assumptions as LLVM IR has about memory and so forth, do you have a guarantee that all optimizations are valid for you.
Backend optimizations: Only Mandreel benefits from LLVM's backend optimizations (optimizations done after the LLVM IR level), because only it is implemented as an LLVM backend using the shared backend infrastructure, as discussed earlier. That gives Mandreel access to things like register allocation and various other optimizations. These can be quite significant in native (x86, ARM, etc.) builds, but it is less clear how crucial they are on the web, since JS engines will do their own backend-type optimizations (register allocation, licm, etc.) anyhow. In place of those, both Emscripten and Duetto perform their own "backend" optimizations. Emscripten's focus on low-level code (which makes sense given it's memory model, as mentioned before), and includes JS-specific passes to remove unneeded variables, shrink overly-large functions, etc. (see my talk from last week at the LLVM developer's conference, also this). I am not familiar with the details of Duetto's backend optimizations, but from their blogposts it sounds like they focus on optimizing things for their memory model (avoiding unnecessary object creation, etc.). If any Duetto people are reading this, I would love to learn more.
Call model: Both Emscripten and Duetto use native-looking function calls: func(param1, param2) etc. This uses the JS stack and tends to be well-optimized in JS engines. Mandreel on the other hand uses the C stack, that is, it writes to global locations before the call and reads them back in the call. Mandreel's approach is similar to how CPUs work, but I believe no JS engine is able to optimize parameters in that call model into registers, so it will mean memory accesses which are slower. Inlining (by LLVM or by the JS engine) will reduce that overhead for Mandreel, but it will still be noticeable in some workloads, and also is bad for code size.
Loop recreation: LLVM IR contains basic blocks and branch instructions, which are easy to model with gotos, but JS lacks goto statements. So all these compilers must either use large switch statements in loops (the closest to a goto as we can get), or recreate loops and ifs. They all do the latter. Emscripten has what it calls its Relooper for that purpose, and I see that Emscripten's code is reused in Duetto which is nice. Mandreel also recreates loops, in a way inspired by the Relooper algorithm but not using the same code.
APIs: Emscripten implements typical C APIs like SDL, glut, etc., so often entire apps can just recompile and run. Mandreel on the other hand has its own API which apps need to be written against. It is custom and designed to be portable not just across native and JS builds, but also various other platforms that Mandreel can target, like Flash. Finally Duetto supports accessing HTML APIs from C++, so they are standard APIs in the sense of the web, but not ones that existing code might already be using. For new apps, Duetto's approach is interesting, but for existing ones I think Emscripten's is preferable. Of course there is no reason to not do both in a single compiler. (In fact it could be possible for Emscripten and Duetto to collaborate on that, but looks like Duetto's WebGL headers are GPL licensed which would be a problem for Emscripten, which only includes permissively licensed open source code like MIT.)

Another factor to consider regarding APIs is native builds. It is often very useful to compile the same app to both native and HTML5 builds, for performance comparisons, benefiting from native debugging and profiling tools, etc. If I understand Duetto's approach properly, it can't compile apps natively because Duetto apps use things like WebGL which are not present as native libraries. Emscripten and Mandreel on the other hand do allow such builds.
Other: A few other points of interest are that while Emscripten is purely an LLVM to JS compiler, Mandreel and Duetto have other aspects as well. As already mentioned, Mandreel has a custom API that apps can be developed against, and it is portable across a very large set of target platforms, not just JS. And Duetto is meant not just for client JS but also to be able to run C++ on both sides of a client-server app, sort of like how node.js lets you run the same code on both sides as well, but using C++.

Interim Summary

As mentioned in the introduction, there are big differences but also big similarities here. Often two out of the three compilers are very similar or identical on a particular aspect, but which two changes from aspect to aspect. Overall I think it's interesting how big the differences are, probably larger than the differences between modern JS engines or modern C compilers! I suspect that is because compiling C++ to JS is a newer problem and the industry is still figuring out the best way to do it.

Performance Analysis

What you don't see in this blogpost are benchmarks. That's intentional, because it's very hard to fairly benchmark code across compilers as different as this. It's practically a given that benchmarks can be found that show each compiler to be best, so I won't bother to do that. Instead, this section contains an analysis of how the different compilers' architectures will affect performance, to the best of my understanding.

Overall the biggest factor responsible for performance is likely the memory model. Over the last 3 years Emscripten tried various approaches including JS objects, arrays, typed arrays in aliased and non-aliased modes, arrays with "compressed indexes" (aka QUANTUM_SIZE=1), and finally ended up on aliased typed arrays - the model described above, and shared with Mandreel - because they were greatly superior in performance. They are better both because they model memory essentially the same way LLVM does, so the full set of LLVM optimizations can be run, and also because being low-level they are very easy for JS engines to optimize - reads and writes in LLVM IR become reads and writes of typed arrays in JS, which can become simple reads and writes from memory in modern JS engines, and so forth. When emitting the asm.js subset of JS, this is even more precise (as it avoids cases where reads can be out of bounds and return undefined, for example, which would have meant it isn't a simple memory read any more), and for those reasons this approach of compiling LLVM to JS can get pretty close to native speed.

I would be surprised if Duetto gets to that level of performance. It uses property accesses on JS objects, which is far more complex than typed array reads and writes - instead of a simple read or write from memory, JS object property accesses are optimized using hidden class optimizations, PICs, and so forth. These can be fast in many cases, but as heuristic-based optimizations they are unpredictable. They also require overhead to be performed, both in terms of time and memory. Furthermore, as we are not just reading and writing numbers (as in the Emscripten/Mandreel model) but also references to JS objects, things like write barriers may be executed in the VM (or if not, then more time would be spent GCing). And again, all of this is competing with reads and writes from typed arrays, which are the among the simplest thing to optimize in JS since their type is known and they are just flat. It is true that in the best case a JS VM can turn a property access on a JS object into a direct read from memory, but even a statically typed VM like the JVM generally doesn't manage to run at C-like speeds, and here we are talking about dynamically-typed VMs which have a much more complex optimization problem before them.

Perhaps the crux of the matter is that Duetto basically transforms a C++ codebase so it runs in a VM using a relatively safe memory model and GC. There have been various such attempts - Managed C++, Emscripten's and other JS compilers' attempts in this space as mentioned before, etc. - and overall we know this is possible, if you limit yourself to a subset of C++. We also know it has benefits, such as somewhat higher security as mentioned earlier on. But that increased security stems from the compiled C++ being more like compiled Java and C#, because just like them it runs in a VM, is JITed, uses a GC, benefits from PICs, does bounds checks for safety, etc. In other words, I believe that the Duetto model will in the best case approach the performance of Java and C#, while the Emscripten/Mandreel model will in the best case approach the performance of C and C++. Furthermore, their ability to reach the best-case performance is not identical: Approaching C and C++ speed is far simpler because the model is simpler, and we have in fact already mentioned that the Emscripten/Mandreel approach can indeed approach that limit in many cases (see link above). The Duetto model may be able to reach the speed of Java and C#, but that is both more complex and yet unproven.

Perhaps I am missing something, and there is a reason Duetto-compiled C++ running on a JavaScript VM can outperform the JVM and .NET? That seems unlikely to me, because if it were possible to do so, then surely the JVM and .NET would have done whatever allows such speedups already - assuming C++ in Duetto's VM-like model is generally analogous to the JVM/.NET model, which is my best guess for the reasons mentioned before. Certainly it is possible that I overlooked something here and there is in fact a fundamental difference, please correct me if so.

I do believe that Duetto can outperform typical handwritten JS. It has two advantages over that: First, it can run the LLVM optimizer on the code, and second, it generates JS that is implicitly statically typed. For example, if the code contains a.x then x should always be of the same type, since it originated from C++ code that had that property (C unions are a complication for this but that is solvable as well, with some overhead). In hand-written JS, on the other hand, it is easy to do things that break that assumption. Again, the model Duetto uses is closer to C# or Java where there are objects and a garbage collector and so forth, like JS, but at least types are static, unlike JS. And modern JS engines are quite good at looking for hidden static types, so Duetto should benefit from those optimizations.

One last note on performance: In small benchmarks the difference between C/C++, Java/C#, and JS can be quite small. JS VMs as well as the JVM and CLR have JITs that can often optimize small loops and such into extremely efficient code, competitive with C/C++ (and LuaJIT can even outperform it in some cases). Things change, however, with large codebases. The larger the codebase, the harder dynamic, heuristic-based optimizations are to perfect; this is true for Java and C# and far more true for JS. For example, a small inner loop containing a temporary object allocation in JS might be analyzed to not escape, and turn into just some constant space on the stack. However, in a large codebase with many function calls, including functions too large and/or too complex to inline, etc., such analyses are often less productive, whereas in C/C++, they will still be on the stack in this example. As another example, JS engines need to analyze types dynamically, and in small functions often do so very well, but the larger the relevant code, the higher the risk for any imprecision in the analysis to "leak" everywhere else (for example, one careless function returning a double instead of an int will send a double into all the callers), whereas in C/C++ there is no such risk. (asm.js tries to get as close as possible to C/C++ by ensuring by a type system that there are type coercions in all necessary places, which would avoid the problem just described - that double would be immediately turned into an int in all callers.)

For all these reasons, small benchmarks often do not show differences, but large ones can. So how fast a compiler is depends a lot on what kind of code is compiled with it. When Duetto is used on a small benchmark, I would not be surprised if it performs very well. My concerns are mainly about large codebases, which I don't think I've seen Duetto report numbers on yet. (By "large" I mean something like Cube 2 or Epic Citadel, hundreds of thousands of lines of code, consisting of many different components, etc.)

General thoughts on Duetto

If Duetto can in fact get close to the speed of Java and C#, that's good enough for many applications, even if it is slower than C and C++ (and Emscripten and Mandreel). And in some cases Java and C# are fairly comparable to C and C++ anyhow. Finally, of course performance is not the only thing that matters when developing applications.

Putting aside performance, Duetto's approach has some advantages. One nice thing about Duetto is that it uses Web APIs in C++, so they might avoid a little overhead that Emscripten and Mandreel have. For example, Emscripten and Mandreel translate OpenGL calls into WebGL calls, while Duetto will have what amounts to translating WebGL into WebGL. That's definitely simpler, and it might be faster in some cases.

Another interesting aspect of Duetto is that since C++ objects become JS objects, this might make it easier to interoperate with normal JS outside of the compiled codebase. It seems like this could work in some cases, however if the codebase is minified then I don't see how this is possible, without either duplicating properties or using wrapper code, which basically gets to the same position Emscripten is in this regard.

Finally, the client-server thing that Duetto does looks quite unique, and it will be interesting to see if people flock to the idea of doing C++ for webservers, and to integrate that code with compiled C++ on the client in JS. That's an original idea AFAIK, and it's cool.

In summary, congratulations to Duetto on launching! Nice to see new and interesting ideas being tried out in the C++/JS space.

Two Recent Emscripten Demos

2013-11-12T10:52:00.003-08:00

Whenever I do a presentation I try to have at least one new demo, here are two that I made recently:

3 screens demo, for the Mozilla 2013 Summit. The demo is of 3 screens inside the BananaBread 3D first person shooter, a port of Sauerbraten. One screen shows a list of recent tweets, another shows a video, and the third shows a playable Doom-based game (see details here). So you can play Doom while you play Sauerbraten ;) Controls are a little tricky, but see the demo page itself for how to play each game.
Clang in the browser, for the LLVM Developer's Conference. The name says it all, this is basically the entire clang C++ compiler running in JS. The port isn't very polished (I didn't try to optimize it, you need to reload the page to recompile, etc.), I basically just got it running.

Outlining: a workaround for JITs and big functions

2013-08-22T11:38:00.000-07:00

Just In Time (JIT) compilers, like JavaScript engines, receive source code and compile it at runtime. That means that end users can notice how long things take to compile, and avoiding noticeable compilation pauses is important. But JavaScript engines and other modern JITs perform optimizations whose complexity is worse than linear, for example SSA analysis and register allocation. That means that aside from total program size, function size is important as well: the time it takes to compile your program may be mostly affected by a few large functions that have lots of code and variables in them. If compilation is N^2 per function (where N is the number of AST nodes or variables or such), then having function sizes

100, 100, 100, 100, 100, 100, 100, 100, 100, 100

is much better than

1000

even though the total amount of code is the same (10 functions of size 100 take 10 * 100^2 or 100,000 units of time, while a single function of size 1000 takes 1,000,000).

Now, very large functions are generally rare, since most people wouldn't write them by hand. But automatic code generators can create them, as well as compilers that perform optimizations like inlining (e.g., closure compiler, LLVM as used by emscripten and mandreel, etc.).

What do JavaScript engines do with very large functions? Generally speaking, since compiling them takes very long, JITs have just not fully optimized them and left them in the interpreter or baseline JIT. This avoids the long compilation pause, but leaves the code running slower. As more JS engines add background compilation, the compilation pause is not as noticeable, but it still means a significant amount of additional time that the code executes before it is fully optimized.

This problem is not unique to JavaScript engines, of course, it is a concern for any JIT that does complex optimizations, for example the JVM. However, JavaScript engines are particularly concerned with noticeable compilation pauses because they usually run on end users' machines (as opposed to remote servers), and they run arbitrary content off the web.

What can we do about this? It could be solved in one of two general ways, either in the JavaScript engine, or by preprocessing the code ahead of time.

In the JavaScript engine, as already mentioned before, background compilation makes it feasible to compile even big and slowly-compiling functions, since the pause is not directly noticeable. But this still means a long delay before the fully-optimized version runs, and the background thread consumes battery power and perhaps competes with other threads and processes. Also, even in JavaScript engines with background compilation, there are often limits still present, just to avoid the risk of a background thread running a ridiculously long time. Finally, a JavaScript engine might be able to compile separate functions in parallel but not parallelize inside a single function. So all of these are partial solutions at best.

Another JavaScript engine option is to not compile entire functions at a time. SpiderMonkey experimented with this using chunked compilation, basically parts of a function could be compiled separately. This was done successfully in JaegerMonkey, but it is my understanding that it turned out to be quite complex and bug prone, and deemed not worth doing in the newer IonMonkey.

(Of course the best JavaScript engine solution is to just make compilation faster. But huge amounts of work have already gone into that in all modern JavaScript engines, so it is not realistic to expect sudden large improvements there.)

That leaves the other option, of preprocessing the code ahead of time. One way in emscripten for example is to tell LLVM to not perform inlining on some part of your code (if inlining is the cause of a big function), but in general there is not always a simple solution.

Another preprocessing option is to work at the JavaScript level and break up huge functions into smaller pieces. I originally thought this would be too complicated and/or generate too much overhead, but I was eventually convinced to go down this route. And a good thing too ;) because even without much tuning, I think the results are promising:

The X axis is the outlining limit, or how large a function size is acceptable before we start to break it up into smaller pieces, by taking chunks of it and moving them outside into a new function - sort of the opposite of inlining, hence outlining. The number itself is the amount of AST nodes. 0 is a special value meaning we do not do anything, then as you move to the right we do more and more breaking up, as we aim for smaller maximal function sizes.

The Y axis is time (in milliseconds), and each measurement has both compilation and run time. Compilation measures the total time needed to compile all the code by the JavaScript engine, and runtime is how long the benchmark - a SQLite testcase - takes to run. (To measure compilation time, I looked at startup time in Firefox, which is convenient because Firefox does ahead of time (AOT) compilation of the asm.js subset of JavaScript, giving an easy way to measure how long it takes to fully optimize all the code in the program. In non-AOT compilers there would be less startup time of course, but as explained before the costs of long compilation time would still be felt: either functions would not be optimized at all, or compile slowly either on the main thread or a background thread, etc. - those effects are less convenient to measure, but still there.)

The benchmark starts at 5 seconds to compile and 5 seconds to run (running creates a SQLite database in memory, generates a few thousand entries, then does some selects etc. on them). 5 seconds to compile all the code is obviously a long time, and most of it is from a single function, yy_reduce. An outlining limit of 160,000 is enough to break that function up into 2 pieces but do nothing else, and the result is over 4x faster compilation (from over 5 seconds to just over 1 second) with a runtime slowdown of only 7%. Outlining more aggressively to 80,000 speeds up compilation by 6x, but the runtime slowdown increases to 35%. At the far right of the graph the overhead becomes very painful (runtime execution is over 4x slower), so the best compromise is probably somewhere in the middle to left of the graph.

Overall there is definitely a tradeoff here, but luckily even just breaking up the very largest function can be very helpful. Polynomial compilation times can be disproportionately influenced by the very largest function, so that makes sense.

What is the additional overhead that affects runtime as we go to the right on the graph? To understand what is going on, let's detail what the outlining optimization does. As already mention, the idea is to take code in the function and move it outside. To do that, it performs three stacked optimizations:

1. Aggressive variable elimination. Emscripten normally works hard to eliminate unneeded variables, for example

var x = y*2;
f(x);

would become

f(y*2);

But there is a potential tradeoff here. While in this example we replace a variable with a small expression, if the expression is large and shows up multiple times, it can increase code size while reducing variable count. However, for outlining, we do want to remove variables at (almost) all costs, because the more local variables we have, the more variables might happen to be shared between code staying in the function and code being moved out. That means we need to ship those variables between the two functions, basically by spilling them to the stack and then reading them from there, which adds overhead (see later for more details). The first outliner pass therefore finds practically all variables that are safe to remove, and removes them, even if code size increases.

2. Code flattening. It is straightforward to split up code that is in a straight line,

if (a) f(a);
if (b) f(b);
if (c) f(c);
if (d) f(d);

Here we can pick any line at which to split. But things are harder if the code is nested

if (a) {
f(a);
} else {
if (b) {
f(b);

} else {
if (c) {
f(c);
} else {

if (d) f(d);

}
}
}

The second pass "flattens" code by breaking up if-else chains, for example in this case it would generate something like

var m = 1;
if (a) {
m = 0;
f(a);
}
if (m && b) {

m = 0;

f(b);

}
if (m && c) {
m = 0;

f(c);

}
if (m && d) {

f(d);

}

The ifs are no longer nested, and form a straightline list of statements which is easier to process. However, we have added overhead here, both in terms of code size and in additional work that is done.

3. Outline code. Now that we minimized the number of local variables and made the code sufficiently flat to be easy to process, we recursively search the AST for straightline lists of statements where we can outline a particular range of them. When we find a suitable one (for details of the search, see the code), we outline it: create a new function, move the code into it, and add spill/unspill code on both sides to pass over the local variables (only the ones that are necessary, of course). A further issue to handle is to "forward" control flow changes, for example if we outlined a break then we basically forward that to the calling function and so forth. Then in the original function the code has been replaced with something like this:

HEAP[x] = 0; // clear control var
HEAP[x+4] = a; // spill 2 locals
HEAP[x+8] = b;
outlinedCode(); // do the call
b = HEAP[x+8]; // read 2 locals
c = HEAP[x+12];
if (HEAP[x] == 1) {
continue; // forwarded continue
}

We spill some variables and read back some afterwards (note that they are not necessarily the same ones - here the outlined code does not modify a, so no need to read it). We also use a location in HEAP as a control flow helper variable, to tell us if the outlined code wants us to do, in this case, a continue.

Overall, while there is additional size and overhead here, with this approach we can replace a very large amount of code with just a function call to it plus some bookkeeping. As the graph above shows, in some benchmarks this can decrease compilation time very significantly.

To try this optimization in an emscripten-using project, just add -s OUTLINING_LIMIT=N to your compilation flags (during conversion of bitcode to JS), where N is a number.

What asm.js is and what asm.js isn't

2013-06-20T10:34:00.000-07:00

This is going to be a bit long, so tl;dr asm.js is a formalization of a pattern of JavaScript that has been developing in a natural way over years. It isn't a new VM or a new JIT or anything like that.

asm.js is a subset of JavaScript, defined with the goal of being easily optimizable and used primarily as a compiler target from languages like C and C++. I've seen some recent online discussions where people appear to misunderstand what those things mean, which motivated me to write this post, where I'll give my perspective on asm.js together with some context and history.

The first thing worth mentioning here is that compiling into JavaScript is nothing new. It's been done since at least 2006 with Google Web Toolkit (GWT) which can compile Java into JavaScript (GWT also does a lot more, it's a complete toolkit for writing complex clientside apps, but I'll focus on the compiler part of it). Many other compilers from various languages to JavaScript have shown up since then, for both existing languages like C++ and C#, to new languages like CoffeeScript, TypeScript and Dart.

Compiled code (that is, JavaScript that is the output of a compiler) can look odd. It's often not in a form that we would be likely to write by hand. And each compiler has a particular pattern or style of code that it emits: For example, a compiler can translate classes in one language into JavaScript classes using prototypal inheritance (which might look a little more like "typical" JS), or it can implement those classes with function calls without inheritance (passing "this" manually; this might look a little less "typical"), etc. Each compiler has a way in which it works, and that gives it a particular pattern of output.

Different patterns of JavaScript can run at different speeds in different engines. This is obvious of course. On one specific benchmark a JS engine can be faster than another, but in general no JS engine is "the fastest", because different benchmarks can test different things - use of classes, garbage collection, integer math, etc. When comparing JS engines, one can be better on one of those aspects and slower on another, because optimizing for them can be quite separate. Look at the individual parts of benchmarks on AWFY for example (click "breakdown"), and you'll see that.

The same is true for compiled code: different JS engines can be faster or slower on particular patterns of compiled code. For example, it was recently noticed that Chrome is sometimes much faster than Firefox on GWT-generated code; as you can see in the bug there, recent work has narrowed the gap. There is nothing odd about that: one JS engine was faster than another on a particular pattern, and work was done to catch up. This is how things work.

Aside from Java, which GWT compiles, another important language is C++. There have been at least two compilers from C++ to JavaScript in existence over the last few years, Emscripten and Mandreel. They are separate projects and quite different in many ways - for example, Emscripten is open source and targets just JavaScript, while Mandreel is closed source and can target other things as well like Flash - but over a few years they converged on basically the same pattern of JavaScript for compiled C++ code. That pattern involves using a singleton typed array to represent memory, and bitwise operations (|0, etc.) to get values to behave like C++ integers (the JavaScript language only has doubles).

We might say that Emscripten and Mandreel "discovered" a useful pattern of JavaScript, where "discovered" means the same as when Crockford discovered JSON. Of course typed arrays, bitwise ops and JavaScript Object Notation syntax all existed earlier, but noticing particular ways in which they can be especially useful is significant. Today it feels very natural to use JSON as a data interchange format - it almost feels like it was originally designed to be used in that manner, even though of course it was not - and likewise, if you are writing a compiler from C++ to JavaScript, it is natural to use |0 and a singleton typed array to do so.

Luckily for the Emscripten/Mandreel pattern, that type of code benefits from many years of work that went into JavaScript engines. For example, they are all very good at optimizing code whose types do not change, and the Emscripten/Mandreel pattern generates code that is implicitly statically typed, since it originated as statically typed C++, so in fact types should not change. Likewise, the bitwise operators that are important in the Emscripten/Mandreel pattern appeared in crypto tests in major benchmarks like SunSpider, so they are already well-optimized for.

In other words, the Emscripten/Mandreel pattern of code could be quite fast due to it focusing on things that JS engines already did well. This isn't a coincidence of course, as both Emscripten and Mandreel tested in browsers and decided to generate code that ran as quickly as possible. Then, as the output of these compilers was increasingly used on the web, browsers started to optimize for them in more specific ways. Google added a benchmark of Mandreel code to the Octane benchmark (the successor to the popular V8 benchmark), and both Chrome and Firefox have optimized for both Mandreel and Emscripten for some time now (as a quick search in their bug trackers can show). So the Emscripten/Mandreel pattern became yet another pattern of JavaScript that JavaScript engines optimize for, alongside all the others which are represented in the familiar benchmarks: SunSpider, Octane, Kraken, etc. This was all a very natural process.

Here is where we get to asm.js. While Emscripten/Mandreel code can run quickly, there was still some gap between it and native code. The gap was not huge - in many cases it ran just 3x slower than native code, which is pretty close to, say, Java - but still problematic in some cases (for example, high-performances games). So Mozilla began a research project to investigate ways to narrow that gap. Developed in the open, asm.js started to formally define the Emscripten/Mandreel pattern as a type system. Formalizing it made us think about all the corner cases, and we found various places in Emscripten/Mandreel code where types could change at runtime, which is contrary to the goal of the pattern, and can make things slower. asm.js's goal was to get rid of all those pitfalls.

The first set of benchmark results was promising - execution close to 2x slower than native, even on larger benchmarks. Achieving that speedup in Firefox took just 1 engineer only 3 months, since it doesn't require a new VM or a new JIT, it only required adding some additional optimizations to Firefox's existing JS engine. Soon after, Google showed very large speedups on asm.js code as well, as noted in the IO keynote and as you can see on AWFY. (In fact, over the last few days there have been several speedups noticeable there on both browsers - these improvements are happening as we speak!) Again, these large speedups in two browsers were achieved quickly because no new VM or JIT was written, just additional optimizations to existing JS VMs. So it would be incorrect to say that asm.js is a "new VM".

I've also seen people say that asm.js is a new "web technology". Loosely speaking, I suppose we could use that term, but only if we understand that if we say "asm.js is a web technology" then that is totally different from when we say "WebGL is a web technology". Something like WebGL is in fact what we normally call a "web technology" - it was standardized, and browsers need to decide to support it and then do work to implement it. asm.js, on the other hand, is a certain pattern of compiled code in JavaScript, which is already standardized and supported. asm.js, just like GWT's output pattern, does not need to be standardized, nor do browsers need to "support" it, nor to standardize whatever optimizations they perform for it. Browsers already support JavaScript and compete on JavaScript speed, so asm.js, GWT output, CoffeeScript output, etc., all work in them and are optimized to varying degrees.

Regarding the term "support": As before, I suppose we can loosely talk about browsers "supporting" asm.js, but we need to be sure what we mean. On the one hand, when we talk about WebGL then it is clear what we mean when we say something like "IE10 does not support WebGL". But if we say "a browser supports asm.js", we intend something very different - I guess people using that phrase mean something like "a browser that spent some time to optimize for asm.js". But I don't see people saying "a browser supports GWT output", so I suspect that using the term "support" comes from the mistaken notion that asm.js is something more than a pattern of JavaScript, which is not the case.

Why, then, do some people apparently still think asm.js is anything more than a pattern of JavaScript? I can't be sure, but here are some possibilities and clarifications to them:

1. asm.js code basically defines a low-level VM, in a sense: There is a singleton array for "memory" and all the operations are on that. While true, it is also true for Emscripten and Mandreel output, and other compilers as well. So in some sense this is valid to say, but just in the general sense that we can implement VMs in JavaScript (or any Turing-complete language), which of course we can regardless of asm.js.

2. Some of the initial speedups from asm.js were surprisingly large, and it felt like a big jump from the current speed of JavaScript, not an incremental step like things normally progress. And big jumps often require something "new", something large and standalone. But as I mentioned above, this was actually a very gradual process. Relevant optimizations for Emscripten/Mandreel code have taken years, writing SSA-optimizing JITs like CrankShaft and IonMonkey have likewise taken years (there is overlap between these two statements, of course), and the speedups on asm.js code are due to those long-term projects being able to really shine on a code pattern that is easy for them to optimize. In fact, many of the recent optimizations to speed up asm.js code do not actually add new optimizations directly, instead they change whether the browser decides to fully optimize it - that is, they get the code to actually reach the optimizing JIT (CrankShaft, IonMonkey, etc.). The power of those optimizing JITs has been there for a while now, but sometimes heuristics prevented it from being used.

3. A related thing is that I sometimes see people say that asm.js code will be "unusably slow" on browsers that do not optimize for it in a special way (just today I saw such a comment on hacker news, for example). First of all this is just false: look at these benchmarks, it is clear that on many of them Firefox and Chrome have essentially the same performance, and where there is a difference, it is noticeably decreasing. But aside from the fact that it is false, it is also disrespectful to the JavaScript VM engineers at Google, Microsoft and Apple: To say that asm.js code will be "unusably slow" on browsers other than Firefox implies that those other VM devs can't match a level of performance that was shown to be possible by their peers. That is a ridiculous thing to say given how talented those devs are, which has been proven countless times.

If someone says it will be "unusably slow" not because they can't reach the same speed but because they won't, then that looks obviously false. Why would any browser decide to not optimize for a type of JavaScript that is being used - including in high-profile things like Epic Citadel - and has relevant benchmarks? All browsers want to be fast on everything, this is a very competitive field, and we have already seen Google optimize for asm.js code as mentioned earlier in the IO keynote; there are also some positive signs from Microsoft as well.

4. Another possible reason is that the asm.js optimization module in Firefox is called OdinMonkey (mentioned for example in this blogpost). SpiderMonkey, Firefox's JavaScript engine, has called its JITs with Monkey-related names: TraceMonkey, JaegerMonkey and IonMonkey (although, baseline is a new JIT that received no Monkey name, so there is definitely no consistency here). So perhaps OdinMonkey sounds like it could be a new JIT, which seems to imply that asm.js optimizations require a new JIT. But as mentioned in that blogpost, OdinMonkey is not a new JIT, instead it is a module that sits alongside the rest of the parts of the JavaScript engine. What OdinMonkey does is detect the presence of asm.js code, take the parse tree that the normal parser emitted, type check the parse tree, and then send that information into the existing IonMonkey optimizing compiler. All the code optimization and code generation of IonMonkey are still being used, no new JIT for asm.js was written. That's why, as mentioned before, writing OdinMonkey only took one engineer 3 month's work (while also working on the spec). Writing a new JIT would take much more time and effort!

5. Also possibly related is the fact that OdinMonkey uses the "use asm" hint to decide to type check the code. This is indeed a bit odd, and feels wrong to some people. I certainly understand that feeling, in fact when working on the design of asm.js I argued against it. The alternative would be to do some heuristic check: Does this block of code not contain anything impossible in asm.js (no throw statements, for example), and does it contain a singleton typed array, etc.? If so, then start to type check in OdinMonkey. This could achieve practically the same result in my opinion. It does, however, have some downsides - heuristics are sometimes wrong and often add overhead, and really this optimization won't "luckily work" on random code, instead it will be expected and intended by a person using a C++ to JS compiler, so why not have them note that intention - which are strong arguments for using an explicit hint.

The important thing is that "use asm" does not affect JS semantics (if it does in some case - and new optimizations often do cause bugs in odd corner cases - then that must be fixed just like any other correctness bug). That means that JavaScript engines can ignore it, and if it is ultimately determined to be useless by JS engines then we can just stop emitting it.

6. asm.js has a spec, and defines a type system. That seems much more "formal" than, say, the output pattern of GWT, and furthermore things that do have specs are generally things like IndexedDB and WebGL, that is, web technologies that need to be standardized and so forth. I think this implies to some people that asm.js is more like WebGL than just a pattern of JavaScript. But not everything with a spec has it for standardization purposes. As mentioned before, one reason for writing a spec and type system for asm.js was to really force us to think about every detail, in order to get rid of all the places where types can change, and other stuff that prevents optimizations. There are 4 other reasons that IMO justified the effort to write a spec:

* Having a spec makes it easy to communicate things to other people. As I mentioned before, GWT output often used to run (and maybe still does) faster in Chrome than Firefox, and I am not really sure why (probably multiple reasons). GWT and Chrome are two projects from the same company, so it would not be surprising if their devs talk to each other privately, and there would be nothing wrong with it if they did. Emscripten and Firefox are, like GWT and Chromium, two open source projects primarily developed by a single browser vendor, so there is sort of a parallel situation here. To avoid the downsides of private discussions, we felt that writing a spec for asm.js would give us the benefit of being able to say "this (with all the technical details) is why this code runs fast in Firefox." That means that if other browser vendors want to run that code quickly, then they have all the docs they need in order to do so, and on the other side of things, if other compilers like Mandreel want to benefit from those speedups, then once more they have all the information they need as well. Writing a spec (and doing so in the open) makes things public and transparent.

* Having a type system opens up the possibility to do ahead of time (AOT) compilation in a reasonable way. Note that AOT should have already been possible in the output patterns for Emscripten and Mandreel (remember, they are generated from C++, that is AOTed), but actually wasn't because of the few places where types could in fact change. Assuming we did things properly, asm.js should have no such possible type changes anymore, so AOT is possible. And a type system makes it not just possible but quite reasonable to actually do so in practice.

AOT can be very useful in reducing the overhead and risk of optimization heuristics. As mentioned before, in some cases code never even reaches the optimizing JIT due to heuristics not making the optimal decision, and furthermore, collecting the data for those heuristics (how long a function or loop executes, type info for variables, etc.) adds overhead (during a warmup phase before full optimization, and possibly later during deoptimization and recompilation). If code can be compiled in an AOT manner, we simply avoid those two problems: we optimize everything, and we do so immediately. (There is a potential downside, however, in that fully optimizing large amounts of code can take significant time; in Firefox this is partially mitigated by things like compilation using multiple cores, another possibility is to consider falling back to the baseline JIT for particularly slow-to-compile functions.)

Because of these benefits of AOT compilation, it can avoid the "stutter" problem - where a game pauses or slows down briefly when new code is executed (like when a new level in the game is begun that uses some new mechanics or effects), due to some new code being detected as hot and then optimized, at the precise time when it is needed to be fast. Stutter can be quite problematic, as it tends to happen at the worst possible times - when something interesting happens - and I've often heard game devs be concerned about it. AOT compilation fully optimizes all the code ahead of time, so the stutter problem is avoided.

A further benefit of AOT is that it gives predictable performance: we know that all the code will be fully optimized, so we can be reasonably confident of performance in a new benchmark we have never tested on, since we do not rely on heuristics to decide what to optimize and when. This has been shown repeatedly when AOT compilation in Firefox achieved good performance the very first time we ran it on a new codebase or benchmark, for example on Epic Citadel, a compiled Lua VM, and others. In my JavaScript benchmarking experiences in the past, that has been rare.

AOT therefore brings several benefits. But with all that said, it is an implementation detail, one possible approach among others. As JS engines continue to investigate ways to run this type of code even better, we will likely see more experimentation in this area.

* Last reason for writing the spec and type system: asm.js began as a research project, and we might want to publish a paper about it at some point. (This one is much less important than the other three, but since I wrote out the others, might as well be complete.)

That sums up point 6, why asm.js has a spec and type system. Finally, two last possible reasons why people might mistakenly think asm.js is a new VM or something like that:

7. Math.imul. Math.imul is something that came up during the design of asm.js, that did actually turn into a proposal for standardization (for ES6), and has been implemented in at least Firefox and Chrome so far. Some people seem to think that asm.js relies on Math.imul, which would imply that asm.js relies on new language features in JavaScript. While Math.imul is helpful, it is entirely optional. It makes only a tiny impact on benchmarks, and is super-easy to polyfill (which emscripten does). Perhaps it would have been simpler to not propose it and just use the polyfill code, to avoid any possible confusion. But Math.imul is so simple to define (multiply two 32-bit numbers properly, return the lower 32 bits), and is basically the only integer math operation that cannot be implemented in a natural way in JS (the remarkable thing is that all the others do have a natural way to be expressed, even though JavaScript does not have integer types!), so it just felt like a shame not to.

It's important to stress that if Math.imul were opposed by the JavaScript community, it would of course not have been used by asm.js - asm.js is a subset of JavaScript, it can't use anything nonstandard. It goes without saying though that new things are being discussed for standardization in JavaScript all the time, and if there are things that could be useful for compiled C++ code, then such things are worth discussing in the JavaScript community and standards bodies.

8. Finally, asm.js and Google's portable native client (PNaCl) share the goal of enabling C++ code to run safely on the web at near-native speeds. So it is perhaps not surprising that some people have written blogposts comparing them. And aside from that shared goal, there are some other similarities, like both utilizing LLVM in some manner. But to compare them as if they were two competing products - that is, as if they directly compete with each other, and one's success may be at the other's detriment - is, I think, irrelevant. PNaCl (and PPAPI, the pepper plugin API on which it depends) is a proposed web technology, something that needs to be standardized with support from multiple vendors if it is to be part of the web, and asm.js on the other hand is just a pattern of JavaScript, something that is already supported in all modern browsers. asm.js code is being optimized for right now, just like many patterns of JavaScript are, driven by the browser speed race. So asm.js will continue to run faster over time, regardless of what happens with other things like PNaCl: I don't see a realistic scenario where PNaCl somehow makes browser vendors decide to stop competing on JavaScript speed (other plugin technologies like Flash, Java, Silverlight, Unity, etc. did not).

All of this has nothing to do with how good PNaCl is. (Worth mentioning here that I consider it to be an impressive feat of engineering, and I have a huge amount of respect for the engineers working on it.) It is simply operating in a different area. The JavaScript speed race is very important right now in the browser market: When your browser runs something slower than another browser, that's something that you are naturally motivated to improve. This happens with benchmarks (we already saw big speedups on two browsers on asm.js code), and it happens with things like the Epic Citadel demo (it initially did not run at all in Chrome, but Google quickly fixed things). This kind of stuff will continue to happen on the web and drive asm.js performance, and as already mentioned, this is regardless of what happens with PNaCl.

Are things really that simple?

I've been saying that asm.js is just a pattern of JavaScript and nothing more, and the speedups on it are part of the normal JavaScript speed competition that leads browser vendors to optimize various patterns. But actually things are more complicated than that. I would not argue that any pattern should be optimized for, nor that it would be ok for a vendor to specifically optimize based on hints for an arbitrary subset of JavaScript.

To make my point, consider the following extreme hypothetical example: A browser vendor decides to optimize a little subset of JS that has a JS array for memory, and stores either integers or strings into it. For example,

function WTF(print) {
  'use WTF';
  var mem = [];
  function run(arg) {
   mem[20] = arg;
   var a = 100;
   mem[5] = a;
   a = 'hello';
   mem[11] = a;
   print(mem[5]);
   mem[5] = mem[11];
   a = arg;
   return a;
  }
  return run;
}

Call this WTF.js. So when 'use WTF' is present, the browser checks that there is a singleton array called mem and so forth, and can then optimize the WTF.js code very well: mem is in a closure and does not escape, so we know its identity statically; we also know it should be implemented as a hash table because it likely has holes in its indexes; we know it can contain only ints or strings so some custom super-specialized data structure might be useful here; etc. etc. Again, this is an example meant to be extreme and ridiculous, but you can imagine that such a pattern might be useful as a compilation target for some odd language.

What is wrong with WTF.js is that while it is JavaScript, it is a terrible kind of JavaScript, in the following ways. In particular it violates basic principles of how current JavaScript engines optimize: for starters, variables are given more than one type (both locals and elements in mem). This is exactly what JS engine devs have been telling us all not to do for years now. Also, this subset comes out of nowhere - I can't think of anything like it, and when I tried to give it a hypothetical justification in the previous paragraph, I had to really reach. And there are obvious ways to make this more reasonable, for example to use consecutive indexes from 0 in the "mem" array, so there are no holes - this is something else that JS engine devs have been telling us to do for a very long time - and so forth. So WTF.js is in fact WTF-worthy.

None of that is the case with asm.js. As I detailed before, asm.js is the latest development in a natural process that has been going on in the JavaScript world for several years: Multiple compilers from C++ to JS appeared spontaneously, later they converged on the subset of JS that they target, and later JS engines optimized for that subset as it became more common on the web. While |0 might look odd to you, it wasn't invented for asm.js. It was "discovered" independently multiple times way before asm.js, found to be useful, and JS engines optimized for it. This process was not directed by anyone; it happened organically.

These principles guided us while designing early versions of asm.js: There were code patterns that were even easier to optimize, but they were novel, sometimes bizarre, and most importantly, they ran poorly without special optimizations for them, that did not exist yet. When I felt that a proposed pattern was of that nature, I strongly opposed it. Emscripten and Mandreel code already ran quite well; to design a new pattern that could be much faster with special optimizations, but right now is significantly slower without them, is a bad idea - it would feel unfair and WTF-like.

Therefore as we designed asm.js we tested on JS engines without any special optimizations for it, primarily on Chrome and on Firefox with the WIP optimizations turned off. asm.js is one output mode in Emscripten, so we had a good basis for comparison: when we flip the switch and go from the old Emscripten output pattern to asm.js, do things get faster or slower? (Note btw that we lack something similar in the WTF.js example from before, which is another problem with WTF.js.) What we saw was that (when we avoided the more novel ideas I mentioned before, that were rejected), flipping the switch from the old pattern to asm.js generally had little effect, sometimes helping and sometimes hurting, but overall things stayed at much the same (quite good) level of performance JS engines already had. That made sense, because asm.js is very close to the old pattern. (My guess is that the cases where it helped were ones where asm.js's extra-careful avoidance of types changing at runtime made a positive difference, and cases where it hurt were ones where it happened to hit an unlucky case in the heuristics being used by the JS engine.)

A few months ago Emscripten's default code generation mode switched to asm.js. Like most significant changes it was discussed on the mailing list and irc, and I was happy to see that we did not get reports of slowdowns in other browsers as a result of the change, which is further evidence that we got that part right. In fact (aside from the usual miscellaneous bugs following any significant change in a large project), the main regression caused by that change was that we got reports of startup slowdowns in Firefox actually! (These were caused by AOT sometimes being sluggish, it was however very easy to get large speedups on it, following those bug reports.)

Developing and shipping optimizations for something bizarre like WTF.js would be selfish, since it benefits one vendor while arbitrarily, unexpectedly and needlessly harming perf in other vendor's browsers. But none of that is the case with asm.js, which builds on a naturally-occurring pattern of JavaScript (the Emscripten/Mandreel pattern, already optimized for by JS engines), was designed to not harm performance on other browsers compared to that already-existing pattern, and as benchmarks have shown on two browsers already, creates opportunities for significant speedups on the web.

Lua in JavaScript: Running a VM in a VM

2013-05-30T09:45:00.000-07:00

Lua is a cool language, and it would be great to run it on the web. It isn't easy to do that though, see this answer (and the thread around it),

[Converting/running Lua in JavaScript] is a recurrent question on the Lua list, i guess because of the superficial similarity of the two languages.

Unfortunately, there are many important differences that are not so obvious. Making it work need either a full-blown compiler targeting JS instead of Lua's bytecode, or rewriting the Lua VM in JavaScript.

There are in fact projects taking those two approaches, for example lua.js and ljs respectively. But as mentioned in the quote above, it is quite difficult to get the full language (including things like using functions as indexes in tables, possible in Lua but not in JavaScript) with a simple 1-to-1 translation, and writing a full VM is not a quick solution either. Any such efforts miss out on all the work that has already gone into the Lua implementation in C.

There is also a third alternative, which is to compile the Lua C implementation into JavaScript, that is, to run the Lua VM inside your JavaScript VM.

lua.vm.js is a project I started over the last week that does just that, here is a benchmark page and here is a REPL for it.

The Lua VM compiles out of the box using Emscripten, with only a few minor Makefile tweaks. It's straightforward to then create a REPL where you tell the VM to run some code. lua.vm.js does more than that though, as you can see in the example on the REPL page, you can interact with the page using Lua,

local screen = js.global.screen
print("you haz " ..
      (screen.width*screen.height)
      .. " pixels")

local document = js.global.document
print("this window has title '" ..
      document.title
      .. "'")

local window = js.global
window.alert("hello from lua!")
window.setTimeout(function()
print('hello from lua callback')
end, 2500)

Lua is given a js global that lets you refer to things in the JavaScript/DOM world. You can get properties on those objects, call functions, even set callbacks back into Lua.

Since this uses the full Lua VM, it means you get the full Lua language, and it took just a few days since all the work that went into that VM is reused. That includes an incremental GC and everything else.

At this point you might be wondering if this isn't a crazy idea - a VM in a VM? There is definitely a lot of skepticism about that going around, but we won't know if the skepticism is justified or not if we don't try, hence this project.

The first specific concern is about size. It turns out that the entire compiled Lua VM fits in 200K when gzipped. That's too much for some use cases, but certainly acceptable for others (especially with proper caching).

Another concern is performance. That's what the benchmark page is for. As mentioned there, the Lua VM compiled into asm.js can have similar performance to other real-world codebases, around half the speed of native execution. Again, that is too much for some uses cases, but certainly acceptable for others. In particular, remember that the Lua VM is often significantly faster than other dynamic languages like Python and Ruby. These languages are useful in many cases even if they are not super-fast.

A third concern is compatibility. But compiling to JavaScript is the safest way to run everywhere, since it requires nothing nonstandard at all and consequently should run in all modern web browsers. Speed can differ of course, but the good thing about JavaScript performance is that if one browser is faster at something then the others always catch up.

So running a VM in a VM can make sense sometimes. It seems like an odd thing to run an entire VM on top of another, but the imagined overhead is not as big as it ends up in reality. If you have a lightweight, efficient VM in portable C, like Lua does, then when compiled to the web it behaves much like other portable C/C++ codebases, and it is possible to run compiled C code at near-native speeds on the web. JavaScript VMs are very capable these days, I think more than people sometimes assume.

How far can the approach of running a VM in a VM get us? Pretty far I think, this early prototype is further along than I had expected, and this is before any specific optimizations. There are however some tricky issues, for example we can't do cross-VM cycle collection - if a Lua object and a JavaScript object are both not referred to by anything, but do refer to each other, then to be able to free them we would need to be able to traverse the entire heap on both sides, and basically do our own garbage collection in place of the browser's - for normal JavaScript objects, not just a new type of objects like Lua ones. JavaScript engines don't let us do that, for a combination of security and performance reasons. What we can do is allow Lua to hold strong references into the JavaScript world, and automatically free those when Lua GCs such a reference. That limits things, but it's important to remember that cross-VM cycle collection is a hard problem in CS in general - the only easy case is when one VM's objects can be implemented entirely in the other, but that isn't possible in most cases (and not in this one: for example, some Lua objects can have finalizers - __gc - that can't be implemented in JavaScript) and even when it is, performance is a concern. But note that this type of problem would also be present if we shipped two separate VMs in web browsers.

Speaking of the idea of shipping additional VMs, that is a suggestion that is often heard, specifically I have seen people call for shipping Python, Mono, Lua, etc. VMs alongside JavaScript (alongside, because we need JavaScript comptability for the existing web, and none of those VMs can run JavaScript near the current speed, and we need to not regress performance). This approach has appeal, but also downsides. Perhaps the main issues are that when we compare it running other VMs in the JavaScript VM, then running in the JavaScript VM has some advantages:

Security is built in, since the new VM runs inside one that has already been heavily tested and hardened. The attack surface is not increased at all.
VMs are downloaded like any web content, which means we can use new languages as people port them to JavaScript, we don't need to wait for browsers to decide to support and ship additional VMs, and we don't risk different browsers deciding to support different VMs.

Running in the JavaScript VM also has disadvantages, but as discussed before they do not seem to be that bad in practice, at least for the Lua VM. It looks like the two approaches are architecturally very different, but can lead to similar results. So the bottom line is that instead of shipping additional VMs in web browsers alongside the JavaScript VM, we can run them inside the JavaScript VM, and the difference could be pretty much transparent to users: In both cases you can run Lua on the web, and in both cases performance is reasonable.

Maybe we wouldn't care? ;)

Last thought, what about running a really fast VM like LuaJIT on the web? That is trickier; LuaJIT has a JIT as well as a hand-written interpreter in assembly, so it can't just be compiled like portable C code can. The JIT would need a JavaScript backend, and the interpreter would need to be ported to JavaScript as well. In principle this could work, but it's hard to say how good performance would be, and there are some interesting questions about code invalidation in code JITed to JavaScript, PICs, etc. This is definitely worth trying, but it isn't a project of a few day's work like porting the Lua C implementation was.

In the meantime, if you love Lua, check out lua.vm.js. Feedback and contributions are welcome!

The Elusive Universal Web Bytecode

2013-05-14T11:09:00.000-07:00

It's often said that the web needs a bytecode. For example, the very first comment in a very recent article on video codecs on the web says

A proper standardized bytecode for browsers would (most likely) allow developers a broader range of languages to choose from as well as hiding the source code from the browser/viewer (if that's good or not is subjective of course).

And other comments continue with

Just to throw a random idea out there: LLVM bytecode. That infrastructure already exists, and you get to use the ton of languages that already have a frontend for it (and more in the future, I'm sure).
[..]
I also despise javascript as a language and wish someone would hurry up replacing it with a bytecode so we can use decent languages again.
[..]
Put a proper bytecode engine in the browser instead, and those people that love javascript for some unknowable reason could still use it, and the rest of us that use serious languages could use them too.
[..]
Honestly, .Net/Mono would probably be the best bet. It's mature, there are tons of languages targeting it, and it runs pretty much everywhere already as fast as native code

Ignoring the nonproductive JS-hating comments, basically the point is that people want to use various languages on the web, and they want those languages to run fast. Bytecode VMs have been very popular since Java in the 90's, and they show that multiple languages can run in a single VM while maintaining good performance, so asking for a bytecode for the web seems to make sense at first glance.

But already in the quotes above we see the first problem: Some people want one bytecode, others want another, for various reasons. Some people just like the languages on one VM more than another. Some bytecode VMs are proprietary or patented or tightly controlled by a single corporation, and some people don't like some of those things. So we don't actually have a candidate for a single universal bytecode for the web. What we have is a hope for an ideal bytecode - and multiple potential candidates.

Perhaps though not all of the candidates are relevant? We need to pin down the criteria for determining what is a "web bytecode". The requirements as mentioned by those requesting it include

Support all the languages

Run code at high speed

To those we can add two additional requirements that are not mentioned in the above quotations, but are often heard:

Be a convenient compiler target
Have a compact format for transfer

In addition we must add the requirements that anything that runs on the web must fulfill,

Be standardized

Be platform-independent

Be secure

JavaScript can already do the last three (it's already on the web, so it has to). Can it do the first four? I would say yes:

Support all the languages: A huge list of languages can compile into JavaScript, and that includes major ones like C, C++, Java, C#, LLVM bytecode, and so forth. There are some rough edges - often porting an app requires changing some amount of code - but nothing that can't be improved on with more work, if the relevant communities focus on it. C++ compilers into JavaScript like Emscripten and Mandreel have years of work put into them and are fairly mature (for example see the Emscripten list of demos). GWT (for Java) has likewise been used in production for many years; the situation for C# is perhaps not quite as good, but improving, and even things like Qt can be compiled into JavaScript. For C#, Qt, etc., it really just depends on the relevant community being focused on the web as one of its targets: We know how to do this stuff, and we know it can work.
Run code at high speed: It turns out that C++ compiled to JavaScript can run at about half the speed of native code, which in some cases outperforms Java, and is expected to get better still. Those numbers are when using the asm.js subset of JavaScript, which basically structures the compiler output into something that is easier for a JS engine to optimize. It's still JavaScript, so it runs everywhere and has full backwards compatibility, but it can be run at near-native speed already today.
Be a convenient compiler target: First of all, the long list of languages from before shows that many people have successfully targeted JavaScript. That's the best proof that JavaScript is a practical compiler target. Also, there are many languages that compile into either C or LLVM bytecode, and we have more than one compiler capable of compiling those to the web, and one of them is open source, so all those languages have an easy path. Finally, while compiling into a "high-level" language like JavaScript is quite convenient, there are downsides, in particular the lack of support for low-level control flow primitives like goto; however, this is addressed by reusable open source libraries like the Relooper.
Have a compact format for transfer: It seems intuitive that a high-level language like JavaScript cannot be compact - it's human-readable, after all. But it turns out though that JS as a compiler target is already quite small, in fact comparable to native code when both are gzipped. Also, even in the largest and most challenging examples, like Unreal Engine 3, the time spent to parse JS into an AST does not need to be high. For example, in that demo it takes just 10 seconds on my machine to both parse and fully optimize the output of over 1 million lines of C++ (remember that much of that optimization time would need to be done no matter what format the code is in, because it has to be a portable one).

So arguably JavaScript is already very close to providing what a bytecode VM is supposed to offer, as listed in the 7 requirements above. And of course this is not the first time that has been said, see here for a previous discussion from November 2010. In the 2.5 years since that link, the case for that approach has gotten significantly stronger, for example, JavaScript's performance on compiled code has improved substantially, and compilers to JavaScript can compile very large C++ applications like Unreal Engine 3, both as mentioned before. At this point the main missing pieces are, first (as already mentioned) improving language support for ones not yet fully mature, and second, a few platform limitations that affect performance, notably lack of SIMD and threads with shared state.

Can JavaScript fill the gaps of SIMD and mutable-memory threads? Time will tell, and I think these things would take significant effort, but I believe it is clear that to standardize them would be orders of magnitude simpler and more realistic than to standardize a completely new bytecode. So a bytecode has no advantage there.

Some of the motivation for a new bytecode appears to come from an elegance standpoint: "JavaScript is hackish", "asm.js is a hack", and so forth, but a new from-scratch bytecode would be (presumably) a thing of perfection. That's an understandable sentiment, but technology has plenty of such things, witness the persistence of x86, C++, and so forth (some would add imperative programming to that list). It's not only true of technology but human civilization as well, for example no natural language has the elegance of Esperanto, and our currently-standing legal and political systems are far from what a from-scratch redesign would arrive at. But large long-standing institutions are easier to improve continuously rather than to completely replace. I think it's not surprising that that's true for the web as well.

(Note that I'm not saying we shouldn't try. We should. But we shouldn't stop trying at the same time to also improve the current situation in a gradual way. My point is that the latter is more likely to succeed.)

Elegance aside, could a from-scratch VM be better than JavaScript? In some ways of course it could, like any redesign from scratch of anything. But I'm not sure that it could fundamentally be better in substantial ways. The main problem is that we just don't know how to create a perfect "one bytecode to rule them all" that is

Fast - runs all languages at their maximal speed
Portable - runs on all CPUs and OSes
Safe - sandboxable so it cannot be used to get control of users' machines

The elusive perfect universal bytecode would need to do all three, but it seems to me that we can only pick two.

Why is this so, when supposedly the CLR and JVM show that the trifecta is possible? The fact is that they do not, if you really take "fast" to mean what I wrote above, which is "runs all languages at their maximal speed" - that's what I mean by "perfect" in the context of the last paragraph. For example, you can run JavaScript on the JVM, but it won't come close to the speed of a modern JS VM. (There are examples of promising work like SPUR, but that was done before the leaps in JS performance that came with CrankShaft, TypeInference, IonMonkey, DFG, etc.).

The basic problem is that to run a dynamic language at full speed, you need to do the things that JavaScript engines, LuaJIT, etc. do, which includes self-modifying code (architecture-specific PICs), or even things like entire interpreters in handwritten optimized assembly. Making those things portable and safe is quite hard - when you make them portable and safe, you make them more generic pretty much by definition. But CPUs have significant-enough differences that doing generic things can lead to slower performance.

The problems don't stop there. A single "bytecode to rule them all" must make some decisions as to its basic types. LuaJIT and several JavaScript VMs represent numbers using a form of NaNboxing, which uses invalid values in doubles to store other types of values. Deciding to NaNbox (and in what way) or not NaNbox is typically a design desicion for an entire VM. NaNboxing might be well and good for JS and Lua, but it might slow down other languages. Another example is how strings are implemented: IronPython, Python on .NET, ran into issues with how Python expects strings to work as opposed to .NET.

Yet another area where decisions must be made is garbage collection. Different languages have different patterns of usage, both determined by the language itself and the culture around the language. For example, the new garbage collector planned for LuaJIT 3.0, a complete redesign from scratch, is not going to be a copying GC, but in other VMs there are copying GCs. Another concern is finalization: Some languages allow hooking into object destruction, either before or after the object is GC'd, while others disallow such things entirely. A design decision on that matter has implications for performance. So it is doubtful that a single GC could be truly optimal for all languages, in the sense of being "perfect" and letting everything run at maximal speed.

So any VM must make decisions and tradeoffs about fundamental features. There is no obvious optimal solution that is right for everything. If there were, all VMs would look the same, but they very much do not. Even relatively similar VMs like the JVM and CLR (which are similar for obvious historic reasons) have fundamental differences.

Perhaps a single VM could include all the possible basic types - both "normal" doubles and ints, and NaNboxed doubles? Both Pascal-type strings and C-type strings? Both asynchronous and synchronous APIs for everything? Of course all these things are possible, but they make things much more complicated. If you really want to squeeze every last ounce of performance out of your VM, you should keep it simple - that's what LuaJIT does, and very well. Trying to support all the things will lead to compromises, which goes against the goal of a VM that "runs all languages at their maximal speed".

(Of course there is one way to support all the things at maximal speed: Use a native platform as your VM. x86 can run Java, LuaJIT and JS all at maximal speed almost by definition. It can even be sandboxed in various ways. But it has lost the third property of being platform-independent.)

Could we perhaps just add another VM like the CLR alongside JavaScript, and get the best of both worlds that way, instead of putting everything we need in one VM? That sounds like an interesting idea at first, but it has technical difficulties and downsides, is complex, and would likely regress existing performance.

Do we actually need "maximal speed"? How about just "reasonable speed"? Definitely, we can't hold out for some perfect VM that can do it all. In the last few paragraphs I've been talking about a "perfect" bytecode VM that can run everything at maximal speed. My point is that it's important to realize that there is no such VM. But, with some compromise we definitely can have a VM that runs many things at very high speeds. Examples of such VMs are the JVM, CLR, and as mentioned before JavaScript VMs as well, since they can run one very popular dynamic language at maximal speed, and they can run statically typed code compiled from C++ about as well or even better than some bytecode VMs (with the already-discussed caveats of SIMD and shared-mutable threads).

For that reason, switching from JavaScript to another VM would not be a strictly better solution in all respects, but instead just shift us to another compromise. For example, JavaScript itself would be slower on the CLR, but C# would be faster, and I'm not sure which of the two can run C++ faster, but my bet is that both can run it at about the same speed.

So I don't think there is much to gain, technically speaking, from considering a new bytecode for the web. The only clear advantage such an approach could give is perhaps a more elegant solution, if we started from scratch and designed a new solution with less baggage. That's an appealing idea, and in general elegance often leads to better results, but as argued earlier there would likely be no significant technical advantages to elegance in this particular case - so it would be elegance for elegance's sake.

I purposefully said we don't need a new bytecode in the last paragraph. We already have JavaScript, which I have claimed is quite close to providing all the advantages that a bytecode VM could. Note that this wasn't entirely expected - not any language can in a straightforward way be transformed into a more general target for other languages. It just so happens though that JavaScript did have just enough low-level support (bitwise operations being 32-bit, for example) to make it a practical C/C++/LLVM IR compiler target, which made it worth investing in projects like the Relooper that work around some of its other limitations. Combined with the already ongoing speed competition among JavaScript engines, the result is that we now have JavaScript VMs that can run multiple languages at high speed.

In summary, we already have what practically amounts to a bytecode VM in our browsers. Work is not complete, though: While we can port many languages very well right now, support for other languages is not quite there yet. If you like a language that is not yet supported on the web, and you want it to run on the web, please contribute to the relevant open source project working on doing that (or start one if there isn't one already). There is no silver bullet here - no other bytecode VM that if only we decided on it, we would have all the languages and libraries we want on the web, "for free" - there is work that needs to be done. But in recent years we have made huge amounts of progress in this area, both in infrastructure for compiling code to JavaScript and in improvements to JavaScript VMs themselves. Let's work together to finish that for all languages.

Heap corruption checking in Emscripten

2013-02-21T11:49:00.000-08:00

I've seen a few weird things when porting codebases, where it just doesn't behave properly. Sometimes this is a bug in the compiler, but it can also be that the source is not fully portable. It can be hard to figure out which it is, so compilers often have automatic tools to help with that sort of thing. Emscripten has several, like SAFE_HEAP mode which checks alignment errors, reading from beyond the heap, and lets you instrument each read and write manually. Recently I added another, CORRUPTION_CHECK which performs simple heap corruption checking.

This idea came to me during a flight and probably it isn't original in any way. Basically, each malloc allocates additional memory, with a buffer zone before and after the "real" allocation that the program sees. The buffer zones are filled with canary values, and later on those are checked to see if they were modified. If they were, the program is writing to memory it has no business writing to, and something has gone very wrong. Then you just increase the frequency of the checks, maybe even adding a few manual calls as you narrow things down, until you see exactly where the corruption happens. With large enough buffer zones, this approach has a good probability of catching writes to random places in memory, and an even better chance of catching typical bugs like allocating 1024 bytes and writing 1025 values. Yesterday I used this tool successfully on a large C++ codebase, the bug turned out to be an incorrect use of std::vector.

But what I really want to talk about here is the implementation of this heap checker tool,

https://github.com/kripken/emscripten/blob/incoming/src/corruptionCheck.js

The idea is fairly simple, and the implementation is less than 100 lines of JavaScript. It replaces the normal malloc and free functions, and basically just does what I said before. I think it's kind of nice how easy it is to do stuff like this on code compiled to JavaScript.

And actually in many ways it is easier to debug C/C++ in JavaScript than C/C++ compiled to native code. For example, even if there is a heap corruption in the C/C++ codebase, we know it cannot corrupt the debugging code in pure JS. JS is a safe language, so JS objects can't be randomly overwritten from the compiled code, which just accesses the typed array heap - and some side objects, but it just can't get to the CorruptionChecker object. That's a nice guarantee to have. So we can debug the compiled code from a safe, scripted environment where it's easy to automate things by just adding some JS.

Emscripten News: LLVM 3.2, etc.

2012-12-21T15:57:00.000-08:00

LLVM 3.2 was just released today, and as with every LLVM release emscripten is switching to it. (We only support a single LLVM version at a time; if you don't want to upgrade to LLVM 3.2 just yet, you can use older revisions of emscripten.)

LLVM 3.2 brings as usual a large amount of general improvements and bugfixes. There isn't anything in particular that will be noticeable with emscripten, except for a change in how LLVM does linking. It now requires an explicit list of symbols to keep alive - this was quite puzzling to me at first but respindola explained it, and this is a very nice change for LLVM to make, linking is more consistent there now. We do have all the necessary symbol information in emscripten, but were not passing it to LLVM, now emscripten has been modified to do so.

The result is that in some cases more unneeded code can be removed, resulting in smaller generated code, which is great (for example ammo.js is 2% smaller). However, if you do not explicitly keep a function alive (either by using EMSCRIPTEN_KEEPALIVE or __used__ in the C/C++, or adding it to EXPORTED_FUNCTIONS), then LLVM may remove it.

Another improvement is landing together with this into master, unrelated to LLVM 3.2, is better linking of .a archives. We now only use the object files that are actually required, and will not link in others. This can also reduce the size of generate code, but again, if you are not careful, needed functions may be removed, in particular because the link order of archives matters (libA.a libB.a will only link in parts of libA that were required by things before it on the commandline, not things in libB).

Finally, another change you might notice if you use emscripten is that it now has better support for systems with both Python 2 and 3 installed at the same time. ack wrote a big patch to make our usage of python much cleaner to enable that. One significant consequence is that we now look for python2 in the python script shebangs. So if you run ./emcc and do not have python2 in your path, you will get an error. Solutions are to either run python emcc or add a symlink to python2 from python.

Emscripten Compiler Upgrades

2012-11-12T12:51:00.000-08:00

Several major architecture improvements have landed in the last few weeks in Emscripten, here is an overview.

New Eliminator

The eliminator optimization phase was originally written by Max Shawabkeh. It basically removes unneeded variables, so

var x = f(a);
var y = x + g(b);

can be optimized into

var y = f(a) + g(b);

This can greatly reduce the size of the code as well as improve performance, and was fundamental for our approach of relying on a combination of the eliminator + the closure compiler to go from LLVM's SSA representation into a register-like format: The eliminator removes large amounts of unneeded variables, and the closure compiler then reduces the number of variables further by reusing the ones that remain.

The eliminator could be slow on large functions, however, because it calculated the transitive closure of dependencies between all the variables, an expensive calculation. It also missed out on some opportunities to optimize because of some simplifying assumptions it made in its design. A final downside was it integrated poorly with the rest of our optimizations (in part due to being written in a different language, CoffeeScript).

I rewrote the eliminator entirely from scratch, in order to do a more precise analysis of which variables can be removed. I also simplified the problem slightly by only eliminating variables that have a single use - this makes it far faster, and I don't see any downside in the quality of the generated code (in fact it avoids some possible bad cases, although it took a long time to figure out what was going on in them). The new version is faster in general and far faster on the rare bad cases (100x even), and generates better-performing code to boot.

Parallel Optimizer

With all of the emscripten optimization passes now in JavaScript, I then worked on parallelizing that. We can't run one pass before the previous one finishes, but within each pass, we can work separately in each function - optimizing each function is independent of the other (we used to have some global optimization passes, but their benefit was very limited).

The parallelization is done using Python's process pool. It splits up the JavaScript into chunks and runs those in multiple JavaScript optimizer instances. The speedup can be close to linear in the number of cores. On BananaBread, the optimization passes become almost twice as fast on my dual-core laptop.

Parallel Compiler

With the optimizer parallel, there remain two phases that can be slow: The compiler (the initial code conversion from LLVM to JavaScript) and Closure Compiler. We can't do much for Closure, but in the long term it will become less and less important: we are implementing specific optimizations for the things we used to rely on it for, which leaves just minifying the code.

For the LLVM to JS compiler, I made the emscripten compiler parallel as well: It splits up the LLVM IR into 3 main parts: type data, function data, and globals. The function data part is unsurprisingly by far the largest in all cases I checked (95% or so), and it can in principle be parallelized - so I did that. Like in the optimizer, we use a Python process pool which feeds chunks of function data to multiple JavaScript compiler instances. There is some overhead due to chunking, and the type data and globals phases are not parallelized, but overall this can be a close to linear speedup.

Overall, pretty much all the speed-intensive parts of code generation and optimization in Emscripten are now parallel, and will utilize all your CPU cores automatically. That means that if you experience slow compilation speeds, you can just throw more hardware at it. I hear they are selling 16-core CPUs now ;)

New Relooper

The relooper is an optimization performed (unlike most optimizations) during initial code generation. It takes basic blocks of code and branching information between them and generates high-level JS control flow structures like loops and ifs, which makes the code run far faster. The original relooper algorithm was developed together with the implementation I wrote in the compiler. Eventually some aspects of how it works were found to be suboptimal, so specific optimizations were added to the JS optimizer ('hoistMultiples', 'loopOptimizer'), overall giving us pretty good generated code.

Meanwhile I wrote a new version of the relooper in C++. There were 2 reasons for that choice of language: First, because other projects needed something like it, and C++ was a better language for them, and second, because we had plans to evaluate writing an LLVM backend for emscripten that would also need to reloop in C++ (note: we decided against the LLVM backend in the end). The new version avoids the limitations of the first, and generates better code. In particular it has no need for additional optimizations done after the fact. It also implements some additional tweaks that are missing in the first one, like node splitting in some cases and more precise removal of loop labels when they are not needed, etc. It's also a much cleaner codebase.

I brought that new version of the relooper into Emscripten by compiling it to JS and using it in the JS compiler. This makes compilation faster both because the new relooper is faster than the previous one (not surprising as often compiled code is faster than handwritten code), and because the additional later optimizations are no longer needed, for overall about a 20% speedup on compiling BananaBread. It also generates better code, for example it can avoid a lot of unneeded nesting that the previous relooper had (which caused problems for projects like jsmess).

Note that this update makes Emscripten a 'self-hosting compiler' in a sense: one of the major optimization passes must be compiled to JS from C++, using Emscripten itself. Since this is an optimization pass, there is no chicken-and-egg problem: We bootstrap the relooper by first compiling it without optimizations, which works because we don't need to reloop there. We then use that unoptimized build of the relooper (which reloops properly, but slowly since it itself is unoptimized) in Emscripten to compile the relooper once more, generating the final fully-optimized version of the relooper, or "relooped relooper" if you will.

Emscripten News: BananaBread, Nebula3, GDC Online, Websockets, Worker API, Performance, etc

2012-10-25T10:14:00.000-07:00

I haven't found time to blog about individual things, so here is one post that summarizes various Emscripten-related things that happened over the last few months.

BananaBread

BananaBread, a port of the Cube 2 game engine to the web, was launched and then received a few minor updates with bugfixes and some additional experimental levels. Feedback was good, and it was linked to by a Firefox release announcement and later a Chromium release announcement, in both cases to show that each browser is now capable of running first person shooter games.

Nebula 3

Cube 2 isn't the only game engine being ported to the web using Emscripten, this post by a Nebula 3 dev is worth reading, and check out the demo it links to. Nebula is a powerful game engine that like id tech engines gets open source releases now and then, and has been used in some impressive games (like this). Very cool to see it working well in JS+WebGL, especially given the dev's initial skepticism - read the blogpost! :)

GDC Online

I gave a talk together with Kevin Gadd at GDC Online, here are my slides. We talked about compiling games to HTML5, I focused on C++ and Kevin on C#, so overall we covered a lot of potential codebases that could be automatically ported to the web.

Among the demos I gave was of course BananaBread, as an example of a 3D first person shooter compiled from C++ and OpenGL to JavaScript and WebGL. Interestingly, Adobe gave a talk later that day about porting games to web browsers, which compared 4 platforms: WebGL/JS, Flash, NaCl, and Unity, and for the WebGL/JS demo they also presented BananaBread, so it ended up being shown twice ;)

Workers API

Support for worker threads is in the incoming branch, look in emscripten.h and for tests with "worker_api" in them in tests/runner.py. This API basically lets you compile code into "worker libraries" that the main thread can then call and get responses from, giving you an easy way to do message-passing style concurrency.

The API is in initial stages, feedback is welcome.

Networking

Initial support for networking using websockets has also been implemented, see tests with "websockets" in their name. Basic sockets usage works, however we have had troubles with setting up a testing websocket server with binary support, see the issue for details. Because of that this won't work on arbitrary binary data yet. If you know websockets and websocket servers and are interested in helping with this, that would be great.

Another approach we intend to work on, and where help would be welcome, is WebRTC. WebRTC could actually be easier to work with since it supports p2p connections, so it's easy to test a connection from one page to itself. It also supports UDP-style unreliable data, so we should be able to get multiplayer working in BananaBread when that is complete.

Library Bindings to JavaScript

We currently have the "bindings generator" which is used to make ammo.js and box2d.js. It works for them, but needs manual hacking and has various limitations. A more proper approach is being worked on, contributed by Chad Austin, which he called "embind". This is a more explicit, controllable approach to bindings generation, and in time it should give us big improvements in projects like ammo.js and box2d. If you use those projects and want them to improve, the best way is to help with the new embind bindings approach. We have some initial test infrastructure set up, and there are various bugs filed with the tag "embind" if you are interested.

LLVM backend

I did some experiments with an LLVM backend for Emscripten when I had free time over the last few months. The results were interesting, and I got some "hello world" stuff working, during which I learned a lot about how LLVM backends are built.

Overall this is a promising approach, and it is what pretty much all other compilers from languages like C into JS work. However, this is going to be low priority, for two main reasons. First, we simply lack the resources: There are many, many other things that are very important for me to work on in Emscripten (see other points in this blogpost for some), and we have not had luck in interesting people to collaborate on this topic so far. Second, while my investigations were mostly positive, they also found negatives in going the LLVM backend route. Some types of optimizations that make sense for JavaScript are an uncomfortable fit for LLVM's backends, which is not surprising given how different JS is from most target languages. It's possible to overcome those issues, of course, but it isn't the optimal route.

Why do pretty much all the other compilers go the LLVM backend route? I suspect it might have to do with the fact that they typically do not just compile to JS. For example, if you already have a compiler into various targets, then when you consider also compiling into JS, it is simplest to modify your existing approach to do that as well. Emscripten on the other hand is 100% focused on JS only, so that's a fundamental difference. If all you care about is targeting JS, it is not clear that an LLVM backend is the best way to go. (In fact I suspect it is not, but to be 100% sure I would need to fully implement a backend to compare it to.)

Compiler and Code Perf

To continue the previous point, there is however one aspect of an LLVM backend that is greatly beneficial - it's written in efficient C++ and will compile your code quickly. Emscripten on the other hand is written in JavaScript and has some complex optimization passes that do a lot of work on a JS AST, and these can take a long time. A fully optimized build of BananaBread, for example, takes about 3 minutes on my laptop, and while it's a big project there are bigger ones of course that would take even more.

On the one hand, this doesn't matter that much - it's done offline by the compiler. People running the generated code don't notice it. But of course, making developer's lives easier is important too.

In Emscripten the goal has always been to focus more on performance of the generated code rather than performance of the compiler itself, so we have added new optimization passes even when they were expensive in compilation time, as long as they made the generated code faster. And we rely on tools like Closure Compiler that take a long time to run but are worth it.

But compiler perf vs code perf isn't an all of nothing decision. Right now on the incoming branch there are some (not fully finished and slightly buggy, but almost ready) optimizations that improve compilation time quite a bit. And with those in place we can move towards parallel compilation in almost all of the optimization passes, so with 8 cores you might get close to 8x speedups in compilation, etc.

So the current goal is to focus on the existing compiler. It will get much faster than it is now, but it will probably never get close to the speed an LLVM backend could get, that's the tradeoff we are making in order to focus on generating faster code. An additional reason this tradeoff makes sense is that we currently have plans for several new types of optimizations to make the generated code yet faster, and it is far easier for us to work on them in the current compiler than an LLVM backend.

Record/Replay

Finally, we added a system for recording and replaying of Emscripten-compiled projects (see reproduceriter.py). With it you basically compile your project in a special mode, run it in record mode and do stuff, then you can run the project in replay mode and it will replay the exact same output you saw before.

The main use case for this is benchmarks: If you have a program that depends on user input and random things like timing or Math.random(), then it is very hard to generate a good benchmark from it because you get different code being run each time. With the record/replay facility you can basically make a reproducible execution trace.

This has been tested on BananaBread so far, and used to create BananaBench, a benchmark based on BananaBread. You can either run it in the browser or in the shell, and hopefully a real-world benchmark like this will make it easier to optimize browsers for this kind of content.

BananaBread (or any compiled codebase) Startup Experience

2012-07-24T09:54:00.000-07:00

This post talks about startup experience improvements in BananaBread. BananaBread is a specific 3D game engine compiled from C++ to JavaScript, but the principles are general and could be applied to any large compiled codebase.

Starting up "nicely"

Starting up as quickly as possible is always best on every platform. This is a general issue so I won't focus on it, because on the web, there is also another important criterion, which is to start up in as asynchronous a way as possible. By asynchronous I mean to not run in a single large event on the main thread: Instead it is better for as much as possible to be done on background threads (web workers) and for what does run on the main thread to at least be broken up into small pieces.

Why is being asynchronous important? A single long-running event makes the page unresponsive - no input events are being handled and no output is being shown. This might seem not that important for startup, when there is little interaction anyhow. But even during startup you want to at least show a progress bar to give the user an indication that things are moving along, and also most browsers will eventually warn the user about nonresponsive web pages, showing a scary "slow script" dialog with an option to cancel the script or close the page.

Asynchronize all the things..?

If you're writing a new codebase, you would indeed make everything asynchronous. All pure startup calculations would be done in background threads, and main thread events would be very short. Here is an example of such a recently-launched product: The worst main thread stall during startup seems to be about half a second, not bad at all, and a friendly progress bar updates you on the current status. When you are writing a new codebase it is straightforward to design in a way that makes nice startup like that achievable.

When you're compiling an existing codebase, things are harder, though. A normal desktop application does not need to be written in an asynchronous manner, while it might have a main loop that can easily be made asynchronous (run each main loop iteration separately), startup can be just a single continuous process of execution with some periodic notifications to update a progress meter. That is exactly the situation with Cube 2, the game engine compiled to JavaScript in the BananaBread project.

Now, there is a way to have long-running JavaScript code in browsers: Run it in a web worker. That would be the perfect solution, however, workers do not have access to WebGL or audio, and there is no way for them to send synchronous messages to the main thread, so even proxying those APIs to them is not practical. So unless you can easily box off "pure calculation" parts into workers, you do need to run most or all of your code on the main thread.

But you can still asynchronize even such a codebase: Here is what startup was like until recently: BananaBread r13, and and here is what it looks like now: BananaBread r15. The worst main thread stall is 1.4 seconds on my laptop, which is not great but definitely enough to prevent "slow script" warnings on most machines, and there is now a progress bar.

Means of asynchronization

The first important thing is to find small chunks of computation that are easily done ahead of time and their results cached for later:

In BananaBread jpg and png images must be decoded into pixel data. Emscripten does that during the preloading phase, each one is decoded by a separate Image element. This not only breaks things up into small pieces, it also uses the browser's native decoders, so it happens faster than if we had compiled a decoding library with the rest of the game engine. (A clever browser might also do these decodings in parallel..)
Crunch files need to be decompressed using a compiled JavaScript decoder. We do that during preloading as well, with the decoder running in a web worker.
Cube 2 levels (or maps as they are called) are gzip compressed, and the engine decompresses them during startup. I refactored that and BananaBread now decompresses them using zee.js during preloading, also in a worker.

Taking these three points together, BananaBread can use three cores during the preloading phase. This is actually an improvement on the original game engine which is single-threaded!

After preloading, the compiled engine starts to run and we are necessarily single-threaded. The important thing to do at this stage is to at least break up the startup code into small-enough pieces to avoid freezing the main thread. This requires refactoring the original source code, and is not the most fun thing in the world, but definitely possible. Emscripten provides an API to help with this (emscripten_push_main_loop_blocker etc.), you can define a queue of functions to be called in sequence, each in a separate invocation. So the tricky part is just to deal with the codebase you are porting.

Over a few days I broke up the biggest functions called during startup, getting from a maximum of 6 seconds to 1.4 seconds. Browsers seem to complain after around 10 seconds, so 1.4 isn't perfect, but on machines 7x slower than my laptop things should still be ok. Further breaking up will be hard as it starts to get into intricate parts of the game engine - it's possible, but it would take serious time and effort.

Other notes

Of course, there are other big factors with startup:

Download time: My personal server that hosts the BananaBread links above is not that fast, and doesn't even do gzip compression. We hope to get a more serious hosting solution before BananaBread hits 1.0.
GPU factors: BananaBread compiles a lot of shaders and uploads a lot of textures during startup. On the plus side, the time these take is probably not much different than a native build of the engine, but it's noticeable there too.
Data: Smaller levels lead to faster startup and vice versa. Our levels aren't done yet, we'll optimize them some more.
Subjective factors: You can't render during long-running JavaScript calculations, but music will keep playing, which makes for a less-boring experience for the user. Also, a nice splash screen during startup is a good idea, we should do that... ;)

Experimenting with an LLVM backend for JavaScript

2012-07-19T10:30:00.000-07:00

We have started to experiment with an LLVM backend for JavaScript. The reasons, approach and other notes are all on the relevant emscripten wiki page.

If you know LLVM and JavaScript and want to help, please get in touch!

Scripting BananaBread / Using Compiled C/C++ Code in JavaScript

2012-07-05T10:38:00.000-07:00

In BananaBread we compile an entire 3D game engine from C++ into JavaScript. The simplest way to use it is to compile everything you need and just run it. However, it might be useful to let the compiled code be controlled from JavaScript, that way you can use normal JavaScript to control the game engine. Here are an examples of that: Fireworks Demo.

In that demo two APIs are provided from the compiled game engine: camera control and particle effect creation. The scripting API used in them begins here, and an example use can be seen here.

How does this work? There are 4 main steps to accessing compiled C/C++ from normal JavaScript:

Make a C API if the code is in C++. You can use C++, but then you need to deal with name mangling and this pointers in a manual way - C is easier. Example.
Use EMSCRIPTEN_KEEPALIVE to keep the code alive. EMSCRIPTEN_KEEPALIVE is a macro that uses compiler attributes to tell the compiler not to eliminate code as dead even if it isn't used (it will be used from JavaScript, but without this the compiler doesn't know that). Example.
Export the function through Closure Compiler. In -O2 closure compiler is used to minify and optimize the code. As a consequence, the original function names are unrecognizable, and closure will also remove code it sees is never used. The way to do this is to add the function to EXPORTED_FUNCTIONS when calling emcc to compile to JavaScript. The function will then show up on the Module object even after closure compiler runs (side note, all of the exports through closure are on the Module object, for example you can access memory through Module.HEAP8, etc.). Example (scroll to EXPORTED_FUNCTIONS).
Access the code through ccall or cwrap. ccall does a one-time call to a function, while cwrap returns a native JavaScript function that wraps the C function. Both take as arguments the return type and argument types (see docs). Example.

You can then access the code, and if you want, you can write a nice JavaScript API on top of the C-like interface ccall/cwrap give you.

Returning to BananaBread specifically, we now have the infrastructure to allow JavaScript access to all the compiled game engine's functionality. Right now as mentioned before we have camera and particle effect APIs (was very quick to start with those), but straightforward work can let us control the how characters move, the rules of the game (how you earn points, get ammo, etc.), how objects behave, how weapons work, etc. The underlying engine is mainly used for first person shooters, but it is easy to use it for other things, from visual demos like the fireworks from before to 2D games to non-game 3D virtual worlds and so forth, once you have the proper scripting APIs in place (in fact I did something very similar a few years ago using the same engine).

If that kind of thing is interesting to you please get in touch, ideas for how to design the JavaScript part of the API are welcome.

BananaBread 0.2: Levels!

2012-06-26T11:08:00.000-07:00

BananaBread, the port of the Sauerbraten first person shooter from C++ and OpenGL to JavaScript and WebGL, is making good progress. We are starting to work on polish and our artist gk is in the process of making some very cool levels!

Link to the launch page for the 3 levels

Here are some screenshots. First, here are parts of the larger of the three levels,

and here is another part of that level, where water effects are turned on maximum (both reflection and refraction, and glare),

Here is the medium-sized level,

which has a very different theme to it. You can also see a bot charging towards me there. Finally, here is the smaller level,

and here in that level is a ferocious squirrel on the attack,

(the squirrel model is from the Yo Frankie game).

What the screenshots can't show is that playing a first person shooter in a web browser (without any plugins!) is an interesting experience, I guess because it isn't a common thing yet. Try it :)

StatCounter and Statistics

2012-06-19T21:37:00.001-07:00

"In summary - the Net Applications sample is small (it is based on 40,000 websites compared to our 3,000,000). Weighting the initial unrepresentative data from that small sample will NOT produce meaningful information. Instead, applying a weighting factor will simply inflate inaccuracies due to the small sample size." http://gs.statcounter.com/press/open-letter-ms

Very misleading. I have no idea which of StatCounter and Net Applications is more accurate. But that argument is off.

In statistics, sample size is basically irrelevant past a certain minimal size. That's how a survey of 300 people in the US can predict pretty well for 300 million. The number of people doesn't matter in two ways: First, it could be 1 million or 1 billion, the actual population size is irrelevant, and second, it could be 3,000 or 30,000 and it would not be much better than 300. The only exception to those two facts is if the population size is very small, say 100. Then a sample size of 100 is guaranteed to be representative, obviously. And for very small sample sizes like say 5, you have poor accuracy in most cases. But just 300 people is enough for any large population.

The reason is the basic statistical law that the standard deviation of the sample is the same as of the population, divided by the square root of the sample size. If you are measuring something like % of people using a browser, the first factor doesn't matter much. That leaves the second. Happily for statistics, 1 over square root decreases very fast. You get to accuracy of a few percent with just a few hundred people, no matter what the population size.

So that StatCounter has 3,000,000 websites and Net Applications has 40,000 means practically nothing (note that 40,000 even understates it, since those are websites. The number of people visiting those sites is likely much larger). 40,000 is definitely large enough: In fact, just a few hundred datapoints would be enough! Of course, that is only if the sample is unbiased. That's the crucial factor, not sample size. We don't really know which of StatCounter and Net Applications is less biased. But the difference in sample size between them is basically irrelevant. Past a minimal sample size, more doesn't matter, even if it seems intuitively like it must make you more representative.

Many More Falling Blocks

2012-06-19T16:13:00.000-07:00

Several months ago I made a demo of box2d.js, a port of the 2D physics engine Box2D to JavaScript. I made the demo using 15 falling blocks because that's what ran well at the time. Checking the demo now, I see that JavaScript Engine improvements allow for the possibility of many more falling blocks. Here is a version of that demo with 80 falling blocks.

On the current Firefox GA Release (13) the frame rate with 80 blocks often drops briefly to 20fps, which is not great. But on Firefox Nightly (16), I get very smooth frame rates, very close to 60fps! I see similar results in Chrome Dev (21) as well [1]. So it looks like it is now possible to run games on the web with lots of 2D physics in them.

[1] On Opera 12 I manually enabled WebGL, but sadly the page doesn't render properly.

Debugging JavaScript is Awesome

2012-06-15T16:15:00.000-07:00

I sometimes find myself debugging large amounts of JavaScript, typically a combination of Emscripten-compiled C++ and handwritten JavaScript that it interfaces with. It turns out to be pretty easy and fun to do.

Obviously that's a personal opinion and it depends on the type of code you debug as well as your debugging style. My C++ debugging style tends to be to add printfs in relevant places, recompile and run, only using gdb when there are odd crashes and such. So in JavaScript this turns out to be better than C++: You add your prints but you don't need to recompile, just reload the page.

But it gets even better. You can use JavaScript language features to make your life easier. For example, I've been debugging OpenGL compiled to WebGL, and sometimes a test would start to fail when I made a change. It is really really easy to add automatic logging of every single WebGL command that is generated, see the 30 or so lines of code starting here. That wraps the WebGL context and logs everything it does. So I can log the output before a commit and after, diff those, and see what went wrong. I also have similar code in Emscripten to log out each call to a libc function. Of course that sort of thing is possible in C++ too, but it is much trickier, while in JavaScript it is pretty simple.

And actually it is better still. Unlike a regular debugger like gdb, when you debug JavaScript you can script debugging tasks directly in the code with immediate effect. For example, when debugging BananaBread I might see something wrong in the particle effects but nowhere else. If so I can just jump into the source code, set a variable to 1 when starting to render particles, and set it to 0 when leaving. I can then check that variable when logging GL stuff, and I'll only see the relevant code. It's also useful for more complex situations like logging specific data on the Nth call to a function or only when certain situations hold. Since reloads are so fast, this is very efficient and effective. I heard gdb has a python scripting option, and maybe other debuggers have similar tools, but really nothing can beat scripting your debug procedure using the same language as your code like you can with JS: There is nothing to learn, your have the full power of the language, and you just hit reload.

And of course there are other nice things like being able to print out new Error().stack to get a stack trace at any point in time, JSON.stringify(x, null, 2) to get nice pretty-printed output of any object, etc.

Reloop All The Blocks

2012-05-29T14:59:00.000-07:00

One of the main results from the development of Emscripten was the Relooper algorithm. The Relooper takes basic blocks of code - chunks of simple code, at the end of which are branches to other blocks of code - and generates high-level structure from that using loops and ifs. This is important because LLVM gives you basic blocks, and JavaScript requires loops and ifs to be fast, so when compiling C++ to JavaScript you need to bridge the two. So if you have any sort of compiler from a representation with basic blocks into JavaScript - or for that manner any high-level language that does not have gotos, but does have labelled loops - then the Relooper might be useful for you. The Relooper is known to be used in the two main C++ to JS compilers, Emscripten and Mandreel.

Emscripten's Relooper implementation is in JavaScript, which was very useful for experimenting with different approaches and developing the algorithm. However, there are two downsides to that implementation: First, that it was built for experimentation, not speed, and second, that being in JavaScript it is not easily reusable by non-JavaScript projects. So I have been working on a C++ Relooper, which is intended to implement a more optimized version of the Relooper algorithm, in a fast way, and to make embedding in other projects as easy as possible.

That implementation is not fully optimized yet, but it has gotten to the point where it is usable by other projects. It got to that point after last week I wrote a fuzzer for it, which generates random basic blocks, then implements that in JavaScript in the trivial switch-in-a-loop manner, and then uses the Relooper to compile it into fast JavaScript. The fuzzer then runs both programs and checks for identical output. This found a few bugs, and after fixing them the fuzzer can be run for a very very long time without finding anything, so hopefully there are no remaining bugs or at least very very few.

The C++ Relooper code linked to before comes with some testcases, which are good examples for how to use it. As you can see there, using the Relooper is very simple: There are both C++ and C APIs, and what you do in them is basically

Define the output buffer
Create the basic blocks, specifying the text they contain and which other blocks they branch to
Create a relooper instance and add blocks to it
Tell the relooper to perform its calculation on those blocks, and finally to render to the output buffer

There is also a debugging mode, in which a lot of debug info will be printed out, including (from the C API) a C program using the C API, which is useful for generating testcases.

Emscripten and LLVM 3.1

2012-05-25T07:06:00.002-07:00

LLVM 3.1 support for Emscripten just landed in master, all tests pass and all benchmarks either remain the same, or improve from 3.0.

LLVM 3.1 is now the officially supported version, all testing from now on will be on 3.1. The Emscripten tutorial has been updated to reflect that.

(3.0 might work, it does right now, but over time that might change.)

Emscripten OpenGL / WebGL Conversion Progress

2012-05-03T14:49:00.000-07:00

Here is a very early demo of a 3D game engine, written in C++ and using OpenGL, compiled to JavaScript and WebGL using Emscripten. The game engine is Sauerbraten (aka Cube 2), one of the best open source game engines out there, and we nicknamed the port BananaBread.

After loading the demo link, press the "fullscreen" button, then click "GO!" to start the game. Move with WASD, jump with space, look around with the mouse. You can shoot a little by clicking the mouse. Please note that

The C++ game code has not been optimized at all in any way yet
The generated JavaScript is itself not fully optimized yet, nor even minified
The level you see when you press "GO!" was made by me, a person with 0 artistic talent
The game assets (textures) have not been optimized for faster downloads at all

So this is a very very early demo - ignore performance and content quality. Also, it might not work in all browsers yet, sorry about that: Seems fine in Firefox 15, including pointer lock and fullscreen mode, but for some reason in Chrome 20 pointer lock isn't working. I do get 60fps in both of them though on my 2 year old MacBook (note that the frame rate is capped at 60 using requestAnimationFrame, so that it does not go higher is not an indication of anything).

After the disclaimers, I did want to blog about this because despite being very early, I think it does show the potential of this approach. We are taking a C++ game engine using an oldish version of OpenGL, and with almost no changes to the source code we can compile it using open source tools to something that runs on the web thanks to modern JS engines, the fullscreen and pointer lock APIs and WebGL, at a reasonable frame rate, even before optimizing it.

A few technical details:

Emscripten supports the WebGL-friendly subset of OpenGL and OpenGL ES quite well. That subset is basically OpenGL ES 2.0 minus clientside arrays. If you are writing C++ code with the goal of compiling it to WebGL, using that subset is the best thing to do, it can be compiled into something very efficient. We should currently support all of that subset, but most of it is untested - please submit testcases if you can.
Emscripten now also supports some amount of non-WebGL-friendly OpenGL stuff. We will never support all of desktop OpenGL I don't think - that would amount to writing an OpenGL driver - but we can add parts that are important. Note that we are doing this carefully, so that it does not affect performance of code that uses just the WebGL-friendly subset, the additional overhead for supporting the nonfriendly features is only suffered if you deviate from the friendly subset.

Specifically, the non-friendly features we partially support include pieces of immediate mode, clientside state and arrays, and shader conversion to WebGL's GLSL. Again, we have only partial support for those - it is best to not rely on them and to use the WebGL-friendly subset. The parts we support are motivated by what Sauerbraten's renderer requires (note that even to render the GUI, you need a immediate mode support, that's all done with OpenGL and not some 2D API).

The demo is the result of about a month of work. Almost all of that time was spent in learning OpenGL (which I had never used before) and writing the emulation layer for OpenGL features not present in WebGL, basically proceeding testcase by testcase after generating testcases from Sauerbraten. Aside from that, everything else pretty much just worked when compiled to JS.

The plan is to continue this port, and help is welcome. Basically we want to get the entire game working, including model rendering (the main part of the Sauerbraten renderer I haven't looked at yet), AI bots and so forth, and to use professionally designed levels and models. At some point we will probably want to optimize the code as well. The goal is to end up with a playable, good looking 3D FPS game that runs on the web, that is open source and is built using open source tools, so other people can learn from the project or even use the code directly. The gane will initially be single player versus some bots, but eventually using WebRTC we should be able to get multiplayer mode working as well (WebRTC should land in most browsers later this year).

Aside from this specific game port, Emscripten's OpenGL support has greatly improved, and there are other projects using it already. If you use the WebGL-friendly subset of OpenGL, it is ready for use now, with the disclaimer that while everything should work we have not rigorously tested it yet, help with testing and testcases would be welcome. In particular if you have some application you want to port, if you find problems in our OpenGL support please file a bug with a testcase, for the WebGL-friendly subset those should be easy to fix and we can add the testcase to our test suite so we don't regress on the features your project needs.

HOWTO: Port a C/C++ Library to JavaScript (xml.js)

2012-03-23T12:21:00.000-07:00

I've been porting various libraries to JavaScript recently (lzma.js, sql.js) and I thought it might be useful to write up details about how this kind of thing is done. So here is how I ported libxml - an open source library that can validate XML schemas - in response to a request. Note that this isn't a general HOWTO, it's more a detailed writeup of what I did to port libxml in particular, but I hope it's useful for understanding the general technique.

If you just want to see the final result, the ported project is called xml.js, and there is an online demo here (thanks syssgx!)

Part 1: Get the Source Code and Check It Natively

I downloaded the latest libxml source code from the project's website and compiled it natively. One of the generated files is xmllint, a commandline tool to validate schemas. I made sure it works on a simple example. This is important first of all as a sanity check on the code being compiled (especially important if you are porting code you never used or looked at, which is the case here!), and second having the testcase will let us easily check the JavaScript version later on. Running xmllint looks like this:

$./xmllint --noout --schema test.xsd test.xml
test.xml validates

Just to be sure everything is working properly, I introduced some errors into those files, and indeed running xmllint on them produces error messages.

Part 2: Run Configure

  emconfigure ./configure

emconfigure runs a command with some environment variables set to make configure use emcc, the Emscripten replacement for gcc or clang, instead of the local native compiler.

When looking at the results of configure, I saw it includes a lot of functionality we don't really need, for example HTTP and FTP support (we only want to validate schemas directly given to us). So I re-ran configure with the options to disable those features. In general, it's a good idea to build just the features you need: First, unneeded code leads to larger code size, which matters on the web, and second, you will need to make sure the additional features compile properly with emcc, and sometimes headers need some modifications (mainly since we use newlib and not glibc).

Part 3: Build the Project

  emmake make

emmake is similar to emconfigure, in that it sets some environment variables. emconfigure sets them in order for configure to work, including configure's configuration tests (which build native executables), whereas emmake sets them in order for actually building the project to work. Specifically, it makes the project's build system use LLVM bitcode as the generated code format instead of native code. It works that way because if we generated JS for each object file, we would need to write a JS linker and so forth, whereas this way we can use LLVM's bitcode linking etc.

Make succeeds, and there are various generated files. But they can't be run! As mentioned above, they contain LLVM bitcode (you can see that by inspecting their contents, they begin with 'BC'). So we have an additional step as described next.

Part 4: Final Conversion to JavaScript

For xmllint, we need xmllint.o. We also need libxml2.a, however. We need to manually specify it because LLVM bitcode linking does not support dynamic linking, so dynamic linking is basically ignored by emcc. But it's pretty obvious in most cases what you need, here, just libxml2.a.

Slightly less obvious is that we also need libz (the open source compression library). Again, dynamic linking was ignored, but we can see it was in the link command. I actually missed this the first time around, but it is no big deal, you get a clear error message at runtime saying a function is not defined, in this case gzopen. A quick grep through the headers shows gzopen is in libz, so I grabbed libz.bc from the emscripten test suite (if it wasn't there, I would have had to make a quick build of it).

Ok, let's convert this to JavaScript! The following will work:

emcc -O2 xmllint.o .libs/libxml2.a libz.a -o xmllint.test.js --embed-file test.xml --embed-file test.xsd

Let's see what this means:

emcc is as mentioned before a drop-in replacement for gcc or clang.
-O2 means to optimize. This does both LLVM optimizations and additional JS-level optimizations, including Closure Compiler advanced opts.
The files we want to build together are then specified.
The output file will be xmllint.test.js. Note that the suffix tells emcc what to generate, in this case, JavaScript.
Finally, the odd bit is the two --embed-file options we specify. What this does is actually embed the contents of those files into the generated code, and set up the emulated filesystem so that the files are accessible normally through stdio calls (fopen, fread, etc.). Why do we need this? It's the simplest way to just access some files from compiled code. Without this, if we run the code in a JS console shell, we are likely to run into inconsistencies of how those shells let JS read files (binary files in particular are an annoyance), and if we run the code in a web page, we have issues with synchronous binary XHRs being disallowed except for web workers. So to avoid all those issues, a simple flag to emcc lets us bundle files with the code for easy testing.

Part 5: Test the Generated JavaScript

A JavaScript shell like Node.js, the SpiderMonkey shell or V8's d8 console can be used to run the code. Running it gives this:

$node xmllint.test.js --noout --schema test.xsd test.xml
test.xml validates

Which is exactly what the native build gave us for those two files! Success :) Also, introducing intentional errors into the input files leads to the same errors as in the native build. So everything is working exactly as expected.

Note that we passed the same commandline arguments to the JavaScript build as to the native build of xmllint - the two builds behave exactly the same.

Part 6: Make it Nice and Reusable

What we have now is hardcoded to run on the two example files, and we want a general function that given any XML file and schema, can validate them. This is pretty easy to do, but to make sure it also works with Closure Compiler optimizations is a little trickier. Still, it's not that bad, details are below, and it's definitely worth the effort because Closure Compiler makes the code much smaller.

The first thing we need is to use emcc's --pre-js option. This adds some JavaScript alongside the generated code (in this case before it because we say pre and not post). Importantly, --pre-js adds the code before optimizations are run. That means that the code will be minified by Closure Compiler together with the compiled code, allowing us to access the compiled code properly - otherwise, Closure Compiler might eliminate as dead code functions that we need.

Here are the contents of the file we will include using --pre-js:

  Module['preRun'] = function() {

    FS.createDataFile(

      '/',

      'test.xml',

      Module['intArrayFromString'](Module['xml']),

      true,

      true);

    FS.createDataFile(

      '/',

      'test.xsd',

      Module['intArrayFromString'](Module['schema']),

      true,

      true);

  };

  Module['arguments'] = ['--noout', '--schema', 'test.xsd', 'test.xml'];

  Module['return'] = '';

  Module['print'] = function(text) {

    Module['return'] += text + '\n';

  };

What happens there is as follows:

Module is an object through which Emscripten-compiled code communicates with other JavaScript. By setting properties on it and reading others after the code runs, we can interact with the code.
Note that we use string names to access Module, Module['name'] instead of Module.name. Closure will minify the former to the latter, but importantly it will leave the name unminified.
Moving on the actual code: The first thing we modify is Module.preRun, which is code that executes just before running the compiled code itself (but after we set up the runtime environment). What we do in preRun is set up two data files using the Emscripten FileSystem API. For simplicity, we use the same filenames as in the testcase from before, test.xml and test.xsd. We set the data in those files to be equal to Module['xml'] and Module['xsd'], which we will explain later, for now, we assume those properties of Module have been set and contain strings with XML or an XML schema, respectively. We need to convert those strings to an array of values in 0-255 using intArrayFromString.
We set Module.arguments, which contains the commandline arguments. We want the compiled code to behave exactly as it did in the testcase! So we pass it the same arguments. The only difference will be that the files will have user-defined content in them.
Module.print is called when the compiled code does printf or a similar stdio call. Here we customize printing to save to a buffer. After the compiled code runs, we can then access that buffer, as we will see later.

In summary, we "sandbox" the compiled code in the sense that we set up the input files to contain the data we need, and capture the output so that we can do whatever we want to with it later.

We are not yet done, but we can compile the code now - the final thing that remains will be done after compile it. Compiling can be done with this command:

  emcc -O2 xmllint.o .libs/libxml2.a libz.a -o xmllint.raw.js --pre-js pre.js

This is basically the command from before, except we no longer embed files. Instead, we use --pre-js to include pre.js which we discussed before.

After that command runs, we have an optimized and minified build of the code. We wrap that with something we do not want to be optimized and minified, because we want it to be usable from normal JavaScript in a normal way,

  function validateXML(xml, schema) {

    var Module = {

      xml: xml,

      schema: schema

    };

{{{ GENERATED_CODE }}}
return Module.return;
}

GENERATED_CODE should be replaced with the output we got before from the compiler. So, what we do here is wrap the compiled code in a function. The function receives the xml and schema and stores them in Module, where as we saw before we access them to set up the "files" that contain their data. After the compiled code runs, we then simply return Module.return which as we set up before will contain the printed output.

That's it! libxml.js can now be used from normal JS. All you need to do is include the final .js file (xmllint.js in the xml.js repo, for now - still need to clean that up and make a nicer function wrapping, pull requests welcome), and then call validateXML with a string representing some XML and another string representing some XML schema.

box2d.js: Box2D on the Web is Getting Faster

2012-02-21T14:18:00.000-08:00

Box2D is a popular open source 2D physics library, used for example in Angry Birds. It's been ported to various platforms, including JavaScript through a previous port to ActionScript. box2d.js is a new port, straight from C++ to JavaScript using Emscripten. Here is a demo.

Last December, Joel Webber benchmarked various versions of Box2D. Of the JavaScript versions, the best (Mandreel's build) was 12x slower than C. Emscripten did worse, which was not surprising since back then Emscripten could not yet support all LLVM optimizations. Recently however that support has landed, so I ran the numbers and on the trunk version of SpiderMonkey (Firefox's JavaScript engine), Emscripten's version is now around 6x slower than C. That's twice as fast as the previous best result from December (three times as fast as Emscripten's result at that time).

That should get even faster as JavaScript engines and the compilers to JavaScript continue to improve. The rate of improvement is quite fast in fact, you will likely see a big difference between stable and development versions of browsers when running processing-intensive code like Box2D.

Aside from speed, it's important that the compiled code be easily usable. box2d.js uses the Emscripten bindings generator to wrap compiled C++ classes in friendly JavaScript classes, see the demo code for an example. Basically, you can write natural JavaScript like new Box2D.b2Vec2(0.0, -10.0) and it will call the compiled code for you.

(And of course, box2d.js is zlib licensed, like Box2D - usable for free in any way.)