azakai's blog: emscripten

Showing posts with label emscripten. Show all posts

Friday, March 23, 2012

HOWTO: Port a C/C++ Library to JavaScript (xml.js)

I've been porting various libraries to JavaScript recently (lzma.js, sql.js) and I thought it might be useful to write up details about how this kind of thing is done. So here is how I ported libxml - an open source library that can validate XML schemas - in response to a request. Note that this isn't a general HOWTO, it's more a detailed writeup of what I did to port libxml in particular, but I hope it's useful for understanding the general technique.

If you just want to see the final result, the ported project is called xml.js, and there is an online demo here (thanks syssgx!)

Part 1: Get the Source Code and Check It Natively

I downloaded the latest libxml source code from the project's website and compiled it natively. One of the generated files is xmllint, a commandline tool to validate schemas. I made sure it works on a simple example. This is important first of all as a sanity check on the code being compiled (especially important if you are porting code you never used or looked at, which is the case here!), and second having the testcase will let us easily check the JavaScript version later on. Running xmllint looks like this:

$./xmllint --noout --schema test.xsd test.xml
test.xml validates

Just to be sure everything is working properly, I introduced some errors into those files, and indeed running xmllint on them produces error messages.

Part 2: Run Configure

  emconfigure ./configure

emconfigure runs a command with some environment variables set to make configure use emcc, the Emscripten replacement for gcc or clang, instead of the local native compiler.

When looking at the results of configure, I saw it includes a lot of functionality we don't really need, for example HTTP and FTP support (we only want to validate schemas directly given to us). So I re-ran configure with the options to disable those features. In general, it's a good idea to build just the features you need: First, unneeded code leads to larger code size, which matters on the web, and second, you will need to make sure the additional features compile properly with emcc, and sometimes headers need some modifications (mainly since we use newlib and not glibc).

Part 3: Build the Project

  emmake make

emmake is similar to emconfigure, in that it sets some environment variables. emconfigure sets them in order for configure to work, including configure's configuration tests (which build native executables), whereas emmake sets them in order for actually building the project to work. Specifically, it makes the project's build system use LLVM bitcode as the generated code format instead of native code. It works that way because if we generated JS for each object file, we would need to write a JS linker and so forth, whereas this way we can use LLVM's bitcode linking etc.

Make succeeds, and there are various generated files. But they can't be run! As mentioned above, they contain LLVM bitcode (you can see that by inspecting their contents, they begin with 'BC'). So we have an additional step as described next.

Part 4: Final Conversion to JavaScript

For xmllint, we need xmllint.o. We also need libxml2.a, however. We need to manually specify it because LLVM bitcode linking does not support dynamic linking, so dynamic linking is basically ignored by emcc. But it's pretty obvious in most cases what you need, here, just libxml2.a.

Slightly less obvious is that we also need libz (the open source compression library). Again, dynamic linking was ignored, but we can see it was in the link command. I actually missed this the first time around, but it is no big deal, you get a clear error message at runtime saying a function is not defined, in this case gzopen. A quick grep through the headers shows gzopen is in libz, so I grabbed libz.bc from the emscripten test suite (if it wasn't there, I would have had to make a quick build of it).

Ok, let's convert this to JavaScript! The following will work:

emcc -O2 xmllint.o .libs/libxml2.a libz.a -o xmllint.test.js --embed-file test.xml --embed-file test.xsd

Let's see what this means:

emcc is as mentioned before a drop-in replacement for gcc or clang.
-O2 means to optimize. This does both LLVM optimizations and additional JS-level optimizations, including Closure Compiler advanced opts.
The files we want to build together are then specified.
The output file will be xmllint.test.js. Note that the suffix tells emcc what to generate, in this case, JavaScript.
Finally, the odd bit is the two --embed-file options we specify. What this does is actually embed the contents of those files into the generated code, and set up the emulated filesystem so that the files are accessible normally through stdio calls (fopen, fread, etc.). Why do we need this? It's the simplest way to just access some files from compiled code. Without this, if we run the code in a JS console shell, we are likely to run into inconsistencies of how those shells let JS read files (binary files in particular are an annoyance), and if we run the code in a web page, we have issues with synchronous binary XHRs being disallowed except for web workers. So to avoid all those issues, a simple flag to emcc lets us bundle files with the code for easy testing.

Part 5: Test the Generated JavaScript

A JavaScript shell like Node.js, the SpiderMonkey shell or V8's d8 console can be used to run the code. Running it gives this:

$node xmllint.test.js --noout --schema test.xsd test.xml
test.xml validates

Which is exactly what the native build gave us for those two files! Success :) Also, introducing intentional errors into the input files leads to the same errors as in the native build. So everything is working exactly as expected.

Note that we passed the same commandline arguments to the JavaScript build as to the native build of xmllint - the two builds behave exactly the same.

Part 6: Make it Nice and Reusable

What we have now is hardcoded to run on the two example files, and we want a general function that given any XML file and schema, can validate them. This is pretty easy to do, but to make sure it also works with Closure Compiler optimizations is a little trickier. Still, it's not that bad, details are below, and it's definitely worth the effort because Closure Compiler makes the code much smaller.

The first thing we need is to use emcc's --pre-js option. This adds some JavaScript alongside the generated code (in this case before it because we say pre and not post). Importantly, --pre-js adds the code before optimizations are run. That means that the code will be minified by Closure Compiler together with the compiled code, allowing us to access the compiled code properly - otherwise, Closure Compiler might eliminate as dead code functions that we need.

Here are the contents of the file we will include using --pre-js:

  Module['preRun'] = function() {

    FS.createDataFile(

      '/',

      'test.xml',

      Module['intArrayFromString'](Module['xml']),

      true,

      true);

    FS.createDataFile(

      '/',

      'test.xsd',

      Module['intArrayFromString'](Module['schema']),

      true,

      true);

  };

  Module['arguments'] = ['--noout', '--schema', 'test.xsd', 'test.xml'];

  Module['return'] = '';

  Module['print'] = function(text) {

    Module['return'] += text + '\n';

  };

What happens there is as follows:

Module is an object through which Emscripten-compiled code communicates with other JavaScript. By setting properties on it and reading others after the code runs, we can interact with the code.
Note that we use string names to access Module, Module['name'] instead of Module.name. Closure will minify the former to the latter, but importantly it will leave the name unminified.
Moving on the actual code: The first thing we modify is Module.preRun, which is code that executes just before running the compiled code itself (but after we set up the runtime environment). What we do in preRun is set up two data files using the Emscripten FileSystem API. For simplicity, we use the same filenames as in the testcase from before, test.xml and test.xsd. We set the data in those files to be equal to Module['xml'] and Module['xsd'], which we will explain later, for now, we assume those properties of Module have been set and contain strings with XML or an XML schema, respectively. We need to convert those strings to an array of values in 0-255 using intArrayFromString.
We set Module.arguments, which contains the commandline arguments. We want the compiled code to behave exactly as it did in the testcase! So we pass it the same arguments. The only difference will be that the files will have user-defined content in them.
Module.print is called when the compiled code does printf or a similar stdio call. Here we customize printing to save to a buffer. After the compiled code runs, we can then access that buffer, as we will see later.

In summary, we "sandbox" the compiled code in the sense that we set up the input files to contain the data we need, and capture the output so that we can do whatever we want to with it later.

We are not yet done, but we can compile the code now - the final thing that remains will be done after compile it. Compiling can be done with this command:

  emcc -O2 xmllint.o .libs/libxml2.a libz.a -o xmllint.raw.js --pre-js pre.js

This is basically the command from before, except we no longer embed files. Instead, we use --pre-js to include pre.js which we discussed before.

After that command runs, we have an optimized and minified build of the code. We wrap that with something we do not want to be optimized and minified, because we want it to be usable from normal JavaScript in a normal way,

  function validateXML(xml, schema) {

    var Module = {

      xml: xml,

      schema: schema

    };

{{{ GENERATED_CODE }}}
return Module.return;
}

GENERATED_CODE should be replaced with the output we got before from the compiler. So, what we do here is wrap the compiled code in a function. The function receives the xml and schema and stores them in Module, where as we saw before we access them to set up the "files" that contain their data. After the compiled code runs, we then simply return Module.return which as we set up before will contain the printed output.

That's it! libxml.js can now be used from normal JS. All you need to do is include the final .js file (xmllint.js in the xml.js repo, for now - still need to clean that up and make a nicer function wrapping, pull requests welcome), and then call validateXML with a string representing some XML and another string representing some XML schema.

Tuesday, February 21, 2012

box2d.js: Box2D on the Web is Getting Faster

Box2D is a popular open source 2D physics library, used for example in Angry Birds. It's been ported to various platforms, including JavaScript through a previous port to ActionScript. box2d.js is a new port, straight from C++ to JavaScript using Emscripten. Here is a demo.

Last December, Joel Webber benchmarked various versions of Box2D. Of the JavaScript versions, the best (Mandreel's build) was 12x slower than C. Emscripten did worse, which was not surprising since back then Emscripten could not yet support all LLVM optimizations. Recently however that support has landed, so I ran the numbers and on the trunk version of SpiderMonkey (Firefox's JavaScript engine), Emscripten's version is now around 6x slower than C. That's twice as fast as the previous best result from December (three times as fast as Emscripten's result at that time).

That should get even faster as JavaScript engines and the compilers to JavaScript continue to improve. The rate of improvement is quite fast in fact, you will likely see a big difference between stable and development versions of browsers when running processing-intensive code like Box2D.

Aside from speed, it's important that the compiled code be easily usable. box2d.js uses the Emscripten bindings generator to wrap compiled C++ classes in friendly JavaScript classes, see the demo code for an example. Basically, you can write natural JavaScript like new Box2D.b2Vec2(0.0, -10.0) and it will call the compiled code for you.

(And of course, box2d.js is zlib licensed, like Box2D - usable for free in any way.)

Saturday, December 10, 2011

Typed Arrays by Default in Emscripten

Emscripten has several ways of compiling code into JavaScript, for example, it can use typed arrays or not (for more, see Code Generation Modes). I merged the 'ta2 by default' branch into master in Emscripten just now, which makes one of the typed array modes the default. I'll explain here the reason for that, and the results of it.

Originally Emscripten did not use typed arrays. When I began to write it, typed arrays were supported only in Firefox and Chrome, and even there they were of limited benefit due to lack of optimization and incomplete implementation. Perhaps more importantly, it was not clear whether they would ever be universally supported in all browsers. So to generate code that truly runs everywhere, Emscripten did not use typed arrays, it generated "plain vanilla" JavaScript.

However, that has changed. Firefox and Chrome now have mature and well-performing implementations of typed arrays, and Opera and Safari are very close to the same. Importantly, Microsoft has said that IE10 will support typed arrays. So typed arrays are becoming ubiquitous, and have a bright future.

The main benefits of using typed arrays are speed and code compatibility. Speed is simply a cause of JS engines being able to optimize typed arrays better than normal ones, both in how they are laid out in memory and how they are accessed. Compatibility stems from the fact that by using typed arrays with a shared buffer, you can get the same memory behavior as C has, for example, you can read an 8-bit byte from the middle of a 32-bit int and get the same result C would get. It's possible to do that without typed arrays, but it would be much, much slower. (There is however a downside to such C-like memory access: Your code, if it was not 100% portable in the first place, may depend on the CPU endianness.)

Because of those benefits, I worked towards using typed arrays by default. To get there, I had to fix various problems with accessing 64-bit values, which are only a problem when doing C-like memory access, because unaligned 64-bit reads and writes do not work (due to how the typed arrays API is structured). The settings I64_MODE and DOUBLE_MODE control reading those 64-bit values: If set to 1, reads and writes will be in two 32-bit parts, in a safe way.

Another complication is that typed arrays cannot be resized. So when sbrk() is called to a value that is larger than the max size, we can't easily enlarge the typed arrays we are using. The current implementation will create new typed arrays and copy the old values into them, which will work but is potentially slow.

Typed arrays have already worked in Emscripten for a long time (in two modes, even, shared and non-shared buffers), but the issues mentioned in the previous two paragraphs limited their use in some areas. So the recent work has been to smooth over all the missing pieces, to make typed arrays ready as the default mode.

The current default in Emscripten, after the merge, is to use typed arrays (in mode 2, with a shared buffer, that is, C-like memory access), and all the other settings are set to safe values (I64_MODE and DOUBLE_MODE are both 1), etc. This means that all the code that worked out of the box before will continue to work, and additional code will now work out of the box as well. Note that this is just the defaults: If your makefile sets all the Emscripten settings itself (like defining whether to use typed arrays or not, etc.), then nothing will change.

The only thing to keep in mind with this change is that by default, you will need typed arrays to run the generated code. If you want your code, right now, to run in the most places, you should set USE_TYPED_ARRAYS to 0 to disable typed arrays. Another possible issue is that not all JS console environments support typed arrays: Recent versions of SpiderMonkey and Node.js do, but the V8 shell has some issues (note that this is just a problem in the commandline shell, not in Chrome), so if you test your generated code using d8 then it will not work. Instead, you can test it in a browser, or by using Node.js or the SpiderMonkey shell for now.

Tuesday, November 15, 2011

Code Size When Compiling to JavaScript

When compiling code to JavaScript from some other language, one of the questions is how big the code will be. This is interesting because code must be downloaded on the web, and large downloads are obviously bad. So I wanted to investigate this, to see where we stand and what we need to do (either in current compilers, or in future versions of the JavaScript language - being a better compiler target is one of the goals there).

The following is some preliminary data from two real-world codebases, the Bullet physics library (compiled to JavaScript in the ammo.js project) and Android's H264 decoder (compiled to JavaScript in the Broadway project):

Bullet

.js 19.2 MB
.js.cc 3.0 MB
.js.cc.gz 0.48 MB

.o    1.9 MB
.o.gz 0.56 MB

Android H264

.js 2,493 KB
.js.cc 265 KB
.js.cc.gz 61 KB

.o      110 KB
.o.gz    53 KB

Terms used:

.js Raw JS file compiled by Emscripten from LLVM bitcode

.js.cc    JS file with Closure Compiler simple opts
.js.cc.gz JS file with Closure, gzipped

.o    Native code object file
.o.gz       Native code object file, gzipped

Notes on methodology:

Native code was generated with -O2. This leads to smaller code than without optimizations in both cases.
Closure Compiler advanced optimizations generate smaller JS code in these two cases, but not by much. While it optimizes better for size, it also does inlining which increases code size. In any case it is potentially misleading since its dead code elimination rationale is different from the one used for LLVM and native code, so I used simple opts instead.
gzip makes sense here because you can compress your scripts on the web using it (and probably should). You can even do gzip compression in JS itself (by compiling the decompressor).
Debug info was not left in any of the files compared here.
This calculation overstates the size of the JS files, because they have the relevant parts of Emscripten's libc implementation statically linked in. But, it isn't that much.
LLVM and clang 3.0-pre are used (rev 141881), Emscripten and Closure Compiler are latest trunk as of today.

Analysis

At least in these two cases it looks like compiled, optimized and gzipped JavaScript is very close to (also gzipped) native object files. In other words, the effective size of the compiled code is pretty much the same as you would get when compiling natively. This was a little surprising, I was expecting to see the size be bigger, and to then proceed to investigate what could be improved.

Now, the raw compiled JS is in fact very large. But that is mostly because the original variable names appear there, which is basically fixed by running Closure. After Closure, the main reason the code is large is because it's in string format, not an efficient binary format, so there are things like JavaScript keywords ('while', for example) that take a lot of space. That is basically fixed by running gzip since the same keywords repeat a lot. At that point, the size is comparable to a native binary.

Another comparison we can make is to LLVM bitcode. This isn't an apples-to-apples comparison of course, since LLVM bitcode is a compiler IR: It isn't designed as a way to actually store code in a compact way, instead it's a form that is useful for code analysis. But, it is another representation of the same code, so here are those numbers:

Bullet

.bc 3.9 MB
.bc.gz 2.2 MB

Android H264

.bc 365 KB
.bc.gz 258 KB

LLVM bitcode is fairly large, even with gzip: gzipped bitcode is over 4x larger than either gzipped native code or JS. I am not sure, but I believe the main reason why LLVM bitcode is so large here is because it is strongly and explicitly typed. Because of that, each instruction has explicit types for the expressions it operates on, and elements of different types must be explicitly converted. For example, in both native code and compiled JS, taking a pointer of one type and converting it to another is a simple assignment (which can even be eliminated depending on where it is later used), but in LLVM bitcode the pointer must be explicitly cast to the new type which takes an instruction.

So, JS and native code are similar in their lack of explicit types, and in their gzipped sizes. This is a little ironic since JS is a high level language and native code is the exact opposite. But both JS and native code are pretty space-efficient it turns out, while something that seems to be in between them - LLVM bitcode, which is higher than native code but lower than JS - ends up being much larger. But again, this actually makes sense since native code and JS are designed to simply execute, while LLVM bitcode is designed for analysis, so it really isn't in between those two.

(Note that this is in no way a criticism of LLVM bitcode! LLVM bitcode is an awesome compiler IR, which is why Emscripten and many other projects use it. It is not optimized for size, because that isn't what it is meant for, as mentioned above, it's a form that is useful for analysis, not compression. The reason I included those numbers here is that I think it's interesting seeing the size of another representation of the same compiled code.)

In summary, it looks like JavaScript is a good compilation target in terms of size, at least in these two projects. But as mentioned before, this is just a preliminary analysis (for example, it would be interesting to investigate specific compression techniques for each type of code, and not just generic gzip). If anyone has additional information about this topic, it would be much appreciated :)

Friday, October 7, 2011

JSConf.eu, Slides, SQLite on the Web

I got back from JSConf.eu a few days ago. I had never been to JSConf before, and it was very interesting! Lots of talks about important and cool stuff. The location, Berlin, was also very interesting (the mixture of new and old architecture in particular). Overall it was a very intensive two days, and the organizers deserve a ton of credit for running everything smoothly and successfully.

I was invited to give a talk about Emscripten, the LLVM to JavaScript compiler I've been working on as a side project over the last year. Here are my slides from the talk, links to demos are in them. There was also a fourth unplanned demo which isn't in the slides, here is a link to it.

If you've seen the previous Emscripten demos, then some of what I showed had new elements, like the Bullet/ammo.js demo which shows the new bindings generator which lets you use the compiled C++ objects directly from JS in a natural way. One demo was entirely new though, SQLite ported to JS. I haven't had time to do any rigorous testing of the port or to optimize it for speed. However it appears to work properly in all the basic tests I tried: creating tables, doing selects, joins, etc. With WebSQL not moving forward as a web standard, compiling SQLite to JS directly might be the best way to get SQL on the web. The demo is just a proof of concept, but I think it shows the approach is feasible.

Monday, April 11, 2011

Rendering PDFs in JavaScript...?

I released Emscripten 1.0 over the weekend, which came with a demo of rendering PDFs entirely in JavaScript (warning: >12MB will be downloaded for that page). Emscripten is an LLVM-to-JavaScript compiler which allows running code written in C or C++ on the web. In the linked demo, Poppler and FreeType were compiled to JavaScript from C++.

The goal of the demo was to show Emscripten's capabilities. Over the last year it has gotten very usable, and can probably compile most reasonable C/C++ codebases (albeit with some manual intervention in some cases). It is my hope that Emscripten can help against the tendency to write non-web applications, such as native mobile applications (for iOS, Android, etc.) or using plugins on the web (Flash, NaCl, etc.). Simply put, the web is competing with these platforms. Emscripten can make the web a more attractive platform for developers, by letting them use their languages of choice, such as C, C++ or Python (without necessarily compromising on speed: the code generated by Emscripten can be optimized very well, and it is my hope that things like type inference will make it very fast eventually).

Meanwhile, getting back to the PDF rendering demo, I was thinking: How about making a Firefox plugin with it, that is, that when a PDF is clicked in Firefox it is shown in an internal PDF viewer? Aside from the novelty, I think this would be cool to do because it would be an extremely secure PDF viewer (since it would be entirely in JavaScript). If you are a plugin or frontend hacker and think it's a cool idea too, please get in touch and let's make it happen! :)

Sunday, September 12, 2010

Emscripten Updates

Minor note: Updates about Emscripten will be on my other blog - fits in better with the theme over there. Though I do believe the two should converge, when all is ready.