Friday, March 23, 2012

HOWTO: Port a C/C++ Library to JavaScript (xml.js)

I've been porting various libraries to JavaScript recently (lzma.js, sql.js) and I thought it might be useful to write up details about how this kind of thing is done. So here is how I ported libxml - an open source library that can validate XML schemas - in response to a request. Note that this isn't a general HOWTO, it's more a detailed writeup of what I did to port libxml in particular, but I hope it's useful for understanding the general technique.

If you just want to see the final result, the ported project is called xml.js, and there is an online demo here (thanks syssgx!)

Part 1: Get the Source Code and Check It Natively

I downloaded the latest libxml source code from the project's website and compiled it natively. One of the generated files is xmllint, a commandline tool to validate schemas. I made sure it works on a simple example. This is important first of all as a sanity check on the code being compiled (especially important if you are porting code you never used or looked at, which is the case here!), and second having the testcase will let us easily check the JavaScript version later on. Running xmllint looks like this:

  $./xmllint --noout --schema test.xsd test.xml
  test.xml validates

Just to be sure everything is working properly, I introduced some errors into those files, and indeed running xmllint on them produces error messages.

Part 2: Run Configure

  emconfigure ./configure

emconfigure runs a command with some environment variables set to make configure use emcc, the Emscripten replacement for gcc or clang, instead of the local native compiler.

When looking at the results of configure, I saw it includes a lot of functionality we don't really need, for example HTTP and FTP support (we only want to validate schemas directly given to us). So I re-ran configure with the options to disable those features. In general, it's a good idea to build just the features you need: First, unneeded code leads to larger code size, which matters on the web, and second, you will need to make sure the additional features compile properly with emcc, and sometimes headers need some modifications (mainly since we use newlib and not glibc).

Part 3: Build the Project

  emmake make

emmake is similar to emconfigure, in that it sets some environment variables. emconfigure sets them in order for configure to work, including configure's configuration tests (which build native executables), whereas emmake sets them in order for actually building the project to work. Specifically, it makes the project's build system use LLVM bitcode as the generated code format instead of native code. It works that way because if we generated JS for each object file, we would need to write a JS linker and so forth, whereas this way we can use LLVM's bitcode linking etc.

Make succeeds, and there are various generated files. But they can't be run! As mentioned above, they contain LLVM bitcode (you can see that by inspecting their contents, they begin with 'BC'). So we have an additional step as described next.

Part 4: Final Conversion to JavaScript

For xmllint, we need xmllint.o. We also need libxml2.a, however. We need to manually specify it because LLVM bitcode linking does not support dynamic linking, so dynamic linking is basically ignored by emcc. But it's pretty obvious in most cases what you need, here, just libxml2.a.

Slightly less obvious is that we also need libz (the open source compression library). Again, dynamic linking was ignored, but we can see it was in the link command. I actually missed this the first time around, but it is no big deal, you get a clear error message at runtime saying a function is not defined, in this case gzopen. A quick grep through the headers shows gzopen is in libz, so I grabbed libz.bc from the emscripten test suite (if it wasn't there, I would have had to make a quick build of it).

Ok, let's convert this to JavaScript! The following will work:

  emcc -O2 xmllint.o .libs/libxml2.a libz.a -o xmllint.test.js --embed-file test.xml --embed-file test.xsd

Let's see what this means:
  • emcc is as mentioned before a drop-in replacement for gcc or clang.
  • -O2 means to optimize. This does both LLVM optimizations and additional JS-level optimizations, including Closure Compiler advanced opts.
  • The files we want to build together are then specified.
  • The output file will be xmllint.test.js. Note that the suffix tells emcc what to generate, in this case, JavaScript.
  • Finally, the odd bit is the two --embed-file options we specify. What this does is actually embed the contents of those files into the generated code, and set up the emulated filesystem so that the files are accessible normally through stdio calls (fopen, fread, etc.). Why do we need this? It's the simplest way to just access some files from compiled code. Without this, if we run the code in a JS console shell, we are likely to run into inconsistencies of how those shells let JS read files (binary files in particular are an annoyance), and if we run the code in a web page, we have issues with synchronous binary XHRs being disallowed except for web workers. So to avoid all those issues, a simple flag to emcc lets us bundle files with the code for easy testing.
Part 5: Test the Generated JavaScript

A JavaScript shell like Node.js, the SpiderMonkey shell or V8's d8 console can be used to run the code. Running it gives this:

  $node xmllint.test.js --noout --schema test.xsd test.xml
  test.xml validates

Which is exactly what the native build gave us for those two files! Success :) Also, introducing intentional errors into the input files leads to the same errors as in the native build. So everything is working exactly as expected.

Note that we passed the same commandline arguments to the JavaScript build as to the native build of xmllint - the two builds behave exactly the same.

Part 6: Make it Nice and Reusable

What we have now is hardcoded to run on the two example files, and we want a general function that given any XML file and schema, can validate them. This is pretty easy to do, but to make sure it also works with Closure Compiler optimizations is a little trickier. Still, it's not that bad, details are below, and it's definitely worth the effort because Closure Compiler makes the code much smaller.

The first thing we need is to use emcc's --pre-js option. This adds some JavaScript alongside the generated code (in this case before it because we say pre and not post). Importantly, --pre-js adds the code before optimizations are run. That means that the code will be minified by Closure Compiler together with the compiled code, allowing us to access the compiled code properly - otherwise, Closure Compiler might eliminate as dead code functions that we need.

Here are the contents of the file we will include using --pre-js:

  Module['preRun'] = function() {
  Module['arguments'] = ['--noout', '--schema', 'test.xsd', 'test.xml'];
  Module['return'] = '';
  Module['print'] = function(text) {
    Module['return'] += text + '\n';

 What happens there is as follows:
  • Module is an object through which Emscripten-compiled code communicates with other JavaScript. By setting properties on it and reading others after the code runs, we can interact with the code.
  • Note that we use string names to access Module, Module['name'] instead of Closure will minify the former to the latter, but importantly it will leave the name unminified.
  • Moving on the actual code: The first thing we modify is Module.preRun, which is code that executes just before running the compiled code itself (but after we set up the runtime environment). What we do in preRun is set up two data files using the Emscripten FileSystem API. For simplicity, we use the same filenames as in the testcase from before, test.xml and test.xsd. We set the data in those files to be equal to Module['xml'] and Module['xsd'], which we will explain later, for now, we assume those properties of Module have been set and contain strings with XML or an XML schema, respectively. We need to convert those strings to an array of values in 0-255 using intArrayFromString.
  • We set Module.arguments, which contains the commandline arguments. We want the compiled code to behave exactly as it did in the testcase! So we pass it the same arguments. The only difference will be that the files will have user-defined content in them.
  • Module.print is called when the compiled code does printf or a similar stdio call. Here we customize printing to save to a buffer. After the compiled code runs, we can then access that buffer, as we will see later.
In summary, we "sandbox" the compiled code in the sense that we set up the input files to contain the data we need, and capture the output so that we can do whatever we want to with it later.

We are not yet done, but we can compile the code now - the final thing that remains will be done after compile it. Compiling can be done with this command:

  emcc -O2 xmllint.o .libs/libxml2.a libz.a -o xmllint.raw.js --pre-js pre.js

This is basically the command from before, except we no longer embed files. Instead, we use --pre-js to include pre.js which we discussed before.

After that command runs, we have an optimized and minified build of the code. We wrap that with something we do not want to be optimized and minified, because we want it to be usable from normal JavaScript in a normal way,

  function validateXML(xml, schema) {
    var Module = {
      xml: xml,
      schema: schema
    {{{ GENERATED_CODE }}}
    return Module.return;

GENERATED_CODE should be replaced with the output we got before from the compiler. So, what we do here is wrap the compiled code in a function. The function receives the xml and schema and stores them in Module, where as we saw before we access them to set up the "files" that contain their data. After the compiled code runs, we then simply return Module.return which as we set up before will contain the printed output.

That's it! libxml.js can now be used from normal JS. All you need to do is include the final .js file (xmllint.js in the xml.js repo, for now - still need to clean that up and make a nicer function wrapping, pull requests welcome), and then call validateXML with a string representing some XML and another string representing some XML schema.