Tuesday, July 13, 2010

Experiments with 'Static' JavaScript: As Fast As Native Code?

I don't really know much about computer language theory. I just mess around with it in my spare time (weekends mostly). Here is one thing I've been thinking about: I want the speed of native code on the web - because I want to run things like game engines there - but I don't want Java, or NaCl, or some plugin. I want to use standard, platform-agnostic web technologies.

So, how about if we could compile JavaScript into native code? Of course, in general we can't, at least not very well. But if a JavaScript program - or part of a program - happens to be implicitly statically typed, and in other ways 'performance-friendly',  then we should be able to compile at least such code and run it very quickly.

In fact PyPy does basically that with RPython - a subset of Python that it can translate into C. So, building on that, I made a demo of the following process:
  • Begin with some benchmark in JavaScript
  • Parse the JavaScript using the SpiderMonkey parser (a small hack of the code was needed in order to get it to dump the syntax tree in a nice format)
  • Translate the parsed code into Python (using a Python script, see below)
  • If the generated code happens to be RPython, happily run that through PyPy's RPython translator to get native code, and execute that
The code I wrote for this is horribly ugly, but if you must look at it, here it is (if I decide this idea is worth doing any more work on, I will almost certainly rewrite it from scratch). Here are the results for the fannkuch benchmark (run on the value n=10, on a Core 2 Duo laptop):


The first row in the chart shows a C++ implementation of fannkuch, compiled with -O3. All the other rows start with a JavaScript implementation and proceed from there. First there is simply the result of running V8 and SpiderMonkey on the benchmark. Then, the results of automatically converting the JavaScript to Python and running CPython on it are shown - very slow. Finally, that Python is run through PyPy's RPython translator, which generates native code, which runs very fast.

The final result is that the JavaScript => Python => PyPy pipeline runs the code only 26% slower than a C++ implementation of the same benchmark compiled at -O3. For comparison, V8 run on that same JavaScript is 3.76 times slower. So, it seems there is potential here to get (some) JavaScript running very fast, pretty much as fast as possible.

Of course, there is a big startup cost here - the conversion process takes a lot of time (at least PyPy is fun to watch as it runs ;). And JavaScript on the web needs to start up quickly. But there should be technical solutions for this, for example, running the code normally while trying the conversion process in a separate thread (so, running on another CPU core). If the process finishes successfully, swap out the code for the faster version (or if you can't hot-swap it, at least remember and use the compiled version next time you visit that website - would still help for websites you visit a lot).

So, as I said this is just a little demo I worked on in my spare time, to see if it can even work, and to learn more about the topic. It seems to sort of work actually (I was not very optimistic before I started), but I'm still unsure if it makes sense to do. Feedback is welcome.

(I also have a few other experiments I'm working on, that are closely related, and involve LLVM. I'll leave those for a future post.)


Edit: Jernej asked in a comment about Rhino. So for comparison I also ran this code in Rhino and Jython. Rhino took 22.51 seconds - more than twice as slow as SpiderMonkey - and Jython took 74.53 seconds, closer to the slowness of CPython than the next slowest result (Rhino). Neither of these - Rhino or Jython - is particularly fast in general, but they do give other significant benefits (integration with the JVM, proper multithreading, security, etc.).

Edit 2: Following comments by Sayre here and dmandelin elsewhere, here are the results of JaegerMonkey on this benchmark: 4.62 seconds. So, much faster than TraceMonkey, and almost as fast as V8 (very cool!).

13 comments:

  1. btw: jaegermonkey is already quite close to v8's performance.

    ReplyDelete
  2. http://www.mozilla.org/rhino/jsc.html

    just to compare ? (java recompiling to native on a given platform...)

    ReplyDelete
  3. I understand the preference for standards, but I wouldn't discount the role of plugins completely. I consider plugins to be a sort of proving ground for what might become standards.

    In that respect I think Google would like some descendant of nacl to become a standard. At the moment it's moving in too many directions simultaneously (llvm? native but multi architecture?)

    I have my own toy project in this area. Similar to nacl but from the point of view of the code being a bytecode of a x86 subset. The proof of the pudding will be if I get it running that x86 subset at good speeds on Arm (Which might happen after I get my hands on a decent Arm device, my pandora is a tad late.). It's a plugin that shows the possibilities. Your own work is in a similar vein. We're looking for a better way to do things.

    ReplyDelete
  4. V8 converts JS directly to machine code (no JIT). I think it's as fast as you go.

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. @tapir I remember reading some comments by Mike Pall (author of LuaJIT) indicating that he thought that multistage compilers can (always?) beat out a single stage compiler (such as v8); I can't find them now, but there's a lot of related interesting commentary from him here: http://www.reddit.com/user/mikemike

    V8's single stage design probably makes startup very fast, but probably doesn't allow for all of the optimization you might want to do (but I'm not a compiler guy, and the V8 guys seem to keep pulling the performance rabbit out of the hat...).

    ReplyDelete
  7. Couldn't someone write a plug-in that does the compiling/execution of the javascript. Then if that plug-in isn;t present the javascript simply runs natively.

    ReplyDelete
  8. I didn't get where are you going with this, RPython is a static typed language like c++ and is compiled just like it (even using the same compiler in the end) so it should have the same speed. I don't think you can static type most of the code on the web though so this speed advantage is not very much relevant, or am I missing something?

    ReplyDelete
  9. @Leonardo:

    Yes, RPython is basically the same. And yes, you can't statically analyze most of the code on the web. What I was investigating, though, was whether *some* of the code on the web was statically analyzable. Turns out that the first benchmark I pulled off of the language shootout was of that kind.

    In general most code isn't. But, the speed-intensive parts of a lot of things might be. So it could be useful for JavaScript engines to *try* to statically analyze, which currently none of them do.

    ReplyDelete
  10. tapir, It's a misconception that "machine code is as fast as you can go." Compiler developers spend just about all their time making some batch of machine code run faster than another. The speedup between one set of machine code and another (both implementing the same algorithm) can be orders of magnitude. I would imagine JITed code suffers from most of the typical problems, including high overhead of many method invocations, blown caches, and complex and unpredictable control flow, causing pipeline bubbles.

    I've implemented a static subset of ruby before, and indeed the trivial cases (if foo; expr; end) can run as fast (or faster) than C/C++. The "or faster" part comes from the fact that scripts are frequently alias-free, whereas C memory can be address-taken, significantly reducing the amount of optimization available. The merest risk of aliases makes optimizing C very difficult in comparison to, say, fortran, where pointer targets have to be declared explicitly by the programmer. In such an environment, fortran code can be 50% faster on the same algorithm.

    C++ introduced references to help with this problem, but as always, the language lets you sidestep the intension.

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete
  12. I doubt V8 generated machine code is "as fast as you can go." Even during 6502/8080 days, optimization difference was +100%. Now with much more complex instruction set with hordes of specialized extensions, multi-level caches, and CPU specific differences, I wouldn't be surprised if maximum difference turns out to be 100x.

    ReplyDelete
  13. Great post, very interesting.

    ReplyDelete