Thursday, February 21, 2013

Heap corruption checking in Emscripten

I've seen a few weird things when porting codebases, where it just doesn't behave properly. Sometimes this is a bug in the compiler, but it can also be that the source is not fully portable. It can be hard to figure out which it is, so compilers often have automatic tools to help with that sort of thing. Emscripten has several, like SAFE_HEAP mode which checks alignment errors, reading from beyond the heap, and lets you instrument each read and write manually. Recently I added another, CORRUPTION_CHECK which performs simple heap corruption checking.

This idea came to me during a flight and probably it isn't original in any way. Basically, each malloc allocates additional memory, with a buffer zone before and after the "real" allocation that the program sees. The buffer zones are filled with canary values, and later on those are checked to see if they were modified. If they were, the program is writing to memory it has no business writing to, and something has gone very wrong. Then you just increase the frequency of the checks, maybe even adding a few manual calls as you narrow things down, until you see exactly where the corruption happens. With large enough buffer zones, this approach has a good probability of catching writes to random places in memory, and an even better chance of catching typical bugs like allocating 1024 bytes and writing 1025 values. Yesterday I used this tool successfully on a large C++ codebase, the bug turned out to be an incorrect use of std::vector.

But what I really want to talk about here is the implementation of this heap checker tool,

https://github.com/kripken/emscripten/blob/incoming/src/corruptionCheck.js

The idea is fairly simple, and the implementation is less than 100 lines of JavaScript. It replaces the normal malloc and free functions, and basically just does what I said before. I think it's kind of nice how easy it is to do stuff like this on code compiled to JavaScript.

And actually in many ways it is easier to debug C/C++ in JavaScript than C/C++ compiled to native code. For example, even if there is a heap corruption in the C/C++ codebase, we know it cannot corrupt the debugging code in pure JS. JS is a safe language, so JS objects can't be randomly overwritten from the compiled code, which just accesses the typed array heap - and some side objects, but it just can't get to the CorruptionChecker object. That's a nice guarantee to have. So we can debug the compiled code from a safe, scripted environment where it's easy to automate things by just adding some JS.

6 comments:

  1. Now I'm curious, does Emscripten implement signal handlers?

    ReplyDelete
  2. We can't really implement signals in JS, we just have stubs for those functions, without content.

    ReplyDelete
  3. It's a good idea, and an old one. Valgrind does exactly this, though it also instruments all memory accesses so it can tell you immediately when a block overrun/underrun happens. In fact, if you're on an amenable platform (Linux or Mac) it might be worth running the C/C++ program under Valgrind in order to find this kind of thing.

    ReplyDelete
  4. Oh, I thought valgrind only did direct checks at each read and write. Cool, thanks for the info!

    ReplyDelete
  5. Hi,

    Hier is an interesting app made by Emscripten : http://www-fourier.ujf-grenoble.fr/~parisse/webxcas.html . It's impressive emscripten can convert c++ into javascript. Its benefits seem to be:

    - Entirely in javascript (no need to worry about security or compatibility)
    - Quite powerful being based on Xcas
    - Simple to use
    - no internet connection need

    Its only downside seems to be it doesn't look user friendly yet (the layout seems a bit squashed, the text maybe a bit too small, the appearance possibly a little bland) but these are all minor issues. Also the generated code is difficult to decipher it might be worthwhile putting it side by side with the c++ code.

    We think that to use something like that in a javascript app, we will nedd to make a second script that imports all of those functions (factor, solve, etc). Is it possible with empscripten ?

    ReplyDelete
  6. No one will see your question here, please open a github issue or use the emscripten mailing list (see emscripten.org).

    ReplyDelete