Thursday, September 23, 2010

Visualizing IPC Messages in Fennec

More progress on IPC profiling tools, this time a visualization generator. Here is what happens during loading of cnn.com:


On the left you see Fennec before dougt's prefs optimizing patch, and on the right after it. On both sides, the data is as follows:
  • Each line is a single IPC message. Messages are ordered by send time. The width of the line is how long it took until the message was received. The X axis is linear in time.
  • The Y axis also progresses forward in time (since each line is one message, and they are ordered). But it isn't linear (a lot of short messages will take a lot of vertical space, even though little time passed). The numbers show how much time actually passed. So the Y axis really indicates how many messages have been sent.
  • Colors indicate the type of message: green = http, red = prefs, blue = cookies, black = other.
That patch basically removes all the prefs messages, and the result is a 12% speedup (yay!) There were a lot of prefs messages, and even though they were very short and fast to process, they were synchronous (i.e., blocking).

The visualization tool can also show messages for the child and parent process separately (in the image above, they are all shown together).

Instructions for using it:
  • Apply the patch.
  • Run Fennec with something like this (from $objdir/dist/bin):
    • MOZ_IPC_MESSAGE_LOG=1 MOZ_REDIRECT_CHILD_STDOUT=co MOZ_REDIRECT_CHILD_STDERR=ce ./fennec www.cnn.com &> o
  • Run the visualizer with something like this (you need the Python Imaging Library installed):
    • python ../../../mozilla-central/tools/ipcprof/profiler.py o ce
  • Images should have been created, parent.png, child.png, both.png.
A perhaps more easily viewable image can be found here.

    Wednesday, September 15, 2010

    Profiling IPC Messages in Fennec

    Bug 596725 has a Python script that can profile the IPC messages log. Here's the output when run on icanhascheezburger: link. Some first impressions:
    • A few sync/rpc methods are used quite frequently, like preferences, cookies, permissions. Thankfully these complete very quickly. Still this might be worth optimizing though?
    • Lots of http is going on, so lots of http messages. Maybe nothing we can do about that?

    Sunday, September 12, 2010

    Emscripten Updates

    Minor note: Updates about Emscripten will be on my other blog - fits in better with the theme over there. Though I do believe the two should converge, when all is ready.

    Thursday, September 9, 2010

    Tools For Debugging Wakeups

    Removing unnecessary CPU wakeups in Fennec is very important, in order to reduce the amount of power used. The main tool used for that is powertop, which basically tells you how many wakeups each process has. But how can you tell what specifically in your app is causing those wakeups?

    There doesn't seem to be a simple solution for that. One idea is to use the debugger, manually. Perhaps that can be automated using debugger scripts, but I'm not sure, and in any case it would be very specific to the debugger and OS.

    So I've tried a different route - automatic rewriting of the source code. I've got it sort of working so far. The idea is this:
    • First you apply this patch. It modifies some critical files, in particular in the condition variable handling code and in the timer thread code. It adds a parameter to both, so that when you call them, you must identify (by a string) who you are.
    • Then, you run python timeout_rewriter.py (that script is included in the patch). This goes through the entire codebase, rewriting all the calls to condition variables (PR_WaitCondVar etc.) and timer creation (->initWithCallback etc.) to add that parameter. It modifies both C++ and JS code. Over 100 files are modified. How it does the rewriting is sort of hackish: It doesn't use anything fancy, just some heuristics derived from trial and error ;)
    • Then you build Firefox normally, and run it. All timeouts will then be printed with explanations, for example:
    -------------------------------------------------

    WaitCondVar: ./js/src/xpconnect/src/xpcjsruntime.cpp:820 (0.818000 seconds since previous; timeout: 1000)
    WaitCondVar: ./js/src/xpconnect/src/xpcjsruntime.cpp:820 (1.000000 seconds since previous; timeout: 1000)
    TimerThread: ./dom/base/nsJSEnvironment.cpp:3733
    WaitCondVar: ./xpcom/threads/TimerThread.cpp:347 (0.000000 seconds since previous; timeout: 997)
    WaitCondVar: ./js/src/jstask.cpp:95 (0.058000 seconds since previous; timeout: -1)
    WaitCondVar: ./js/src/jstask.cpp:95 (0.045000 seconds since previous; timeout: -1)
    WaitCondVar: ./js/src/xpconnect/src/xpcjsruntime.cpp:820 (0.825000 seconds since previous; timeout: 1000)
    TimerThread: ./content/events/src/nsEventStateManager.cpp:793
    WaitCondVar: ./xpcom/threads/TimerThread.cpp:347 (0.000000 seconds since previous; timeout: 329)
    TimerThread: ./mobile/components/SessionStore.js:345
    WaitCondVar: ./xpcom/threads/TimerThread.cpp:347 (0.329000 seconds since previous; timeout: 532)
    -------------------------------------------------


    Explanations:
    • The 1st line shows that the JS watchdog thread woke up - you can just look at the source line mentioned to see what it is. Additional info is that the previous CPU wakeup was ~0.8 seconds before it, and that this particular call to WaitCondVar was with a timeout of 1 second (so, it would like to be woken up again in 1 second).
    • The 2nd line shows the same place waking up, after 1 second.
    • The 3rd line shows the DOM GC timer being fired. The next line is the condition variable that the timer thread uses.
    • The 5th line is the JS background thread.
    • etc.
    I might write some tools to parse the output (like profilers do), but so far it already seems useful. Some conclusions: When the browser is idle, we have two 1/5 second timers, the IdleService (should be fixed soon), and nsEventStateManager (need to look into that).

    The reasons for rewriting the source code automatically are that, as mentioned, it's over 100 files - so a lot to do by hand. But also, we don't want those changes to go into our actual tree (even with ifdefs, it would be horrible), and maintaining such a big patch as the actual tree changes would be very problematic. So automatic source code rewriting seemed the best approach. Just remember to throw away all those changes, and not commit them by mistake ;)

    Wednesday, September 8, 2010

    iX script

    I wrote a small script for investigating xpconnect objects at runtime, iX (investigate Xpconnect).

    If you paste that code in a JavaScript file, and run iX(something), it will dump all the info it can find about the object: all regular JavaScript properties, as well as all valid interfaces that you can QueryInterface and GetInferface on that object, and also the node hierarchy. And it tries not to crash when doing so.

    (There is probably already something to do this that I am not aware of.)

    Tuesday, August 3, 2010

    Mozilla Try Servers on Demand?

    Has having try servers run on demand, say in Amazon EC2, been discussed in the past? Seems like it might make sense. We'd be able to have as many try servers as we need (especially important when there is a lot of load, like before a code freeze - but also in general, to reduce wait times). And we'd be able to not run any try servers when we don't need them.

    I guess it can't handle all cases - EC2 can't run all OSes, and the machines are servers, so no normal screen rendering, etc. Also their hardware and performance might not be standardized enough for performance results to be informative. But certainly for building and for a lot of the tests (xpcshell, etc.), it seems like it could speed development up. Maybe building could be sent to EC2 servers on demand, and the results sent back to normal servers for the tests?

    Monday, August 2, 2010

    Power Update

    An update on power-related stuff:
    • Patch to remove the 1/sec JavaScript timer should be pushed soon. This is one of the biggest sources of wakeups, and can drain your battery over time, so fixing this is very cool.
    • Patch to fix unnecessary image animation timers awaiting review. On certain websites, this can leave many, many useless timers, so on those websites this is even more important than the previous point.
    • Still working on patch for grouping timers - rewrote it with a new and better approach, works well locally but seeing some odd oranges on try server, not sure if random or not. Once this works, it will reduce the # of wakeups due to animations, probably by something like 10-20%
    • Patch with a nice scriptable API for freezing/throttling the content process waiting for feedback. Two main uses:
      • Freezing the content process when the OS tells us to. Still working out what the various OSes want us to do and when, see wiki page here.
      • Throttling websites that use too much power. As a first iteration of this idea, I was thinking about an addon that has a blacklist of websites that are known CPU hogs, and how much to throttle them.
    • Not strictly a power-related issue, but preloading the browser with SIG_STOP/CONT is also being worked on. What is really nice is this lets the browser be started almost instantly from the user's perspective, while being frozen until that time prevents it from using any power before that. Still need to work out the details of this though.

    Tuesday, July 13, 2010

    Experiments with 'Static' JavaScript: As Fast As Native Code?

    I don't really know much about computer language theory. I just mess around with it in my spare time (weekends mostly). Here is one thing I've been thinking about: I want the speed of native code on the web - because I want to run things like game engines there - but I don't want Java, or NaCl, or some plugin. I want to use standard, platform-agnostic web technologies.

    So, how about if we could compile JavaScript into native code? Of course, in general we can't, at least not very well. But if a JavaScript program - or part of a program - happens to be implicitly statically typed, and in other ways 'performance-friendly',  then we should be able to compile at least such code and run it very quickly.

    In fact PyPy does basically that with RPython - a subset of Python that it can translate into C. So, building on that, I made a demo of the following process:
    • Begin with some benchmark in JavaScript
    • Parse the JavaScript using the SpiderMonkey parser (a small hack of the code was needed in order to get it to dump the syntax tree in a nice format)
    • Translate the parsed code into Python (using a Python script, see below)
    • If the generated code happens to be RPython, happily run that through PyPy's RPython translator to get native code, and execute that
    The code I wrote for this is horribly ugly, but if you must look at it, here it is (if I decide this idea is worth doing any more work on, I will almost certainly rewrite it from scratch). Here are the results for the fannkuch benchmark (run on the value n=10, on a Core 2 Duo laptop):


    The first row in the chart shows a C++ implementation of fannkuch, compiled with -O3. All the other rows start with a JavaScript implementation and proceed from there. First there is simply the result of running V8 and SpiderMonkey on the benchmark. Then, the results of automatically converting the JavaScript to Python and running CPython on it are shown - very slow. Finally, that Python is run through PyPy's RPython translator, which generates native code, which runs very fast.

    The final result is that the JavaScript => Python => PyPy pipeline runs the code only 26% slower than a C++ implementation of the same benchmark compiled at -O3. For comparison, V8 run on that same JavaScript is 3.76 times slower. So, it seems there is potential here to get (some) JavaScript running very fast, pretty much as fast as possible.

    Of course, there is a big startup cost here - the conversion process takes a lot of time (at least PyPy is fun to watch as it runs ;). And JavaScript on the web needs to start up quickly. But there should be technical solutions for this, for example, running the code normally while trying the conversion process in a separate thread (so, running on another CPU core). If the process finishes successfully, swap out the code for the faster version (or if you can't hot-swap it, at least remember and use the compiled version next time you visit that website - would still help for websites you visit a lot).

    So, as I said this is just a little demo I worked on in my spare time, to see if it can even work, and to learn more about the topic. It seems to sort of work actually (I was not very optimistic before I started), but I'm still unsure if it makes sense to do. Feedback is welcome.

    (I also have a few other experiments I'm working on, that are closely related, and involve LLVM. I'll leave those for a future post.)


    Edit: Jernej asked in a comment about Rhino. So for comparison I also ran this code in Rhino and Jython. Rhino took 22.51 seconds - more than twice as slow as SpiderMonkey - and Jython took 74.53 seconds, closer to the slowness of CPython than the next slowest result (Rhino). Neither of these - Rhino or Jython - is particularly fast in general, but they do give other significant benefits (integration with the JVM, proper multithreading, security, etc.).

    Edit 2: Following comments by Sayre here and dmandelin elsewhere, here are the results of JaegerMonkey on this benchmark: 4.62 seconds. So, much faster than TraceMonkey, and almost as fast as V8 (very cool!).

    Thursday, June 17, 2010

    Power Measuring & Saving

    With help from some friendly people in the Maemo IRC (thanks SpeedEvil and DocScrutinizer51!), I was able to get realtime power usage measurements from Fennec on the N900. The goal was to see how much clumping together CPU wakeups would help save power, the idea being that even if the same amount of CPU calculation is done, doing it together in 'clumps' is better than spread out, as each wakeup means the CPU will be in a non-sleep mode for longer. Here are the figures:


    (click for fullsize version)

    The settings are:
    • Baseline: Fennec is not run at all. This is for comparison, to see how much power is drawn by the N900 when idle
    • Fennec 200 clumped wakeups: 200 wakeups/second are scheduled, and all of them are scheduled at the same time. In other words, 200 tiny operations are done, then the CPU can sleep for almost 1 second.
    • Fennec 200 separate wakeups: As before 200 wakeups/second are scheduled, but at equal intervals over the 1-second period. So the longest the CPU has to sleep is about 1/200th of a second.
    • All settings were tested both with the screen off (left chart) or screen on (right chart).
    Why were 200 wakeups used here? No special reason. As an absolute # of wakeups it's a large number - few websites would have that many animated images/setIntervals()/etc. - but these wakeups were very very short (just incrementing a counter). So this could be similar to having fewer but longer wakeups. But in any case, the results should be treated as just a general indication. The goal is to see whether reducing timeouts is worth doing or not, and not to calculate exactly how much it will improve battery life (that would depend on the specific device, websites visited, usage style, how much clumping we do, etc. etc.). Note also that clumping has a price - it adds latency, as timers don't fire at their normal times. When it is worthwhile, and how much, is a hard question.

    For now, though, the results show that clumping wakeups can be noticeable: 38% savings when the screen is off, and 8% when on. As expected much less is saved when the screen is on, because the screen itself drains a lot of power (so our savings become less significant). Which of the 38% or 8% figures is more relevant is a good question - it depends on how often the user takes a 'break' from browsing (so the screen goes off) but leaves the browser on. But anyhow, again, these figures can't be taken literally.

    FWIW, to put this in battery life terms, if you save 38% of your power drain then your battery life would be 60% longer, and if you save 8% of your power drain then your battery life would be 8% longer. Again, this is just a general indication, but I think it does show that implementing timeout clumping (properly) in Fennec can be useful.



    Other Recent Work

    I've also submitted patches for some other bugs related to saving power:
    • 567339 / 359608 - A longstanding issue where image animation timers continue to fire even after browsing to another page. The patch uses the refresh driver to figure out when to suspend timers (but it doesn't use the refresh driver for the actual animations, which might cause more wakeups as it uses a fixed 50Hz timer).
    • 568730 - Stop the JavaScript engine's GC (& other stuff) timer when no JS is running, by suspending it if enough time has passed since the last JS call context was exited.

    EDIT: Improved charts

      Friday, May 28, 2010

      Powersaving Bughunting

      I spent most of yesterday working on Bug 567339, which is about removing unnecessary CPU wakeups in order to save power (which is very important on mobile devices - the fewer time the CPU is woken up from sleep, the longer your battery will last, since each wakeup disturbs sleep mode for a short while after the actual activity).

      So, long story short, it seem the bug is that timers for image animations are not canceled. In other words, if you view an image with animations, then a timer is created for each image, and it does the rendering at the proper rate, but the timer keeps running after the image is no longer needed. So you keep getting CPU wakeups at the appropriate rate for the animations, wasting power.

      (This is actually an issue on normal Firefox as well, not just mobile - some CPU cycles are being spent on these timers. It is probably negligible, though, but still wasteful.)

      The longer story of hunting the cause of the bug is as follows:
      • A helpful person reports the bug, and mentions a suspicion of https being the cause (since enabling/disabling https affects the symptoms). At issue here are 4 wakeups/second which should not be happening.
      • Trying to reproduce the bug, it turns out that it happens in nightly builds, but not when building from source in trunk. Hmm.
      • Trying to figure out the difference, one noticeable change between the nightly builds and building from source is that the start page shows some suggestions of addons in the nightly, but building from source does not. Turns out this is because /trunk reports itself as 2.0, and no addons exist for that yet. Changing the reported version to 1.1 leads to addons being suggested, and the bug in fact occurs.
      • Since fetching addon suggestions is done from AMO through https, and the original bug report strongly suspected https as the culprit, it would seem we are almost done here. Or are we?
      • Actually inspecting all the wakeups (this was where most of the time I spend on this bug was; I put up a wiki page with some details of the methodology I used) eventually reveals that the vast majority of wakeups are in fact of imgContainer::notify()... and not https or anything network related.
      • To confirm this, removing the animated image that was shown when loading addon suggestions removes the wakeups... and indeed, the image had 4 frames/second, exactly the number of wakeups we were hunting!
      Still working on figuring out why this happens and how it can be resolved.

        Monday, May 24, 2010

        How to copy files to an N900

        Today was my first interaction with an N900, and it was not smooth at all. After several hours of messing around, it seems the 'proper' way to copy files from your desktop to the N900 is:
        • Connect N900 via USB, in 'mass storage' mode
        • Open the mounted drive in your desktop, and copy files
        • Unmount the drive
        • Disconnect the USB cable
        • If you have a terminal open in the N900, close it
        • Open a terminal in the N900 and go to ~/MyDocs. The files you transferred will be there.
        Thanks go out to the people that helped me on IRC.

        Wednesday, May 19, 2010

        Electrolify()?

        I proposed an idea for e10s-enabling existing code - that is, making existing code work with the new electrolysis framework in Firefox, which splits things into separate processes. A bug to track the issue is here, and it includes a working patch I wrote (well, working just well enough to show the kind of JavaScript code this approach would allow).

        Friday, May 7, 2010

        Lesson #1: Improper XPCOM definitions can lead to silent failure

        I've been tasked with implementing electrolysis (e10s) support for FTP. That means letting child processes (content) send requests to the parent process (chrome), which actually does the networking - while from the perspective of the child process, the interface to an FTP channel is the same.

        As this is just my second week at Mozilla, I am still learning the build process, fundamental technologies like XPCOM, and other stuff about the codebase. So I ran into some problems, which I'll mention here on the blog in hopes that they help other people that search for the same keywords.

        So, lesson #1 is: Improper XPCOM definitions can lead to silent failure.

        Specifically, the issue was that this:

          NS_IMPL_ADDREF_INHERITED(FTPChannelChild, FTPBaseChannel)
          NS_IMPL_RELEASE_INHERITED(FTPChannelChild, FTPBaseChannel)

          NS_INTERFACE_MAP_BEGIN(FTPChannelChild)
            NS_INTERFACE_MAP_ENTRY(nsIRequest)
            NS_INTERFACE_MAP_ENTRY(nsIChannel) 
            NS_INTERFACE_MAP_ENTRY(nsIFTPChannel)
          NS_INTERFACE_MAP_END_INHERITING(FTPBaseChannel)

        was needed, and I was missing the nsIRequest line. I missed it because nsIRequest is a parent interface, and I didn't realize it was also necessary. And, indeed, when sent through JavaScript to a function that expects to receive an object with that interface, it fails. What was tricky was it silently failed - I called a JavaScript function, which was known to work, and it simply returned a failure code, without even starting the function. Only by stepping through all the JS embedding code did it eventually become clear to me that a call to QueryInterface was failing, so the JS embedding code was failing to set up the parameters for the JS function, and not even calling it.