Friday, March 18, 2011

massdiff - Diff for Massif Snapshots

Massif (part of Valgrind) is a super-useful tool to find out how a program allocates memory. It gives very detailed graphs about exactly which lines of code do allocations that are not free'd later. For example,

96.42% (22,771,482B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->39.75% (9,388,032B) 0x8058AC1: arena_malloc_large (jemalloc.c:3831)
| ->39.75% (9,388,032B) 0x8058DB9: arena_malloc (jemalloc.c:3856)
|   ->34.74% (8,204,288B) 0x8058E98: imalloc (jemalloc.c:3866)
|   | ->34.27% (8,093,696B) 0x805D833: malloc (jemalloc.c:5882)
|   | | ->12.23% (2,887,680B) 0x6BBB63F: sqlite3MemMalloc (sqlite3.c:14221)
  ...

This shows a snapshot at a particular point in time, during which the whole program allocated ~22MB of memory, of which almost 3MB was due to SQLite.

One limitation though is you can't easily see what changes from one snapshot to another, which is important in order to see if there are any gradual memory increases (these may or may not be actual leaks). So I wrote a small python script, massdiff, which diffs two Massif snapshots. Here is an example of the output:


Diffing snapshots 30 50

- (heap allocation functions) malloc/new/new[], --alloc-fns, etc. - 22,286,738
    [ diff: +502,088 ]
+ (heap allocation functions) malloc/new/new[], --alloc-fns, etc. - 22,788,826

-   0x8058927: arena_malloc_small (jemalloc.c:3794) -  7,494,594
      [ diff: +499,552 ]
+   0x8058927: arena_malloc_small (jemalloc.c:3794) -  7,994,146

-     0x8058D9D: arena_malloc (jemalloc.c:3854) -  7,494,594
        [ diff: +499,552 ]
+     0x8058D9D: arena_malloc (jemalloc.c:3854) -  7,994,146

-       0x8058E98: imalloc (jemalloc.c:3866) -  6,055,702
          [ diff: +499,232 ]
+       0x8058E98: imalloc (jemalloc.c:3866) -  6,554,934

-         0x805D833: malloc (jemalloc.c:5882) -  5,552,542
            [ diff: +499,232 ]
+         0x805D833: malloc (jemalloc.c:5882) -  6,051,774

-           0x52D2FA3: js_malloc (jsutil.h:213) -  1,635,252
              [ diff: +432,064 ]
+           0x52D2FA3: js_malloc (jsutil.h:213) -  2,067,316

-             0x52D9D27: JSRuntime::malloc(unsigned int, JSContext*) (jscntxt.h:1358) -  1,635,252
                [ diff: +432,064 ]
+             0x52D9D27: JSRuntime::malloc(unsigned int, JSContext*) (jscntxt.h:1358) -  2,067,316

-               0x52D9DEC: JSContext::malloc(unsigned int) (jscntxt.h:2027) -  1,600,532
                  [ diff: +431,904 ]
+               0x52D9DEC: JSContext::malloc(unsigned int) (jscntxt.h:2027) -  2,032,436

-                 0x5D5B41D: JSObject::allocSlots(JSContext*, unsigned int) (jsobj.cpp:4032) -    161,456
                    [ diff: +344,832 ]
+                 0x5D5B41D: JSObject::allocSlots(JSContext*, unsigned int) (jsobj.cpp:4032) -    506,288

-                   0x5D5B594: JSObject::growSlots(JSContext*, unsigned int) (jsobj.cpp:4078) -    161,456
                      [ diff: +344,832 ]
+                   0x5D5B594: JSObject::growSlots(JSContext*, unsigned int) (jsobj.cpp:4078) -    506,288


The diff only shows what changed between the two revisions, and shows that in in tree format, just like Massif's ms_print does. Each group of three lines shows a line from the first snapshot on top, the corresponding line from the later snapshot on bottom, and between them the difference.

The output here is from loading about:blank 500 times in Fennec. There is overall ~500K of additional allocation (so, ~1K per page load), of which JSObject::growSlots is responsible for ~340K (this is later all deallocated at once, presumably due to GC being run).

So far this has been useful in helping discover one specific case, bug 641663, and I'm still investigating some additional issues. Hopefully it can be a useful tool for other people too. To use it,
  • Run Massif on your program, see here and here
  • Run ms_print on the output
  • Run massdiff on that file, with the two snapshot numbers you want to diff as parameters

3 comments:

  1. The script doesn't work when you take 100 or more snapshot as it matches one or more spaces at the beginning of the snapshot title - instead it should match zero or more spaces.

    i.e.

    # Snapshot title
    - m = re.match(' +([..regex elided...]')
    + m = re.match(' *([..regex elided...]')
    if m:
    snapshot = Snapshot()
    for i in ['n', 'time', 'total', 'useful_heap', 'extra_heap', 'stacks']:

    ReplyDelete
  2. Your script is interesting, but doesn't work when valgrind collects more than 100 snapshots. The problem is on line 42 - when you attempt to match the snapshot title it should have zero or more spaces at the beginning, not one or more.

    ReplyDelete