More about Memory Fragmentation in BDW GC

Paul Bone — Dec 29, 2016

This is a follow up article to Memory Fragmentation in The Boehm-Demers-Weiser Garbage Collector. After writing that article Bruce Hoult pointed out that the article stops before getting to the interesting part. I agree completely, it would be very interesting to see the heap utilisation for different sizes and object kinds over time. That is what I will explore in this article. I suggest reading the first article before reading this one.

Also in this article I have not set a heap size limit and therefore Prince was able to run until completion, rather than crash when it couldn’t allocate new memory. Each snapshot of the heap usage is taken immediately after a collection. The reported times are based upon the elapsed time of the mutator (Prince excluding the collector itself). It’s worth noting that BDW GC is a mark+incremental sweep collector, and that generational collection is not enabled, so the only collections are major collections.

I also ran Prince with a more typical document in addition to the large and complex document from the last article.

The previous article was also investigative: I was trying to answer the question of how can we be out of memory when the heap is far from full? This article is not attempting to answer any specific question. Instead I am using it to demonstrate the kind of information we can extract from the collector once armed with: an understanding of BDW GC’s heap layout, a small change to the collector to make the GC_dump() output more parseable, and a python script to parse that output and generate tables and graphs. It’s worth noting that until about 18 months ago, I had thought that the collector was totally inscrutable. This article shows that that assumption was totally false.

Some basic over time measurements

The first plots show the heap size and utilisation over time.

Utilisation over time - Typical

Utilisation over time - High usage

Early during the program’s execution the reported utilisation is high, however the heap is still expanding and once it becomes larger (allowing less frequent collections) the reported utilisation is lower. The actual average utilisation will be higher, we see a low reported utilisation because the heap information is reported immediately after each collection.

Part-way through the program’s execution there is a drop in used memory and a corresponding drop in total heap size. This is roughly when Prince finishes performing layout and begins writing out the PDF file, it can discard temporally structures used during layout. The reduction in total heap size doesn’t necessarily correspond to memory released back to the OS. Doing so is complicated by a number of factors (including fragmentation) and support for unmapping pages isn’t always compiled into the collector. Instead these blocks are most-likely placed into a block free list, which can be re-used to store allocations of any size less than the block size.

The other odd-looking thing is that for the typical program the total heap usage increases in jumps, rather than smoothly. This is caused by BDW GC tuning heap size, collection frequency and collection pause times. At several times it allows the heap size to grow significantly before performing a full collection.

None of these things indicate a problem, but they’re interesting to see graphed nevertheless.

Block utilisation for different allocation sizes

These graphs show the relative proportions of the total heap size given to objects of different sizes and how that changes over time. Each area in the graph represents objects of different size classes, with the lowest area representing objects greater than 64 bytes. Unfortunately the trends are difficult to interpret because they are accumulated on each lower area in the graph. However it’s the relative areas that I wanted to show. It would be simple enough to make line-graphs from the same data.

Total usage by object size - Typical

Total usage by object size - High usage

Obviously the 16 byte objects take up the largest amount of the heap (almost) all the time in both profiles. Also 48 byte objects take the second largest amount of room in the "High usage" profile. This was true in the data from the previous article, however we can see that it’s true for (almost) all of the program’s runtime.

In both profiles, almost all the time that the total heap size changes significantly it is the combined effect of all the different size classes changing, with two exceptions. First, in the "High usage" profile, the memory used by the "other" sizes decreases while the usage for 16 and 48 byte objects increases dramatically. Inspecting the tabular version of the data (very verbose and not shown here) shows that a single 98MB buffer was freed at this time which caused the drop in the "other" size class. Second, also in the "High usage" profile, the decrease in heap usage in the middle of the execution is mostly due to the decrease in blocks reserved for 16 byte pages, the others contribute, but not as much.

Utilisation for 16 and 48 byte object classes

Let’s look at the same charts, but this time we will show utilisation within the 16 and 48 byte size classes.

Utilisation by object size - Typical

Utilisation by object size - High usage

Well, the first graph looks pretty normal, the utilisation within these to areas is nothing to get excited about, but it increases steadily, there are no dramatic changes. However there are three interesting things in the "High usage" graph.

Easy one first, the utilisation is much worse within the 16 byte blocks than it is in the 48 byte blocks. Even though these are the two largest groups.

The most interesting thing is that at about 7 seconds, the utilisation within the 16 byte objects becomes dramatically worse and stays that way. This is most likely what Bruce suggested, that many 16 byte objects are being allocated, some short-lived and others long-lived but both on the same blocks. After the short-lived objects become garbage, the long lived objects populate these blocks sparsely. Bruce proposed solving this type of problem by allocating one of these two groups from a separate pre-allocated pool, so that they are not intermixed within blocks. This might be worth trying if we wanted to improve memory utilisation.

However this utilisation might not actually be a problem. Recall that these heap snapshots are gathered immediately after a collection. If the allocation rate for 16 objects is very high, higher than other object sizes then this reported utilisation is just a lower bound, and harmless. Before attempting a fix we would want to measure and analyse how utilisation changes between two full collections. Unfortunately doing so involves marking the heap, a major part of doing a collection, this would certainly change the result that we’re trying to measure! We could however measure the allocation rate for different sizes and use that to determine the maximum utilisation (immediately before a collection event).

We would also need to determine what is happening within Prince at this time, and therefore which structures are potentially long or short lived.

Of course, the "simpler" answer is to use a moving collector with better generational support, which would have many additional benefits. This is simple to say, but writing and integrating a new collector into Mercury is a significant amount of work. Nevertheless it’s worth acknowledging that this is a solution to this and other problems.

It’s also worth noting that Bruce was commenting on the poor utilisation of 32 byte objects, rather than the fair utilisation of 16 byte objects. Nevertheless his suggestions apply equally to 16 byte objects.

Utilisation for 32 byte objects

Finally since the first article found a very low utilisation for 32 byte objects I’d like to show their utilisation over time. To make the detail easier to see I’ve shown this on separate graphs.

Utilisation for 32 byte objects - Typical

Utilisation for 32 byte objects - High usage

There’s relatively little to say about these plots. Yes the utilisation is poor, it’s not clear why but I have no reason to doubt Bruce’s suggestion. The only thing that looks a little odd is how jagged the top of the graphs look - there is a lot of variation in the total heap area used for 32 byte objects. This is most likely a factor of how allocation rates for different size classes change over time and add up to total heap capacity (which is when collections are usually triggered).

Final remarks

I set out to demonstrate that BDW GC is not as inscrutable as I had previously thought, if there’s a simple message here it’s that "GC isn’t magical". In doing so I unexpectedly found a potential opportunity to reduce Prince’s memory footprint, although fixing this will still require a fair amount of effort.

I will be updating this article with a link to my python script for generating the graphs and links to BDW GC patches that make this output more parse able.

To use my analyse script with Mercury, start with a Mercury version that includes the above changes to BDW GC (not in master yet). Then be sure not to use the gcd grades, they seem to add extra data to each object but I don’t know why. Instead use a regular gc grade and edit Makefile.direct, look for CFLAGS += -DNO_DEBUGGING and uncomment that line.