Paul Bone

Memory Fragmentation in The Boehm-Demers-Weiser Garbage Collector

The Boehm-Demers-Weiser Garbage Collector (BDWGC) is a garbage collector for "uncooperative environments", usually meaning C/C++ and other languages that don’t provide type information. To quote their website:

The Boehm-Demers-Weiser conservative garbage collector can be used as a garbage collecting replacement for C malloc or C++ new.

and…​

The collector is also used by a number of programming language implementations that either use C as intermediate code, want to facilitate easier interoperation with C libraries, or just prefer the simple collector interface.

In Mercury's case we started using BDWGC and it worked well enough and we simply haven’t yet implemented a better collector ourselves.

However this convenience comes with some costs. BDWGC is a non-moving collector. Without type information it is not safe to move objects as that requires updating pointers found in other objects (copying collectors must do this). This can lead to memory fragmentation.

Before we get started I’d like to provide some more taxonomy and definitions for especially curious readers: these details aren’t important to the article so I won’t be going into detail. BDWGC is a mark-sweep collector. The mark phase will stop the world and can perform parallel marking. The sweep phase can run lazily. There is some support for incremental collection but I’m not sure if it works with a parallel mutator. BDWGC is a conservative collector. It will never deallocate a reachable object but might fail to deallocate a genuinely non-reachable object.

I work for YesLogic; we develop Prince, a HTML→PDF tool written in Mercury. In Prince’s case memory usage is pretty reasonable, but occasionally a customer sends us a large document and asks us if we could reduce memory usage. In this case the document is 98MB of uncompressed HTML, it generates a 1500 page document using 1.5GB of memory. 1.5GB is not a lot of memory by today’s standards and for the kind of work that Prince is doing for this document. However if want to run multiple Prince instances on a busy server, or run it in conjunction with other software then reducing memory usage is a good thing. Reducing memory usage will also allow the client to process larger and larger documents. My boss Michael wrote "…​it would be nice if we could use 1GB max, ideally.".

Mercury has some memory profiling support that will dump to a file information about the reachable objects and their types (report_memory_attribution/4). Which Michael was using and found that the total size of all reachable objects (736MB) was less than the restricted heap size (1GB) before crashing when it could not allocate a 160 byte object. This cannot account for the conservative nature of BDWGC, there were only 736MB of reachable objects, not 1GB, and yet the program cannot satisfy a memory allocation request and crashes.

Heap organisation

To understand why we need to understand heap organisation in BDWGC. The heap is divided into 4KB sized and aligned blocks, and each block is used for a single object type and size, multiple blocks may be used for each type and size combination. Object types are: pointer-free, normal and uncollectable. Sizes are first rounded up to the nearest 16 bytes (on 64bit systems) then may be rounded up to a higher size so that objects of similar sizes can be allocated together, reducing waste. This arrangement is very convenient, it allows an object’s size to be determined by rounding its pointer to it’s heap block address and inspecting the block’s header, as all the objects on that block will have the same size. Other information is also located in the header including the mark bits.

Allocation is done using free lists, one free list per size and type. If the free list for the desired size and type is empty then the collector will try to get a free block, initialize it and add it to the free list for that size and type. If there are no free blocks, a collection will be triggered. The collector may also perform lazy sweeping, but I’m skipping this detail.

Heap usage or "Where has my memory gone?"

Let’s get an idea of how we’re using the heap. First with a look at the distribution of object sizes, and the number of blocks use for each object size.

Table 1. Heap usage by object size
Object size (bytes) No. objects Size of objects No. blocks Size of blocks Utilisation

16

23,254,210

354MB

146,825

573MB

61%

32

747,766

22MB

16,185

63MB

36%

48

4,258,800

194MB

54,243

211MB

92%

64

268,858

16MB

6,070

23MB

69%

80

7,589

592KB

150

600KB

98%

96

18

1KB

2

8KB

21%

112

410,040

43MB

1,1391

44MB

98%

128

2,599

324KB

89

356KB

91%

144

68,513

9MB

2 447

9MB

98%

160

401,600

61MB

16,064

62MB

97%

176

95,011

15MB

4,132

16MB

98%

192

2,520

472KB

120

480KB

98%

208

7,503

1MB

395

1MB

96%

224

504

110KB

28

112KB

98%

240

7,513

1MB

442

1MB

99%

256

2,016

504KB

127

508KB

99%

272

17,010

4MB

1,134

4MB

99%

288

3,010

846KB

215

860KB

98%

320

504

157KB

42

168KB

93%

336

1,500

492KB

125

500KB

98%

352

1,001

344KB

91

364KB

94%

448

501

219KB

57

228KB

96%

512

6

3KB

1

4KB

75%

800

7,053

5MB

1,780

6MB

77%

1,024

1

1KB

1

4KB

25%

1,344

0

0B

1

4KB

0%

2,048

2

4KB

1

4KB

100%

2,064

8

16KB

8

32KB

50%

2,736

1

2KB

1

4KB

66%

3,584

2

7KB

2

8KB

87%

6,160

4

24KB

4

?

?

8,464

7

57KB

7

?

?

Small objects (16 and 48 bytes) dominate the heap. Objects 48 bytes and smaller account for 95% of the objects by count, or 77% by total size. 83% of the heap is usable for objects of this size only. This is not very surprising at least for a language like Mercury where many objects might only be a few words long. For example a cons cell is the most common object (heap attribution profile) and is exactly 16 bytes long.

They also have some of the lowest heap utilisation percentages, this is the percentage of used space within blocks for objects of this size. A low percentage indicates that there is a lot of empty space that can only be used by objects of this size. The question marks indicate that I don’t know how BDWGC stores large objects and do not know the sizes of their heap blocks.

Recall that the program attempted to allocate a 160 byte object. The utilisation of the 160 byte area was 97%, this is an artifact of how I collected the data since I’ve coalesced blocks of different types together in these tables. The program crashed when the collector could not find a block of the correct size and type, and there were no empty blocks that it could initialise for this size and type.

Well, that’s the explanation of why the program crashed, a bit dull. The more interesting story here is what is going on with the small object sizes. We can’t allocate 160 bytes and yet there’s 288MB free, 277MB of which is reserved for objects 48 bytes or smaller! What’s going on with the spare memory in those smaller object blocks?

Block usage distribution for 16 byte objects

Block usage distribution for 16 byte objects

Block usage distribution for 32 byte objects

Block usage distribution for 32 byte objects

Block usage distribution for 48 byte objects

Block usage distribution for 48 byte objects

The first thing to note about these histograms is that I’ve had to scale the graph: For example there are 78,355 blocks with 256 objects in the first graph, you can see a thin bar on the far right of the graph, it extends far beyond the top boarder and scaling the graph and cutting it off was necessary to see the rest of the graph.

Excluding the full blocks, the block utilisation looks like a normal distribution, skewed towards mostly empty blocks for 16 and 32 byte objects, and towards mostly full blocks for 48 byte objects. Blocks for larger objects havn’t been shown they a generally full or close to full. There are no empty blocks for any size category. Presumably they are returned immediately to the free block list when they become free. The following table provides a little more information about block utilisation.

Table 2. Block usage statistics

Object size

Max objs/block

Partially full blocks

No.

Percentage

Mean objects

Mean utilisation

16

256

68,470

46%

46

18%

32

128

12,501

77%

22

17%

48

85

14,710

27%

61

72%

The graphs show that there are a number of blocks with very few 16 and 32 byte objects on them. We know that there are 160 contiguous bytes within these blocks, however they’re totally unavailable for the allocation of 160 byte objects. While there is a single object on a block for a given size, it cannot be used for objects of other sizes. It might be tempting to call this memory "wasted", however it is still available for objects of the correct size. It is however inefficient, Roughly 219MB of memory could be made available to blocks of practically any size if there was no fragmentation within the 16 byte objects.

What can we do about fragmentation

The simple answer is to use a moving collector. The simplest moving collector, a semi-space copying collector, will use twice as much memory as the program’s working set. While it’s a reasonable GC algorithm it would not help in this particular situation as the original request was to reduce overheads like this. A mark compact collector is a moving collector that would address this specific fragmentation problem. Furthermore it’s easy to imagine keeping the same heap organisation and compacting objects of the same sizes. This can enable the use of a much faster compaction algorithm.

Using a moving collector in an "uncooperative environment" isn’t possible, and doing so for Mercury would be a lot of work. So what can we do with BDWGC?

  • Arrange free lists to allocate from mostly-full pages first, thereby fragmentation. It’s possible that BDWGC already does this, I haven’t checked. This would need to interact with lazy sweeping.

  • If a request fails, and continues to fail after lazy sweeping and a full collection, then round the request size up and retry it. This doesn’t address fragmentation, it would be a work-around. I attempted to implement this but it lead to terrible performance, after 12 hours the program was still running and I terminated it. Perhaps I introduced a bug or perhaps it quickly lead to repeated redundant full collections that simply caused performance to plummet.

  • Another work-around is to pre-reserve more blocks of this object size, this is workload specific and we would probably need to do this for multiple object sizes. The intuition that fewer blocks for smaller objects would be allocated and fragmentation would be reduced for those object sizes.

At YesLogic, we stopped pursuing these ideas and instead recommended that our client did something else, like ran fewer concurrent tasks or purchased more memory. We continue to make changes to Prince itself to reduce memory usage and churn.

A little more about fragmentation

In the opening of this article I said that 1.5GB is not a lot by today’s standards for such a task, generating a 1600 page PDF. It’s also true that most developers, most of the time, do not need to worry about memory fragmentation. That being said I’d like to describe two other affects of memory fragmentation.

Cache locality

A modern CPU has multiple layers of memory, while we have multiple gigabytes of slow RAM, we have only megabytes of fast last-level cache and kilobytes of very fast first level cache. Fitting as much information as possible within a smaller amount of memory will increase the chance that other relevant information also being cached, due to its locality. A typical processor has 64 byte cache lines, so this is only true for object sizes less than 64 bytes (assuming they’re aligned). Nevertheless as long as 16 byte objects are the most common this is very relevant. If a 16 byte object is loaded into a cache, but its neighbouring areas are empty, then three quarters of that cache line is wasted.

Page locality

A similar problem happens at the level of pages (also blocks in BDWGC). These days, due to very large memories, page faults are very rare so many developers will usually assume this is not a problem in practice. This is inaccurate. When the processor translates a virtual memory address into a physical address it does so using page tables configured by the OS. Reading page tables involves reading memory, we know this is slow. A special type of cache called a Translate Look-aside Buffer (TLB) is used to speedup address translation. TLBs are very limited and reducing the number of pages your program needs will improve performance by reducing TLB misses.

References

  • Mercury supports memory profiling when using a .memprof grade. In such a grade the standard library predicate benchmark.report_memory_attribution/4 will write to a file a list of types, and the number of values of each time, that are currently reachable in memory. This is useful for learning about long-lived memory allocations.

  • The deep profiler and memory profiler both provide information on per-predicate memory allocation. This will show all allocations, it is useful to understand memory churn.

  • BDWGC provides a GC_dump() function which will dump various information, including a list of all blocks with their basic information. This list is provided in a CSV format that, when put into a separate file, can be processed by a simple python script to generate the above tables, graphs and more.

  • Richard Jones' Garbage Collection Handbook is a great general resource about garbage collection and also provides detailed information on many influential collectors. I reviewed it for a friend and colleague’s website.

  • Update: I linked this article on the BDWGC mailing list and Bruce Hoult replied with some helpful suggestions. Thanks Bruce.