An optimization of free() using boundary tags

Last week, I was concerned about the performance of the coalescing algorithm. In particular - could I be smarter about coalescing by trading off memory for time? Memory is “cheap”, so I’m happy to store some extra bytes that facilitate the coalescing process.

As a reminder, the two embedded data structures in the heap:

struct FreeBlockHeader {
    size_t size; // size of free space
    FreeBlockHeader* next; // location of next free block
};

struct AllocBlockHeader {
    size_t size; // size of allocated space
};

The coalescing algorithm works by continuously comparing the head of the free list with every other node in the free list - until no more coalescing is possible.

Rather than iterating through the free list - we can directly check the blocks before and after the newly freed block, and check if they are free or not.

  • if the block before is free, then merge the newly freed block into the block before.
  • if the block after is free, then merge the block after into the newly freed block.
  • if both blocks before and after are free, then merge everything into the block before.

The problem, then, is checking if the previous and following blocks are free. Instead of iterating through the memory area, let’s just put use some extra memory in the headers and also create a footer.

struct FreeBlockHeader {
    size_t size;
    bool free;
    FreeBlockHeader* next;
};

struct AllocBlockHeader {
    size_t size;
    bool free;
};

struct BlockMeta { // place at the end of every block (free or not)
	size_t size;
	bool free;
}

This way - we can jump efficiently to the blocks straddling the newly freed block.

  • To check if the previous block is free, move the pointer back to the free bit and check.
    • If the previous block is free, then move the pointer back by size to the header of the previous block and coalesce. Rewrite the header and footer with the appropriate information post-coalescing.
  • To check if the following block is free, move the pointer forward by the size of the freed block and some extra bytes to find the free bit.
    • If the following block is free, then rewrite the header and footer with the appropriate information. Remember to copy the next pointer to the header of the new coalesced block. Each iteration of free() + coalesce() guarantees a free list that is coalesce-free (i.e. a free list that has no more coalesce-able blocks).

Notice how each Header/Footer has the same information. We can simplify:

struct FreeBlockHeader {
    BlockMeta meta;
    FreeBlockHeader* next;
};

struct BlockMeta { // place at header + footer of every block (free or not)
	size_t size;
	bool free;
}

The code is just tedious pointer arithmetic and copying – so I chose not to write the optimizations.

Something to remember: the addition of the extra memoized information is a tradeoff. I’m trading off time for two things:

  1. the extra overhead of maintaining copies of information (the header and footer free booleans + sizes need to match), and
  2. more pure memory overhead (there’s less memory available for use by the application now).

Allocators in the wild - multithreading

“Reality has a surprising amount of detail” - always. My simple memory allocator barely scratches the surface of what a production-grade memory allocator might consider e.g. cache locality, multithreading support, etc.

The most common keyword I’ve encountered while researching memory allocators is “multithreading”. Before multi-core processors were ubiquitous, memory allocators didn’t need to think about multi-threading - memory would be accessed by one thread, and one thread only.

The naive way of extending the memory allocation model to support multithreading is by putting the free list behind a lock; only one thread is allowed to manipulate the free list at a time. But that comes with a serious performance degradation - if only one thread can manipulate memory at a time, what’s the point of multithreading?

So let’s take a look at ptmalloc2 to see how it tackles the multithreading problem.

ptmalloc2

The core idea of ptmalloc2 is that rather than placing a singular free list behind a lock, why not create multiple free lists – one for each thread?

ptmalloc2 creates a memory arena for each thread. The memory of the main thread is managed by the main memory arena. Thread 1’s memory is managed by the thread 1 arena. The idea is then simple: treat each arena-thread pair as a single-threaded ecosystem.

But what happens if there are an abundance of threads? Only then do we resort to the lock method - put each arena behind a lock. This way, we leverage multi-cores effectively - instead of pumping every allocation request through a single lock, we shard the requests across different spaces so that we can service multiple requests at once.

Next week, we’ll look more deeply into allocators in the wild, and see what makes tradeoffs each allocator makes.

References