|
| 1 | +# How snmalloc Manages Address Space |
| 2 | + |
| 3 | +Like any modern, high-performance allocator, `snmalloc` contains multiple layers of allocation. |
| 4 | +We give here some notes on the internal orchestration. |
| 5 | + |
| 6 | +## From platform to malloc |
| 7 | + |
| 8 | +Consider a first, "small" allocation (typically less than a platform page); such allocations showcase more of the machinery. |
| 9 | +For simplicity, we assume that |
| 10 | + |
| 11 | +- this is not an `OPEN_ENCLAVE` build, |
| 12 | +- the `BackendAllocator` has not been told to use a `fixed_range`, |
| 13 | +- this is not a `SNMALLOC_CHECK_CLIENT` build, and |
| 14 | +- (as a consequence of the above) `SNMALLOC_META_PROTECTED` is not `#define`-d. |
| 15 | + |
| 16 | +Since this is the first allocation, all the internal caches will be empty, and so we will hit all the slow paths. |
| 17 | +For simplicity, we gloss over much of the "lazy initialization" that would actually be implied by a first allocation. |
| 18 | + |
| 19 | +1. The `LocalAlloc::small_alloc` finds that it cannot satisfy the request because its `LocalCache` lacks a free list for this size class. |
| 20 | + The request is delegated, unchanged, to `CoreAllocator::small_alloc`. |
| 21 | + |
| 22 | +2. The `CoreAllocator` has no active slab for this sizeclass, so `CoreAllocator::small_alloc_slow` delegates to `BackendAllocator::alloc_chunk`. |
| 23 | + At this point, the allocation request is enlarged to one or a few chunks (a small counting number multiple of `MIN_CHUNK_SIZE`, which is typically 16KiB); see `sizeclass_to_slab_size`. |
| 24 | + |
| 25 | +3. `BackendAllocator::alloc_chunk` at this point splits the allocation request in two, allocating both the chunk's metadata structure (of size `PAGEMAP_METADATA_STRUCT_SIZE`) and the chunk itself (a multiple of `MIN_CHUNK_SIZE`). |
| 26 | + Because the two exercise similar bits of machinery, we now track them in parallel in prose despite their sequential nature. |
| 27 | + |
| 28 | +4. The `BackendAllocator` has a chain of "range" types that it uses to manage address space. |
| 29 | + By default (and in the case we are considering), that chain begins with a per-thread "small buddy allocator range". |
| 30 | + |
| 31 | + 1. For the metadata allocation, the size is (well) below `MIN_CHUNK_SIZE` and so this allocator, which by supposition is empty, attempts to `refill` itself from its parent. |
| 32 | + This results in a request for a `MIN_CHUNK_SIZE` chunk from the parent allocator. |
| 33 | + |
| 34 | + 2. For the chunk allocation, the size is `MIN_CHUNK_SIZE` or larger, so this allocator immediately forwards the request to its parent. |
| 35 | + |
| 36 | +5. The next range allocator in the chain is a per-thread *large* buddy allocator that refills in 2 MiB granules. |
| 37 | + (2 MiB chosen because it is a typical superpage size.) |
| 38 | + At this point, both requests are for at least one and no more than a few times `MIN_CHUNK_SIZE` bytes. |
| 39 | + |
| 40 | + 1. The first request will `refill` this empty allocator by making a request for 2 MiB to its parent. |
| 41 | + |
| 42 | + 2. The second request will stop here, as the allocator will no longer be empty. |
| 43 | + |
| 44 | +6. The chain continues with a `CommitRange`, which simply forwards all allocation requests and (upon unwinding) ensures that the address space is mapped. |
| 45 | + |
| 46 | +7. The chain now transitions from thread-local to global; the `GlobalRange` simply serves to acquire a lock around the rest of the chain. |
| 47 | + |
| 48 | +8. The next entry in the chain is a `StatsRange` which serves to accumulate statistics. |
| 49 | + We ignore this stage and continue onwards. |
| 50 | + |
| 51 | +9. The next entry in the chain is another *large* buddy allocator which refills at 16 MiB but can hold regions |
| 52 | + of any size up to the entire address space. |
| 53 | + The first request triggers a `refill`, continuing along the chain as a 16 MiB request. |
| 54 | + (Recall that the second allocation will be handled at an earlier point on the chain.) |
| 55 | + |
| 56 | +10. The penultimate entry in the chain is a `PagemapRegisterRange`, which always forwards allocations along the chain. |
| 57 | + |
| 58 | +11. At long last, we have arrived at the last entry in the chain, a `PalRange`. |
| 59 | + This delegates the actual allocation, of 16 MiB, to either the `reserve_aligned` or `reserve` method of the Platform Abstraction Layer (PAL). |
| 60 | + |
| 61 | +12. Having wound the chain onto our stack, we now unwind! |
| 62 | + The `PagemapRegisterRange` ensures that the Pagemap entries for allocations passing through it are mapped and returns the allocation unaltered. |
| 63 | + |
| 64 | +13. The global large buddy allocator splits the 16 MiB refill into 8, 4, and 2 MiB regions it retains as well as returning the remaining 2 MiB back along the chain. |
| 65 | + |
| 66 | +14. The `StatsRange` makes its observations, the `GlobalRange` now unlocks the global component of the chain, and the `CommitRange` ensures that the allocation is mapped. |
| 67 | + Aside from these side effects, these propagate the allocation along the chain unaltered. |
| 68 | + |
| 69 | +15. We now arrive back at the thread-local large buddy allocator, which takes its 2 MiB refill and breaks it down into powers of two down to the requested `MIN_CHUNK_SIZE`. |
| 70 | + The second allocation (of the chunk), will either return or again break down one of these intermediate chunks. |
| 71 | + |
| 72 | +16. For the first (metadata) allocation, the thread-local *small* allocator breaks the `MIN_CHUNK_SIZE` allocation down into powers of two down to `PAGEMAP_METADATA_STRUCT_SIZE` and returns one of that size. |
| 73 | + The second allocation will have been forwarded and so is not additionally handled here. |
| 74 | + |
| 75 | +Exciting, no? |
| 76 | + |
| 77 | +## What Can I Learn from the Pagemap? |
| 78 | + |
| 79 | +### Decoding a MetaEntry |
| 80 | + |
| 81 | +The centerpiece of `snmalloc`'s metadata is its `PageMap`, which associates each "chunk" of the address space (~16KiB; see `MIN_CHUNK_BITS`) with a `MetaEntry`. |
| 82 | +A `MetaEntry` is a pair of pointers, suggestively named `meta` and `remote_and_sizeclass`. |
| 83 | +In more detail, `MetaEntry`s are better represented by Sigma and Pi types, all packed into two pointer-sized words in ways that preserve pointer provenance on CHERI. |
| 84 | + |
| 85 | +To begin decoding, a bit (`REMOTE_BACKEND_MARKER`) in `remote_and_sizeclass` distinguishes chunks owned by frontend and backend allocators. |
| 86 | + |
| 87 | +For chunks owned by the *frontend* (`REMOTE_BACKEND_MARKER` not asserted), |
| 88 | + |
| 89 | +1. The `remote_and_sizeclass` field is a product of |
| 90 | + |
| 91 | + 1. A `RemoteAllocator*` indicating the `LocalAlloc` that owns the region of memory. |
| 92 | + |
| 93 | + 2. A "full sizeclass" value (itself a tagged sum type between large and small sizeclasses). |
| 94 | + |
| 95 | +2. The `meta` pointer is a bit-stuffed pair of |
| 96 | + |
| 97 | + 1. A pointer to a larger metadata structure with type dependent on the role of this chunk |
| 98 | + |
| 99 | + 2. A bit (`META_BOUNDARY_BIT`) that serves to limit chunk coalescing on platforms where that may not be possible, such as CHERI. |
| 100 | + |
| 101 | +See `src/backend/metatypes.h` and `src/mem/metaslab.h`. |
| 102 | + |
| 103 | +For chunks owned by a *backend* (`REMOTE_BACKEND_MARKER` asserted), there are again multiple possibilities. |
| 104 | + |
| 105 | +For chunks owned by a *small buddy allocator*, the remainder of the `MetaEntry` is zero. |
| 106 | +That is, it appears to have small sizeclass 0 and an implausible `RemoteAllocator*`. |
| 107 | + |
| 108 | +For chunks owned by a *large buddy allocator*, the `MetaEntry` is instead a node in a red-black tree of all such chunks. |
| 109 | +Its contents can be decoded as follows: |
| 110 | + |
| 111 | +1. The `meta` field's `META_BOUNDARY_BIT` is preserved, with the same meaning as in the frontend case, above. |
| 112 | + |
| 113 | +2. `meta` (resp. `remote_and_sizeclass`) includes a pointer to the left (resp. right) *chunk* of address space. |
| 114 | + (The corresponding child *node* in this tree is found by taking the *address* of this chunk and looking up the `MetaEntry` in the Pagemap. |
| 115 | + This trick of pointing at the child's chunk rather than at the child `MetaEntry` is particularly useful on CHERI: |
| 116 | + it allows us to capture the authority to the chunk without needing another pointer and costs just a shift and add.) |
| 117 | + |
| 118 | +3. The `meta` field's `LargeBuddyRep::RED_BIT` is used to carry the red/black color of this node. |
| 119 | + |
| 120 | +See `src/backend/largebuddyrange.h`. |
| 121 | + |
| 122 | +### Encoding a MetaEntry |
| 123 | + |
| 124 | +We can also consider the process for generating a MetaEntry for a chunk of the address space given its state. |
| 125 | +The following cases apply: |
| 126 | + |
| 127 | +1. The address is not associated with `snmalloc`: |
| 128 | + Here, the `MetaEntry`, if it is mapped, is all zeros and so it... |
| 129 | + * has `REMOTE_BACKEND_MARKER` clear in `remote_and_sizeclass`. |
| 130 | + * appears to be owned by a frontend RemoteAllocator at address 0 (probably, but not certainly, `nullptr`). |
| 131 | + * has "small" sizeclass 0, which has size 0. |
| 132 | + * has no associated metadata structure. |
| 133 | + |
| 134 | +2. The address is part of a free chunk in a backend's Large Buddy Allocator: |
| 135 | + The `MetaEntry`... |
| 136 | + * has `REMOTE_BACKEND_MARKER` asserted in `remote_and_sizeclass`. |
| 137 | + * has "small" sizeclass 0, which has size 0. |
| 138 | + * the remainder of its `MetaEntry` structure will be a Large Buddy Allocator rbtree node. |
| 139 | + * has no associated metadata structure. |
| 140 | + |
| 141 | +3. The address is part of a free chunk inside a backend's Small Buddy Allocator: |
| 142 | + Here, the `MetaEntry` is zero aside from the asserted `REMOTE_BACKEND_MARKER` bit, and so it... |
| 143 | + * has "small" sizeclass 0, which has size 0. |
| 144 | + * has no associated metadata structure. |
| 145 | + |
| 146 | +4. The address is part of a live large allocation (spanning one or more 16KiB chunks): |
| 147 | + Here, the `MetaEntry`... |
| 148 | + * has `REMOTE_BACKEND_MARKER` clear in `remote_and_sizeclass`. |
| 149 | + * has a *large* sizeclass value. |
| 150 | + * has an associated `RemoteAllocator*` and `Metaslab*` metadata structure |
| 151 | + (holding just the original chunk pointer in its `MetaCommon` substructure; |
| 152 | + it is configured to always trigger the deallocation slow-path to skip the logic when a chunk is in use as a slab). |
| 153 | + |
| 154 | +5. The address, whether or not it is presently within an allocated object, is part of an active slab. Here, the `MetaEntry`.... |
| 155 | + * encodes the *small* sizeclass of all objects in the slab. |
| 156 | + * has a `RemoteAllocator*` referencing the owning `LocalAlloc`'s message queue. |
| 157 | + * points to the slab's `Metaslab` structure containing additional metadata (e.g., free list). |
0 commit comments