the seduction
someone on your team has an idea.
"what if we stored the linked list... in arrays?"
instead of nodes scattered across the heap, connected by pointers that chase each other through memory like a drunk trying to find their keys. we use three contiguous arrays:
data[] = [ X, A, B, C, D, E, F, Y ]
next[] = [ 1, 2, 3, 4, 5, 6, 7, -1]
prev[] = [-1, 0, 1, 2, 3, 4, 5, 6 ]
data[i] holds the value. next[i] tells you the index of the next element. prev[i] tells you the previous.
- no malloc/free per operation: memory is pre-allocated
- cache-friendly: arrays are contiguous, right?
- O(1) everything: deletion is just pointer rewiring
- index stability: elements don't move, just get "unlinked"
this is called a static linked list (the textbook name; you'll also see "cursor implementation") or an array-backed linked list. john carmack used variants of this in quake's entity system. the linux kernel's hlist is a cousin. it's real, it's battle-tested, and it's everywhere heap fragmentation is terrifying.
and sometimes it's exactly right.
but the reasons people reach for it are almost never the reasons it works.
the promise (a deletion with actual numbers)
trace a deletion.
say we want to delete index 3 (the node holding 'C'). the operation is:
- look up
prev[3]→ 2 (B is our left neighbor) - look up
next[3]→ 4 (D is our right neighbor) - set
next[2] = 4(B now points past C to D) - set
prev[4] = 2(D now points back past C to B)
two array reads, two array writes. no memory allocation. no pointer chasing across cache lines. no calling free() and praying the allocator doesn't fragment.
the data at index 3 is still there. we just routed around it. if you walked the raw array with a debugger, you'd see C sitting between B and D, fully intact, no indication anything happened. it is present in every physical sense and absent in every logical one.
play with the visualizer above. navigate to any node. hit "unlink". watch the next[] and prev[] arrays get rewritten. watch the "logical" row at the bottom update. then hit "relink" and watch the node step back into the chain.
the LRU cache: a perfect marriage
an LRU (least recently used) cache needs two operations to be O(1):
- lookup by key → return value instantly
- touch → move accessed item to "most recently used" position
the classic implementation: a hash map for lookup + a doubly linked list for ordering.
with heap-allocated nodes, every access does:
- hash lookup → O(1)
- unlink node from current position → O(1)
- link node at head → O(1)
but each pointer dereference is a potential cache miss. the node could be anywhere on the heap, and you can't read its neighbors until the node itself arrives from memory. those dependent loads are the expensive part: a trip to RAM runs roughly 200 to 400 cycles on modern hardware, and the CPU mostly just waits. (the writes are cheaper than they look. stores get buffered; loads stall.)
with array-backed:
keys[] = [ "A", "B", "C" ]
vals[] = [ 100, 200, 300 ]
next[] = [ 1, -1, 0 ] // C -> A -> B -> end
prev[] = [ 2, 0, -1 ]
touching "A" means:
- unlink index 0 (2 array writes:
next[2] = 1,prev[1] = 2) - reinsert at the head (3 or 4 more:
next[0] = 2,prev[2] = 0,prev[0] = -1, plus updating the head index)
all of it lands in the same few cache lines. the data never moves. all that changes is the routing tables of a fixed topology.
for hot caches (millions of ops per second), you're trading pointer chasing for cache-line-local integer arithmetic. array-backed LRU can be substantially faster than pointer-based, depending on cache size and access pattern.
the swiss cheese problem
delete a couple of nodes. then a couple more from the middle.
what does your memory look like now?
data[] = [ X, -, B, -, -, E, -, Y ]
(- = dead slot, still allocated)
you have 8 slots allocated. 4 are actually in use. memory utilization: 50%, and it only degrades from here.
you can't reclaim those slots without:
- tracking free slots in a free list (another linked list, often implemented... the same way)
- or compacting the array (which invalidates all external indices)
"just use a free list," you say. fine. now you have two parallel-array linked lists to maintain. allocation pops the head of the free list (O(1), to be fair). deletion pushes onto it. the asymptotics survive. the bookkeeping is now yours: another structure to initialize, thread through the holes, and keep correct.
and if you're doing this to avoid heap fragmentation, you've recreated heap fragmentation inside your array. unless you are the allocator.
the memory allocator: where this pattern was born
doug lea wrote dlmalloc in 1987. it became the basis for glibc's malloc (ptmalloc2), which means every C program on linux has been using this pattern for decades. the dlmalloc/ptmalloc family threads explicit free lists through the freed blocks themselves. the newer allocators (jemalloc in firefox and redis, tcmalloc at google, mimalloc at microsoft) mix in other machinery (jemalloc tracks its small size classes with bitmaps, for instance), but freed-memory-as-data-structure is the shared lineage.
the core idea: each free block contains:
size(how big is this chunk)prev_free(offset to previous free block)next_free(offset to next free block)
when you free(ptr):
- mark the block as free
- link it into the free list
- coalesce with adjacent free blocks
no malloc/free calls. you're managing memory using the memory you're managing.
why does it work this way? because of a recursive dependency.
to track fragmented memory, you need a data structure (a list) that grows and shrinks. if you stored this list in external memory, adding a node would require an allocation.
but you are the allocator.
the allocator needs to track a newly freed block. to track it, it needs a node for its free list. to create that node, it needs to allocate memory. to allocate memory, it needs to consult the free list. to consult the free list, it needs the node it hasn't created yet. every solution requires the solution.
the solution is intrusive metadata.
the free list doesn't live about the free blocks; it lives within them. the next and prev pointers occupy the exact bytes that will eventually hold user data. when a block is free, it is structure. when it is allocated, it is raw payload.
it is the classic way to break the recursion. (not the only way: bitmap allocators and out-of-band metadata exist. but it is the oldest and the most direct.)
there's no layer beneath you. the holes are your problem. the coalescing is your job. the free list threading through dead space is your responsibility.
are you handling this as well as the people who've spent 40 years on allocator design?
the cache you don't get (and its exception)
"but arrays are contiguous! cache-friendly!"
yes, data[] is contiguous. but the traversal doesn't go in order. it follows next[] pointers, which might say:
next[] = [ 7, 3, 7, 5, -1, 7, 1, -1 ]
your traversal order through data[] is: 0 → 7 → (end). or maybe 1 → 3 → 5 → 7 → (end).
you're still jumping around. the array is contiguous, but your access pattern isn't.
compare to a plain array where you just do:
for i in 0..n:
process(data[i])
that's sequential access. the prefetcher loves it. cache lines are fully utilized.
with array-backed linked lists, you're doing:
i = head
while i != -1:
process(data[i])
i = next[i]
every next[i] is a data-dependent load. the CPU can't predict where you're going next. it has to wait for next[i] to resolve before it can even issue the fetch for data[next[i]]. when the line isn't already cached, that wait is the same 200-to-400-cycle trip to RAM from §1.2, except now it's serialized: one full trip per hop. (the viz below charges a simplified flat 100 cycles per miss; the ratio is what matters.)
say each element is 8 bytes, so a 64-byte cache line holds 8 of them. if your hops through data[] average a stride of 8 elements or more, each line you fetch contributes one useful element: 8 bytes out of 64 = 12.5% efficiency.
a plain array traversal? 100%.
but what about the game engine?
doom (1993) had hard caps: 128 vissprites, 128 visplanes, 256 drawsegs per rendered scene. quake had entity limits. even modern engines like unity and unreal use pooled allocators for hot paths.
games typically have entity caps:
- max 256 enemies
- max 1024 projectiles
- max 64 players
you pre-allocate pools for each type:
struct EntityPool {
Entity data[256];
int16_t next[256]; // free list threading
int16_t prev[256]; // active list threading
int16_t free_head;
int16_t active_head;
};games don't just traverse the linked list. they ALSO iterate the array sequentially.
every frame:
// physics update: sequential iteration (cache-perfect)
for (int i = 0; i < 256; i++) {
if (alive[i]) updatePhysics(&data[i]);
}
// spatial query: linked list traversal (via the structure)
for (int i = near_head; i != -1; i = near_next[i]) {
checkCollision(&data[i]);
}the sequential update is O(n) through the array. the linked structure is for selective operations: "give me all entities near the player."
you get both access patterns from the same data. sequential when you want it (iterate the array). selective when you need it (follow the links). the structure provides an alternative access pattern that IS cache-friendly: sequential iteration over the raw array.
and there's more. want a "near player" list for AI updates? add near_next[], near_prev[]. want a "needs render" list for culling? add render_next[], render_prev[].
each entity can be a member of multiple linked lists simultaneously. the lists are external to the data.
// Entity 47 is simultaneously in:
// - the active list (enemies that exist)
// - the near-player list (within 50 meters)
// - the render list (on screen)
// - the damage-over-time list (poisoned)
// One entity, four list memberships,
// 16 bytes of indices in the side arraysintrusive lists can do this too. the linux kernel embeds several list_heads in one struct all the time. what the parallel-array version buys you is different. it's non-invasive: adding a list never touches the entity struct, so you can thread a new list through a type you don't control. the links are 2-byte indices instead of 8-byte pointers, so those four memberships cost 16 bytes instead of 64. and a list you aren't currently walking lives in its own arrays, off to the side, so cold lists don't take up room in the entity's cache lines.
with parallel arrays, you just add more arrays. the entity doesn't know or care how many lists it's in.
indices that move anyway (and dancing links)
"but at least indices are stable! external code can cache index 5 and know it always means the same element."
until you compact.
eventually the swiss cheese gets so bad you have to defragment. when you compact:
- element at index 7 moves to index 3
- element at index 5 moves to index 4
- every external reference is now wrong
the options are:
- never compact (memory bloats forever)
- invalidate all external references (break all callers)
- add an indirection layer (index → generation → actual slot)
option 3 is called generational indices or slot maps. it's the right answer for long-lived systems. but now your "simple array index" is a struct with an index AND a generation counter.
dancing links: where stability is the whole point
in 2000, donald knuth published a paper with one of the best names in computer science: "Dancing Links."
the problem: exact cover. given a matrix of 0s and 1s, find a subset of rows such that each column has exactly one 1. sounds abstract until you realize: sudoku is exact cover. pentomino puzzles are exact cover. N-queens is generalized exact cover (the diagonal constraints become secondary columns that may be covered at most once). scheduling problems, bin packing, graph coloring. all reduce to this.
knuth's solving algorithm is brute-force backtracking: pick a column, try each row with a 1 in it, recurse on the reduced matrix, undo if stuck. he called it Algorithm X. just X. no name, no acronym, no flourish. the search is brute force. what makes it fast is how you undo.
represent the sparse matrix as a grid of doubly-linked nodes. each node links to its neighbors in four directions (left, right, up, down). Algorithm X + dancing links = DLX.
the key operation: temporarily remove a row/column, then restore it perfectly.
remove column c:
next[prev[c]] = next[c]
prev[next[c]] = prev[c]
// c still points to its old neighbors
// they just don't point back
restore column c:
next[prev[c]] = c
prev[next[c]] = c
// c's pointers never changed, so this works
no information is lost. the node still points to its old neighbors. they just don't point back. unwind the recursion, and the structure reassembles.
one rule is load-bearing: restores must happen in the exact reverse order of the removals, or the structure corrupts. each unlinked node is a snapshot of the chain at the moment it left, so the last node out has to be the first one back in; restore out of order and the splices write into a chain that no longer matches anyone's snapshot. backtracking gives you the ordering for free, because recursion unwinds LIFO. the algorithm and the data structure share a spine.
knuth calls this "dancing." nodes are removed and restored, but they never move. they're just temporarily invisible.
with array-backed:
// remove column 2:
next_col[1] = next_col[2] // 1 now points past 2
prev_col[3] = prev_col[2] // 3 now points back past 2
// column 2's values are unchanged
// next_col[2] still = 3, prev_col[2] still = 1
// restore column 2:
next_col[1] = 2
prev_col[3] = 2
because indices are stable AND the unlinked element remembers its neighbors, restoration is trivial. there is no searching: the node's index never changed. there is nothing to remember: it still points where it always did.
the unlinked node's pointers are, by any normal definition, stale. they point to neighbors that no longer acknowledge the node's existence. in any other data structure, stale pointers are a bug. here, the staleness is what makes restoring a node O(1). the information persists because nobody cleaned it up.
this is why DLX implementations prefer parallel arrays. the index stability isn't a weakness. it's the entire mechanism. the "dancing" metaphor becomes literal: elements step out of the chain, but their positions are frozen. step back in, and the choreography resumes. (the visualizer in §1.1 does exactly this: unlink a node, hit relink, watch it step back in.)
the structure enables backtracking that would be awkward with heap nodes (where "restore" might mean re-allocating or scanning for insertion point).
fast sudoku solvers use DLX. so do polyomino tiling programs. if you've ever wondered how those "solve any sudoku in milliseconds" tools work: this is how. the dancing is the algorithm.
the undo stack: same dance, different music
text editors have tried everything: gap buffers (emacs), piece tables (vscode, originally from bravo at xerox parc in 1974), ropes. but for certain operations, the parallel-array linked list shines.
imagine a text editor buffer:
chars[] = [ 'H', 'e', 'l', 'l', 'o' ]
next[] = [ 1, 2, 3, 4, -1 ]
prev[] = [ -1, 0, 1, 2, 3 ]
delete 'l' at index 2:
- save:
{ type: "delete", idx: 2, old_next: 3, old_prev: 1 } - relink:
next[1] = 3,prev[3] = 1
the character is still there. index 2 is still 'l'. we just routed around it.
undo:
- restore:
next[1] = 2,prev[3] = 2
no copying. no reallocation. the undo operation is exactly as cheap as the original edit.
this extends to collaborative editing. operations become index-based patches that can be serialized, transmitted, and inverted. the structure wants to be versioned.
the other costs
three more problems, quickly.
fixed size. the array has a fixed capacity. when you need element N+1, you either fail, pre-allocate huge, or grow the array. growth means reallocation, copying all data, invalidating raw pointers. geometric doubling keeps the total copying linear ( element copies to reach slots, amortized per insert), but each individual doubling is a full-array copy at a moment you don't get to pick. for a data structure that promised "no allocation overhead." (unless you have a hard cap: 128 audio voices, 4096 rigid bodies, 256 sensors. then fixed size is a contract, not a limitation.)
false sharing. each core keeps its own private copy of any cache line it touches, and a write by one core forces every other core to throw its copy away and refetch. now pack next[] into a contiguous array and hand different indices to different threads. next[4] and next[5] are 4 bytes apart, same cache line. thread 1 writes next[4], invalidating thread 2's cache. thread 2 writes next[5], invalidating thread 1's. your "independent" operations are ping-ponging cache lines between cores. with pointer-based nodes on separate allocations this is less likely, though not impossible: two small nodes from the same size class can still land on one line.
debugging. a pointer-based list is self-describing. follow node->next, see the memory. an array-backed list requires the arrays AND the indexing scheme. if next[5] = 3 and data[3] looks corrupted, is it a stale index? a reused slot? a misinterpreted byte? the relationships are implicit in the arithmetic, not explicit in the data.
the topology of data
the array-backed linked list is often sold as a performance hack. a way to get the locality of an array with the flexibility of a list.
you don't get the locality. you get the storage of an array (contiguous allocation) but the access pattern of a list (random hops). you are fighting the hardware's desire to stream.
so why use it? topology.
the practical test is three conditions at once: you need O(1) unlinking (elements come and go constantly), you can't afford heap fragmentation (or you are the heap), and the size of the world is known. 128 voices, 4096 bodies, 81 sudoku cells. when the cap is a fact about the domain rather than a budget you picked, the fixed array stops being a limitation and becomes the point.
standard arrays are about values. vec[i] is just a bucket. if you swap index 4 and 5, the values move. the identity of "slot 4" is generic.
linked lists are about relationships. node->next is a specific pathway. the node exists independently of its location.
the array-backed list is a hybrid: stationary data, moving relationships.
it shines when the rewiring of pointers is the work the algorithm actually does.
- dancing links: the algorithm is the rewiring of neighbors. the data (the matrix 1s and 0s) never moves; only the visibility changes.
- LRU cache: the cache lines are fixed in hardware (or memory slots). the "recency" is an overlay of pointers that shuffles constantly.
- the allocator: the free blocks are the memory. you cannot move them to organize them. you must thread the organization through them.
you are avoiding allocation because the structure must live within the data it organizes.
for everyone else (building a todo app, a particle system, a generic container) it is usually a mistake. you are trading the honest cost of memmove (which CPUs destroy) or malloc (which allocators solve) for the hidden cost of a fragmentation engine you have to maintain yourself.
the structure creates a self-referential universe where data can reorganize itself without ever leaving its home. it is a closed system.
references
the dancing links paper:
- Knuth, Donald E. "Dancing Links." arXiv:cs/0011047, November 2000. Presented at the Oxford-Microsoft Symposium, 1999. The paper that named the technique and showed how it solves exact cover problems (sudoku, polyomino tiling, and, via secondary columns, N-queens).
memory allocators:
- Lea, Doug. "A Memory Allocator." The original documentation for dlmalloc (1987), which became the basis for glibc's ptmalloc2.
- dlmalloc source code. The actual implementation. Last released in 2012, still studied.
- Durner, Dominik, Viktor Leis, and Thomas Neumann. "On the Impact of Memory Allocation on High-Performance Query Processing." DaMoN 2019. Benchmarks jemalloc, tcmalloc, TBBmalloc, and others.
- Leijen, Daan, Benjamin Zorn, and Leonardo de Moura. "Mimalloc: Free List Sharding in Action." Microsoft Research, 2019.
linux kernel linked lists:
- Linked Lists in the Linux Kernel. Official kernel documentation on
list_headandhlist. - linux/include/linux/list.h. The actual source.
hlistis the hash-optimized variant with a single-pointer head.
game engines:
- Sanglard, Fabien. "Quake 3 Source Code Review: Architecture." Deep dive into Carmack's engine design, including entity management.
- Quake source code release. Released under GPL in 1999.
text editor data structures:
- "Piece Table." Wikipedia. The structure is credited to J Strother Moore; its first famous use was the Bravo editor (Lampson and Simonyi, Xerox PARC, 1974).
- "Data Structures for Text Sequences." Crowley, 1998. A comparative survey of buffer structures: gap buffers, piece tables, and friends. (VSCode's "piece tree" is its own later redesign, a piece table whose pieces live in a red-black tree.)
- "Gap Buffer." The technique Emacs uses.
further reading:
- Wilson, et al. "Dynamic Storage Allocation: A Survey and Critical Review." International Workshop on Memory Management, 1995. The definitive survey on allocator design.