the seduction
someone on your team has an idea.
"what if we stored the linked list... in arrays?"
instead of nodes scattered across the heap, connected by pointers that chase each other through memory like a drunk trying to find their keys. we use three contiguous arrays:
data[] = [ X, A, B, C, D, E, F, Y ]
next[] = [ 1, 2, 3, 4, 5, 6, 7, -1]
prev[] = [-1, 0, 1, 2, 3, 4, 5, 6 ]
data[i] holds the value. next[i] tells you the index of the next element. prev[i] tells you the previous.
- no malloc/free per operation — memory is pre-allocated
- cache-friendly — arrays are contiguous, right?
- O(1) everything — deletion is just pointer rewiring
- index stability — elements don't move, just get "unlinked"
this is called an implicit linked list or array-backed linked list. john carmack used variants of this in quake's entity system. redis uses it for LRU eviction. the linux kernel's hlist is a cousin. it's real, it's battle-tested, and it's everywhere heap fragmentation is terrifying.
and sometimes it's exactly right.
but the reasons people reach for it are almost never the reasons it works.
the promise (with a real masterpiece)
trace a deletion.
say we want to delete index 3 (the node holding 'C'). the operation is:
- look up
prev[3]→ 2 (B is our left neighbor) - look up
next[3]→ 4 (D is our right neighbor) - set
next[2] = 4(B now points past C to D) - set
prev[4] = 2(D now points back past C to B)
four array writes. no memory allocation. no pointer chasing across cache lines. no calling free() and praying the allocator doesn't fragment.
the data at index 3 is still there. we just routed around it. if you walked the raw array with a debugger, you'd see C sitting between B and D, fully intact, no indication anything happened. it is present in every physical sense and absent in every logical one.
play with the visualizer above. navigate to any node. hit "Unlink". watch the next[] and prev[] arrays get rewritten. watch the "Computed Logical Order" update.
the LRU cache: a perfect marriage
an LRU (least recently used) cache needs two operations to be O(1):
- lookup by key → return value instantly
- touch → move accessed item to "most recently used" position
the classic implementation: a hash map for lookup + a doubly linked list for ordering.
with heap-allocated nodes, every access does:
- hash lookup → O(1)
- unlink node from current position → O(1)
- link node at head → O(1)
but each pointer dereference is a potential cache miss. the node could be anywhere on the heap. 4 pointer writes means potentially 4 round trips to RAM, ~100 cycles each.
with array-backed:
keys[] = [ "A", "B", "C" ]
vals[] = [ 100, 200, 300 ]
next[] = [ 2, -1, 0 ] // C -> A -> B -> end
prev[] = [ 2, 0, -1 ]
touching "A" means:
- unlink index 0 from position (4 array writes, all in same/adjacent cache lines)
- insert at head (2 more array writes)
the data never moves. you're not shuffling memory. you're rewriting the routing tables of a fixed topology.
for hot caches (thousands of ops/second), you're trading pointer chasing for cache-line-local integer arithmetic. array-backed LRU can be 2-3x faster than pointer-based.
the swiss cheese problem
delete five nodes. then delete five more. then delete three from the middle.
what does your memory look like now?
data[] = [ X, -, B, -, -, E, -, Y ]
(- = dead slot, still allocated)
you have 8 slots allocated. 3 are actually in use. memory utilization: 37.5%.
you can't reclaim those slots without:
- tracking free slots in a free list (another linked list, often implemented... the same way)
- or compacting the array (which invalidates all external indices)
"just use a free list," you say. fine. now you have two parallel-array linked lists to maintain. every allocation walks the free list. every deletion appends to it. your "O(1)" is actually "O(1) + bookkeeping that adds up."
and if you're doing this to avoid heap fragmentation, you've recreated heap fragmentation inside your array. unless you are the allocator.
the memory allocator: where this pattern was born
doug lea wrote dlmalloc in 1987. it became the basis for glibc's malloc (ptmalloc2), which means every C program on linux has been using this pattern for decades. jemalloc (firefox, redis), tcmalloc (google), mimalloc (microsoft). all descendants, all using variants of implicit free lists.
the core idea: each free block contains:
size(how big is this chunk)prev_free(offset to previous free block)next_free(offset to next free block)
when you free(ptr):
- mark the block as free
- link it into the free list
- coalesce with adjacent free blocks
no malloc/free calls. you're managing memory using the memory you're managing.
why is this the only way? because of a recursive dependency.
to track fragmented memory, you need a data structure (a list) that grows and shrinks. if you stored this list in external memory, adding a node would require an allocation.
but you are the allocator.
the allocator needs to track a newly freed block. to track it, it needs a node for its free list. to create that node, it needs to allocate memory. to allocate memory, it needs to consult the free list. to consult the free list, it needs the node it hasn't created yet. every solution requires the solution.
the solution is intrusive metadata.
the free list doesn't live about the free blocks; it lives within them. the next and prev pointers occupy the exact bytes that will eventually hold user data. when a block is free, it is structure. when it is allocated, it is raw payload.
it is the only way to break the recursion.
there's no layer beneath you. the holes are your problem. the coalescing is your job. the free list threading through dead space is your responsibility.
are you handling this as well as the people who've spent 40 years on allocator design?
the cache you don't get (and its exception)
"but arrays are contiguous! cache-friendly!"
yes, data[] is contiguous. but you're not traversing data[] sequentially. you're following next[] pointers, which might say:
next[] = [ 7, 3, 7, 5, -1, 7, 1, -1 ]
your traversal order through data[] is: 0 → 7 → (end). or maybe 1 → 3 → 5 → 7 → (end).
you're still jumping around. the array is contiguous, but your access pattern isn't.
compare to a plain array where you just do:
for i in 0..n:
process(data[i])
that's sequential access. the prefetcher loves it. cache lines are fully utilized.
with array-backed linked lists, you're doing:
i = head
while i != -1:
process(data[i])
i = next[i]
every next[i] is a data-dependent load. the CPU can't predict where you're going next. it has to wait for next[i] to resolve before it can even issue the fetch for data[next[i]].
if your stride through data[] averages 4 elements, you're using 8 bytes per 64-byte cache line = 12.5% efficiency.
a plain array traversal? 100%.
but what about the game engine?
doom (1993) had hard caps: 128 things visible, 256 linedefs per subsector. quake had entity limits. even modern engines like unity and unreal use pooled allocators for hot paths.
games typically have entity caps:
- max 256 enemies
- max 1024 projectiles
- max 64 players
you pre-allocate pools for each type:
struct EntityPool {
Entity data[256];
int16_t next[256]; // free list threading
int16_t prev[256]; // active list threading
int16_t free_head;
int16_t active_head;
};games don't just traverse the linked list. they ALSO iterate the array sequentially.
every frame:
// physics update: sequential iteration (cache-perfect)
for (int i = 0; i < 256; i++) {
if (alive[i]) updatePhysics(&data[i]);
}
// spatial query: linked list traversal (via the structure)
for (int i = near_head; i != -1; i = near_next[i]) {
checkCollision(&data[i]);
}the sequential update is O(n) through the array. the linked structure is for selective operations: "give me all entities near the player."
you get both access patterns from the same data. sequential when you want it (iterate the array). selective when you need it (follow the links). the structure provides an alternative access pattern that IS cache-friendly: sequential iteration over the raw array.
and there's more. want a "near player" list for AI updates? add near_next[], near_prev[]. want a "needs render" list for culling? add render_next[], render_prev[].
each entity can be a member of multiple linked lists simultaneously without storing multiple pointers per object. the lists are external to the data.
// Entity 47 is simultaneously in:
// - the active list (enemies that exist)
// - the near-player list (within 50 meters)
// - the render list (on screen)
// - the damage-over-time list (poisoned)
// One entity, four list memberships, zero per-entity overheadthis is impossible with intrusive linked lists (where next/prev live in the node). the node would need N sets of pointers for N lists.
with parallel arrays, you just add more arrays. the entity doesn't know or care how many lists it's in.
indices that move anyway (and dancing links)
"but at least indices are stable! external code can cache index 5 and know it always means the same element."
until you compact.
eventually the swiss cheese gets so bad you have to defragment. when you compact:
- element at index 7 moves to index 3
- element at index 5 moves to index 4
- every external reference is now wrong
the options are:
- never compact (memory bloats forever)
- invalidate all external references (break all callers)
- add an indirection layer (index → generation → actual slot)
option 3 is called generational indices or slot maps. it's the right answer for long-lived systems. but now your "simple array index" is a struct with an index AND a generation counter.
dancing links: where stability is the whole point
in 2000, donald knuth published a paper with one of the best names in computer science: "Dancing Links."
the problem: exact cover. given a matrix of 0s and 1s, find a subset of rows such that each column has exactly one 1. sounds abstract until you realize: sudoku is exact cover. pentomino puzzles are exact cover. N-queens is exact cover. scheduling problems, bin packing, graph coloring. all reduce to this.
knuth's solving algorithm is brute-force backtracking: pick a column, try each row with a 1 in it, recurse on the reduced matrix, undo if stuck. he called it Algorithm X. just X. no name, no acronym, no flourish. the search is brute force. what makes it fast is how you undo.
represent the sparse matrix as a grid of doubly-linked nodes. each node links to its neighbors in four directions (left, right, up, down). Algorithm X + dancing links = DLX.
the key operation: temporarily remove a row/column, then restore it perfectly.
remove column c:
next[prev[c]] = next[c]
prev[next[c]] = prev[c]
// c still points to its old neighbors
// they just don't point back
restore column c:
next[prev[c]] = c
prev[next[c]] = c
// c's pointers never changed, so this works
no information is lost. the node still points to its old neighbors. they just don't point back. unwind the recursion, and the structure reassembles.
knuth calls this "dancing." nodes are removed and restored, but they never move. they're just temporarily invisible.
with array-backed:
// remove column 2:
next_col[1] = next_col[2] // 1 now points past 2
prev_col[3] = prev_col[2] // 3 now points back past 2
// column 2's values are unchanged
// next_col[2] still = 3, prev_col[2] still = 1
// restore column 2:
next_col[1] = 2
prev_col[3] = 2
because indices are stable AND the unlinked element remembers its neighbors, restoration is trivial. you're not finding the node. you know exactly where it is. you're not remembering what it pointed to. it still points there.
the unlinked node's pointers are, by any normal definition, stale. they point to neighbors that no longer acknowledge the node's existence. in any other data structure, stale pointers are a bug. here, the staleness is what makes restoration O(1). the information persists because nobody cleaned it up.
this is why DLX implementations prefer parallel arrays. the index stability isn't a weakness. it's the entire mechanism. the "dancing" metaphor becomes literal: elements step out of the chain, but their positions are frozen. step back in, and the choreography resumes.
the structure enables backtracking that would be awkward with heap nodes (where "restore" might mean re-allocating or scanning for insertion point).
fast sudoku solvers use DLX. so do polyomino tiling programs. if you've ever wondered how those "solve any sudoku in milliseconds" tools work: this is how. the dancing is the algorithm.
the undo stack: same dance, different music
text editors have tried everything: gap buffers (emacs), piece tables (vscode, originally from bravo at xerox parc in 1974), ropes. but for certain operations, the parallel-array linked list shines.
imagine a text editor buffer:
chars[] = [ 'H', 'e', 'l', 'l', 'o' ]
next[] = [ 1, 2, 3, 4, -1 ]
prev[] = [ -1, 0, 1, 2, 3 ]
delete 'l' at index 2:
- save:
{ type: "delete", idx: 2, old_next: 3, old_prev: 1 } - relink:
next[1] = 3,prev[3] = 1
the character is still there. index 2 is still 'l'. we just routed around it.
undo:
- restore:
next[1] = 2,prev[3] = 2
no copying. no reallocation. the undo operation is exactly as cheap as the original edit.
this extends to collaborative editing. operations become index-based patches that can be serialized, transmitted, and inverted. the structure wants to be versioned.
the other costs
three more problems, quickly.
fixed size. the array has a fixed capacity. when you need element N+1, you either fail, pre-allocate huge, or grow the array. growth means reallocation, copying all data, invalidating raw pointers. geometric doubling copies every element twice over its lifetime (). for a data structure that promised "no allocation overhead." (unless you have a hard cap: 128 audio voices, 4096 rigid bodies, 256 sensors. then fixed size is a contract, not a limitation.)
false sharing. pack next[] into a contiguous array, then hand different indices to different threads. next[4] and next[5] are 4 bytes apart, same cache line. thread 1 writes next[4], invalidating thread 2's cache. thread 2 writes next[5], invalidating thread 1's. your "independent" operations are ping-ponging cache lines between cores. with pointer-based nodes on separate allocations, this rarely happens.
debugging. a pointer-based list is self-describing. follow node->next, see the memory. an array-backed list requires the arrays AND the indexing scheme. if next[5] = 3 and data[3] looks corrupted, is it a stale index? a reused slot? a misinterpreted byte? the relationships are implicit in the arithmetic, not explicit in the data.
the topology of data
the array-backed linked list is often sold as a performance hack. a way to get the locality of an array with the flexibility of a list.
you don't get the locality. you get the storage of an array (contiguous allocation) but the access pattern of a list (random hops). you are fighting the hardware's desire to stream.
so why use it? topology.
the practical test is three conditions at once: you need O(1) unlinking (elements come and go constantly), you can't afford heap fragmentation (or you are the heap), and the size of the world is known. 128 voices, 4096 bodies, 81 sudoku cells. when the cap isn't a constraint but a fact about the domain, the fixed array stops being a limitation and becomes the point. you're not managing memory. you're sculpting a closed universe where things rearrange without ever leaving home.
standard arrays are about values. vec[i] is just a bucket. if you swap index 4 and 5, the values move. the identity of "slot 4" is generic.
linked lists are about relationships. node->next is a specific pathway. the node exists independently of its location.
the array-backed list is a hybrid: stationary data, moving relationships.
it shines when the "dance" of the pointers is the algorithm itself.
- dancing links: the algorithm is the rewiring of neighbors. the data (the matrix 1s and 0s) never moves; only the visibility changes.
- LRU cache: the cache lines are fixed in hardware (or memory slots). the "recency" is an overlay of pointers that shuffles constantly.
- the allocator: the free blocks are the memory. you cannot move them to organize them. you must thread the organization through them.
you are avoiding allocation because the structure must live within the data it organizes.
for everyone else (building a todo app, a particle system, a generic container) it is usually a mistake. you are trading the honest cost of memmove (which CPUs destroy) or malloc (which allocators solve) for the hidden cost of a fragmentation engine you have to maintain yourself.
the structure creates a self-referential universe where data can reorganize itself without ever leaving its home. it is a closed system.
references
the dancing links paper:
- Knuth, Donald E. "Dancing Links." arXiv:cs/0011047, November 2000. Presented at the Oxford-Microsoft Symposium, 1999. The paper that named the technique and showed how it solves exact cover problems (sudoku, N-queens, polyomino tiling).
memory allocators:
- Lea, Doug. "A Memory Allocator." The original documentation for dlmalloc (1987), which became the basis for glibc's ptmalloc2.
- dlmalloc source code — the actual implementation, still maintained.
- Appel, et al. "On the Impact of Memory Allocation on High-Performance Query Processing." arXiv, 2019. Benchmarks jemalloc, tcmalloc, mimalloc, and others.
- Leijen, Daan. "Mimalloc: Free List Sharding in Action." Microsoft Research, 2019.
linux kernel linked lists:
- Linked Lists in the Linux Kernel — official kernel documentation on
list_headandhlist. - linux/include/linux/list.h — the actual source.
hlistis the hash-optimized variant with a single-pointer head.
game engines:
- Sanglard, Fabien. "Quake 3 Source Code Review: Architecture." Deep dive into Carmack's engine design, including entity management.
- Quake source code release — released under GPL in 1999.
redis eviction:
- "Key Eviction." Redis official docs on LRU/LFU implementation. Redis uses approximate LRU with an eviction pool (a linked list) and sampling.
- The implementation lives in
evict.cin the Redis source.
text editor data structures:
- "Piece Table." Wikipedia. Invented at Xerox PARC in 1974 for the Bravo editor by Butler Lampson and Charles Simonyi, based on J. Strother Moore's structure-sharing ideas.
- "Data Structures for Text Sequences." Crowley, 1998. The paper that suggested using trees for piece tables—which VSCode eventually implemented as "piece trees."
- "Gap Buffer." The technique Emacs uses.
further reading:
- Wilson, et al. "Dynamic Storage Allocation: A Survey and Critical Review." International Workshop on Memory Management, 1995. The definitive survey on allocator design.