multiplying two numbers on a modern chip costs about four picojoules — a picojoule is a trillionth of a joule. fetching one of those numbers from main memory costs 640. all of the chip's arithmetic lives inside that imbalance: the math is nearly free, and the bill is the fetch.
this essay watches the imbalance produce a strange machine. the job, small enough that every number fits on screen: multiply two 3×3 grids, and . nine answer cells, three multiplies each — 9 × 3 = 27 multiplies. 27 × 4 ≈ 100 pJ. one fetch from main memory: 640 pJ. ('s numbers are drawn green throughout; 's, blue.)
the geography (where a number lives)
main memory — DRAM, the RAM in every laptop — is a separate piece of silicon. a number stored there sits in a row of thousands of bits, millimetres of copper away from the arithmetic. on the processor die itself there is a smaller, faster memory (SRAM — caches are made of it), and inside each math unit a register, where a value rests within reach of the multiplier.
a DRAM read is an event with consequences. to deliver one value, the chip wakes a row of 8–16 kilobytes; sensing the bits drains the charges that store them, so the entire row must be written back; the surviving bits then drive off-die, across the copper. 640 pJ, most of it distance.
a 32-bit multiply costs ~4 pJ. 640 ÷ 4 = 160: one fetch, 160 multiplies. every machine that does arithmetic for a living is some answer to that ratio.
the obvious machine (a GPU, more or less)
start where software starts: cores and a memory. nine cores, one bank, no wires between the cores — a real GPU reduced to its essential arrangement. each core computes one answer cell, and since no core can hand anything to a neighbour, each fetches its own copies of the six numbers it needs — numbers it shares, every one of them, with two other cores.
one clock drives every panel from here down, including the running bill at the bottom. the GPU counts its turns in rounds; the machine that answers it counts in ticks — the same beat either way.
the top-left 2 of is needed by 3 different cores, so it leaves the bank 3 times. so does every other number. 18 numbers × 3 fetches = 54 reads, finished in 3 rounds. the design is fast. it is also 54 payments at the ladder's top rung for 18 distinct values.
(real GPUs stage re-reads in SRAM — a technique called tiling; the ladder at the bottom prices that escape route. the 54 full-price reads are the untiled worst case, the bare shape of the problem before software fights it.)
the ceiling (one fetch, many uses)
the GPU's three rounds hide a harder limit than the energy bill. every number it multiplies arrives through the same memory pipe, and a pipe delivers only so many words a second. each multiply eats two words, so with every word used once, a pipe that delivers a billion words a second feeds at most half a billion multiplies — no matter how many cores wait on the other side. adding cores to a starved machine adds waiting.
count what each machine extracts per fetch. the GPU: 27 multiplies from 54 fetches — 27 ÷ 54 = 0.5 multiplies per word. read each word once instead, and the same 27 multiplies need 18 fetches — 27 ÷ 18 = 1.5 per word. the same pipe now feeds 3× the arithmetic. the escape from the ceiling is not a faster pipe; it is using each word more times between trips. the energy ratio and the throughput ratio are the same number, because they are the same fact.
reuse, though, needs a verb software does not have: one core handing a number to another without going back through memory.
porting a number (a verb software doesn't have)
in software, two cores share data exactly one way: through the memory system. core a stores; core b loads. there is no instruction that says give my register to core b — registers are private by construction (a GPU warp can shuffle registers among its own 32 lanes, a wired exception that proves how much the verb is worth; beyond that island, the only public place on the machine is memory). every hand-off is a round trip toward the top of the ladder, paid in the same coin as any other fetch.
the array's cells are built differently. the output register of each cell is wired — physically, ~50 µm of metal — to the input register of its neighbour. on every clock edge, every cell latches whatever sits on its neighbour's output. that is the entire mechanism: no address, no instruction, no arbitration. while it travels, the moving copy exists only as charge on a wire. a hop costs ~2 pJ because that is what 50 µm of wire and one latch cost.
the price of the magic is rigidity. the wires are drawn at fabrication, so data can flow only where the wires point — a machine like this multiplies matrices and does nothing else. general-purpose machines keep memory as the meeting place precisely because an address can name anyone; a wire names exactly one neighbour, forever.
the answer (a systolic array)
wire nine multiply-accumulate cells — MACs, the nine boxes on the geography map — into a 3×3 grid, each passing to the next. 's rows enter from the left and march right, one cell per tick. 's columns fall from the top. each cell multiplies whatever pair lands on it and adds the product to a running sum; cells ahead of the wave sit idle — their wires carry nothing until the queues reach them, and an empty tick adds zero.
the queues are staggered: row waits ticks before it starts, column waits . the stagger is the entire design.
the marked 2 leaves memory once — its cell in the bank flashes, then fades — and reaches all 3 cells that need it by hand-off. the bank serves 18 reads, one per number; after the last (t=4) the machine spends nothing but hops, 2 pJ each.
why the stagger works (the two 2s, replayed)
the diagonal wave has the look of luck. it is bookkeeping.
the marked 2's last meeting is with a blue 2 — the top-right value of — in cell (0,2). green has the longer trip, two cells right of where it enters against blue's none, and the stagger sends it off earlier: it departs at t=0 and arrives at 0+2 = 2. blue falls into the same cell from directly above, no walk at all; its queue holds it two ticks and releases it at t=2. arrival, 2+0 = 2.
0+2 = 2+0 — the same sum in the opposite order. counting each queue's numbers from the front: green number of row departs at and walks cells; blue number of column departs at and falls cells. both clocks read . each value departs exactly as many ticks before the meeting as it has cells to cross. no pair in the grid can miss.
the uneven case behaves the same way. cell (2,1)'s first pair: green departs at 2 and walks 1; blue departs at 1 and falls 2. 2+1 = 1+2 = 3.
and the wave is this arithmetic seen whole. cell fires its -th multiply at , so cells with equal fire together, and equal is an anti-diagonal. the stripe is the formula, drawn.
the receipt (the energy ladder)
the bill is the fetch, not the math. the array pays the top rung 18 times, once per operand, then lives on the hop rung: 18 × 640 + 36 × 2 = 11,592 pJ ≈ 11.6 nJ. the GPU: 54 × 640 = 34,560 pJ ≈ 34.6 nJ. identical arithmetic, a third of the bill.
the toll is time. the wave takes ticks to drain, where is the grid size: 3 × 3 − 2 = 7 here, against the GPU's 3 rounds. energy, bought with latency.
the heartbeat
in 1978, h. t. kung and charles leiserson drew this machine — a grid of cells through which data is pumped on a shared beat — and named it for the contraction phase of the heartbeat: systolic. memory is the heart, the numbers are the blood, and the array keeps them circulating so that the heart pumps each one exactly once. kung's argument, made before the energy wall had a name, is the one §1.3 counted out: a machine's arithmetic is capped by its memory pipe times its reuse, so wire the reuse in.
google's TPU carries a systolic array at its center, 256 cells on a side. at that size the wave takes 3 × 256 − 2 = 766 ticks to fill and drain, which is why the machine is fed in long uninterrupted streams — the heart is not allowed to stop between beats — and why the trade keeps being made: the ratio only widens with size — at 256 cells across, each fetched word feeds 128 multiplies, not one and a half — and energy is the thing datacenters run out of.
tl;dr (for the adhd crew)
- a 32-bit multiply costs ~4 pJ; a DRAM fetch of its operand, 640. 640 ÷ 4 = 160 — the bill is the fetch, not the math.
- DRAM is expensive because it is far — millimetres off-chip — and each read wakes an 8–16 KB row to deliver one value.
- nine cores with no wires between them fetch each number once per core that needs it: 18 × 3 = 54 reads, 3 rounds.
- the memory pipe caps compute: multiplies per second = words per second × multiplies per word. GPU, 27 ÷ 54 = 0.5 multiplies per word; read-once, 27 ÷ 18 = 1.5 — 3× the ceiling from the same pipe.
- software moves a value between cores only through memory — outside a warp's own 32 lanes, no instruction gives a register to a neighbour. the array wires output register to input register: a hop is one latch on one clock edge, ~2 pJ, and the wiring is fixed at fabrication.
- a systolic array reads each number once and passes it cell to cell: 18 reads + 36 hops, 3N−2 = 7 ticks.
- the stagger is bookkeeping, not luck: green number of row departs at and walks ; blue departs at and falls . both clocks read . no pair can miss.
- cells with equal fire together — the diagonal wave is the formula, drawn.
- identical arithmetic at a third of the energy (11.6 nJ vs 34.6), paid for in latency — the trade the TPU makes at 256 × 256, where the ratio is no longer 3× but hundreds.
references
- mark horowitz, "computing's energy problem (and what we can do about it)", ISSCC 2014 — the per-operation energy prices on the ladder.
- h. t. kung, "why systolic architectures?", IEEE Computer, 1982 — the balance argument of §1.3: computation is capped by I/O, so make multiple use of every word fetched.
- norman p. jouppi et al., "in-datacenter performance analysis of a tensor processing unit", ISCA 2017 — the TPU's 256×256 systolic matrix unit, this essay's grid at production scale.