Notes · Dissecting Real Systems

evergreen

The Grain of the Machine

A processor isn't a featureless calculator — it has a shape: words, cache lines, vector lanes. Code that moves with that grain runs many times faster than code that fights it, on the very same data. Here's the shape, and a measured case where the same arithmetic runs 22× slower against it.

· · 12 min read

hardware, performance, cpu, cache, simd, dissecting-systems

For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.

— Richard P. Feynman, "Personal Observations on the Reliability of the Shuttle", Rogers Commission Report, Appendix F (1986)

Cite this
APA
Mangalapilly, Y. J. (2026, June). The Grain of the Machine. Saṃhitā Notes. https://yesudeep.com/blog/the-grain-of-the-machine/
BibTeX
@online{mangalapilly2026the,
          author  = {Yesudeep Jose Mangalapilly},
          title   = {The Grain of the Machine},
          journal = {Sa\d{m}hit\=a Notes},
          year    = {2026},
          month   = {June},
          url     = {https://yesudeep.com/blog/the-grain-of-the-machine/},
          urldate = {2026-07-02},
        }
Plain
Yesudeep Jose Mangalapilly. “The Grain of the Machine.” Saṃhitā Notes, 2026. https://yesudeep.com/blog/the-grain-of-the-machine/.
RIS
TY  - ELEC
        AU  - Mangalapilly, Yesudeep Jose
        TI  - The Grain of the Machine
        T2  - Saṃhitā Notes
        PY  - 2026
        UR  - https://yesudeep.com/blog/the-grain-of-the-machine/
        Y2  - 2026-07-02
        ER  - 

The hardware companion to With the Grain. That piece was about the grain of the interpreter — doing less in Python, more in C. Below the interpreter is the silicon, and it has a grain of its own, sharper and less forgiving. By the end you'll have a working picture of four features of a real CPU — words, the memory hierarchy, cache lines, and vector lanes — and a measured case where the same arithmetic on the same data runs 22×22\times slower for moving against the grain.

In the previous piece I sped up a function and was careful to say the win lived in the layer above the metal — fewer trips through the Python interpreter, the real work handed to C. The data was small; it never left the fastest cache; nothing there was about the hardware. That was the honest scope.

This is the piece that goes down to the hardware. Because the metal has a grain too, and when your code actually reaches it — in C, in Rust, in a tight numeric kernel — that grain is the difference between fast and embarrassing. The penalties are bigger down here, and they hide behind code that looks innocent.

Words: the unit the machine thinks in

Start with the smallest grain, the one the last piece leaned on. A processor holds its working values in registers — a handful of named, word-wide slots on the chip itself, 64 bits on a modern core — and its arithmetic operates on full registers. (To be exact: it can load and add single bytes, at the same per-instruction cost as words — the byte-at-a-time tax in the last piece was paid in instruction count, one loop turn per byte instead of one per eight, not in a per-byte surcharge on each instruction.) Feed the machine words and each instruction moves eight bytes of work; feed it bytes and you execute eight times the instructions for the same bytes. That's the grain at its finest scale, and it's the easiest to cut along — most of the time the compiler does it for you.

The interesting grain is coarser, and it's where the surprises live: how the machine fetches those words from memory.

The memory hierarchy

A register holds a word for essentially nothing — it's already on the chip. Main memory is hundreds of times farther away. To paper over that gulf the CPU keeps a ladder of caches, each smaller and faster than the one below it:

The memory hierarchy. Latencies are order-of-magnitude, in CPU cycles (a cycle is well under a nanosecond); the capacities drawn are x86-class — an M-series core has a larger L1 and a big shared L2 instead. A register is essentially free; a trip to RAM costs a few hundred cycles.

The numbers are the whole story. A read from L1 is a rounding error; a read from RAM stalls a load for a few hundred cycles — in which time a wide out-of-order core could have issued hundreds of instructions, even a couple of thousand. (The core will race ahead and overlap other misses where it can — that's memory-level parallelism, and it's the difference between "a few hundred cycles" and a few hundred cycles of dead silence.) So the entire performance game, once you're at the metal, is a single question: *is the data you need already in a near cache, or are you about to pay full freight for it?* And the answer depends almost entirely on how you walk through memory — which brings us to the grain that matters most.

Cache lines

A cache line is the unit memory moves in — 64 bytes on x86-64, 128 on Apple silicon (sysctl hw.cachelinesize reports it; Lemire measured it). You never fetch a byte; you fetch the line it lives in. Reading the rest of the line afterward is free, because it came along for the ride. Learn more.

Here is the fact that governs everything: the cache doesn't deal in bytes, it deals in lines. Ask for one byte and the machine fetches the whole 64-byte line around it — and the next 63 bytes are then yours for free, already hot in cache.

Memory is a warehouse, and the forklift only moves whole pallets — never a single box. Read the items off one pallet in order and you pay for one trip. Read one item from each of a hundred pallets and you've made a hundred trips to fetch a hundred items, hauling six thousand boxes you'll never open. Same shopping list; a hundred times the freight.

So sequential access goes with the grain — every fetched line is used to the last byte — and strided or random access cuts straight across it, paying for a full line to use a fraction of it:

The same bytes, two access patterns. Sequential reads use every byte of each line they pay for; strided reads pull a fresh line for each access and discard seven-eighths of it.

Feel it before the measurement. Sixty-four reads of an 8-byte element, with the stride between them yours to drive:

The stride walker. Each cell is one 64-byte cache line; accent marks the 8 bytes a read actually used. At stride ×1 every fetched line is fully spent; at stride ×8 each read drags in a fresh line and throws seven-eighths of it away. The readout does the arithmetic on the addresses, not on the prose.

This isn't a small effect, and it isn't hypothetical. Take a large 2-D array and sum it two ways — along its rows (sequential in memory) and down its columns (a stride the width of a row, a fresh cache line every step):

/* row-major: with the grain — consecutive addresses */
for (i = 0; i < N; i++) for (j = 0; j < N; j++) sum += a[i][j];

/* column-major: against it — a cache line apart each step */
for (j = 0; j < N; j++) for (i = 0; i < N; i++) sum += a[i][j];

Identical arithmetic, identical data, a one-line difference in loop order. On an 8192×81928192 \times 8192 array of int (256 MB — far larger than any cache), compiled -O2:

Summing a 256 MB array row-wise vs. column-wise. The only difference is the order memory is walked — the two loops above are the whole experiment.
Traversal Apple silicon (M-series) x86-64 (Ryzen 9)
row-major — with the grain 0.064 s 0.038 s
column-major — against it 0.352 s 0.861 s
slowdown $5.5×$ $22.7×$

Nothing changed but the direction of the walk, and the x86 box ran twenty-two times slower. That is the grain of the machine, charging you for the lie that memory is flat.

Be careful, though, with the easy explanation. "You fetch a 64-byte line and use four bytes of it" is real, but on its own it predicts a factor nearer 16×16\times in memory traffic — and it can't explain why the same walk costs only 5.5×5.5\times on the Apple core. Against the grain, the machine charges you on several meters at once:

  • Line waste — each fetched line yields one 4-byte int before the walk leaves it behind.
  • TLB thrash — a 32 KB stride lands every single access on a fresh virtual-memory page. One column pass touches 8,192 distinct pages, more than any TLB holds, so a page-table walk shadows read after read.
  • Defeated prefetchers — the hardware prefetchers that make the row walk fly won't follow you: Intel's optimization manual is explicit that prefetched lines "must be in the same 4K page," so a stride that changes page every access gets no prefetch at all.
  • Conflict misses — 32 KB is a power of two, so consecutive rows map onto the same cache sets and evict each other long before the cache is full. Drepper: "Laying out data at boundaries that are powers of two happens often enough in the real world, but this is exactly the situation which can easily lead to … degraded performance."
  • And the compiler happily vectorizes the row loop while the column loop stays scalar.

The gap between the two machines is the same anatomy read in reverse. Apple silicon uses 16 KB pages (the IOMMU won't even do 4 K, per the Asahi notes), so the 256 MB matrix needs a quarter as many page-table entries and its several-thousand-entry TLB covers roughly 4848 MB of address space against a Ryzen's 10\sim 10 MB — the misses still happen, but the walks are shorter and hit cache-resident page tables; and each 128-byte line, once fetched, feeds the next 32 columns instead of 16. Cheaper misses, amortized twice as widely: 5.5×5.5\times instead of 22.7×22.7\times. The lesson underneath the lesson: "against the grain" is rarely one penalty — it's the same wrong shape billed by the line, the page, the prefetcher, and the cache-set map simultaneously.

False sharing: the grain bites two cores at once

False sharing: two cores write two different variables that happen to sit in the same cache line. The hardware keeps lines coherent, so each write invalidates the other core's copy — and the line ping-pongs between them. They share no data, only its line; the fix is to pad the variables onto lines of their own.

The line — not the byte — is also the unit of sharing between cores, and that makes a subtler trap. Two threads on two cores, each updating its own counter, no shared variable between them — and yet it crawls, because the two counters landed in the same cache line.

Two clerks share one ledger page, each writing in a different column. They never touch the same number — but there's only one page, so every time one writes, the other has to hand it over and wait to get it back. They aren't sharing data, only the page it sits on, and that page shuttles across the room all day.

The cores aren't sharing anything that means anything; they're sharing the line the hardware moves in. Pad the two variables apart so each owns its own line and the contention vanishes. It's the same grain as before — the 64-byte line — felt through the coherence machinery instead of the cache latency.

SIMD: when the lanes line up

SIMD — single instruction, multiple data. A vector register holds several values (lanes); one instruction operates on all of them at once. A 256-bit unit is eight 32-bit lanes; AVX-512 is sixteen. The catch is that the data has to be laid out so the lanes can stride through it contiguously.

The last grain runs the other way — not how data comes in, but how much the machine can chew per instruction. A scalar add does one pair of numbers. A vector unit holds a register of lanes and adds them all in a single instruction:

Scalar vs. vector. The same four additions are one SIMD instruction instead of four — provided the eight (or four, or sixteen) operands sit contiguously, so the lanes can stride straight through them.

A normal cashier scans one item per beep. A SIMD cashier has eight scanners on a single trigger — pull it once and eight items go through together. But only if the items are lined up so all eight scanners can reach them at once; hand them over in a jumbled pile and seven scanners sit idle.

That proviso is the grain again. The vector unit wants its operands laid out in a straight, contiguous run — the same sequential layout the cache wanted. Structure your data as parallel arrays and the lanes (and the cache lines) stride through it cleanly; scatter it into tangled objects and both the vector unit and the cache stall, waiting on memory that arrives one useless line at a time.

The grain is fractal

Step back and the four features are one idea wearing four masks. A word is the grain of the register; a cache line is the grain of memory; coherence is that same line seen by many cores; a vector lane is the grain of the arithmetic. Every one of them rewards the same posture — contiguous, sequential, aligned — and punishes the same one: scattered, strided, fighting the shape of the thing.

And it nests. The interpreter has a grain (the last piece); under it the silicon has a grain (this one); under that are page tables and memory controllers and DRAM rows with grains of their own. You don't have to descend every level. You have to know which level your code actually lives at — and cut along the grain of that one.

The machine has a shape. Fast code isn't clever; it's shaped to fit — contiguous, sequential, aligned — at whatever level it runs. The rest is fighting the grain, and the grain always wins.

The cache benchmark is a 30-line C program; the row-major / column-major loops above are the whole of it. Compile -O2, size the array well past your last-level cache, and you'll land in the same order of magnitude — the exact factor depends on your page size, TLB (the CPU's small cache of page-address lookups — column order thrashes it too), prefetchers (the hardware guessing your next read — column order defeats the guess), and cache associativity. Those differ chip to chip, which is why the two columns of the table differ by 4×4\times: same experiment, different machinery around the cache line.

Lessons

  • A CPU is not a featureless calculator; it has physical structure: word registers, memory hierarchies, cache lines, and vector lanes.
  • Cut along the grain of the metal: design structures for contiguous layout, sequential access, and proper alignment.
  • The memory system moves data in cache lines (64 bytes on x86-64, 128 on Apple silicon). Striding column-wise through a large array paid 5.5×5.5\times on an M-series and 22.7×22.7\times on a Ryzen in the measurement above — and the bill is itemized across line waste, TLB misses, dead prefetchers, and cache-set conflicts, not the line alone.
  • Vector lanes (SIMD) execute arithmetic in parallel across registers — and the compiler auto-vectorizes exactly the loops that already run with the grain: contiguous, sequential, no cross-iteration dependence.
  • False sharing occurs when independent threads write to distinct variables residing on the same cache line, triggering expensive coherency bounces.

References

  1. CPU Cache.” — how the cache hierarchy hides DRAM latency
  2. SIMD.” — Single Instruction, Multiple Data, and vectorization
  3. Ulrich Drepper. “What Every Programmer Should Know About Memory.” LWN. — the classic text · part 5 covers the power-of-two conflict-miss pathology
  4. Intel 64 and IA-32 Architectures Optimization Reference Manual.” Intel. — source for the prefetchers' same-4K-page rule
  5. With the Grain.” — the predecessor to this post, focusing on interpreter overhead

How to cite

APA
Mangalapilly, Y. J. (2026, June). The Grain of the Machine. Saṃhitā Notes. https://yesudeep.com/blog/the-grain-of-the-machine/
BibTeX
@online{mangalapilly2026the,
          author  = {Yesudeep Jose Mangalapilly},
          title   = {The Grain of the Machine},
          journal = {Sa\d{m}hit\=a Notes},
          year    = {2026},
          month   = {June},
          url     = {https://yesudeep.com/blog/the-grain-of-the-machine/},
          urldate = {2026-07-02},
        }
Plain
Yesudeep Jose Mangalapilly. “The Grain of the Machine.” Saṃhitā Notes, 2026. https://yesudeep.com/blog/the-grain-of-the-machine/.
RIS
TY  - ELEC
        AU  - Mangalapilly, Yesudeep Jose
        TI  - The Grain of the Machine
        T2  - Saṃhitā Notes
        PY  - 2026
        UR  - https://yesudeep.com/blog/the-grain-of-the-machine/
        Y2  - 2026-07-02
        ER  - 

Annotations

Thank you — your note is held for review and will appear once approved.

Thank you — your note is published.

Please sign in below to leave a note.

Type to search · ↑↓ to move · ↵ to open · Esc to close