Notes · Dissecting Real Systems

evergreen

With the Grain

The original turned a big integer into bytes one byte at a time, fighting the machine. The rewrite went with its grain — whole machine words, packed in C. That structural choice is why it's still ~15× faster fifteen years on, even as CPython sped up underneath it.

Yesudeep Jose Mangalapilly

Published 7 days ago · Updated 2 days ago · 9 min read

python, performance, cpython, optimization, dissecting-systems

…wood is allowed to be wood. It is the most humanly intimate of all materials. Man loves his association with it, likes to feel it under his hand, sympathetic to his touch and to his eye.

— Frank Lloyd Wright, "In the Cause of Architecture, IV: The Meaning of Materials—Wood", Architectural Record (1928)

Cite this

APA

Mangalapilly, Y. J. (2026, June). With the Grain. Saṃhitā Notes. https://yesudeep.com/blog/with-the-grain/

BibTeX

@online{mangalapilly2026with,
          author  = {Yesudeep Jose Mangalapilly},
          title   = {With the Grain},
          journal = {Sa\d{m}hit\=a Notes},
          year    = {2026},
          month   = {June},
          url     = {https://yesudeep.com/blog/with-the-grain/},
          urldate = {2026-07-01},
        }

Plain

Yesudeep Jose Mangalapilly. “With the Grain.” Saṃhitā Notes, 2026. https://yesudeep.com/blog/with-the-grain/.

RIS

TY  - ELEC
        AU  - Mangalapilly, Yesudeep Jose
        TI  - With the Grain
        T2  - Saṃhitā Notes
        PY  - 2026
        UR  - https://yesudeep.com/blog/with-the-grain/
        Y2  - 2026-07-01
        ER  -

A principle, demonstrated on fifteen-year-old code. A cabinetmaker cuts with the grain — fighting it splinters the wood and dulls the blade. Code has a grain too: the shape the machine and its runtime want data to move in. The optimizations that last are the ones cut along it. Here is that idea on a real function I rewrote in python-rsa in 2011, measured again on two machines today.

A machine has a grain. Its arithmetic works in words — 32 or 64 bits at a stroke — and its fastest primitives are the ones that handle a word at a time. Go along that grain and the work is easy. Cut across it — a byte here, a byte there, in a hand-run loop — and you splinter: more passes, more waste, a duller edge. Most of the fast code I've written, I eventually realized, wasn't ingenuity. It was just me finding the grain and stopping fighting it.

Run a hand plane along a board's grain and the blade peels off one long, clean ribbon — the wood guides the cut. Run it across the grain and the same blade catches, tears, and leaves you sanding out the splinters. The slow function planed across the grain: a byte at a time, fighting the machine for every chip. The fast one planed along it: whole machine words, in long clean strokes the hardware was built to make.

The original: int2bytes in python-rsa

In 2011 I spent a few weeks working on optimizing the core cryptographic functions inside python-rsa, a small pure-Python RSA library. RSA spends its life converting between big integers and byte strings, and the function that did the converting — int2bytes — was a bottleneck. I rewrote it. Sybren Stüvel, the library's maintainer, put it more warmly than I would have, in the project changelog:

Big, big credits to Yesudeep Mangalapilly for all the changes listed below!

Added ability to generate keys on multiple cores simultaneously.

Massive speedup

Partial Python 3.2 compatibility (core functionality works, but saving or loading keys doesn't, for that the pyasn1 package needs to be ported to Python 3 first)

Lots of bug fixes

— Sybren Stüvel, python-rsa CHANGELOG

He called the result a "massive speedup." My own note at the time put it near $40×$.

Across the grain

Here is the original — preserved in the tree as _int2bytes at the version-3.4 tag. It walks the number one byte at a time, prepending each to a list:

# _int2bytes — against the grain (rsa/transform.py @ version-3.4)
raw_bytes = []
while number > 0:
    raw_bytes.insert(0, byte(number & 0xFF))   # one byte; insert at the FRONT
    number >>= 8
return padding + b('').join(raw_bytes)

Everything here cuts across the grain. A 4096-bit key takes 512 turns of the loop. Each list.insert(0, …) shifts the entire list to make room, so the whole thing is quietly $O(n^2)$ . Each turn boxes a fresh one-byte object. And the size arithmetic that fed it leaned on floating-point division — a continuous approximation doing an exact, discrete job. None of it is wrong; it's just sawing perpendicular to the way the wood wants to part.

$O(n^2)$ ("quadratic") — cost that grows with the square of the input: double the bytes, roughly quadruple the work. Here it's the shifting — $n$ inserts, each pushing up to $n$ earlier bytes aside, and $n \times n = n^2$ . The notation is Big-O. Learn more.

Count the shifting yourself — it is the arithmetic the notation compresses:

The shift tax, counted. Step the original loop: each insert(0, …) pushes every byte already in the list one slot right before the new byte lands in front — the pushed cells dim. The counter is computed from the steps, not asserted: finish a run of n and the shifts total n(n − 1)/2; double n with the slider and the total roughly quadruples. The append row below does the same job with nothing moving.

With it

A machine word is the unit a processor — and struct.pack — handle in one go: not a single byte, but four or eight at once, depending on the platform's widest lane.

A 32-bit machine word is four bytes handled together. struct.pack writes them in a single C step; the old loop touched each byte in turn, from Python.

So the rewrite takes the number a word at a time — asking the platform for its widest packing lane — and hands the cut to C. The shipped code, from the same rsa/transform.py:

# int2bytes — along the grain (rsa/transform.py @ version-3.4)
num = number
word_bits, _, max_uint, pack_type = get_word_alignment(num)
pack_format = ">%s" % pack_type          # '>Q', '>I', '>H', or '>B'
while num > 0:
    raw_bytes = pack(pack_format, num & max_uint) + raw_bytes
    num >>= word_bits

A word per turn instead of a byte means a quarter (on 64-bit lanes, an eighth) of the passes, and struct.pack does the packing in C rather than in a Python loop boxing bytes by hand. One honest concession before admiring it: the rewrite keeps a quadratic of its own — each pack(…) + raw_bytes re-copies the accumulated byte string, so both functions are $O(n^2)$ in principle. The win is the constant factor: an eighth as many interpreter turns, each doing its byte work in C. At key sizes the constant factor is the whole game; the complexity-class fix comes later in this story. And none of this is the hardware moving memory differently — the win lives entirely in the layer above: a quarter as many trips through the Python interpreter, with the byte-work itself handed to C. I also re-implemented the byte_size and bit_size helpers with divmod, taking floating-point out of the hot path entirely — discrete math for a discrete problem. Same output, far less resistance: the new function and the old one still produce byte-for-byte identical results, which the benchmark checks before it times anything.

This is interpreter territory, not the metal. A pure-Python loop's cost is bytecode dispatch and allocation, not memory latency — and a few-hundred-byte number lives entirely in L1 cache anyway, so cache lines never come into it. The hardware grain — cache lines, SIMD, false sharing — is a different story, about code that reaches down to the silicon.

The same number, cut two ways. The original makes a pass per byte; the rewrite makes a pass per 32-bit machine word — a quarter as many cuts, each handed to C. Fewer, wider strokes, along the grain.

What it measures today

Both functions live side by side at the version-3.4 tag — int2bytes and its byte-at-a-time predecessor _int2bytes, kept precisely so the two could be raced — and that tag ships the harness, speed.sh. (On today's main you'll find neither: the story has a better ending, below.) I checked out the tag and ran the race on an Apple-silicon laptop and an x86-64 box:

**`int2bytes` vs.** `_int2bytes` at the `version-3.4` tag, 20,000 iterations each, output verified identical. Speedup is old ÷ new. Reproducible from that tag's own `speed.sh`.
Integer size	Apple silicon (Python 3.14)	x86-64 / Ryzen (Python 3.13)
1024-bit	$13.1×$	$12.4×$
2048-bit	$15.7×$	$17.4×$
4096-bit	$19.0×$	$20.2×$

Still worth its keep — $\sim 15\times$ at a standard 2048-bit key, $\sim 20\times$ for larger numbers, on both architectures. But not the $\sim 40\times$ I remember.

The grain holds; the margin doesn't

The honest caveat first: I haven't re-run the 2011 measurement. It was on Python 2.5–2.7 on hardware long gone — CPython 2.7 still compiles if you're determined, but the original machine and interpreter build don't exist to race. So I can't prove it was $40\times$ . But the direction has a mechanism, and the mechanism is the whole point of this note.

A substrate has a grain too, and unlike a plank's, it changes. Look at where each function spends its time. The rewrite is dominated by struct.pack — a C routine whose cost barely moves between interpreter versions. The original is dominated by the interpreter itself: a long Python loop, a method call and an allocation every turn, an $O(n^2)$ insert. That is exactly the code modern CPython got dramatically better at running — the 3.11 "Faster CPython" work, the specializing interpreter, cheaper integers and bytes. So over fifteen years the two improved unequally: the C-bound rewrite stood still while the interpreter-bound original sped up. The ratio between them is the speedup, and it narrowed — not because my code got slower, but because the thing it was beating got faster. The interpreter's own grain had sharpened. (The comparison is confounded, and honesty says so: Python 2 → 3 rebuilt int and bytes wholesale, and my two 2026 machines run different interpreter versions. The mechanism, though, is not speculative: interpreter work speeds up interpreter-bound code and leaves C-bound code alone.)

And yet $\sim 15\times$ survived. That is the part worth keeping. A win that comes from working with the grain — fewer passes, word-aligned, the cut handed to C, no floating point in an integer routine — is structural. It doesn't depend on any one interpreter being slow; it depends on the shape of the machine, which doesn't change. Had the original speedup come from incidental cleverness — micro-tuning the Python loop, exploiting some quirk of CPython 2.6 — it would have evaporated the moment the quirk did. Instead it compressed and held. The margin is the runtime's to give and take; the alignment is mine to keep.

The grain won so hard it became a builtin

The story has a better ending than "my function survived." It didn't survive — it was retired, by the strongest possible vindication. Python 3.2 (2011, the same year as the rewrite) added int.to_bytes: one C call that converts directly from CPython's internal big-integer digits to bytes, no interpreter loop at all, and $O(n)$ — the complexity-class fix the word-packing loop never made. Upstream python-rsa eventually deleted both racers; int2bytes on today's main is a thin wrapper around number.to_bytes(…). The grain-following idea — do the byte-work word-wise, in C, below the interpreter — won so completely that it moved into the language, one layer deeper than I could reach in 2011. That's the trajectory of a with-the-grain optimization: it doesn't rot; it gets absorbed.

That is the rule I've trusted ever since, in code and out of it: work with the grain, and you fight less, splinter less, waste less.

Find the grain — of the machine, of the language, of the problem — and cut along it. When the ground shifts, as it always does, the cuts you made with the grain are the ones still holding.

The credit lives in python-rsa's CHANGELOG; both racers live at the version-3.4 tag with its speed.sh. Check out the tag and run the numbers — that's the whole point.

Lessons

In an interpreted language the grain is the C boundary: make fewer interpreter turns and hand the byte-work to code that's already compiled.
Structural optimizations — fewer, wider passes along the machine's word size — compress but survive runtime upgrades; quirk-targeting micro-optimizations evaporate when the runtime matures.
Say the honest part: the rewrite was a constant-factor win, not a complexity-class one — both loops were $O(n^2)$ . The $O(n)$ fix arrived as int.to_bytes, one layer deeper.
The best with-the-grain optimizations don't rot; they get absorbed into the platform. Upstream deleted the racers and now wraps the builtin.

References

“python-rsa.” — the library; both racers preserved at the version-3.4 tag
“int.to_bytes.” Python docs. — the builtin that retired the whole contest (added 3.2)
“Python struct module.” — documentation for C-based struct packing and unpacking
“Faster CPython Project.” — overview of the specialization and performance improvements in Python 3.11+
“The Grain of the Machine.” — the hardware-level companion to this post

How to cite

APA

Mangalapilly, Y. J. (2026, June). With the Grain. Saṃhitā Notes. https://yesudeep.com/blog/with-the-grain/