Notes · Dissecting Real Systems

growing

A Little Uncertainty Buys a Lot of Space

Bloom filters trade a small chance of being wrong for an enormous saving in memory — a bargain storage engines take and build systems, so far, refuse.

Yesudeep Jose Mangalapilly

Published 7 days ago · Updated yesterday · 12 min read

probabilistic-data-structures, bloom-filters, build-systems, content-addressing, dissecting-systems

Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.

— Burton H. Bloom, "Space/Time Trade-offs in Hash Coding with Allowable Errors," CACM 13(7), 1970

Cite this

APA

Mangalapilly, Y. J. (2026, June). A Little Uncertainty Buys a Lot of Space. Saṃhitā Notes. https://yesudeep.com/blog/a-little-uncertainty-buys-a-lot-of-space/

BibTeX

@online{mangalapilly2026a,
          author  = {Yesudeep Jose Mangalapilly},
          title   = {A Little Uncertainty Buys a Lot of Space},
          journal = {Sa\d{m}hit\=a Notes},
          year    = {2026},
          month   = {June},
          url     = {https://yesudeep.com/blog/a-little-uncertainty-buys-a-lot-of-space/},
          urldate = {2026-07-02},
        }

Plain

Yesudeep Jose Mangalapilly. “A Little Uncertainty Buys a Lot of Space.” Saṃhitā Notes, 2026. https://yesudeep.com/blog/a-little-uncertainty-buys-a-lot-of-space/.

RIS

TY  - ELEC
        AU  - Mangalapilly, Yesudeep Jose
        TI  - A Little Uncertainty Buys a Lot of Space
        T2  - Saṃhitā Notes
        PY  - 2026
        UR  - https://yesudeep.com/blog/a-little-uncertainty-buys-a-lot-of-space/
        Y2  - 2026-07-02
        ER  -

A forward-looking coda to the build-systems series. Everything before this insisted on exact answers: a content hash is an identity, a Merkle root names a tree, FindMissingBlobs tells you precisely what's absent. This piece asks what happens if you give up exactness on purpose. By the end you'll understand how a Bloom filter works, why storage engines lean on it everywhere, why build systems deliberately don't, and where — if anywhere — a little uncertainty might still pay.

The whole series has been a story about exact identity. A file's content hash is its identity; a Merkle root names an entire tree to the byte; a remote cache answers "do you have this?" with a precise digest lookup. Exactness is the source of every guarantee we've relied on.

So it's worth meeting the opposite idea squarely, because it's one of the most beautiful results in computer science — and because knowing exactly why build systems decline it tells you something about what they guarantee. In 1970, Burton H. Bloom published a five-page paper with a startling proposition: if you can tolerate being occasionally, one-sidedly wrong about set membership, you can answer "is X in this set?" using a few bits per element — no matter how large the elements themselves are.

A Bloom filter answers "definitely not" exactly, and "probably yes" approximately — and that asymmetry is the whole trick.

How a Bloom filter works

Take a bit array of $m$ bits, all zero, and $k$ independent hash functions. To insert an element, hash it $k$ ways to get $k$ positions, and set those bits to 1. To query an element, hash it the same $k$ ways and look at those $k$ bits. If any of them is 0, the element was certainly never inserted — setting it would have set that bit. If all $k$ are 1, the element is probably present: but those bits might have been set by other elements, so you might be wrong.

Hash function — here, a function mapping an element to a position in $[0, m)$ , ideally spreading inputs uniformly. The $k$ functions are independent so a single element lights up $k$ scattered bits. Learn more.

Inserting "lib.o" sets three bits. A query for "foo.o" probes its own three positions: any 0 means definitely-absent; all-1s means probably-present. There are no false negatives, only false positives.

This is the asymmetry the whole structure rests on, and the canonical statement is worth quoting exactly:

False positive matches are possible, but false negatives are not — in other words, a query returns either "possibly in set" or "definitely not in set."

Think of a long row of light switches, all off. To remember a word, you flip on three switches chosen by spelling it through three fixed rules. To check a word, you look at its three switches: if even one is off, you never saw that word. If all three are on, you probably did — though some other words might have flipped those same three by coincidence. You can be falsely sure, but never falsely doubtful.

Here is that row of switches, live — 32 bits, 3 probes per key. Insert the build keys one at a time, then run the queries.

The filter, live. Each insert lights its three probe bits; each query checks its own three. One of the query words was never inserted, yet finds all three of its bits already lit by other keys — see if you can spot the false positive before the readout names it.

The bargain, quantified

The reason the structure matters is the size of the trade. The false-positive probability after inserting $n$ elements into $m$ bits with $k$ hashes is

$\varepsilon = \left(1 - e^{-kn/m}\right)^{k}, \qquad \text{minimized when} \quad k = \frac{m}{n}\ln 2.$

Read the formula as a tug-of-war over the row of switches. Every key you insert ( $n$ ) turns a few more on, so a stranger's switches are more likely to be all-lit by coincidence — the part in parentheses is just "how crowded the board is." Demanding more switches per key ( $k$ , the exponent) means a false alarm needs more coincidences at once, which helps — but every key now crowds the board faster, which hurts. The sweet spot $k^{*}$ is where the two pulls balance, and it lands exactly where the board stays about half lit.

Work the algebra at the optimal $k$ and a remarkable number falls out: the space cost is about $1.44 \log_2(1/\varepsilon)$ bits per element, where the $1.44$ is exactly $\log_2 e = 1/\ln 2$ . For a 1% false-positive rate that is roughly 9.6 bits per element — the reference rounds it to "fewer than 10 bits per element." Not 10 bits per byte of the element; ten bits per element, whether the element is a short key or a megabyte blob. A filter for ten million items at 1% error fits in about twelve megabytes, and never grows with the size of the items it screens.

The $1.44$ constant is $1/\ln 2 = \log_2 e$ . The bits-per-element figure depends only on the target error rate $\varepsilon$ , not on $n$ — which is why a Bloom filter's footprint is set by how wrong you'll tolerate being, not by how big the elements are. And $\log_2(1/\varepsilon)$ has a plain reading: each ten-fold cut in the error rate — 1% to 0.1% to 0.01% — costs the same flat surcharge, about $4.8$ bits per element, whether it's your first factor of ten or your fourth. A logarithm is a price that's flat per factor.

The formula rewards driving. Push $n$ up with $m$ fixed and watch the filter saturate; buy more bits and watch it recover; slide $k$ and watch the optimum emerge exactly where $\frac{m}{n}\ln 2$ says it must:

The trade, drivable. The curve is ε(k) at the current m/n; the dot is your k, the dashed line the optimum k^* = (m/n)ln 2. The readout computes the displayed formula live — bits per item, and how many queries buy one false "probably."

That is the bargain: a fixed, tiny budget of memory, in exchange for a controllable sliver of uncertainty. The structure has well-known limits — a standard Bloom filter cannot delete (you can't clear a bit without risking other elements that share it), which is the counting Bloom filter's reason to exist — and a family of sharper relatives has grown up around it: the cuckoo filter (Fan, Andersen, Kaminsky, Mitzenmacher, 2014), which supports deletion and is often smaller; the XOR filter (Graf & Lemire, 2020); and the ribbon filter (Dillinger & Walzer, 2021), now shipping in RocksDB. The idea is alive and improving.

Where this bargain is taken every day

Storage engines take Bloom's deal constantly, and always for the same reason: to avoid a disk read that would almost certainly come back empty. A log-structured merge tree — the shape under most modern key-value stores — keeps data in many on-disk sorted files. To find a key you might have to check each one. A Bloom filter per file lets you skip the ones that definitely don't have it.

Google's Bigtable paper (OSDI 2006) put it plainly:

A Bloom filter allows us to ask whether an SSTable might contain any data for a specified row/column pair. … Our use of Bloom filters also implies that most lookups for non-existent rows or columns do not need to touch disk.

The same move appears in Apache Cassandra, in RocksDB and LevelDB (a Bloom filter per SSTable), and historically in Chrome's Safe Browsing, which previously screened URLs against a local Bloom filter before any network check (it has since moved to an exact prefix set). Everywhere the pattern is identical: a false positive costs only a wasted lookup you were about to do anyway, while the filter saves you from the overwhelming majority of pointless reads. When the penalty for being wrong is small and the saved work is large, uncertainty is a bargain.

Why build systems refuse it

Now the turn. With Bloom filters proven and ubiquitous next door in storage, you might expect build systems to use them too — to test "is this artifact already cached?" cheaply. They essentially don't, and the reason is exact and worth stating, because it's not an oversight.

A storage engine's false positive costs a wasted disk read. A build cache's false positive would cost you the wrong binary.

Recall what a build cache's membership test means. From the remote-execution piece: the cache is keyed by content digests, and FindMissingBlobs is an exact digest query — it tells you precisely which inputs the store lacks so you upload only those. If that test were probabilistic and returned a false "yes, present," the build would skip uploading an input the server doesn't actually have, or — worse — treat a different artifact as a cache hit. A storage engine recovers from a false positive by reading the file and finding the key absent. A build cache built on content identity has no such recovery step: the whole point of "the hash is the identity" is that a digest match is taken as proof of sameness. Make that match probabilistic and you've reintroduced exactly the silent-corruption risk the hermeticity discipline existed to banish.

Two membership tests. Exact (left): a content-digest lookup, large keys, an authoritative yes/no — what build caches use. Probabilistic (right): a tiny fixed filter where "no" is exact but "yes" is only probable. Build systems need the left because a wrong "yes" means a wrong artifact.

So build systems use the exact cousins of the Bloom filter's world: content hashes, Merkle trees, and CAS digest lookups. The hash is doing a similar job — a compact fingerprint standing in for a large object — but it is a fingerprint you can trust as identity, because a digest match means byte-identical content (collisions aside), whereas a Bloom "yes" means only "maybe." The build world chose the structure whose "yes" is as trustworthy as its "no."

The Remote Execution API lets the server pick the digest function — it lists MD5, SHA-1, SHA-256, SHA-512 and others — so "build caches use SHA-256" is the common default, not a protocol mandate. The point is that whatever the function, the match is treated as exact identity.

The state of the art: making exactness cheaper and finer

If probabilistic membership is off the table for correctness, where is the research frontier actually moving? Toward making exact incremental builds cheaper, finer-grained, and more clearly understood — not toward tolerating error.

The cleanest map of the territory is "Build Systems à la Carte" (Mokhov, Mitchell, Peyton Jones, ICFP 2018), which shows that essentially every build system factors into two independent choices — a scheduler (what order to build in) and a rebuilder (how to decide something needs rebuilding):

…the part of the build system responsible for scheduling tasks in the dependency order (a "scheduler") can be cleanly separated from the part responsible for deciding whether a key needs to be rebuilt (a "rebuilder"). … we can combine any scheduler with any rebuilder, and obtain a correct build system.

From that frame, Make, Shake, Bazel, Buck, Nix, and even Excel are points in one design space — and the paper's "constructive trace" rebuilders, which memoize a key's result by the hash of its inputs and share those traces in a cloud cache, are precisely the exact-hash caching the series has been describing. Nix calls its version a build trace; Unison pushes the same content-addressing down to individual functions.

Two further directions are worth a forward glance, both still about exactness, just earned differently:

Dependencies discovered by tracing, not declared. Rattle (Spall, Mitchell, Tobin-Hochstadt, OOPSLA 2020) runs a build script and observes which files each command actually reads and writes, hashing them to skip unchanged work and speculating to recover parallelism — "perfect dependencies" without a human writing them down.
Finer-than-file incrementality. The self-adjusting computation lineage (Acar, 2005) and Adapton (Hammer et al., PLDI 2014) make individual computations incremental at sub-file granularity — the research basis for build graphs that rebuild a function, not a file, the floor the utils piece noted most toolchains can't yet reach.

Where a little uncertainty might still fit

Let me be honest about the speculative part, and label it as such. I found no production build system that uses Bloom-style filters, and the correctness argument above says the answer a build returns must stay exact. But there is a narrow seam where probabilistic structures might help without touching correctness: not in deciding what an artifact is, but in cheaply pruning work before an exact check.

A remote-execution client about to call FindMissingBlobs for thousands of inputs could, in principle, keep a Bloom filter of digests it has recently uploaded, to skip asking about blobs it almost certainly already sent — falling back to the exact RPC on a "maybe," so a false positive costs only a redundant exact check, never a wrong result. That is the storage-engine pattern exactly: use the filter to avoid work, keep the authoritative test for the answer. Whether the saved round-trips justify the machinery is an open, measurable question — and I'm flagging it as a proposal, not something the tools do today.

Probabilistic structures belong wherever a wrong "maybe" only costs a second look — and nowhere that a wrong "yes" costs a wrong answer.

That line is the whole lesson of the series, stated from the other side. We spent five pieces showing how build systems earn trustworthy identity — content hashes, Merkle trees, hermetic actions, shared CAS — precisely so that a match can be believed. Bloom's bargain is the counterexample that proves the rule: a glorious trade when being wrong is cheap, and exactly the trade you must refuse when being wrong is silent.

Lessons

A Bloom filter tests set membership in tiny fixed memory, with no false negatives and a tunable rate of false positives: "definitely not" is exact, "probably yes" is approximate.
The cost is about $1.44 \log_2(1/\varepsilon)$ bits per element — $\approx 9.6$ (under 10) bits each at 1% error — independent of how large the elements are.
Storage engines (Bigtable, Cassandra, RocksDB, LevelDB) use them everywhere, because a false positive costs only a disk read they'd have done anyway.
Build systems refuse them, because their membership test means identity: a false "present" would yield the wrong artifact, with no recovery step. They use exact content hashes, Merkle trees, and CAS lookups instead.
The frontier (Build Systems à la Carte, Rattle, self-adjusting computation) is about making exact builds cheaper and finer-grained — not about tolerating error. Probabilistic structures fit only where a wrong "maybe" triggers a harmless second look.

References

Bloom. “Space/Time Trade-offs in Hash Coding with Allowable Errors.” CACM, 1970. — the original
“Bloom filter.” — · cuckoo filter · XOR filter · ribbon filter — the family
“Bigtable.” OSDI, 2006. — and RocksDB — Bloom filters in real storage engines
“Build Systems à la Carte.” ICFP, 2018. — and Rattle (OOPSLA 2020) — the build-systems frontier
“The Hash Is the Identity.” — why build caches need exact membership

How to cite

APA

Mangalapilly, Y. J. (2026, June). A Little Uncertainty Buys a Lot of Space. Saṃhitā Notes. https://yesudeep.com/blog/a-little-uncertainty-buys-a-lot-of-space/