Notes · Dissecting Real Systems

growing

The Hash Is the Identity

Content-addressing in the build cache — content-addressed storage and Merkle trees turn a build cache into a shared resource, so your build is proportional to anyone's change.

· · 15 min read

build-systems, remote-execution, bazel, buck2, content-addressing, dissecting-systems

git - the stupid content tracker

— the git(1) man page, NAME line, since Linus Torvalds's 2005 README

Cite this
APA
Mangalapilly, Y. J. (2026, June). The Hash Is the Identity. Saṃhitā Notes. https://yesudeep.com/blog/the-hash-is-the-identity/
BibTeX
@online{mangalapilly2026the,
          author  = {Yesudeep Jose Mangalapilly},
          title   = {The Hash Is the Identity},
          journal = {Sa\d{m}hit\=a Notes},
          year    = {2026},
          month   = {June},
          url     = {https://yesudeep.com/blog/the-hash-is-the-identity/},
          urldate = {2026-07-02},
        }
Plain
Yesudeep Jose Mangalapilly. “The Hash Is the Identity.” Saṃhitā Notes, 2026. https://yesudeep.com/blog/the-hash-is-the-identity/.
RIS
TY  - ELEC
        AU  - Mangalapilly, Yesudeep Jose
        TI  - The Hash Is the Identity
        T2  - Saṃhitā Notes
        PY  - 2026
        UR  - https://yesudeep.com/blog/the-hash-is-the-identity/
        Y2  - 2026-07-02
        ER  - 

The last of five pieces on build systems — it carries the series' promise outward to a whole team. By the end you'll understand how content-addressed storage makes a file's hash its identity, how a Merkle tree names an entire input directory, how the Remote Execution protocol turns one machine's result into everyone's cache hit, and why hermeticity was the precondition all along.

The whole series has circled one promise: a build's cost should track what changed. Skyframe and DICE make that true on one machine. This final piece extends it across a whole team, a whole CI fleet, a whole company — so that when your colleague compiled lib an hour ago, your build doesn't compile it again. It just takes their result. The mechanism that makes this safe is a single, beautiful idea:

A thing's content hash is its identity.

The leap from "my changes" to "anyone's changes"

On one machine, the cache key for an action is a fingerprint of its inputs: the command line, the environment, and the contents of every input file. Run the action once, store the outputs under that fingerprint, and you never run it again with the same inputs. That's local caching.

Now make the cache a server that everyone shares. The instant a build action is keyed by a hash of its inputs rather than by who ran it or where, it stops mattering who built something. If your action's input fingerprint matches one already in the shared cache, the outputs are already there — produced by someone else's machine, minutes or days ago. You download the result instead of computing it. The build is now proportional not to your changes but to the changes nobody on the team has built yet.

This is what Bazel and Buck2 mean by remote caching and remote execution. And it only works because of content addressing.

A blob is its hash

Read the Remote Execution v2 protocol — canonically remote_execution.proto in the bazelbuild/remote-apis repo, vendored into Bazel's tree — and the foundation is the Digest message:

message Digest {
  string hash = 1;        // lowercase hex of the content hash
  int64 size_bytes = 2;   // the size of the blob
}

Which hash? The protocol deliberately doesn't hardcode one — "the hash algorithm to use is defined by the server," negotiated through CacheCapabilities.digest_functions — but in practice the answer is SHA-256: it is Bazel's default and the first named value in the protocol's DigestFunction enum.

That's it. Every file, every directory, every command, every action is named by the hash of its bytes plus its length. The name of a blob is a deterministic function of its content. Two files with identical content have identical digests, necessarily; two files that differ in a single byte have unrelated digests. There is no separate identity to assign or collide on — the content is the address. This is the same idea behind Git's object store, IPFS, and Nix; the build world calls its version a Content-Addressable Storage, or CAS.

Imagine naming a moving box by an exact fingerprint of its contents — not "kitchen stuff," but a name that changes if a single spoon changes. Two boxes with the same name are interchangeable without opening either. Now name the whole room by the fingerprints of its boxes: change one spoon and the spoon's box gets a new name, so the room does too. That chain of renames — spoon, box, room — is the Merkle tree the next section builds.

Hashing a whole tree: Merkle all the way down

A single file is easy to address by hash. A build action needs to address its entire input tree — a directory of sources, headers, toolchains — as one identity. The protocol does this with a Merkle tree.

Merkle tree — a tree in which every node's label is a hash of its children's labels, so a single root hash fingerprints the entire structure. Change any leaf and the root changes. Ralph Merkle described the construction at CRYPTO '87; the same idea underpins Git commits, IPFS, and blockchains. Learn more.

A directory is a list of (name, digest) entries. Each entry's digest is the hash of that file's content, or — for a subdirectory — the hash of that directory's own listing. Hash the listing and you get the directory's digest; that digest folds into its parent's listing; and so on up to a single root digest that uniquely identifies the entire tree. Change one byte in one deep file and its digest changes, which changes its directory's listing, which changes that digest, all the way up — the root digest is different. Leave the tree untouched and the root is bit-for-bit identical, every time, on every machine.

A Merkle tree: each file is its content hash; each directory is the hash of its listing; one root digest names the whole input tree. Change one leaf and the root changes; otherwise it's identical everywhere.

So an action's identity is computable, deterministic, and global. The protocol's Action message is precise about what goes into it: the digest of the Command (which contains the arguments and the environment variables — the proto requires them "lexicographically sorted by name" exactly so that equivalent commands hash identically), the digest of the input tree's root, the timeout, and, since v2.2, the platform. Hash that message and you have the action's digest — per the proto, "Action​s can be succinctly identified by the digest of their wire format encoding" — and it means the same thing on your laptop and in CI and on a build farm in another datacenter.

The protocol, in three services

With identities settled, the Remote Execution API is small. The proto defines three gRPC services, and you can read the whole story from their names and signatures.

ContentAddressableStorage is the shared blob store. Its key RPC is the one that makes sharing cheap — FindMissingBlobs, whose own comment explains the trick:

Clients can use this API before uploading blobs to determine which ones are already present in the CAS and do not need to be uploaded again.

You don't push your inputs and hope. You ask the store which of these digests you're missing, and upload only those. Everything the cache has already seen — from anyone — you skip. Identical content is transferred once, ever.

ActionCache maps an action's digest to its result. GetActionResult takes an action digest and returns the outputs, if some machine has run that exact action before. A hit here is the whole game: you compiled nothing, you downloaded a finished artifact.

Execution is the fallback for a miss. Execute ships the action — by its digests, not its bytes, since the CAS already holds the bytes — to a remote worker, which runs it hermetically and stores the outputs back in the CAS keyed by the action digest. The next person to need that exact action gets a cache hit.

The remote build flow. Ask the action cache; on a miss, ask the CAS which inputs it lacks, upload only those, execute remotely, and store the result for everyone next.

Notice the loop closes on itself. A cache miss for you becomes a cache hit for the next person, because the result is stored under the same content-derived identity they'll compute. The cache fills in exactly as fast as the team produces genuinely new work — and only that fast.

Why hermeticity was the precondition all along

This is where the series' recurring insistence on hermeticity finally pays off, and it's worth making explicit. Sharing results across machines is only safe if running the same action always produces the same output. If an action secretly read the wall clock, a machine's hostname, or a file outside its declared inputs, then two machines with the "same" action digest could legitimately produce different results — and a shared cache would serve one machine the other's wrong answer.

Content addressing is what makes the cache addressable; hermeticity is what makes it correct. That's why Buck2 is honest that local-only builds aren't yet hermetic and reserves the guarantee for remote execution, and why Skyframe tags every function hermetic or not. The sandbox isn't bureaucracy. It's the thing that lets "this action's identity is the hash of its inputs" be true — that an action is a pure function of its declared inputs and nothing else. Take hermeticity away and the identity is a lie, and a shared cache built on a lie corrupts silently.

Two words are worth keeping apart here, because they fail differently. Hermeticity is a property of the inputs: everything the action reads is declared and isolated, so the cache key is honest. Determinism is a property of the function: the same inputs produce bit-identical outputs, so the cached value is stable. A hermetic action can still be nondeterministic — a compiler that stamps a timestamp into its output, an archiver that orders entries by directory-walk order, a codegen step with a race — and then two honest keys map to two different values, and which one the team gets depends on who built first. Hermeticity you get from sandboxes and declared inputs; determinism you have to get from the tools themselves.

Hermeticity kills a combinatorial explosion

There's a second payoff, quieter than correctness but just as important, and it's the real reason "works on my machine" is a structural problem rather than a personal failing. A non-hermetic action's output isn't a function of its declared inputs alone — it's a function of those inputs plus every ambient variable it secretly reads: $PATH, the locale, the compiler version on this host, an environment flag, the phase of the moon. Each of those variables is a hidden extra input, and the true space of distinct "environments" the action might run in is their Cartesian product.

Cartesian product — combine nn variables with kk values each and you get knk^n combinations. Three binary environment leaks already make eight distinct environments; a dozen makes thousands. The blow-up is multiplicative, which is why it's an explosion and not a list. Learn more.

That product is the explosion. In a Make-style build, every variable your environment leaks into the build multiplies the number of states a "build" can be in — and because nothing is keyed on those variables, two builds that are secretly in different states look identical and yet produce different artifacts. Caching is hopeless: you can't safely reuse a result when you can't even enumerate the conditions it was produced under. The combinatorial space isn't tracked, so it can't be shared, and it can't be reproduced.

Each variable a non-hermetic action leaks multiplies its environments — three binary leaks already make eight. Hermeticity collapses the whole product back to one declared input set with one identity on every machine.

Abstract knk^n is easy to nod along to, so make it concrete with the toolchain that taught everyone this lesson: C++. Take the most innocent command in the world, gcc -c foo.cc, on a stock machine, and count the inputs nobody declared. Which binary runs at all is decided by $PATH/usr/bin/gcc, a Homebrew clang, or a ccache shim wearing the compiler's name (that's three). The compiler's major version changes codegen and semantics: GCC 11 silently moved the default dialect from gnu++14 to gnu++17, so the same file can parse differently on two hosts (call it four versions in a team's wild). Which standard library — libstdc++ or libc++ — is two more; and libstdc++ itself ships a dual ABI, where the _GLIBCXX_USE_CXX11_ABI macro flips the memory layout of std::string, and objects compiled on opposite sides of it link into crashes (two more). The SDK or sysroot the headers come from — two Xcode versions and a Linux sysroot make three. And GCC reads environment variables most people have never audited — CPATH, CPLUS_INCLUDE_PATH, LIBRARY_PATH silently prepend include and link paths (set or unset: two) — plus the locale (two). Multiply the column:

3×4×2×2×3×2×2=576 3 \times 4 \times 2 \times 2 \times 3 \times 2 \times 2 = 576

One compile step, five hundred seventy-six distinct environments it might silently be in — and a timestamp-keyed build treats all of them as "the file hasn't changed." A team of twenty laptops is a random sample from that space. That is "works on my machine," written as arithmetic.

The explosion, live. Every toggle is a real ambient input of a stock gcc -c foo.cc, at its real cardinality; the grid is the Cartesian product of whatever is currently leaking. Turn everything on and one compile step has 576 possible environments — then flip to declared, and the product collapses to the one environment that is hashed into the action key.

Now the design question — why does Bazel do what it does — answers itself, mechanism by mechanism, because each one deletes a column of that multiplication. Actions run with a scrubbed environment: the --incompatible_strict_action_env flag (on by default) "uses an environment with a static value for PATH and does not inherit LD_LIBRARY_PATH," per its own help text in the source — which warns, in the same breath, that leaking client variables back in "can prevent cross-user caching." There goes $PATH, $CPATH, and the locale. The sandbox shows the action only its declared inputs, so an undeclared header can't be found even if it exists on the host. And the compiler itself stops being ambient: a cc_toolchain names it, and the strict form — a pinned LLVM or a zig-based cross toolchain downloaded by the build — makes the compiler binary just another hashed input, the same as any source file. Version, standard library, ABI flag, sysroot: all of it rides inside the toolchain's digest.

Hermeticity collapses that product to a point — per action. By declaring the toolchain, the locale, the environment — making them inputs instead of leaks — Bazel turns the unbounded ambient state space into a named input set. For a given action key there are no longer knk^n possible environments; there is the environment, hashed into the action's identity. (To be precise about what survives: platforms, toolchains, and build flags still multiply configurations, and a big Bazel project can suffer real configuration explosion — but those combinations are declared and keyed, so each is a distinct honest identity rather than an untracked ambient state.) The explosion doesn't get managed; it stops existing as a source of lies. And only once it's gone can the survey's promise — a build proportional to the change — extend across machines at all, because only then does the same input mean the same output, everywhere.

The honest limits

Three cautions keep the shared-cache story truthful. First, the ActionCache is a security boundary. A CAS write is self-verifying — the content must hash to its own name — but an ActionCache write is an unverifiable claim: "this action produced these outputs." Anyone who can write it can poison every machine's build, which is why Bazel's own remote-caching docs warn: "Take care in who has the ability to write to the remote cache. You may want only your CI system to be able to write." Real deployments give developers read-only access (--remote_upload_local_results=false). Second, the store is not forever. Bazel 7 defaults to "Builds without the Bytes" (--remote_download_toplevel), keeping intermediate outputs only in the remote CAS — and if the server evicts a blob mid-build, Bazel "may throw a CacheNotFoundException and exit with code 39," per the Bazel blog. The identity is eternal; the bytes have a TTL. Third, the round-trip isn't free: for small, fast actions the network can cost more than recomputing, which is why remote execution earns its keep on expensive actions and why dynamic (racing local-vs-remote) execution exists.

The whole arc, in one sentence

Step back across the series and the structure is one idea elaborated. A build is a graph (the survey). Skyframe computes that graph incrementally by letting nodes demand their dependencies. DICE collapses the phases so the graph is uniform. Starlark stays bounded so evaluating the graph is itself a safe, terminating analysis. And content addressing makes each node's identity a function of its content, so the graph's results can be shared by everyone, forever.

The thread is the promise the series started with, now in its strongest form. A build's cost should be proportional to the change — and once a node's identity is the hash of its inputs, the change it's proportional to isn't yours. It's everyone's, together: the first time anyone, anywhere, builds something new, and never again.

Everything in this series insisted on exact answers — a digest either matches or it doesn't. A coda asks the forward-looking question: what do you get if you give up exactness on purpose — and why do storage engines take that bargain while build systems refuse it?

Lessons

  • Content addressing: a blob's name is the hash of its bytes. Identical content has identical identity, automatically, on every machine — no coordination required.
  • A Merkle tree extends that to an entire input directory: one root digest names the whole tree, and changes anywhere ripple to the root.
  • The Remote Execution protocol is three small services — a content store (CAS), an action cache, and execution — and FindMissingBlobs means identical inputs are uploaded once, ever.
  • A cache miss for you becomes a cache hit for the next person, because the result is stored under the same content-derived identity they'll compute.
  • Hermeticity is the precondition. Content addressing makes the cache addressable; hermeticity makes it correct. A shared cache built on non-hermetic actions serves wrong answers silently.
  • Hermeticity also kills a combinatorial explosion. Every variable a non-hermetic build leaks ($PATH, locale, compiler) is a hidden input; the true state space is their knk^n Cartesian product — the structural form of "works on my machine." Declaring the environment collapses the product to one named input set per action, so the explosion stops existing.
  • The cache has boundaries: ActionCache write access is a trust decision (CI-only in serious deployments), remotely-kept outputs can be evicted out from under a build, and small actions can cost more to fetch than to rerun.

References

  1. The Remote Execution API (REv2).” — the protocol, in protobuf
  2. Bazel: remote build execution.” Bazel. — · Buck2: remote execution
  3. Bazel: remote caching.” Bazel. — including the who-may-write warning
  4. Builds without the Bytes in Bazel 7.” Bazel blog. — the eviction failure mode, from the source
  5. Merkle. “A Digital Signature Based on a Conventional Encryption Function.” CRYPTO '87, 1988. — the tree, from its inventor · Merkle trees and content-addressable storage for the general ideas
  6. Dolstra. “The Purely Functional Software Deployment Model.” PhD thesis, Utrecht, 2006. — hermeticity as content-addressed store paths, worked out as a dissertation — the theory Nix is built on
  7. Besta, Miretskiy & Cox. “Build in the Cloud: Distributing Build Outputs.” Google Eng Tools blog, 2011. — Google describing its content-addressed output cache, twelve years before REv2 readers met it
  8. Nix.” — and Git's object model — the same hash-is-identity principle elsewhere

How to cite

APA
Mangalapilly, Y. J. (2026, June). The Hash Is the Identity. Saṃhitā Notes. https://yesudeep.com/blog/the-hash-is-the-identity/
BibTeX
@online{mangalapilly2026the,
          author  = {Yesudeep Jose Mangalapilly},
          title   = {The Hash Is the Identity},
          journal = {Sa\d{m}hit\=a Notes},
          year    = {2026},
          month   = {June},
          url     = {https://yesudeep.com/blog/the-hash-is-the-identity/},
          urldate = {2026-07-02},
        }
Plain
Yesudeep Jose Mangalapilly. “The Hash Is the Identity.” Saṃhitā Notes, 2026. https://yesudeep.com/blog/the-hash-is-the-identity/.
RIS
TY  - ELEC
        AU  - Mangalapilly, Yesudeep Jose
        TI  - The Hash Is the Identity
        T2  - Saṃhitā Notes
        PY  - 2026
        UR  - https://yesudeep.com/blog/the-hash-is-the-identity/
        Y2  - 2026-07-02
        ER  - 

Annotations

Thank you — your note is held for review and will appear once approved.

Thank you — your note is published.

Please sign in below to leave a note.

Type to search · ↑↓ to move · ↵ to open · Esc to close