Notes · Dissecting Real Systems

growing

The URL Is the Hash

Content-addressing on the wire — how the web quietly became a content-addressed store, where fingerprinted URLs and Subresource Integrity are real Merkle edges and the cache that looks most like one isn't.

Yesudeep Jose Mangalapilly

Published 3 days ago · Updated 2 days ago · 17 min read

http, caching, content-addressing, subresource-integrity, merkle, cdn, web-performance, security, dissecting-systems

There are only two hard things in Computer Science: cache invalidation and naming things.

— Phil Karlton, as remembered by his son

Cite this

APA

Mangalapilly, Y. J. (2026, June). The URL Is the Hash. Saṃhitā Notes. https://yesudeep.com/blog/the-url-is-the-hash/

BibTeX

@online{mangalapilly2026the,
          author  = {Yesudeep Jose Mangalapilly},
          title   = {The URL Is the Hash},
          journal = {Sa\d{m}hit\=a Notes},
          year    = {2026},
          month   = {June},
          url     = {https://yesudeep.com/blog/the-url-is-the-hash/},
          urldate = {2026-07-01},
        }

Plain

Yesudeep Jose Mangalapilly. “The URL Is the Hash.” Saṃhitā Notes, 2026. https://yesudeep.com/blog/the-url-is-the-hash/.

RIS

TY  - ELEC
        AU  - Mangalapilly, Yesudeep Jose
        TI  - The URL Is the Hash
        T2  - Saṃhitā Notes
        PY  - 2026
        UR  - https://yesudeep.com/blog/the-url-is-the-hash/
        Y2  - 2026-07-01
        ER  -

A standalone piece in the Dissecting Real Systems series, sharing the content-addressing lens with You Don't Want Separate Repos. That essay argued a repository is content-addressed and that publishing trades a content hash for a version string. This one follows the same idea out onto the wire: by the end you'll see that the modern web has quietly grown a content-addressed layer — where a file's name is a hash of its bytes — and you'll be able to say exactly where that layer is a genuine Merkle structure and where the part that looks most like one, the humble 304 Not Modified, isn't content-addressed at all.

Karlton's joke is funny because both halves are the same problem. Naming a thing means choosing a key to find it by; invalidating a cache means knowing when the thing behind a key has changed. Pick the wrong kind of key — a mutable, human-assigned name — and you own both problems forever: the name can point at different bytes over time, so every cache has to keep asking whether its copy is stale. Pick a key derived from the bytes themselves and both problems dissolve at once: the name changes if and only if the content changes, so a cached copy is never stale — a different content is simply a different name.

That second kind of key is a content address, and the web is now full of them.

Content address — an identifier computed as a hash of the content it names, so identical bytes always get the identical key and the key doubles as an integrity check. (The term is the analytic lens here; the specs below speak of "hashes" and "fingerprints," not "content addresses.") Learn more.

The fingerprinted URL is a content address

Open the page source of almost any site built this decade and you'll find it: app.7e2c49a6.js, main.b3f01dc8.css. That hex string isn't a version or a date — it's a hash of the file's own bytes, stamped into the name by the build. Webpack spells it out: the [contenthash] placeholder "will add a unique hash based on the content of an asset… When the asset's content changes, [contenthash] will change as well." esbuild says the same of its [hash]: it is "the content hash of the asset." Change one byte of the file and its URL changes; leave it untouched and the URL is byte-for-byte the same as last deploy. (Know what these hashes are, though: build-tool fingerprints, not cryptography. webpack defaults to MD4 truncated to a handful of characters — and computed from internal module state unless optimization.realContentHash re-hashes the emitted bytes, which production mode does by default since webpack 5; esbuild's is XXH64, eight base32 characters, mixed with chunk metadata. Collision-avoidance quality — plenty for cache busting, not an integrity guarantee.)

Think of two ways to name a song file. You could call it single.mp3 and keep re-recording over it — then "do you have the latest single.mp3?" is a real question you must keep asking, because the name stays put while the bytes move under it. Or you could name every recording by a fingerprint of its own sound, so a new take automatically gets a new name. Now "do you have this file?" is answered by the name alone: if you have a file with that fingerprint, it is that file — there is nothing to re-check. The fingerprinted filename turns "is my copy current?" from a question into a non-question.

This is why a fingerprinted asset can be cached essentially forever. The canonical rule, per MDN, is Cache-Control: max-age=31536000, immutable — cache for a year, and don't even bother revalidating. You can promise that safely precisely because the URL is a content address: there will never be a different app.7e2c49a6.js, because different content would be a different URL. The deploy publishes app.9f1b20e4.js alongside it and rewrites the HTML to point at the new name; the old bytes keep their old name, undisturbed.

Be precise about who promises what. The HTTP immutable directive (RFC 8246) does not say "this URL's bytes never change"; it narrowly tells caches to "skip conditionally revalidating fresh responses." "Never changes" is a discipline you enforce by never overwriting a hashed URL — the content address is what makes the discipline safe, not the header. And note the looser cousin: main.v3.css is "cache-busting" too, but a version number is a mutable name, not a content address — it can be repointed, so it carries none of the guarantee. Support is uneven, too: Firefox and Safari honor immutable; Chrome never implemented the directive — it changed reload semantics instead (2017), to similar effect on the reload case.

So the first move is already content addressing: the URL stopped being a location and became a fingerprint. The name is the hash.

Subresource Integrity makes the HTML a Merkle node

Now look one level up — at the HTML that references those assets. A page can do better than linking a script by URL; it can pin the script's hash (candor about adoption: per the 2024 Web Almanac, SRI appears on about 23% of pages, mostly copy-pasted CDN snippets, and covers a median of ~3% of a page's scripts — the mechanism is real; the habit is rare):

<script src="/js/app.9f1b20e4.js"
        integrity="sha384-oqVuAfXRKap7fdgcCY5uykM6+R9GqQ8K/uxy9rx7HNQlGYl1kPzQho1wx4JwY8wC"
        crossorigin="anonymous"></script>

(That crossorigin attribute is mandatory for cross-origin SRI, and its reason is telling: per MDN, browsers "will not allow no-cors requests to use subresource integrity" because a hash check against an opaque response would let an attacker probe the content of cross-site resources — integrity verification is powerful enough to be an oracle, so it requires the resource's consent via CORS.)

That integrity attribute is Subresource Integrity: "a cryptographic hash of the representation of the resource the author expects to load." Before the browser runs the script, it fetches the bytes, hashes them, and compares; on any mismatch it "will refuse to render or execute" the resource "and return a network error." The parent document, in other words, commits to the hash of its child. It is the same fingerprint-the-bytes move a CSP script hash makes — there, to authorize an inline script; here, to verify an external one. One primitive, two jobs.

That is exactly a Merkle edge — the same structure as a Git commit naming its tree, or a tree naming its blobs by hash. The HTML is a node; each integrity attribute is an edge labeled with the child's content hash; the browser walks the edge and verifies it by re-hashing. Follow the chain and the page is the root of a little hash tree: change a script's bytes and either its fingerprinted URL no longer resolves or its SRI hash no longer matches — the tampering is caught at the edge, cryptographically, before a line of it executes.

The page is a shallow Merkle tree. The HTML root names each asset by a hash of its bytes — the fingerprint in the URL, pinned again in integrity. The browser re-hashes every child and refuses it on mismatch, so changing one byte changes the name and breaks the edge.

The same shape, one layer down. Git's object graph: a commit names its tree by the tree's hash; the tree names each blob by the blob's hash. Every edge is a SHA, so changing one blob changes its hash, the tree's, and the commit's — the structure the page above mirrors.

Merkle tree — a tree in which each node carries a hash computed from its children's hashes, so a single root hash fingerprints the entire structure and any change anywhere bubbles up to it. Git's object graph is one; an SRI-pinned page is a shallow one. Learn more.

And this buys something TLS alone cannot. TLS secures the channel — it proves the bytes arrived unmodified from the server you connected to. It says nothing about whether that server served the right bytes. A CDN hosting a shared library can be compromised and serve malicious JavaScript over a perfectly valid certificate. SRI is the stated defense: "Compromise of a third-party service should not automatically mean compromise of every site which includes its scripts." The content hash, pinned in the parent, makes the child self-proving regardless of who serves it.

Notice the rhyme with the repository essay. There, the danger of a version string was that it could lie — the bytes behind 1.5.0 can change under you, which is the supply-chain attack surface. SRI is the fix stated in reverse: name the child by its hash, and the name can't lie. A content address is its own integrity check; a human-assigned name needs one bolted on, and usually doesn't have one.

A fingerprinted URL plus an SRI hash is a genuine Merkle edge: the URL half names the bytes, and the SRI half verifies them — no client ever recomputes a filename hash, so without integrity the fingerprint is bookkeeping, not proof. Together, the page's delivery layer becomes a content-addressed store: names that are fingerprints, verified on arrival, immutable by discipline.

The honest break: `304 Not Modified` is not content-addressed

Here is where intuition overreaches, and where the analogy has to be policed. The mechanism that feels most like a hash-tree sync — you ask for a page, the server says 304 Not Modified, you reuse your copy — is not content addressing. It is worth getting this exactly right, because it's the seductive part.

The machinery is the conditional request. The server tags a response with an ETag; the client sends it back in If-None-Match on the next request; if it still matches, the server returns 304 and skips re-sending the body. It looks like "do you still have the right hash?" — and if ETag were a content hash, it would be exactly that.

It isn't, and the spec is emphatic. RFC 9110 defines an entity-tag as "an opaque validator." How the server generates it is entirely the server's business: it "might use an internal revision number, … a combination of various file attributes, or a modification timestamp that has sub-second resolution." A collision-resistant hash of the content is one permitted option among several — not the definition.

And in the wild the validators are all over the map, which is the point. nginx builds its ETag from the file's modification time and length — pure metadata, not the bytes. Amazon S3 is genuinely content-derived for a single-PUT object (the ETag is the MD5 of the bytes) — but a multipart or encrypted upload gets an ETag that is deliberately not a hash of the content. So whether a 304 reflects the actual bytes depends on which server you drew, and even on how you uploaded the file — exactly the ambiguity a content address doesn't have.

Strong vs. weak validators — a strong ETag changes on any byte-observable change; a weak one (prefixed W/) may stay the same across changes the server deems equivalent, so it "can't be used for… byte-range requests." Either way the comparison is a character-by-character token match, never a re-hash of the body.

And the validation itself is not cryptographic. When your If-None-Match arrives, the server doesn't re-hash anything; it compares the two ETag tokens "character-by-character." A 304 means the tokens matched — a freshness check, a currency check — not the bytes were re-verified. RFC 9111 is candid about what this is for: "The goal of HTTP caching is significantly improving performance by reusing a prior response." It is a reuse-and-freshness protocol optimized for latency and bandwidth, not an integrity protocol.

So the analogy holds at the edge and breaks at the tree. A fingerprinted URL plus an SRI is a real Merkle edge — content-derived, self-verifying. But the cache as a whole is not a Merkle tree: there is no single root hash that cryptographically commits to everything beneath it; validation is per-resource token-matching, the tokens needn't be content hashes, and a 304 proves currency, not authenticity. The web has content-addressed pieces bolted onto a cache that merely checks whether names still agree.

Two checks, one tampered file. Press Tamper with the bytes — the served bytes change but the server doesn't bump its ETag. The content-address path re-hashes and catches it (blocked); the 304 path compares an opaque token that didn't change and waves the tampered copy through. The first verifies content; the second only checks a name.

Two ways a librarian can tell you a book is unchanged. One: she re-reads every page against a master copy — slow, but she can swear the words are right. Two: she checks whether the little "last updated" slip in the cover still reads the same date as yours, and if so says "same book." The slip is fast and almost always right — but it's a note about the book, not the book itself. If someone swapped the pages and forgot to update the slip, she'd never know. ETag + 304 is the slip. SRI is re-reading the pages.

The same shape, everywhere you look

Once you have the lens, the pattern is hard to stop seeing — a content-derived, self-verifying, immutable key, with mutable human names layered on top for discovery:

Container images. An OCI digest "uniquely identifies content by taking a collision-resistant hash of the bytes," so you "can verify content from an insecure source by recalculating the digest." A tag like latest is the mutable name on top — it can be repointed at new bytes; the digest can't. Pin by digest and you've content-addressed your deploy.
Git. Every object — blob, tree, commit — is keyed by the hash of its content; a branch name is the mutable pointer you layer on top. The repository essay's whole argument lives here.
Nix, IPFS, BitTorrent. A Nix store path, an IPFS CID, a BitTorrent piece hash (SHA-1 per piece in the classic v1 protocol; v2 upgrades to per-file SHA-256 Merkle trees) — each names data by a hash of the data, so a fetch from anywhere is verifiable. Deduplication follows where the chunking agrees — IPFS itself is frank that the same file under a different chunker or CID version gets a different CID.

In every case the division of labor is the same one Karlton's joke points at: the content hash handles identity and integrity (the "naming things" half done right), and a thin mutable-name layer handles discovery and "what's the latest" (the "cache invalidation" half, quarantined to where it's unavoidable). The web's asset layer simply adopted the same split — fingerprint for identity, Cache-Control for the rest.

The gap is closing

If the seam in all this is that HTTP's caching validators check currency but not content, the obvious fix is to give HTTP a real content digest — and that is exactly what's landing. RFC 9530 (2024) defines Content-Digest and Repr-Digest: header fields carrying "a digest… calculated using a hashing algorithm applied to the actual message content," for a recipient "to detect data corruption." The RFC is blunt that this was missing — "HTTP does not define the means to protect the data integrity of content" — and just as blunt that it's a separate concern from the ETag: a digest is for integrity, the validator is for caching. The content hash is being added back to the protocol as a first-class field, right next to the opaque token that was never meant to carry it.

Why not just make every ETag a content hash? Because the ETag has a different job — it's a cheap currency check the server can compute from a timestamp, and forcing a full re-hash on every request would tax exactly the hot path caching exists to speed up. The cleaner design is two fields: a fast opaque validator for freshness, and an explicit digest for integrity. RFC 9530 takes that second path.

The other frontier is the dynamic resource. Classic SRI pins a hash of fixed bytes, so it can't protect a script that legitimately changes — a SaaS widget, an auto-updating bundle. Signature-based SRI closes that gap by pinning a public key instead of a hash: the server signs each version with its private key (HTTP Message Signatures, Ed25519), and the browser verifies the signature against the pinned key before executing. (No longer just a proposal: Chrome shipped it in 141, late 2025, after an origin trial spanning 135–141; other engines haven't yet, and the spec itself is still a WICG draft.) It's the same content-vs-name move at one remove — you stop trusting the name a resource arrives under and start trusting a cryptographic property of the bytes, even when the bytes are allowed to change. The direction of travel is consistent: wherever the web has leaned on a mutable name, it keeps reaching for something derived from, and verifiable against, the content itself.

The limitations

"Content address" is a gloss, not the vocabulary of the specs. MDN and the RFCs speak of "hashes," "fingerprints," and "validators"; none says "content address." The reading earns its keep as a unifying frame, not as a term anyone shipped.

Most of the caching layer is not content-addressed, and that's by design. The 304 path is the common case for HTML and uncacheable resources, and it is a freshness protocol, not a Merkle tree — strong wording about "the web is content-addressed" should be confined to the fingerprinted-asset + SRI layer, which is real, and kept away from the cache machinery, which isn't.

And "immutable" is a promise the naming discipline keeps, not the header. RFC 8246's directive only suppresses revalidation while a response is fresh; overwrite a hashed URL in place and you've broken the world's caches with no protocol-level complaint. The content address is load-bearing; the header just trusts it.

What survives all of that is the same modest, durable claim the repository essay made: when a system needs to know whether two things are really the same, it reaches for a hash of the bytes, not a name someone typed. The web reached for it too — in the URL, and in the integrity attribute right next to it.

Lessons

A fingerprinted URL is a content address: app.[contenthash].js changes iff the bytes change, which is what makes max-age=31536000, immutable safe — the name can't point at different content, so a cached copy is never stale.
SRI is a real Merkle edge: the parent HTML pins a cryptographic hash of each child and re-hashes to verify, refusing to execute on mismatch. The page is the root of a shallow hash tree, self-verifying against a tampered CDN in a way TLS cannot be.
A content address can't lie; a version string or a tag can. SRI and OCI digests are the same fix the repository essay wanted: name the child by its hash, not by a mutable handle someone can repoint.
ETag / 304 is not content addressing. RFC 9110 makes the entity-tag an opaque validator (revision number, file attributes, timestamp — content hash is merely one option), and 304 is decided by character-by-character token comparison, never by re-hashing. It checks currency, not authenticity.
So the analogy holds at the edge, breaks at the tree: the web has content-addressed pieces (hashed URLs + SRI) bolted onto a cache that only checks whether names still agree. Honor the seam.
The seam is closing: Content-Digest (RFC 9530) adds a real content hash to HTTP as a field distinct from the ETag, and signature-based SRI extends content-pinning from fixed bytes to dynamic resources by pinning a public key. Wherever the web leaned on a mutable name, it keeps reaching back for the bytes.

References

“webpack: Caching.” webpack. — and esbuild: API — [contenthash] / [hash], the build step that turns a URL into a content address
“MDN: Cache-Control.” MDN. — and RFC 8246 (immutable) — the fingerprinted-URL bargain and what immutable does and doesn't promise
W3C. “W3C: Subresource Integrity.” W3C. — and MDN: SRI — the integrity attribute, the verification algorithm, and the CDN-tampering threat model
“RFC 9110 §8.8 (validators).” IETF. — and RFC 9111 (caching) — why an ETag is opaque, how 304 is decided, and that caching's goal is reuse, not integrity
“RFC 9530 (Content-Digest).” IETF. — HTTP's real content-integrity field, separate from the caching validator
WICG. “Signature-based SRI.” WICG. — and RFC 9421 (HTTP Message Signatures) — extending content-pinning to dynamic resources by pinning a key, not a hash
“OCI Image Spec: Descriptor.” OCI. — the digest as a content identifier, and why you pin by digest rather than by latest
“You Don't Want Separate Repos.” — the companion essay: a content hash for a version string, at repository granularity

How to cite

APA

Mangalapilly, Y. J. (2026, June). The URL Is the Hash. Saṃhitā Notes. https://yesudeep.com/blog/the-url-is-the-hash/