Notes · Dissecting Real Systems
growing
You Don't Want Separate Repos
A repository is a database; splitting a subproject out trades a content hash for a version string — and there's only one case where that trade is actually forced.
You think you want a stable kernel interface, but you really do not, and you don't even know it. What you want is a stable running driver, and you get that only if your driver is in the main kernel tree.
— Greg Kroah-Hartman, The Linux Kernel Driver Interface
Cite this
Mangalapilly, Y. J. (2026, June). You Don't Want Separate Repos. Saṃhitā Notes. https://yesudeep.com/blog/you-dont-want-separate-repos/ @online{mangalapilly2026you,
author = {Yesudeep Jose Mangalapilly},
title = {You Don't Want Separate Repos},
journal = {Sa\d{m}hit\=a Notes},
year = {2026},
month = {June},
url = {https://yesudeep.com/blog/you-dont-want-separate-repos/},
urldate = {2026-07-02},
} Yesudeep Jose Mangalapilly. “You Don't Want Separate Repos.” Saṃhitā Notes, 2026. https://yesudeep.com/blog/you-dont-want-separate-repos/. TY - ELEC
AU - Mangalapilly, Yesudeep Jose
TI - You Don't Want Separate Repos
T2 - Saṃhitā Notes
PY - 2026
UR - https://yesudeep.com/blog/you-dont-want-separate-repos/
Y2 - 2026-07-02
ER - A standalone piece in the Dissecting Real Systems series, sharing the cost-of-change lens with the build survey and the utils essay. By the end you'll be able to read a multi-project repository as a database, see exactly what a team gives up when it moves a subproject into its own repo and registry, explain why SemVer is strictly less information than the VCS already had, and name the one situation — only one — where the split is genuinely forced.
Grab a cup of coffee and sit down to read because this requires patience.
A team has one repository with several subprojects — a handful of pnpm packages, say, that import each other. Someone proposes moving one of them into its own repository: cleaner ownership, independent releases, a published version other teams can depend on. It sounds like tidiness, and tidiness sounds free.
A word on words. This essay is not an argument for the "monorepo" in the Google sense — one repository for an entire organization's code. It is about the much smaller and more common case: a coherent set of tightly-coupled projects that import each other. The claim is "don't shard things that are coupled," not "put the whole world in one repo." A team can run many repositories and still keep each cluster of coupled projects together — that is the right design. People call any multi-project repository a "monorepo" and any multi-repository org a "polyrepo," but the boundary that matters is coupling, not count: one repo per unit of code that changes together.
It isn't. The split changes how versioning works — from the version control system to a package registry — and that single change cascades into how your builds find breakage. To see why, you have to stop thinking of the repository as a folder and start thinking of it as what it actually is.
Much of what follows came out of experience building Web app security tooling at Google Fiber and Firebase Genkit, an AI framework with a small core and a large plugin ecosystem, where keeping the plugins composable hinged on exactly this decision — and where the pull to split the repository is a live and recurring one. The argument below is the general case; a later section returns to the specifics.
A repository is a database. A content hash is its primary key, a commit is a versioned snapshot, and one commit is one consistent read across every subproject at once.
A repository is a database
This is not a metaphor reaching for color. Git is, in its own documentation's words, "a content-addressable filesystem… a simple key-value data store." Pro Git is explicit: you put content in, and Git "will hand you back a unique key" — the SHA-1 of the content (hardened post-SHAttered; a SHA-256 object format exists but is opt-in, with no interoperability between the two yet) — "you can use later to retrieve" it [1]. GitHub's own engineers put the database reading plainly: "the object store is like a database table with two columns: the object ID and the object content" [2]. The object ID is computed from the bytes, so it is a key that doubles as an integrity check.
Content-addressed — stored and retrieved by a hash of the content itself, not by a name or location. Identical content always gets the identical key, so the key is also a tamper-check. Learn more.
Once you see the key-value store, the rest of the mapping falls out. A commit "takes a picture of what all your files look like at that moment" — a versioned snapshot. Each commit points at its parents, so the history is an append-only log; nothing is mutated in place, because "it's impossible to change the contents of any file or directory without Git knowing about it." An internal import — package A reaching for package B in the same tree — is a foreign key, resolved against that one snapshot.
The load-bearing consequence is the last one. Because every subproject lives at the same commit, one commit is one consistent read of the entire codebase — the database property a single transaction gives you: a snapshot in which everything you read is mutually consistent. Hold onto that. It is the thing the split quietly destroys.
Snapshot isolation — a database reads from one consistent point-in-time view, so it never sees a half-applied change from someone else. A monorepo commit is exactly this, across all subprojects at once. Learn more.
An internal dependency is resolved by content, not by version
Inside a monorepo, when A depends on B, which B does it get? The current one. Not a published version — the B that exists at this commit.
The build tools are unambiguous about this. Bazel draws the sharpest line: "there's no notion of 'version' for internal dependencies — a target and all of its internal dependencies are always built at the same commit/revision in the repository." Internal targets reference each other by label into a dependency graph; only external dependencies "have versions, and those versions exist independently of the project's source code," resolved against a Bazel registry.
DAG — a directed acyclic graph: nodes with one-way edges and no cycles. A build's dependency graph is a DAG, which is what guarantees the build terminates. Learn more.
pnpm makes the same point at the package-manager level, and makes the seam visible. Inside a workspace you write workspace:*, and "pnpm will refuse to resolve to anything other than a local workspace package." The dependency is local, by content, whatever is here now. Then watch what happens at the boundary: when you publish, pnpm "dynamically replaces" workspace:* with a frozen version string — workspace:* becomes "1.5.0". The conversion from content-reference to version-string happens at exactly the moment the package leaves the repository. Nx and Turborepo build their internal graphs the same way — from workspace links and source imports, never from versions. Python's uv workspace has the exact same primitive: a member declares genkit = { workspace = true } and resolves the core package from the tree, not the index — a content reference, until you publish.
Inside the repo, a dependency is a reference resolved at a snapshot. The moment you publish, it becomes a version string resolved against a registry. The split is that conversion, made permanent.
The split, in three precise words
So what is the split, in database terms? It is tempting to reach for one word, but the honest answer needs three, and naming all three is what makes the argument hold up.
First, structurally, the split is federation: you partition the data across two databases (the repo and the registry), and the version string becomes a federated foreign key pointing across the boundary. Second, and most importantly, for consistency, the split is a loss of the global snapshot: where one commit gave you a single consistent read across every subproject, you now have per-repo consistency plus eventual consistency between them, mediated by versions. There is no longer any commit that atomically reads "A in this state and B in this state." Third, mechanically, to operate without the global join, the ecosystem denormalizes — it stores a cached copy of B keyed by a version range instead of resolving B by reference every time.
Denormalization — deliberately storing a redundant copy of data to avoid a costly join at read time, accepting the risk that the copy drifts out of sync. The classic consistency-for-locality trade. Learn more.
"Denormalization" is the memorable handle, and it's why the title could as easily have been A Split Is a Denormalization. But the federation is the structure and the lost snapshot is the wound. Keep all three in view.
There is a tempting objection to the federation framing — "a monorepo has no registry, so you're not federating databases, you're adding one." But look at what the repo's object store actually is: a content-addressed store is a registry, just one keyed by a hash of the bytes instead of by a human-chosen name. Both are key-to-value stores from an identifier to an artifact; the only thing that changes across the split is the key function. Inside the repo the key is the content hash — derived from the value, injective, self-verifying, resolved by exact lookup. Publish, and the key becomes a version string — assigned to the value by a person, lossy, resolved by an NP-complete search, and able to lie (npm itself never lets 1.5.0 be reused once published, but the name a build actually resolves still moves: latest is a mutable pointer, a range admits whatever new version appears, a package can vanish — the supply-chain attack surface). So the split doesn't add a registry to a repo that lacked one; it swaps a content-derived key for a human-asserted one over a store that was content-addressed all along. The version string is the new key; everything that goes wrong downstream is a property of that key, not of having a registry.
NP-complete — informally, the class of problems for which no known method beats "try the possibilities," and which are all secretly the same problem: a fast solution to one would solve them all. Version selection is in it — the EDOS report [10] proved package installability NP-complete, and Russ Cox's reduction shows the version puzzle directly encodes 3-SAT, the canonical hard problem. Learn more.
A registry layers two things on a content store: naming (a memorable handle, "the latest lodash") and resolution (turn a range into bytes). A bare CAS keeps only the store — you can't ask it "what's the newest X?", because every distinct content is just a distinct opaque key. The split's cost lives entirely in that added naming-and-resolution layer; the storage underneath was content-addressed either way — which is why the lockfile has to pin the hash back.
SemVer is a lossy hash of a diff the VCS already had
Here is the part most teams never quite articulate. When you depend on B@^1.2.0, you depend on a version string. What is a version string?
Semantic Versioning defines it as a human's three-bucket classification of a changeset. The spec is precise: increment MAJOR "when you make incompatible API changes," MINOR "when you add functionality in a backward compatible manner," PATCH "when you make backward compatible bug fixes" [3]. Every clause turns on backward compatibility — a property the author asserts about a diff. The version number is not the change; it is one of three labels a person stuck on the change.
A commit hash, by contrast, is the change — the exact bytes, losslessly. So the published version throws information away twice over. It discards which symbols changed and whether your call sites are among them; and it adds a promise it cannot keep, because — by Hyrum's Law — with enough consumers, every observable behavior is depended on by somebody [4], so "backward compatible" is not actually decidable. "Minor" means "the author believes this won't break you," which is a hope, not a guarantee.
SemVer is a lossy three-bucket hash of a diff the version control system already had, losslessly, for free.
The "lossy hash" framing is mine; the premises are the sources'. SemVer calls the buckets a compatibility judgment; Hyrum's Law makes that judgment undecidable at scale. Put together, the three buckets are a lossy projection of a diff.
And SemVer is only the most common projection, not the only one — every versioning scheme is a different lossy encoding of the same underlying diff, each discarding something different and keeping a different social signal:
- CalVer (
2024.06, Ubuntu,pip) encodes when a release shipped and nothing about compatibility — useful when cadence matters more than a break promise, honest precisely because it stops pretending the number predicts breakage. - ZeroVer (
0.xforever) is the joke that names a real practice: stay below1.0soSemVer's rules formally don't apply and every release can break. The scheme is a lossy hash whose buckets you've opted out of reading. - Commit-derived pseudo-versions are the interesting case. A Go module pseudo-version like
v0.0.0-20191109021931-daa7c04131f5is literallyvX.0.0+ a UTC timestamp + a 12-character prefix of the commit hash [5]. When no human has tagged a release, the tool reaches back for the one lossless identifier that was always there — the commit — and wraps just enough version-string scaffolding around it to satisfy a resolver. The closer a scheme drifts toward "embed the hash," the more it is admitting the VCS already had the answer.
The ranking is the point: a commit hash is the change; CalVer keeps time; a pseudo-version keeps the hash and a sortable order; SemVer keeps a three-bucket guess. Pick the scheme by which signal your consumers actually need — but none of them recovers the per-symbol diff the monorepo never threw away.
The cost: a compile-time error becomes a post-release error
Now the two halves meet, and the price of the split becomes concrete.
In a monorepo, a breaking change to B and all the call-site fixes across every A that uses it land in a single commit. This is the discipline Google calls the One-Version Rule — in Software Engineering at Google, ch. 16: "Developers must never have a choice of 'What version of this component should I depend upon?'" Because the change can be atomic — the same chapter's definition of version control at this scale: "changes to a collection of files submitted at once are treated as a single unit" — breakage is not merely caught early; it can be made unobservable, because there need be no moment when B has changed and A hasn't. The fix is part of the same read. Scope that claim precisely, because its strongest critics won't: it holds in full for build-time consumers — code compiled and linked together. Where A and B are separately deployed services, an atomic commit does not by itself give an atomic deploy: the running fleet spans versions while a rollout proceeds (Snellman makes this the centerpiece of the best-known rebuttal: repos with atomic commits "have atomic commits across projects. But the two facts have nothing to do with each other" when it comes to runtime migration).
One-Version Policy — every dependency exists at exactly one version in the repo, so no two parts of the build can disagree about which B they mean. The rule that makes atomic cross-cutting changes possible. Learn more.
The graph claws back a real part of even that, though — and this is a monorepo advantage the objection usually misses. Because the build graph is queryable, the deployment blast radius of a change is a computed artifact, not folklore: bazel query "rdeps(…)" names exactly the services that must move together. A topology-aware publisher can order rollouts along those edges when the change is wire-compatible in one direction — or blue/green the entire affected subgraph behind one ingress and flip traffic as a unit, which for stateless request/response services makes the deploy effectively atomic: the skew window collapses to in-flight requests. Polyrepo shops can't reliably compute that set, let alone flip it.
rdeps on the changed core names exactly the cone of services that must move; the provably-unaffected service never deploys. Right: stage the new cone beside the old and flip traffic once at the ingress — for stateless request/response, version skew collapses to the requests in flight at the flip.What no topology can flip, and where the critics keep their point: state and clients that outlive the deploy. Rows and queue messages written by the old version are still there after the cutover, so durable-data changes want expand–migrate–contract no matter how synchronized the binaries are; yesterday's browser tab and last quarter's mobile app span versions for months and appear in no dependency graph; and rolling back one service in a flipped subgraph reintroduces the skew unless you roll back the co-versioned set — while blue/greening a large subgraph costs double capacity exactly when the change touches the widely-depended-on core. The monorepo removes the source-level version-negotiation step and turns deploy-time coordination into a computable problem; the residue it cannot remove is whatever outlives the deploy.
A scaling caveat, in fairness: at the largest sizes even a monorepo can't always land a sweeping migration in one commit. Software Engineering at Google is candid that updating a third-party library across all its uses "in a single atomic change" may be "infeasible," so such changes are staged — add the new version, fence off the old, migrate callers incrementally — as a multi-commit large-scale change [8]. The point that survives is not "always one commit" but the sharper one: the migration happens inside the repo, with no version-negotiation step across a boundary. Whether it lands in one commit or fifty, every step is a consistent read, and there is never a published version skew for a downstream to resolve. Atomicity is the clean limit; in-repo coordination is the property.
A common objection lands right here: isn't a single commit that touches B and forty call sites bad practice — a giant, hard-to-review, hard-to-revert blob that "should have been split up"? It feels like it violates the rule that a commit should do one thing. But that intuition has the accounting backwards. The change is one thing — "make B's interface different, and keep every caller working" — and it is irreducibly cross-cutting: there is no smaller correct unit, because a commit that changes B without its call sites is precisely the broken-in-between state the atomic commit exists to forbid. You are not choosing between a big change and a small one. You are choosing where the cross-cutting work lives: in one commit a reviewer can read, a revert can undo, and a bisect can land on — or smeared across repositories, releases, and a version-resolution problem, discovered after the fact. The atomic commit didn't create the coupling; it gave the coupling its cheapest possible encoding. Calling that bad practice is doing the ergonomics without doing the math.
One honest residual, though: "a commit a reviewer can read" is doing work in that sentence. A diff that touches a core type and forty call sites is correct as one unit, but it is also genuinely tiring to review — the cross-cutting change is exactly the shape that invites cognitive fatigue, and a tired reviewer waves through subtle logic bugs. This is a real cost, and it is the strongest version of the objection. But notice which direction the tooling is moving: the part of that review that is mechanical — "did every call site get updated consistently with the new signature?" — is precisely what a structural diff tool, or an AI-assisted review pass, does tirelessly and at scale. The honest comparison is about kinds of cost, not a promise of zero: the big commit's review fatigue is mechanical and keeps getting cheaper as tooling improves (and better tooling also eases the split's version bumps — Renovate exists), while the split's coordination tax is structural and doesn't yield to better tools. One cost trends down; the other is a property of the shape.
The "one commit, one logical change" rule is real and good — but the logical change here genuinely is "evolve B and its callers together." Splitting that into "change B" + forty separate "fix caller" commits doesn't make each do one thing; it makes forty of them describe half a thing, none of which is independently correct. Atomicity is the A in ACID applied to a refactor.
The Linux kernel runs the largest live demonstration of this, and it's the source of this essay's epigraph. Greg Kroah-Hartman's argument that you want "a stable running driver," not a stable interface, is the monorepo thesis stated in kernel terms: the kernel deliberately has no stable in-kernel API or ABI for drivers — and the in-tree thousands of drivers are exactly why it can get away with it. The docs put the bargain in one sentence:
If your driver is in the tree, and a kernel interface changes, it will be fixed up by the person who did the kernel change in the first place.
Refactor a core interface and every caller is "fixed up at the same time" — one changeset, no version skew, breakage unobservable. That is the One-Version Rule, enforced across an operating system. The flip side is the proof: drivers kept out of tree must chase "an ever changing kernel interface" on their own, which the same docs flatly call "a rough job." Now imagine the kernel sharding its drivers into thousands of separately-versioned repos — every core change becomes a SemVer release the driver maintainers discover later, and building a coherent kernel turns into a dependency-resolution problem nobody could win. The in-tree policy is not bureaucratic taste; it is the only thing keeping that problem from existing.
Split B out and the same change becomes a sequence: B publishes release N+1, and downstream discovers the break later — at their next bump, in CI, or in production.
By splitting, you have moved a failure from compile time to post-release — the single most expensive direction a failure can move.
Imagine a recipe book where step 12 says "add the sauce from page 40." In a monorepo, page 40 is in the book — you flip to it and there it is, current. Rewrite the sauce and you fix every recipe that points at it in the same edit. Split page 40 into its own book and recipes now say "add the sauce from Sauces, 2nd edition." Publish a 3rd edition that drops an ingredient, and the old recipes don't know until someone actually cooks one — and finds the dish broken on the plate, not on the page.
Below: click the shared lib D and watch the same breaking change play out in each topology. Toggle between Monorepo and Split.
This is the diamond dependency in the flesh, and it is not theoretical. Block (Cash App) wrote up exactly this when they consolidated a fragmented polyrepo back into a monorepo. Having replaced versions with SHA-plus-timestamp auto-updates, they found it "removed the forcing function for backward compatibility." The result: "App A depends on lib B and C, which depend on incompatible versions of D. NoSuchMethodError and NoClassDefFoundError became familiar failures — sometimes caught in CI, and sometimes surfacing as SEVs." Dependencies "drifted months or even years behind." [24]
Diamond dependency — A depends on B and C, which both depend on D, but on different versions of D. For internal code, a single source tree dissolves it — there is one D, the current one, no version to disagree on. Be precise about why, though: it's the one-version policy, not mere co-location, that does the work. Airbnb found subprojects inside one monorepo pinned to different versions of the same third-party library [16] — diamonds survive co-location when the policy doesn't follow. Across repos the conflict is the default; in one repo it's a policy you must still choose to enforce. Learn more.
The cost is the same cost the utils essay measured, lifted to repo granularity: a change's damage is its probability of breaking times the number of downstream consumers that learn about it late, — read the bars as "the count of". The monorepo drives the second factor to zero by making the break and the fix the same commit.
The math: one commit, or a quadratic of constraints
That last move — "drives the factor to zero" — is worth taking literally, because the difference between the two worlds is not "a little work" versus "more work." It is a change of complexity class: the same task that costs a fixed amount no matter how big the system gets, in one world, grows with the square of the system's size in the other.
Complexity class / "Big-O" — a way of describing how a cost grows as the input grows, ignoring constant factors. ("constant") means the cost doesn't grow with the input at all; ("quadratic") means doubling the input quadruples the cost. The point is the shape of the growth, not the stopwatch. Learn more.
Picture a small team where everyone's work depends on everyone else's. In one shared notebook, "getting everyone onto the same page" is one act: you turn to a page and that is the state of the whole team — there's no separate step where Ann reconciles with Bob, then Bob with Carol. Everybody reads the same page at once.
Now give each person their own notebook that cross-references the others by edition number — "uses Bob's notebook, 3rd edition." To stay consistent, every pair who depend on each other must agree on which editions go together. Three people is three handshakes; six people is fifteen; the handshakes grow much faster than the people. Worse, finding a set of editions where all the handshakes hold at once is a genuine puzzle, not a lookup — the kind a computer can choke on. The one notebook didn't just save effort; it made the whole puzzle disappear.
Made precise: take mutually-coupled projects — a cluster where, in the limit, each can call into any other.
One repository. A consistent state of the whole cluster is a commit. Moving it from one consistent state to the next — however many projects the change touches — is a single atomic write. The number of synchronization actions is , independent of : that is , constant. There is no "sync A against B" step, because the snapshot is global by construction; one commit is one consistent read across all at once.
repositories. Consistency is no longer given; it is a property you must reconstruct from agreements between pairs. Each dependency edge carries a version constraint, and a fully-coupled cluster has up to of them — so the number of constraints that must hold at the same time grows as , quadratically. A single breaking change no longer lands once; it propagates along its out-edges as a release-and-bump cascade, and pulling the cluster back to a coherent state means re-satisfying that entire web. The atomic write has become a quadratic coordination problem.
("n choose 2") — the number of distinct pairs you can form from things, equal to . It's why "every pair must agree" costs grow quadratically: projects → pairs, → , → . The pairs outrun the projects.
You can watch it happen. Below is a small cluster — a shared core feeding three libraries feeding two apps — and the controls run release time: each "Cut a release" advances one team's published version. In Monorepo mode the constraint count stays pinned at zero — every edge resolves to the current commit, no matter how many releases you cut. Flip to Split and cut a few: consumers fall behind the providers they depend on, their edges go loose (dashed), and where an app depends on two libraries that have drifted apart, it can't satisfy both at once — that edge breaks. Keep going and the unsatisfied count climbs past the handful of nodes you started with: that's the web filling in, one release at a time.
D) feeds three libraries, which feed two apps — two stacked diamonds. Each release advances one team's version; in Split mode, consumers lag and edges drift from in-sync to loose (dashed) to broken (a diamond conflict, where an app can't satisfy two lagging constraints at once). Watch the unsatisfied count climb. Monorepo mode holds every edge in sync — it never leaves zero.The pairs outrunning the projects is exactly the curve people underestimate until they watch it, so here it is under a slider:
And is only the coordination floor — it counts the constraints, not the cost of solving them. Actually selecting a set of versions that satisfies all of them at once is, in general, NP-complete [9] — with the scope Cox himself draws: the hardness assumes each package may be installed at one version. Ecosystems that permit duplicates (npm's nested trees) make finding a solution trivial — "it just might not be the smallest possible combination (that's still NP-complete)" — and Go's Minimal Version Selection (MVS) ducks the class entirely by restricting the constraint language. The split's tax is either solver-hardness or duplication; you pick one.
How MVS ducks it. NP-completeness needs the freedom to say "not that version" — upper bounds, exclusions, "if X then not Y." MVS forbids exactly those: a module may only state lower bounds ("I need at least v1.2"). With no way to exclude, selection stops being a search and becomes arithmetic — take the maximum of the stated minimums. In Horn terms, dropping negative implications keeps the constraints Horn-satisfiable, where SAT is linear, not NP-complete — "a few hundred lines of Go … recursive graph traversals," in Cox's words. The cost isn't free: you lose the ability to say "anything but the version with that CVE" — you bump the floor instead.
Resolution in the single-version world is as hard as the hardest problems we know of that have no known fast solution. So the split trades an atomic write for an pile of constraints handed to a worst-case-intractable solver.
The complexity zoo — P, NP, NP-complete, and the wider menagerie of classes this argument walks past (NP-complete itself is pinned down in the margin of the registry section above). For the ten-minute intuition, hackerdashery's P vs. NP and the Computational Complexity Zoo is the canonical tour.
The monorepo's "one commit" was never a mere convenience — it was a complexity collapse, folding a quadratic-and-then-some problem down to a single write.
A plugin ecosystem is the case that proves it
I had to make this exact decision building Genkit, a framework with a plugin ecosystem: a small core and dozens of plugins — model providers, vector stores, evaluators — that each depend on that core. The whole value of the framework is that the plugins work together, against one consistent version of the core, so a developer can compose them without thinking about it. That goal is, exactly, the One-Version Rule applied to an SDK.
So the Python tree is one uv workspace whose pyproject.toml declares members = ["packages/*", "plugins/*"], and every plugin depends on the core by genkit = { workspace = true }, resolved from the tree rather than the index [18][19]. Choosing uv was choosing that workspace primitive on purpose: when I change the core's interface, I change every plugin that uses it in the same commit, and CI builds the whole ecosystem against that one change. A breaking change is caught the instant it's made, because there is no plugin living at an old version to be broken later. The plugin authors never chase an API; the core and its plugins are always one consistent read.
A plugin ecosystem is a One-Version Rule with extra steps. The plugins are worthless if they don't agree on the core — which is precisely the global snapshot a single repository gives you for free.
Now picture splitting that core and each plugin into its own repository. Nothing about the code improves — the plugins still depend on the core exactly as before. What changes is when a developer learns the framework is broken. Today, a core change that would break the Anthropic plugin fails in CI, in the same pull request, and never ships. After the split, the core publishes a release, and the plugin discovers the break at its next version bump — or worse, an end user discovers it, having installed a core and a plugin that resolved to incompatible versions. Every plugin becomes a diamond corner: an app pulls the core directly and through three plugins, and now four version ranges have to agree. The framework's central promise — "the plugins compose" — quietly becomes the user's problem to satisfy, by hand, against an NP-complete resolver.
This is the same trade the whole essay describes, but it's worth seeing from the ecosystem's end. The maintainers feel the split as relief — smaller repos, independent release cadence, cleaner ownership. The users feel it as version skew: the plugin that lagged the core, the combination that doesn't resolve, the ImportError that only shows up in production. The repository boundary the team draws for its own convenience becomes a compatibility matrix the user has to navigate. Keeping the ecosystem in one tree is not the maintainers indulging a monolith; it is the maintainers absorbing the version problem so their users never have to — which is the entire reason the plugins felt effortless to begin with. The clear loser in the trade, the one who never sat in the meeting where the repo was split, is the end-user.
The lockfile is the tell
If the content hash were worthless, the ecosystem would have discarded it. It did the opposite: it re-adds it, by hand, in every lockfile.
A pnpm-lock.yaml, package-lock.json, or Cargo.lock does not store a version range. It stores an integrity hash — npm's integrity field is a Subresource Integrity string, usually sha512 for registry tarballs; git dependencies pin a commit SHA, and Cargo.lock pins an exact version plus a SHA-256. The lockfile exists precisely to record which exact bytes a fuzzy range resolved to, so the install is reproducible. pnpm's content-addressed store goes further and keys every package by its integrity hash on disk — a content-addressed database rebuilt on top of the registry. (The web makes the same move with the same hash: a Subresource Integrity attribute pins a <script> to its bytes so a compromised CDN can't swap them.)
The lockfile re-introduces, per artifact, the content-addressing the monorepo had for free across the whole tree. The hash was the valuable thing all along.
The hash reappears one level up, too — in how a monorepo decides what to rebuild. Uber's Go monorepo, on Bazel, figures out which targets a change affects with what it calls the Changed Targets Calculation: it "creates a Merkle-style tree, where each node (representing each Go package) is computed from all of its source files and inputs," then diffs the tree before and after a change to see which packages moved [15]. That is the same content-addressing the lockfile re-adds, applied to the build graph instead of the dependency graph: the build doesn't ask "what version changed?", it asks "what content changed?" — a Merkle hash of inputs, exactly Git's own move. Bazel and Nx key their action caches the same way. Up and down the stack, the moment a system needs to know whether two things are really the same, it reaches for the content hash, not the version string.
Merkle tree — a tree where each node's hash is computed from its children's hashes, so one root hash fingerprints the whole structure and any change bubbles up to it. Git's commit graph is one; so is Uber's build-target tree. Learn more.
Be honest about what the lockfile does and doesn't buy back: it restores reproducibility of one resolution, computed when the lock was last updated. It does not restore the global atomic snapshot — there is still no commit that reads A and B together. It fixes the lossy key; it cannot fix the lost transaction.
The split is also an attack surface
Everything so far has been about correctness and cost. But there is a second ledger, and it is the one most teams forget to price: security. A split doesn't only change how you find breakage — it opens a class of attacks that an in-tree dependency does not have, because every one of them is an attack on the name and version a registry resolves, and an in-tree dependency has neither. It is resolved by content, at a commit, from a tree you control.
A content hash is a strictly stronger versioning contract than SemVer or any other scheme: it pins the exact content and guarantees the integrity of your supply chain in one move. There's no name to mimic, no registry to race, no version to re-point.
Every supply-chain attack vector is, at bottom, a tax on the version string the split introduces. Walk them, and notice each one needs a property the split creates that a content reference never had:
- Typosquatting needs a public namespace to mimic. Publish
reqeustsnext torequestsand wait for a typo in someone's manifest. In a workspace there is no name to squat: a plugin imports the core bygenkit = { workspace = true }, resolved to a path in the tree, and a typo is a build error in your own repo, not a silent install of someone else's code. - Dependency confusion needs a resolver choosing between a private and a public source by version. Alex Birsan's 2021 research weaponized exactly this: publish a package with the same internal name and an absurd version like
9000.0.0to a public index, and the resolver, told only to take the "higher version number," pulls the attacker's code. He hit Apple, Microsoft, PayPal, Netflix, and more than thirty others this way [20]. The attack is the article's thesis turned hostile — it exploits the resolver's trust in a version string over content identity. An in-tree dependency offers no such choice: there is one source, the tree. - Shadow (transitive) dependencies are the packages you never named — pulled in by the packages you did, through their
^ranges, often dozens deep. Each is a separately-published, separately-owned, separately-mutable artifact in your build, and you reviewed none of them. In a monorepo the entire dependency graph of your coupled code lives in the tree, where it is read, diffed, and committed like any other code. Splitting turns a reviewed subtree into an unreviewed transitive closure — the very thing a per-dependency review habit exists to catch. - Version mutability is the quiet one. A content hash is the bytes; a version string is a pointer to bytes that can move. The
left-padincident (2016) showed the benign face — one unpublished 11-line package 404'd builds at Facebook, PayPal, Netflix, and Spotify [21]. To npm's credit the registry is immutable per version — a published1.5.0can never be reused, even after unpublish [22] — so the malicious face works one level up, on the pointers that remain mutable: publish a backdoored new version and every open range admits it, move thelatestdist-tag, or compromise the maintainer who does. (Registries and mirrors that lack the immutability rule lose even the per-version guarantee.) A git commit cannot be re-pointed at new bytes; that is the whole meaning of content-addressing.
Dependency confusion — substituting a malicious public package for a private one by exploiting that the resolver picks the highest version across all configured registries. The fix is to pin the source per package — which is, in effect, re-adding the identity the version string lost. Learn more.
None of this is an argument that registries are unsafe to use — you depend on external packages no matter what, and the ecosystem has built real defenses (scoped names, lockfile integrity hashes, provenance attestation). It's a narrower, sharper point:
Splitting your own coupled code across the registry boundary moves it from the side of the line with none of the above-mentioned supply chain risks to the side with all of them. You are not just paying a coordination tax; you are enrolling code you wrote, and used to resolve by content, into the threat model of the public supply chain. The in-tree dependency was never in that threat model at all.
Submodules are a foreign key without the transaction
There is a tempting middle road that looks like it should give you the best of both: keep the dependency in its own repo, but vendor it back in as a git submodule, so your repo pins an exact commit of it. No registry, no version string — just a pin. It is worth seeing why this is usually the worst option, because it falls out of the database model immediately.
A submodule is not a copy of the other repo; it is a gitlink — a single tree entry, with the special mode 160000, whose contents are one commit SHA of the foreign repo [12]. Read through the database lens, that is exactly a foreign key: your tree stores a pointer (the SHA) into another database's object store, not the rows themselves.
Gitlink — a tree entry of mode 160000 that records a commit SHA instead of a blob or subtree. It is how a superproject names "which commit of the submodule I point at." Learn more.
And that is the whole trouble: a foreign key buys you referential pinning without the transaction that makes a key safe. You get the exact-bytes guarantee — the lockfile's one virtue — but none of the atomic snapshot, and you pay for it in operational friction the monorepo never charges:
- No atomic cross-repo commit. Updating the dependency is two commits in two repos: commit inside the submodule, then commit the moved pointer in the parent. Between them the parent references a commit nobody else can fetch yet — push them out of order and a collaborator's checkout points at a SHA that doesn't exist for them. The split's lost transaction, re-enacted by hand on every update.
- Detached HEAD by default.
git submodule updatechecks the submodule out to a bare commit, not a branch, so work done inside it lands on no branch and is easy to lose on the next update. - The pin is silent and opt-in to follow. A plain
git cloneleaves the submodule empty; a plaingit checkoutacross branches doesn't move it unless you remember--recurse-submodules. The pointer is real but the tooling makes it trivially easy to build, test, and ship against a stale one without any error — the version-skew failure of the split, now with no resolver to even warn you. - Pin-versus-tip splits teams, and one side breaks CI. Two camps form. One pins: the gitlink SHA is law, CI builds exactly what the superproject commit records — reproducible, but the dependency goes stale until someone bumps it by hand. The other floats:
git submodule update--remote"instead of using the superproject's recorded SHA-1… use[s] the status of the submodule's remote-tracking branch," fetching the tip at build time. The float is the one that breaks: the build is now a function of when it runs, not of the superproject commit. You test against the tip at noon; CI runs at three against a tip that a merge moved in between — so CI builds something you never tested, and the SHA the superproject records no longer matches what shipped. Re-run the same commit tomorrow and you get different bytes. It is the essay's whole thesis in miniature: the gitlink was a content address, and--remotethrows it away to chase a mutable branch name, reintroducing exactly the "the name can point at different bytes later" failure the content hash existed to prevent.
So a submodule is federation with the costs doubled: you took on the split's lost global snapshot and its version-skew risk, and in exchange you got a manual, foot-gun-laden pin instead of a registry and a lockfile that at least resolve and verify for you. If the code is genuinely yours, keep it in the tree and let one commit be one consistent read. If it is genuinely someone else's, depend on it as a published, lockfile-pinned package. The submodule sits in the unhappy valley between: the coordination cost of two repos with neither the atomicity of one nor the tooling of a real registry.
What about git subtree? It is the same vendoring instinct executed honestly. Where the submodule stores a foreign key, git subtree denormalizes: it merges the other repo's actual trees into a subdirectory of yours (the subtree merge), so the bytes are rows in your database again — the global snapshot is back, one commit is one consistent read, and a plain git clone is complete, with no --recurse-submodules to forget. The costs move to the write-back path, where they belong: contributing upstream means re-deriving the foreign projection with git subtree split / push, and the imported history either interleaves with yours or is flattened by --squash [13]. Notice what the trade concedes: git's own escape hatch from submodule pain works by moving the bytes back into one tree. And if you never push back, it is simply vendoring — a monorepo decision made one dependency at a time.
"But we're not Google"
This is the reflexive objection, and it deserves a direct answer: none of the argument above mentioned Google's scale. The complexity result — atomic commit versus an web of version constraints handed to an NP-complete solver — is arithmetic. does not depend on who is doing the adding; it depends on what you are adding. The cost of encoding coupling as version strings depends on the coupling and the count, not on the logo on your badge. A three-person team that shards four tightly-coupled packages pays the same shape of cost a thousand-person team does — fewer constraints in absolute terms, but the same quadratic curve, the same post-release failure mode, the same NP-complete resolver. Smallness doesn't exempt you from the math; it just lowers the constant in front of it.
"We're not Google" answers a question nobody asked. The split's cost isn't a function of your size — it's a function of your coupling. regardless of who's counting.
And here is the part that inverts the intuition entirely: the fewer engineers you have, the more this costs you, not less. The instinct runs the other way — "we're small, we can't afford monorepo discipline" — but look at which kind of cost each side is. The monorepo's costs are capital: one-time infrastructure (the tooling the honest-limits section names) that you mostly don't even pay until you're huge. The split's costs are operating: recurring labor — a release-and-bump cascade, a compatibility test, a skew investigation — spent out of the same budget, on every cross-cutting change, forever. A large team has slack to absorb recurring labor; a three-person team spends it instead of shipping. The split asks the resource-constrained team for precisely the resource it has least.
Worse, that labor doesn't arrive as planned work. Version skew surfaces post-release — in CI on a Friday, in a user's ImportError, in the bug report about the combination that won't resolve — so it lands as interrupts, the most expensive form of work for a team with no one to spare. And because a small team can't keep up with the reconciliation, the drift compounds: dependencies fall "months or even years behind" [24], each lag making the next upgrade harder, until the cost of catching up exceeds the cost of the feature you were trying to ship. The team that most needs every hour to go to product is the team that can least afford to spend it negotiating versions with itself.
Constraint doesn't make the split cheaper; it makes you less able to pay. The monorepo's cost is capital you mostly defer; the split's is labor you spend on every change — and it compounds fastest on the team with the fewest hands.
So the "we're not Google" instinct points the opposite way from where it's aimed. Google is the one organization with the dedicated source-control engineering to make a giant monorepo painless and the scale that occasionally forces a shard; you have neither. So the honest move is not to invoke scale as a reason to split — it's to measure, because the odds are you're splitting when you don't need to.
The measurement is concrete. Take the cluster you're tempted to break apart and ask:
- How coupled is it, really? Count the import edges between the subprojects. If they import each other freely — the diamond shape from the earlier figure — you are looking at exactly the constraint web a split would force you to maintain by hand. If they barely touch, they were already separable and the split is cheap (and this whole essay doesn't apply to them).
- How often does a change cross a subproject boundary? Run
git logover the cluster and count commits that touch more than one subproject. Each such commit is one that, after a split, becomes a release-and-bump cascade across repos. If that number is anything but tiny, you are about to convert your most common kind of change into your most expensive kind. - Is the VCS actually suffering? Not "the repo feels big" — is
git statusslow, is clone painful, is one subproject's history dwarfing the rest? That is the only mechanical reason to shard, and it has a measurable threshold.
Most teams who run those three numbers discover the split was solving a problem they didn't have, at a cost they hadn't priced. But the numbers can come back the other way — and when they genuinely do, only one reason survives the analysis. The case that looks like a second reason, an external consumer boundary, turns out to demand honest versioning rather than a move; it's worth working through precisely because it's the one people most often mistake for a mandate to shard.
The repo outgrew its storage engine (the celebrity shard)
The one purely mechanical reason: the repository has grown so large that the VCS itself suffers, and its size is dominated by one subproject. When the database outgrows its storage engine, you shard the part that's bursting.
The giants all hit this wall — and notice it's a different wall from coupling. Microsoft's Windows repo on Git is "approximately 3.5M files" and "about 300GB," with operations degrading as engineers "crawl across the code base and touch more and more stuff" (they named it "over hydration"). Brian Harry's writeup frames the wins in percentiles — clone at the 80th percentile dropping to 127 seconds — not the headline minutes-to-seconds you sometimes see quoted [25]. Meta's Sapling and EdenFS set the explicit goal — per the Sapling README — that operations should "scale with the number of files in use by a developer, and not with the size of the repository itself." Google's monorepo — over 80 terabytes and around two billion lines — runs on Piper, the successor to Google's old Perforce setup, built on Bigtable (now Spanner) behind a virtual filesystem [26]. (One caveat for this essay's own thesis: Piper's internal storage model isn't publicly documented as content-addressed — a Piper workspace is described only as "comparable to… a client in Perforce" — so its scale is agnostic evidence for content-addressing-at-scale. The cleaner case for content addressing is made below, by build systems that demonstrably key on content hashes.)
All three giants solved the same scaling problem the same way — a lazy virtual filesystem (Microsoft's VFS for Git, Meta's EdenFS, Google's Clients-in-the-Cloud) that materializes only the files you touch. The fix for scale is virtualization, not fragmentation. None of this is free, though: each is years of dedicated source-control engineering. At monorepo scale the content hash stays valuable, but it stops being free — you pay for it in infrastructure.
The tell is the distribution. If one subproject dwarfs the rest and the VCS is genuinely hurting, that one earns its own shard. A balanced repo that's merely biggish does not.
Team ownership is access control, not a repository
The most common reason of all is "each team should own its code" — and it is the one that most cleanly dissolves once you ask what "own" means. Ownership is a bundle of three rights: who approves changes here, who is responsible when it breaks, and who decides its direction. Every one of those is a question about people and permissions, and none of them is a question about where the bytes live or how they're versioned. Splitting the repo to get ownership is answering a governance question with a storage decision — and paying for it in the consistency currency this whole essay is about.
A repository is a coarse, expensive access-control list. Team ownership is an ACL question; you don't need a second database to answer it.
The boundary you actually want is within one repo, and the tooling for it is mature. A CODEOWNERS file makes a directory a team's domain: changes that touch packages/checkout/ automatically request the checkout team, and branch protection or a ruleset can require approval from a listed owner before the change lands [14]. Bazel visibility rules go further and make a package's API a compile-time contract — code outside the allowed list cannot even depend on you, enforced by the build, not by a code-review convention. You get the hard boundary — approval rights, a clear responsible team, an enforced API surface — with the global snapshot fully intact. One commit still reads the whole tree consistently; it just can't be authored across team lines without the right approvals.
CODEOWNERS — a file mapping path patterns to the teams that must review changes under them. It encodes ownership as review policy over a subtree, which is what ownership actually is — not a separate repository. Learn more.
This is the Conway's-Law point the utils essay opened on, read the right way round. Your module boundaries should mirror your team boundaries — but a module boundary is a directory with an owner and an enforced API, not a separate repository with its own version history. The split doesn't give you the team boundary; CODEOWNERS already did. What the split adds, on top of the boundary you wanted, is the version dance you didn't — because now every cross-team change is also a cross-repo change, negotiated through published versions instead of a single reviewed commit. Ownership asked for a fence; the split builds a fence and then floods the land between the fenced plots.
There's a real failure mode hiding in the conflation, too: split-for-ownership tends to ossify the org chart into the build graph. When the team boundary is a CODEOWNERS line, a reorg is a one-line edit; when it's a repository split, the boundary is welded into every downstream's dependency manifest, and moving it means re-sharding and re-versioning. You wanted teams to own their code; you ended up letting last year's org chart own your dependency graph.
Be fair about where this gets hard, though, because it's the honest catch. CODEOWNERS cleanly solves review routing, and Bazel visibility solves API enforcement — but a lot of the surrounding platform still treats the repository as the atomic unit of ownership. Out-of-the-box CI dashboards, vulnerability scanners, SLO/alerting, deploy pipelines, and cost attribution often key on "which repo?" and can't, by default, map a finding to packages/checkout/ and page the checkout team. Inside a monorepo you have to teach them the sub-tree boundaries — and that is genuine platform-engineering work, not a free lunch. The answer is to make ownership a first-class artifact in the tree rather than an implicit property of the repo: a service-catalog manifest (Backstage's catalog-info.yaml, one per component, anywhere in the tree), or a code-ownership index, that maps directories to owning teams so the scanner, the dashboard, and the pager can all resolve "who owns this finding?" from the path. Note the shape of this cost: it is one-time, in-tree infrastructure you build once and every team inherits — the same capital cost the honest-limits section describes, not the per-change, per-release operating cost the split would charge you forever.
You don't own the call sites
The next consideration looks like another reason to split, and mostly isn't. If a package has genuine external consumers — teams whose call sites you do not control — then the monorepo's atomic-fix superpower buys you nothing for them, because you cannot land the fixes you don't own. You will pay the SemVer tax on that boundary no matter where the code lives. But notice what that obligates: it obligates you to version honestly and treat compatibility as a real contract. It does not, by itself, obligate you to move the code.
That's the slip worth catching. "You owe external consumers a versioned contract" and "you must split the repository" are different claims, and the first does not imply the second. A package can keep living in the monorepo — single source of truth, atomic internal refactors intact — and still publish honest, range-resolvable versions to the outside: tag a release and run the publish step from the monorepo, or push a read-only mirror of that one subtree and cut versions from there. That mirror is a solved problem — Google open-sourced Copybara, the tool it built "for transforming and moving code between repositories," whose headline case is exactly this: keeping a confidential internal repository and a public one in sync [17]. It moves and rewrites paths, excludes internal-only files, and carries commit history forward — so the source of truth stays put while external consumers get a clean repo to depend on. External consumers resolve a version string either way; they never needed a separate writable history to do it.
So a true repo split is forced by the boundary only when something outside the versioning contract demands a separate home: you're open-sourcing a component and want its issues, governance, and license to live apart; a downstream legally cannot pull from your tree; an acquirer or partner takes ownership. And even most of those are satisfied by the mirror — a separate, public, read-only projection — rather than by tearing the subtree out of the source repo. The fork of writable history is the last resort, not the default the word "external" seems to invite.
This sharpens the usual advice. The rule is not "tightly-coupled code should never be split," it is not "team ownership means each team needs its own repo," and it is not "external consumers mean you must split." It's narrower:
There is really only one reason to split: one subproject so dominant the VCS itself is suffering. Team ownership is a CODEOWNERS line, not a repository. External consumers oblige you to version honestly — publishable from the monorepo or a read-only mirror — not to move the code. Fork writable history only when something outside the contract (open-sourcing, legal separation, a change of ownership) demands a separate home. Never split for tidiness — tidiness is what the denormalization costs you.
The honest limits
"A repository is a database" is a sharp lens, but it is a lens, and it warps at the edges. Four places to be careful.
Git is a key-value store with a custom query surface, not a relational engine. It has no general query language and no secondary indexes by default — to ask "what depends on X?" you traverse, or you build your own index (which is exactly what the commit-graph and pack-index files are). Calling it "a database" is fair only if you mean a content-addressed object store, not Postgres.
There is no cross-push ACID transaction. The snapshot guarantee is local: one repository, one revision. Network partition between clones is the normal state. But that's the whole point of this essay — the split is precisely what trades the local global-snapshot for distributed eventual consistency; the lack of cross-repo atomicity is the disease, not a flaw in the diagnosis.
And don't over-read individual mappings: a content hash is a content-derived natural key that doubles as an integrity check, not an arbitrary surrogate primary key; the append-only commit graph is MVCC-like but exists for provenance, not concurrency control. The analogy earns its keep at the level of consistency and resolution — which is exactly where the split does its damage.
MVCC — multi-version concurrency control: instead of overwriting data in place, a database keeps old versions of each row so readers see a consistent snapshot while writers make new ones. Git also keeps every version — but as history, not as transaction isolation. Learn more.
Finally — and this is the cost the analogy must own — the single-repo virtues are not free at scale. The build graph the One-Version Rule makes legible also makes broad changes expensive to land: in Uber's Go monorepo a change touching a widely-used library can block every other change in the merge queue while its dependents revalidate — head-of-line blocking the queue, an O(1) commit serialized into an O(everything) wait. Even this, though, is a bottleneck the tooling has already largely engineered away: a modern merge queue tests pull requests speculatively in parallel — each batched against the base plus the changes ahead of it, so they validate concurrently rather than one-at-a-time, and a failure simply ejects the offending change and revalidates the rest. The serial wait becomes a parallel one. And stock Git won't carry the load until someone builds the storage engine for it.
But note the shape of these costs, because it is the opposite of the split's. They are engineering problems with known, shipped answers — cone-mode sparse-checkout for the quadratic pattern-match, VFS for Git and Meta's Sapling/EdenFS for the working tree, merge-queue scheduling for head-of-line blocking — one-time, amortized infrastructure. Google has run a monorepo of a shared codebase this way for over two decades — one that had already reached two billion lines and 86 TB by January 2015, per its own CACM paper; the in-tree experience there is not a brave experiment, it is a solved problem, and engineers who have worked in it know the cross-repo version dance simply does not arise. The split's cost is the mirror image: not a one-time engineering bill but a recurring, structural one — the constraints, the NP-complete resolution, the post-release breakage — paid again on every change, forever, with no infrastructure that makes it go away, because it is inherent in encoding coupling as version strings.
So the honest reckoning is not "monorepos are free." It is that the two cost curves point in opposite directions. The monorepo's costs are real, finite, and already paid down by the people who hit them; the split's costs compound, and they land in the worst place a cost can land — after release. Some make the contrarian case that at scale a monorepo must solve every problem a polyrepo does anyway (Matt Klein's "Monorepos: please don't!" is the sharpest version) [23]; it has real points about tooling, but it argues from ergonomics where this essay argues from the math, and the math is what decides it. For code that changes together, the database is not merely the nicer model — it is the asymptotically cheaper one.
Head-of-line blocking — when the first item in a queue stalls everything behind it, however ready those are. A merge queue revalidating a sweeping change holds up the small independent changes lined up after it. Learn more.
The other half: friction, not just cost
Everything to here has been the technical case — hashes, snapshots, the web. But I've managed engineering teams, and the cost that actually shows up in a standup isn't on that graph. It's friction, and a split manufactures it.
Watch what a post-release break does to a team's communication.
In a monorepo, a breaking change and its fixes are one commit with one author and one reviewer: the conversation is "here's the change, here are the call sites I updated, PTAL." After the split, the same break becomes a mystery. The app is down; nobody's commit obviously did it; so the work starts — a search across the subgraph of related repos to find which release introduced it, a hunt for the owner of the package that changed, a thread that slides from "what broke?" into "who broke the contract?", and finally a coordinated dance of releases and version bumps across repositories just to get back to a state that, in one tree, would have been a single green commit.
None of that is engineering. It is coordination overhead the repository boundary created, and it lands as the worst kind of team time: interrupt-driven, cross-team, and faintly accusatory.
A repository boundary is also a communication boundary. Split a coupled team across repos and you convert "here's my diff" into "whose release broke us?" — you've replaced a code review with an investigation.
That is the part of the job the math misses. A manager's actual role is to minimize internal friction — to make it cheap for people to change each other's code and easy to see when they've broken something. A repository split does the opposite: it builds a wall, hands each side its own release process, and then asks them to negotiate across it. You don't improve a team's throughput by giving it more boundaries to coordinate over; you improve it by removing the ones that weren't buying anything.
And the single tree pays organizational dividends the cost graph never shows:
- A new component is nearly free. Adding a subproject to a monorepo is making a folder — it inherits the CI, the linters, the release process, the dependency graph, the review gates, all of it, on day one. Standing up the same component as its own repo is a project: a new pipeline, new secrets, new branch-protection rules, a new place for the scanner and the dashboard to know about — DevOps in the loop before a line of product ships. The monorepo's marginal cost per component trends toward zero; the polyrepo's is a fixed setup tax paid every single time.
- Shared tooling, once. One formatter, one linter, one test runner, one CI config — configured a single time and inherited by every project, instead of drifting independently across N repos until "fix it in all of them" is its own migration.
- Consistency by default. The same code style, the same commit conventions, the same review gates apply everywhere, because there is only one everywhere. New code looks like old code without anyone enforcing it cross-repo.
- One CI verdict. "Is main green?" has a single answer for the whole coupled system, not a matrix of per-repo statuses you have to mentally join to know whether the product actually works.
- Learning by osmosis. An engineer who touches one corner can read, grep, and step through the rest — the whole system is right there. Splitting hides the other half behind a package boundary and a separate checkout, and institutional knowledge fragments along the same lines the repos do.
None of these are exotic. They are the quiet, compounding benefits of everyone working in the same place — and they evaporate the moment "the same place" becomes several places that talk through a registry.
There's a temperament behind which way a team leans, worth naming because it drives the decision more than any analysis does. Splitting is an operational fix: it feels like progress — you spin up a repo, wire a pipeline, ship — and the cost it incurs is deferred and diffuse. Keeping the cluster coherent is a structural fix: it asks where in the tree the thing belongs, who owns the directory, what the dependency edge should be — questions with no dashboard and no quick win. A monorepo rewards the team willing to make structural decisions and frustrates the one that only reaches for operational ones, because in a monorepo the structure is right there, unavoidable, and yours to get right or wrong. Reaching for the split is often less a considered architecture choice than a preference for the fix that doesn't make you think about structure — which is exactly the fix that lets the structure rot.
It's worth asking who recommends the split, because the incentives rarely point the same way as the costs. Ask a barber whether you need a haircut and you know the answer; ask the platform or release-engineering function whether you should split the repo and you'll often get the same reflexive yes — not from bad faith, but because a new repository is a new artifact to own, automate, and justify headcount around. The party that recommends the split gets a tidy new thing to manage; the engineering team gets handed the NP-complete version-resolution problem, the post-release breakage, and the coordination tax — forever. When the decider doesn't bear the cost, the cost gets chosen. Notice where the bill lands before you take the advice.
And sometimes the reason isn't operational at all — it's territorial. "I don't want to share my code with the other team." "Why should the frontend team's changes break my backend?" A repository wall feels like sovereignty: my repo, my rules, my green build, insulated from your mess. But the instinct mistakes the boundary of the org chart for the boundary of the product. The user doesn't hold your backend; they hold the whole thing, frontend and backend together, and it works or it doesn't as a unit. Two teams whose code ships in one product are coupled whether or not they admit it — the wall doesn't decouple them, it just hides the coupling until it surfaces as a cross-repo integration failure nobody owns. Wanting a wall so the other team's changes can't touch yours is wanting the coupling to be someone else's problem; it never is, because the user's problem is the product, and the product is one thing. The job was never to protect your repo. It was to ship a thing that works in someone's hands — and that thing does not respect your team boundaries.
So there are three cases against splitting a coupled codebase, and they reinforce each other. There is the technical one — content hashes, the global snapshot, the versioning tax — which most of this essay has made. There is the people one — friction, blame, lost osmosis, walls where you wanted bridges — which any engineering manager has felt even without the vocabulary for it. And there is the business one, which is just the first two read off the cost curves: the monorepo's cost is capital — a one-time investment that amortizes, so the curve flattens and the ceiling stays low as you grow. The split's cost is operating — a per-change, per-team coordination tax that compounds with headcount and with time, so the curve keeps climbing. A cost that scales sub-linearly with the company and one that scales super-linearly aren't two prices; they're two trajectories, and the gap between them is the whole question a business should be asking. The split feels cheaper in the quarter you do it and is more expensive in every quarter after — exactly the cost profile a growing company can least afford, because it's buying a drag that gets heavier precisely as it scales.
The business case is the cost curve, not the cost. A monorepo's ceiling stays low as you grow; the split's climbs with every team and every release. You aren't choosing a price — you're choosing a trajectory.
The reason the split keeps happening anyway is that the technical case is invisible unless you do the math, the people case is easy to mistake for "good fences make good neighbors," and the business case shows up only later — as the slow, unattributable erosion of velocity nobody connects back to the org-chart-shaped split from two years ago. All three are wrong for code that changes together. Most people don't do the math; the ones who do, and who have also watched a release-coordination thread eat a sprint, stop reaching for the split.
Lessons
- The unit is coupling, not count: this is about keeping a coherent set of tightly-coupled projects in one repo, not about a Google-scale whole-org "monorepo." Run many repositories — just don't shard things that change together. One repo per unit of code that changes together.
- A repository is a database: the content hash is a primary key, a commit is a versioned snapshot, the history is an append-only log, and one commit is one consistent read across every subproject.
- Inside a repo, a dependency is resolved by content at a snapshot; publishing converts it to a version string resolved at a registry. The split is that conversion, made permanent.
- The split is federation (structure) + loss of the global snapshot (consistency) + denormalization (mechanism). All three, not one.
- Staying in sync changes complexity class: atomic commits in one repo vs. up to pairwise version constraints once split — fed to a solver whose general problem is NP-complete. "One commit" was a complexity collapse, not a convenience.
SemVeris a lossy three-bucket hash of a diff the VCS already had losslessly; Hyrum's Law makes its compatibility promise undecidable at scale.- Every versioning scheme is a different lossy projection of the same diff: CalVer keeps time, ZeroVer opts out of the promise, a Go pseudo-version reaches back for the commit hash itself. The closer a scheme drifts to "embed the hash," the more it admits the VCS already had the answer.
- The split moves breakage from a compile-time error (atomic fix, One-Version Policy) to a post-release error (diamond dependency, version skew). The Linux kernel runs this at OS scale: in-tree drivers stay buildable through interface churn because every change fixes its callers atomically — there is deliberately no stable in-kernel ABI to version against.
- A plugin ecosystem is a One-Version Rule with extra steps: the plugins are worthless if they don't agree on the core. Keeping core and plugins in one workspace (Genkit's
uvtree,genkit = { workspace = true }) lets the maintainers absorb the version problem in CI so users never meet it; splitting hands that problem to the user as a compatibility matrix. - The lockfile re-adds the content hash the split discarded — proof the hash was the valuable thing.
- The split is also an attack surface. Typosquatting, dependency confusion, shadow transitive deps, and backdoored new versions riding open ranges and dist-tags are all attacks on the name and version a registry resolves — which an in-tree, content-resolved dependency doesn't have. Splitting enrolls code you wrote into the public supply chain's threat model.
- A submodule is a foreign key without the transaction: a gitlink pins an exact commit but gives you neither the atomic snapshot nor a registry's tooling — federation's costs doubled, by hand. And
--remote"track the tip" throws even the pin away for a mutable branch, so CI builds what's current, not what you tested. Vendor in the tree (git subtreeis that vendoring done honestly — denormalization with the snapshot restored, not a foreign key) or depend on a published package; don't live in the valley between. - Team ownership is access control, not a repository: a
CODEOWNERSline (or Bazel visibility) gives a team approval rights and an enforced API over its subtree with the global snapshot intact. Splitting to get ownership answers a governance question with a storage decision — and welds last year's org chart into every downstream's dependency manifest. The honest catch: much surrounding tooling (CI, scanners, alerting) still keys on repo, so per-directory ownership needs an in-tree ownership manifest (a service catalog) to wire up — one-time capital, not the split's recurring tax. - Split for exactly one reason: a celebrity shard the VCS can no longer carry. Team ownership is a
CODEOWNERSline, not a repo. A consumer boundary you don't control obliges honest versioning, not a move — publish from the monorepo or a read-only mirror; fork writable history only when something outside the contract (open-sourcing, legal separation) demands a separate home. Never split for tidiness. - The monorepo isn't free at scale — but its costs and the split's point in opposite directions. The monorepo's are one-time, solved infrastructure (VFS for Git, Sapling, merge-queue scheduling); Google has run a monorepo for over two decades, at two billion lines this way for two decades — a solved problem, not a brave experiment. The split's are recurring and structural: the constraints, NP-complete resolution, and post-release breakage, paid on every change, forever. For code that changes together, the database is the asymptotically cheaper model.
- There are three cases, and they reinforce each other. The technical one is the math above. The people one is friction: a repo boundary is a communication boundary, turning "here's my diff" into "whose release broke us?" — a cross-repo, owner-chasing, release-coordinating investigation. The business one is the cost curve: the monorepo's cost is capital that amortizes (ceiling stays low as you grow), the split's is an operating tax that compounds with headcount and time (curve keeps climbing) — you're choosing a trajectory, not a price. A monorepo also banks the quiet dividends a split forfeits: a new component is nearly free (a folder that inherits everything, vs. a from-scratch repo with DevOps in the loop), shared tooling configured once, one consistent format and CI verdict, and learning by osmosis. A manager's job is to reduce internal friction, not wall it in. The split keeps happening because the technical cost is invisible without the math, the people cost gets mistaken for "good fences," the business cost shows up only later — and because splitting is an operational fix that feels like progress, where keeping the cluster coherent is a structural one that asks you to think. Most people don't do the math.
This piece reuses the cost-of-change lens of two companions in the series — The Build Is Proportional to the Change and Utils Is Where Modularity Goes to Die — lifted from file granularity to repo granularity. The sources behind the claims above are collected below.
References
- Scott Chacon, Ben Straub. “Pro Git — Git Internals: Git Objects.” Apress, 2014.
- Derrick Stolee. “Git's Database Internals: Packed Object Store.” GitHub Engineering, 2022.
- Tom Preston-Werner. “Semantic Versioning 2.0.0.” semver.org.
- Hyrum Wright. “Hyrum's Law.” hyrumslaw.com.
- The Go Authors. “Go Modules Reference: Pseudo-versions.” go.dev.
- Juho Snellman. “A monorepo misconception — atomic cross-project commits.” 2021. — the canonical atomic-commit-is-not-atomic-deploy rebuttal, taken on above
- Google. “The One Version Rule.” Google Open Source.
- Hyrum Wright, Titus Winters, Tom Manshreck (eds.). “Software Engineering at Google (Ch. 16: Version Control and Branch Management).” O'Reilly, 2020.
- Russ Cox. “Version SAT.” research.swtch.com, 2016.
- Mancinelli et al.. “Managing the Complexity of Large Free and Open Source Package-Based Software Distributions.” EDOS / ASE, 2006.
- Greg Kroah-Hartman. “The Linux Kernel Driver Interface (stable API nonsense).” kernel.org.
- Git. “gitsubmodules.” git-scm.com.
- Git. “git-subtree — Merge subtrees together and split repository into subtrees.” git contrib. — the vendoring answer done as denormalization: the bytes come back into the tree;
split=/=pushre-derive the upstream projection - GitHub. “About code owners (CODEOWNERS).” GitHub Docs.
- Uber Engineering. “How We Halved Go Monorepo CI Build Time.” Uber, 2022.
- Airbnb Engineering. “Migrating Airbnb's JVM Monorepo to Bazel.” Airbnb.
- Google. “Copybara: moving and transforming code between repositories.” GitHub.
- “Genkit — py/pyproject.toml @ 2c105d21.” GitHub. — pinned to the commit: the Python tree is one
uvworkspace (members = ["packages/*", "plugins/*"]), each plugin depending on the core bygenkit = { workspace = true } - Astral. “uv: Workspaces.” Astral. — the workspace primitive — a member resolves a sibling from the tree, not the index
- Alex Birsan. “Dependency Confusion: How I Hacked Into Apple, Microsoft and Dozens of Other Companies.” Medium, 2021. — the attack that weaponizes "highest version wins" across public and private registries
- “npm left-pad incident.” Wikipedia, 2016. — one unpublished 11-line package 404'd builds across the industry — the name resolved to nothing, which a commit hash never does
- npm, Inc.. “npm Unpublish Policy.” npm Docs. — "registry data is immutable" and a used package@version can never be reused — the per-version guarantee; dist-tags and ranges stay mutable
- Matt Klein. “Monorepos: Please Don't!.” Medium, 2019.
- Block Engineering. “From Polyrepo Fragmentation to Monorepo Leverage.” Block.
- Brian Harry. “The Largest Git Repo on the Planet.” Microsoft DevBlogs, 2017.
- Rachel Potvin, Josh Levenberg. “Why Google Stores Billions of Lines of Code in a Single Repository.” Communications of the ACM 59(7), 2016.
How to cite
Mangalapilly, Y. J. (2026, June). You Don't Want Separate Repos. Saṃhitā Notes. https://yesudeep.com/blog/you-dont-want-separate-repos/ @online{mangalapilly2026you,
author = {Yesudeep Jose Mangalapilly},
title = {You Don't Want Separate Repos},
journal = {Sa\d{m}hit\=a Notes},
year = {2026},
month = {June},
url = {https://yesudeep.com/blog/you-dont-want-separate-repos/},
urldate = {2026-07-02},
} Yesudeep Jose Mangalapilly. “You Don't Want Separate Repos.” Saṃhitā Notes, 2026. https://yesudeep.com/blog/you-dont-want-separate-repos/. TY - ELEC
AU - Mangalapilly, Yesudeep Jose
TI - You Don't Want Separate Repos
T2 - Saṃhitā Notes
PY - 2026
UR - https://yesudeep.com/blog/you-dont-want-separate-repos/
Y2 - 2026-07-02
ER - Webmentions
Annotations
Thank you — your note is held for review and will appear once approved.
Thank you — your note is published.
Please sign in below to leave a note.
