The prevailing assumption behind multi-agent AI is additive, and the evidence says otherwise.
What follows reads the new coordination corpus as a single converging result, then draws the architecture decision it implies.
The curse of coordination is now measured, not asserted
The sharpest result comes from CooperBench, which assigns two coding agents separate features on the same repository — logically compatible, but spatially overlapping, meaning the features touch the same regions of code and have to be reconciled to combine — and measures whether the merged result passes both features’ tests. GPT-5- and Claude Sonnet 4.5-based agents reach roughly 25% success when they must cooperate, against roughly 48% when a single agent does both features alone: about a 50% relative drop for the same total workload (Khatua et al., 2026, CooperBench: Why Coding Agents Cannot be Your Teammates Yet, arXiv:2601.13295v2, preprint). The authors name this the curse of coordination, and it does not relent with scale — success falls monotonically from 68.6% with two agents to 46.5% with three and 30.0% with four. Pooled across five models, only 59% of solo capability survives the move to cooperation.
The effect is not confined to code. The Collaboration Gap evaluates 32 open- and closed-source models on a collaborative maze task, splitting the map so that two agents must combine partial views to solve it. The finding is blunt: “virtually all studied models experience a significant performance drop when moving from a solo to a collaborative setting” (Davidson et al., 2025, The Collaboration Gap, arXiv:2511.02687v1, preprint). Crucially, the stronger agent in a pairing tends to cap joint performance while the weaker one fails to set a floor — collaboration can underperform either participant alone. That single observation should unsettle any architecture that routes work to a mix of large and small models on the assumption that the strong one will carry the team.
Capability does not predict coordination
If coordination were simply a harder form of capability, the best solo models would coordinate best. They do not. CooperBench reports that its weakest individual coder retains the most capability under cooperation (retention 0.68) while a mid-tier coder retains the least (0.46) — coding skill provides no protection against coordination overhead (Khatua et al., 2026). Silo-Bench, accepted to ACL 2026, makes the same point from the opposite direction: across a battery of distributed-information tasks, agents exchange information competently and then fail to integrate it into a correct answer — a gap between information held and answers reached that its authors name the Communication-Reasoning Gap (Zhang et al., 2026, Silo-Bench, arXiv:2603.01045v2, ACL 2026, preprint). Post 2 takes that result up in detail; here it is the third independent construction — a coding benchmark, a maze, and a battery of communication-complexity tasks — converging on one conclusion. The binding constraint is the coordination, not the coder.
Talk is not the missing ingredient
The intuitive fix is to let agents communicate more. The evidence forecloses it: across these benchmarks agents already communicate, and the communication does not close the gap — it reshapes where agents work without changing whether their work fits together. Post 2 takes up that dissociation directly. What matters here is the consequence for the diagnosis: because the deficit is not a communication shortfall, it is not something more conversation, or a more articulate model, will supply.
The production view points the same way
The most instructive part of this picture is that the leading practitioner account of multi-agent systems already operates on these terms. Anthropic’s engineering write-up on its Research system reports that a multi-agent configuration — a Claude Opus 4 lead delegating to Claude Sonnet 4 subagents — outperformed a single Opus 4 agent by 90.2% on its internal research evaluation (Hadfield et al., 2025, How we built our multi-agent research system, Anthropic Engineering). That is a real, large gain, and it is worth understanding precisely why it does not contradict the benchmarks above. The gain is specific to breadth-first work — independent directions explored in parallel, each in its own context window, with results compressed back to a lead. It is the regime where the subtasks genuinely do not need to coordinate.
The same account is candid about the boundary. It states that “most coding tasks involve fewer truly parallelizable tasks than research, and LLM agents are not yet great at coordinating and delegating to other agents in real time,” and that domains requiring shared context or many inter-agent dependencies “are not a good fit for multi-agent systems today” (Hadfield et al., 2025). Read alongside CooperBench and Silo-Bench, this is not a hedge — it is the same boundary, drawn from production rather than from a benchmark. Where work decomposes into independent parts, agent teams scale; where it requires coordinating over shared, interdependent state, they regress. The peer-reviewed corpus does not dispute the production guidance; it measures the line the production guidance draws.
Why this is an architecture problem
If the deficit were a capability ceiling, the rational response would be to wait for better models. The evidence forecloses that move. Capability does not predict coordination; the strongest coder is not the best teammate; the failure localizes to integration, not to reasoning or communication in isolation. What changes outcomes, in every one of these studies, is structure. CooperBench’s rare successful runs are the ones where agents convert vague intentions into specific, verifiable commitments; the Collaboration Gap finds that ordering the interaction so the stronger agent seeds the work recovers much of the lost performance; Silo-Bench finds that the tasks which survive scale are those with a clean aggregate-then-reduce structure.
This is the premise of Coordination by Construction, the organizing frame for this series: that reliable agent teamwork is achieved by engineering coordination into the structure of the system — verifiable shared state, explicit integration contracts, partitioned work — rather than hoping it emerges from model capability. The frame is not drawn from any single source; it is our analytical contribution, grounded in the convergent findings across this corpus and aligned with the direction the leading production accounts are already taking. Anthropic’s engineering independently arrives at the same move: its Research subagents write outputs to a shared filesystem and pass lightweight references back rather than routing everything through the coordinator, making the shared work an artifact rather than a behavior the model must be smart enough to produce (Hadfield et al., 2025). The rest of this series follows that thread — through the architectures that make integration verifiable, the governance that makes cooperation observable, and the open question of whether training can close what structure cannot.
The takeaway for an architecture decision today is narrow and firm. Multi-agent deployment is justified where the work is genuinely separable and each agent can operate in its own context. Where the work is shared and interdependent — most real coding among it — a single capable agent is the current baseline to beat, and the burden of proof sits with the multi-agent design. That is not a limitation to wait out. It is a specification to build against.
The multi-agent coordination gap is an architecture problem with an architecture answer. Capability gains will not close it, because capability is not what it measures — the evidence is consistent across independent benchmarks and aligned with the most credible production guidance.
Treat coordination as something engineered — verifiable shared state, explicit commitments, partitioned work — and hold multi-agent designs to the standard of beating a single capable agent on the specific task at hand. Where they cannot, the single agent is the right architecture.
