The Assumption That Has to Die
The reigning practitioner mental model for agentic systems holds that more test-time compute is a lever. Give the agent more rounds, more tool calls, more reasoning budget, and the outputs get better. This assumption has been the operational logic behind nearly every enterprise agentic deployment: when an agent fails, give it more room to work. When the cost ceiling forces a choice, choose the more capable model and let it think longer. The cost-quality curve, on this view, slopes up.
This is not what the research shows. Across the most carefully designed studies of how frontier models behave under extended reasoning, the curve does not slope up. It bends, flattens, and at the limit reverses. The most direct statement of this finding comes from Anthropic’s own research team. In July 2025, Anthropic researchers Gema, Cohen-Wang, Pirola, Reddy, Marlow, Nguyen, and colleagues published Inverse Scaling in Test-Time Compute. The paper is unusual in the AI literature for the directness with which it documents a limitation of the company’s own models. It is, by some distance, the most important paper in the corpus for understanding what agents are actually doing when they consume tokens.
On a series of carefully constructed evaluation tasks — distractor-rich counting problems, misleading regression setups, complex deduction puzzles, AI safety scenarios — Anthropic’s researchers measured how Claude and other frontier models performed as they were given progressively more test-time compute. The finding was the opposite of the dominant practitioner assumption. Across multiple task families, accuracy decreased as reasoning length increased. Not slightly. Materially. The models were not getting smarter with more thinking. They were getting worse.
The phrase Anthropic’s team used to describe the phenomenon. The implication, which the paper develops with care, is that the cost-quality lever most enterprises think they have does not exist in the form they think it has. Buying more tokens does not buy more reasoning. Under specific and identifiable conditions, it buys less.
Five Failure Modes Anthropic Documented
What makes the inverse scaling paper load-bearing for the architectural argument is that it does not stop at the headline finding. It decomposes the failure into specific behavioral modes, each of which is independently observable, independently reproducible, and independently relevant to what an agentic system does in production. The paper documents five.
Claude models, when given more reasoning time on counting problems containing irrelevant information embedded in the prompt, increasingly fixated on the distractors rather than the actual question. With short reasoning windows, the models answered the question. With long reasoning windows, the same models had drifted into elaborate analyses of the distractor content. Longer chains of thought did not produce more careful problem-solving; they produced more thorough engagement with material that did not matter to the answer.
On regression tasks where the surface description of a problem suggested a familiar pattern that was not actually present in the data, models with more reasoning time were more likely to converge on the familiar-looking answer rather than the data-supported answer. Anthropic’s paper describes this as the model “pattern-matching to the problem framing instead of the actual data,” and notes that the failure intensified with reasoning depth — the models had more space to elaborate the wrong answer, and they used it.
Given complex regression setups with multiple variables, some genuinely predictive and others coincidentally correlated, models given more compute increasingly seized on the coincidental correlations. The phenomenon parallels a well-documented failure mode in human reasoning under time pressure removed: more deliberation does not always filter signal from noise; it sometimes amplifies the noise.
On deduction puzzles requiring a sequence of valid logical steps, increased reasoning time produced trajectories that drifted off the deductive path. The early steps would be correct; the later steps, given more compute to elaborate, would introduce contradictions, abandon constraints, or invent premises that the puzzle had not specified. The model was not running out of reasoning capacity. It was elaborating itself into incoherence.
In safety-relevant scenarios, models with more reasoning time were more likely to produce responses that violated the safety constraints the model had been trained to respect. The paper does not frame this as a model defect; it frames it as a property of long reasoning trajectories in general. Given enough room to elaborate, the model can reason itself into outcomes its shorter reasoning would not have produced.
The careful reader will notice that all five failure modes share a structural property. None of them appear in short, bounded interactions. All of them emerge in extended reasoning. The relationship between compute and quality, in the regime the paper studies, is not just non-monotonic — it has a specific shape that punishes the kind of long autonomous trajectories that agentic systems by definition produce.
The Bridge: From Lab Conditions to Production Agents
The inverse scaling paper is not a study of production agents. Its tasks are carefully constructed evaluation suites designed to isolate specific reasoning behaviors. What it documents is the behavior of frontier models under controlled conditions. The question for enterprise deployment is whether these laboratory failure modes appear in production agentic systems, and if so, how.
The empirical answer comes from a second study, this one focused explicitly on production multi-agent systems. In March 2025, Cemri, Pan, Kim, Su, Hu, Cao, Madala, Lai, Awad, Karpman, and colleagues published Why Do Multi-Agent LLM Systems Fail? (accepted to NeurIPS 2025), which constructs the first comprehensive taxonomy of multi-agent system failure modes from empirical evidence. The authors call it MAST — Multi-Agent System Failure Taxonomy. The work was developed at Berkeley with collaborators across multiple institutions, and the headline finding is severe.
Across seven popular open-source multi-agent systems analyzed by the MAST team, failure rates on tasks the systems were nominally designed to perform ranged from 41% to 86.7%. The taxonomy decomposes these failures into fourteen distinct modes across three categories: specification and system design failures, inter-agent misalignment, and task verification and termination failures.
The first comprehensive empirical taxonomy of multi-agent LLM system failure modes. Three categories. Fourteen specific failure modes. Specification and system design failures: agents misunderstood the task, disregarded specified constraints, stepped outside their assigned roles. Inter-agent misalignment: agents withheld information from each other, reset progress made by other agents, ignored input, derailed conversations. Task verification and termination failures: agents prematurely declared tasks complete, failed to verify outputs, could not recognize when a task was unsolvable.
What makes the MAST taxonomy structurally important is that its categories cleanly mirror the failure modes Anthropic documented in single-agent test-time scaling. The two papers describe the same phenomenon at two scales. Anthropic shows it inside a single reasoning trajectory. Cemri et al. show it across coordinated trajectories. Both are observing what happens when an autonomous loop has no governing structure above it to bound exploration against the actual objective.
- Distractor susceptibility within one reasoning trajectory
- Overfitting to problem framing inside a single chain of thought
- Regression depth attenuation as reasoning extends
- Amplified misaligned behavior at extended lengths
- Spurious correlation chasing within elaborated reasoning
- Failure intensifies with reasoning depth
- Agents derail each other onto irrelevant subtasks
- Entire system converges on a misframed task
- Protracted exchanges drift from original objective
- Inter-agent misalignment compounds across rounds
- Task verification failures accumulate without correction
- Failure intensifies with coordination complexity
The Looping Signature, Behaviorally
Post 1 introduced what the corpus calls the looping signature: the behavioral pattern Bai et al. (2026) documented in agentic coding runs, where high-cost runs were characterized by repeated file viewing and repeated file modification rising sharply with cost quartile. The post observed that expensive runs were not reasoning more deeply; they were looping. The architectural language Post 1 used was provisional. With the behavioral evidence from the inverse scaling and MAST work now on the table, the looping signature can be read more precisely.
What the agent is doing on a high-cost SWE-bench run is not random exploration. It is a specific combination of the failure modes documented at the model layer and the coordination failures documented at the system layer. The agent re-reads files because it cannot determine whether its current understanding is sufficient — a verification failure from MAST’s third category. The agent re-edits files because the previous edit was based on an over-elaborated reasoning chain that introduced premises the codebase did not support — distractor susceptibility and regression depth attenuation from the inverse scaling paper. The agent fails to recognize when the task is unsolvable, continuing to consume tokens against an objective it cannot complete — premature task termination’s opposite, also documented by MAST.
The token bill is not noise. It is a behavioral signature with an identifiable cause. The cost variance Post 1 documented — 2× across runs on the same problem, 11× across teams running the same workflow — is the variance in how many of these failure modes activate, and for how long, before either the task completes or the budget runs out. An expensive run is an unlucky run, but it is unlucky in a structurally predictable way: it encountered the failure modes the architecture provides no mechanism for catching.
The model is not broken. Claude is not failing. Anthropic’s research, which the field reads as the most rigorous available characterization of these behaviors, does not frame the inverse scaling findings as defects to be patched. It frames them as properties of the regime — properties that emerge whenever the reasoning is long, autonomous, and unsupervised. The conclusion Anthropic’s paper reaches is not “fix the model.” It is closer to “understand the regime and constrain the conditions under which models operate in it.”
What Cannot Be Fixed at the Model Layer
The architectural argument the inverse scaling paper licenses, perhaps without intending to, is one of the clearest in the recent literature. The five failure modes Anthropic documented are not present at short reasoning lengths. They emerge as reasoning extends. They are not specific to Claude — the paper tests across model families and finds the pattern is general. They are not specific to particular task types — the paper deliberately constructs diverse task families and finds the pattern across all of them. They are not solvable by prompt engineering — the paper varies prompting strategies and the failures persist.
The phrase Anthropic’s paper uses is worth quoting precisely: the failures emerge from “properties of extended reasoning itself.” This is a structural claim. It says that the behavior is a function of how long the model is allowed to reason, not which model is reasoning or what it is reasoning about. The implication is that no model improvement, no matter how substantial, eliminates the regime. Better models reason more capably for longer, which means they reach the inverse-scaling regime at a different point on the curve, but they do not eliminate the curve.
This places the field in a position the inverse scaling paper does not need to argue for explicitly, because the structure of the finding does it. If the failure mode is a property of extended reasoning, the response cannot be at the layer of the reasoning. It has to be at the layer that decides how much reasoning is allowed, what constitutes sufficient progress, and what triggers an interrupt. That layer is not currently part of the standard agentic stack. The model has no such layer. The framework has no such layer in any meaningful production sense. The application has not been built to provide one.
The token bill, then, is what happens when there is no such layer. The agent reasons until the budget runs out, the rounds cap is hit, or — rarely — the task completes. There is no intermediate signal that says: the trajectory has entered an unproductive regime, halt and re-evaluate. The failure modes Anthropic documented have no observer in production deployments. They run, they consume tokens, and they conclude — often with an answer that resembles a solution but does not solve the problem.
What This Means for the Series Argument
Series 14 is constructing a single argument across four posts: that the costs enterprises are encountering with agentic systems are the visible signatures of an invisible substrate gap. Post 1 named the economic signature. This post has named the behavioral signature. They are the same phenomenon at two layers of analysis.
The behavioral diagnosis, taken seriously, narrows the space of possible responses. It rules out three categories of intervention that enterprises and vendors are currently treating as primary. First, it rules out model upgrades as a sufficient response, because the inverse scaling regime is a property of long reasoning across model generations. Second, it rules out prompt engineering as a sufficient response, because the failures persist across prompting variations in Anthropic’s experiments. Third, it rules out adding compute as a sufficient response, because compute is precisely what triggers the regime.
What remains is structural. A response at a layer above the model — a layer that observes the trajectory, holds a representation of what the authorized objective is, monitors progress against that objective, and triggers intervention when the trajectory enters a regime of the type Anthropic documented. The substrate that would do this work is what the Luminity corpus calls the harness layer. Post 4 of this series examines what closes the gap. Two pieces of recent research — Liu et al.’s budget-aware tool-use work (arXiv:2511.17006) and Chen et al.’s SEMAP protocol (arXiv:2510.12120) — both point to what the response looks like at the harness layer.
Before that, however, the argument has to confront a second consequence of the behavioral diagnosis: what happens when these dynamics meet enterprise deployment reality. The cost variance and behavioral failures documented across this post and the previous one explain, in detail, the deployment data the field has been struggling to interpret. The MIT NANDA finding that 95% of GenAI pilots deliver zero measurable P&L impact. The IBM 2026 IBV CEO Study showing only 25% of AI initiatives delivered expected ROI and just 16% scaled enterprise-wide. The S&P Global report that 42% of enterprises abandoned most AI projects in 2025. Morgan Stanley Research’s finding that only 30% of North American AI adopters cited quantifiable business impact in Q4 2025. The Klarna reversal. The Air Canada precedent. These are not anomalies. They are the institutional signature of the same gap that produces the token bill and the behavioral failures.
Anthropic’s own research demonstrates that more test-time compute, under identifiable conditions, produces less accuracy. Five failure modes — distractor susceptibility, overfitting to problem framing, spurious correlation chasing, regression depth attenuation, amplified misaligned behavior — emerge as reasoning extends. MAST documents the multi-agent mirror at 41–86.7% failure rates. The looping signature from Post 1 is the visible trace of these specific failures running in unbounded loops. The response cannot be at the model layer, because the failure is a property of the regime, not the model.
Post 3 — The Gap Between Pilot and Production turns to the institutional signature. The third lunch is on the table.
