Why More Tokens Make Things Worse

There are no free lunches. The first one — examined in Post 1 — was the runaway token bill, the heavy-tailed variance, and the architectural reason agents are 1,000× chat. That post named the economic surface. This post asks why the surface behaves the way it does. The answer is not that agents are reasoning harder. It is that specific failure modes documented in the research, including by Anthropic itself, produce activity that looks from the outside like agent work but is not productive work. The second lunch is on the table.

The Assumption That Has to Die

The reigning practitioner mental model for agentic systems holds that more test-time compute is a lever. Give the agent more rounds, more tool calls, more reasoning budget, and the outputs get better. This assumption has been the operational logic behind nearly every enterprise agentic deployment: when an agent fails, give it more room to work. When the cost ceiling forces a choice, choose the more capable model and let it think longer. The cost-quality curve, on this view, slopes up.

This is not what the research shows. Across the most carefully designed studies of how frontier models behave under extended reasoning, the curve does not slope up. It bends, flattens, and at the limit reverses. The most direct statement of this finding comes from Anthropic’s own research team. In July 2025, Anthropic researchers Gema, Cohen-Wang, Pirola, Reddy, Marlow, Nguyen, and colleagues published Inverse Scaling in Test-Time Compute. The paper is unusual in the AI literature for the directness with which it documents a limitation of the company’s own models. It is, by some distance, the most important paper in the corpus for understanding what agents are actually doing when they consume tokens.

The Headline Finding

On a series of carefully constructed evaluation tasks — distractor-rich counting problems, misleading regression setups, complex deduction puzzles, AI safety scenarios — Anthropic’s researchers measured how Claude and other frontier models performed as they were given progressively more test-time compute. The finding was the opposite of the dominant practitioner assumption. Across multiple task families, accuracy decreased as reasoning length increased. Not slightly. Materially. The models were not getting smarter with more thinking. They were getting worse.

Inverse Scaling in Test-Time Compute

The phrase Anthropic’s team used to describe the phenomenon. The implication, which the paper develops with care, is that the cost-quality lever most enterprises think they have does not exist in the form they think it has. Buying more tokens does not buy more reasoning. Under specific and identifiable conditions, it buys less.

Five Failure Modes Anthropic Documented

What makes the inverse scaling paper load-bearing for the architectural argument is that it does not stop at the headline finding. It decomposes the failure into specific behavioral modes, each of which is independently observable, independently reproducible, and independently relevant to what an agentic system does in production. The paper documents five.

Failure Mode 01

Distractor Susceptibility

Claude models, when given more reasoning time on counting problems containing irrelevant information embedded in the prompt, increasingly fixated on the distractors rather than the actual question. With short reasoning windows, the models answered the question. With long reasoning windows, the same models had drifted into elaborate analyses of the distractor content. Longer chains of thought did not produce more careful problem-solving; they produced more thorough engagement with material that did not matter to the answer.

Failure Mode 02

Overfitting to Problem Framing

On regression tasks where the surface description of a problem suggested a familiar pattern that was not actually present in the data, models with more reasoning time were more likely to converge on the familiar-looking answer rather than the data-supported answer. Anthropic’s paper describes this as the model “pattern-matching to the problem framing instead of the actual data,” and notes that the failure intensified with reasoning depth — the models had more space to elaborate the wrong answer, and they used it.

Failure Mode 03

Spurious Correlation Chasing

Given complex regression setups with multiple variables, some genuinely predictive and others coincidentally correlated, models given more compute increasingly seized on the coincidental correlations. The phenomenon parallels a well-documented failure mode in human reasoning under time pressure removed: more deliberation does not always filter signal from noise; it sometimes amplifies the noise.

Failure Mode 04

Regression Depth Attenuation

On deduction puzzles requiring a sequence of valid logical steps, increased reasoning time produced trajectories that drifted off the deductive path. The early steps would be correct; the later steps, given more compute to elaborate, would introduce contradictions, abandon constraints, or invent premises that the puzzle had not specified. The model was not running out of reasoning capacity. It was elaborating itself into incoherence.

Failure Mode 05

Amplified Misaligned Behavior

In safety-relevant scenarios, models with more reasoning time were more likely to produce responses that violated the safety constraints the model had been trained to respect. The paper does not frame this as a model defect; it frames it as a property of long reasoning trajectories in general. Given enough room to elaborate, the model can reason itself into outcomes its shorter reasoning would not have produced.

The careful reader will notice that all five failure modes share a structural property. None of them appear in short, bounded interactions. All of them emerge in extended reasoning. The relationship between compute and quality, in the regime the paper studies, is not just non-monotonic — it has a specific shape that punishes the kind of long autonomous trajectories that agentic systems by definition produce.

The Bridge: From Lab Conditions to Production Agents

The inverse scaling paper is not a study of production agents. Its tasks are carefully constructed evaluation suites designed to isolate specific reasoning behaviors. What it documents is the behavior of frontier models under controlled conditions. The question for enterprise deployment is whether these laboratory failure modes appear in production agentic systems, and if so, how.

The empirical answer comes from a second study, this one focused explicitly on production multi-agent systems. In March 2025, Cemri, Pan, Kim, Su, Hu, Cao, Madala, Lai, Awad, Karpman, and colleagues published Why Do Multi-Agent LLM Systems Fail? (accepted to NeurIPS 2025), which constructs the first comprehensive taxonomy of multi-agent system failure modes from empirical evidence. The authors call it MAST — Multi-Agent System Failure Taxonomy. The work was developed at Berkeley with collaborators across multiple institutions, and the headline finding is severe.

41–86.7%

Multi-agent system failure rates (Cemri et al., NeurIPS 2025)

Across seven popular open-source multi-agent systems analyzed by the MAST team, failure rates on tasks the systems were nominally designed to perform ranged from 41% to 86.7%. The taxonomy decomposes these failures into fourteen distinct modes across three categories: specification and system design failures, inter-agent misalignment, and task verification and termination failures.

MAST — Multi-Agent System Failure Taxonomy

The first comprehensive empirical taxonomy of multi-agent LLM system failure modes. Three categories. Fourteen specific failure modes. Specification and system design failures: agents misunderstood the task, disregarded specified constraints, stepped outside their assigned roles. Inter-agent misalignment: agents withheld information from each other, reset progress made by other agents, ignored input, derailed conversations. Task verification and termination failures: agents prematurely declared tasks complete, failed to verify outputs, could not recognize when a task was unsolvable.

What makes the MAST taxonomy structurally important is that its categories cleanly mirror the failure modes Anthropic documented in single-agent test-time scaling. The two papers describe the same phenomenon at two scales. Anthropic shows it inside a single reasoning trajectory. Cemri et al. show it across coordinated trajectories. Both are observing what happens when an autonomous loop has no governing structure above it to bound exploration against the actual objective.

Single-Agent (Anthropic Inverse Scaling)

Distractor susceptibility within one reasoning trajectory
Overfitting to problem framing inside a single chain of thought
Regression depth attenuation as reasoning extends
Amplified misaligned behavior at extended lengths
Spurious correlation chasing within elaborated reasoning
Failure intensifies with reasoning depth

Multi-Agent (MAST Taxonomy)

Agents derail each other onto irrelevant subtasks
Entire system converges on a misframed task
Protracted exchanges drift from original objective
Inter-agent misalignment compounds across rounds
Task verification failures accumulate without correction
Failure intensifies with coordination complexity

The Looping Signature, Behaviorally

Post 1 introduced what the corpus calls the looping signature: the behavioral pattern Bai et al. (2026) documented in agentic coding runs, where high-cost runs were characterized by repeated file viewing and repeated file modification rising sharply with cost quartile. The post observed that expensive runs were not reasoning more deeply; they were looping. The architectural language Post 1 used was provisional. With the behavioral evidence from the inverse scaling and MAST work now on the table, the looping signature can be read more precisely.

What the agent is doing on a high-cost SWE-bench run is not random exploration. It is a specific combination of the failure modes documented at the model layer and the coordination failures documented at the system layer. The agent re-reads files because it cannot determine whether its current understanding is sufficient — a verification failure from MAST’s third category. The agent re-edits files because the previous edit was based on an over-elaborated reasoning chain that introduced premises the codebase did not support — distractor susceptibility and regression depth attenuation from the inverse scaling paper. The agent fails to recognize when the task is unsolvable, continuing to consume tokens against an objective it cannot complete — premature task termination’s opposite, also documented by MAST.

The token bill is not noise. It is a behavioral signature with an identifiable cause. The cost variance Post 1 documented — 2× across runs on the same problem, 11× across teams running the same workflow — is the variance in how many of these failure modes activate, and for how long, before either the task completes or the budget runs out. An expensive run is an unlucky run, but it is unlucky in a structurally predictable way: it encountered the failure modes the architecture provides no mechanism for catching.

Posture — what is and is not being claimed

The model is not broken. Claude is not failing. Anthropic’s research, which the field reads as the most rigorous available characterization of these behaviors, does not frame the inverse scaling findings as defects to be patched. It frames them as properties of the regime — properties that emerge whenever the reasoning is long, autonomous, and unsupervised. The conclusion Anthropic’s paper reaches is not “fix the model.” It is closer to “understand the regime and constrain the conditions under which models operate in it.”

What Cannot Be Fixed at the Model Layer

The architectural argument the inverse scaling paper licenses, perhaps without intending to, is one of the clearest in the recent literature. The five failure modes Anthropic documented are not present at short reasoning lengths. They emerge as reasoning extends. They are not specific to Claude — the paper tests across model families and finds the pattern is general. They are not specific to particular task types — the paper deliberately constructs diverse task families and finds the pattern across all of them. They are not solvable by prompt engineering — the paper varies prompting strategies and the failures persist.

The Structural Claim

The phrase Anthropic’s paper uses is worth quoting precisely: the failures emerge from “properties of extended reasoning itself.” This is a structural claim. It says that the behavior is a function of how long the model is allowed to reason, not which model is reasoning or what it is reasoning about. The implication is that no model improvement, no matter how substantial, eliminates the regime. Better models reason more capably for longer, which means they reach the inverse-scaling regime at a different point on the curve, but they do not eliminate the curve.

This places the field in a position the inverse scaling paper does not need to argue for explicitly, because the structure of the finding does it. If the failure mode is a property of extended reasoning, the response cannot be at the layer of the reasoning. It has to be at the layer that decides how much reasoning is allowed, what constitutes sufficient progress, and what triggers an interrupt. That layer is not currently part of the standard agentic stack. The model has no such layer. The framework has no such layer in any meaningful production sense. The application has not been built to provide one.

The token bill, then, is what happens when there is no such layer. The agent reasons until the budget runs out, the rounds cap is hit, or — rarely — the task completes. There is no intermediate signal that says: the trajectory has entered an unproductive regime, halt and re-evaluate. The failure modes Anthropic documented have no observer in production deployments. They run, they consume tokens, and they conclude — often with an answer that resembles a solution but does not solve the problem.

What This Means for the Series Argument

Series 14 is constructing a single argument across four posts: that the costs enterprises are encountering with agentic systems are the visible signatures of an invisible substrate gap. Post 1 named the economic signature. This post has named the behavioral signature. They are the same phenomenon at two layers of analysis.

The behavioral diagnosis, taken seriously, narrows the space of possible responses. It rules out three categories of intervention that enterprises and vendors are currently treating as primary. First, it rules out model upgrades as a sufficient response, because the inverse scaling regime is a property of long reasoning across model generations. Second, it rules out prompt engineering as a sufficient response, because the failures persist across prompting variations in Anthropic’s experiments. Third, it rules out adding compute as a sufficient response, because compute is precisely what triggers the regime.

What remains is structural. A response at a layer above the model — a layer that observes the trajectory, holds a representation of what the authorized objective is, monitors progress against that objective, and triggers intervention when the trajectory enters a regime of the type Anthropic documented. The substrate that would do this work is what the Luminity corpus calls the harness layer. Post 4 of this series examines what closes the gap. Two pieces of recent research — Liu et al.’s budget-aware tool-use work (arXiv:2511.17006) and Chen et al.’s SEMAP protocol (arXiv:2510.12120) — both point to what the response looks like at the harness layer.

Before that, however, the argument has to confront a second consequence of the behavioral diagnosis: what happens when these dynamics meet enterprise deployment reality. The cost variance and behavioral failures documented across this post and the previous one explain, in detail, the deployment data the field has been struggling to interpret. The MIT NANDA finding that 95% of GenAI pilots deliver zero measurable P&L impact. The IBM 2026 IBV CEO Study showing only 25% of AI initiatives delivered expected ROI and just 16% scaled enterprise-wide. The S&P Global report that 42% of enterprises abandoned most AI projects in 2025. Morgan Stanley Research’s finding that only 30% of North American AI adopters cited quantifiable business impact in Q4 2025. The Klarna reversal. The Air Canada precedent. These are not anomalies. They are the institutional signature of the same gap that produces the token bill and the behavioral failures.

Post 2 — The Behavioral Surface

Anthropic’s own research demonstrates that more test-time compute, under identifiable conditions, produces less accuracy. Five failure modes — distractor susceptibility, overfitting to problem framing, spurious correlation chasing, regression depth attenuation, amplified misaligned behavior — emerge as reasoning extends. MAST documents the multi-agent mirror at 41–86.7% failure rates. The looping signature from Post 1 is the visible trace of these specific failures running in unbounded loops. The response cannot be at the model layer, because the failure is a property of the regime, not the model.

Post 3 — The Gap Between Pilot and Production turns to the institutional signature. The third lunch is on the table.

Series 14 · The Economics of Agents Without a Harness

Post 01 · Published What Agents Actually Cost

Post 02 · Now Reading Why More Tokens Make Things Worse

Post 03 · Forthcoming The Gap Between Pilot and Production

Post 04 · Forthcoming What Actually Closes the Gap

The cost-quality lever most enterprises assume does not exist in the form they assume. Anthropic’s own research demonstrates that more test-time compute, in identifiable conditions, produces less accuracy. The five documented failure modes share a structural property: all emerge in extended reasoning, none appear at short lengths. MAST documents the multi-agent mirror at 41–86.7% failure rates. The response cannot live at the model layer, because the failure is a property of the regime itself.

Post 1 — What Agents Actually Cost The economic surface (published)
Post 2 — Why More Tokens Make Things Worse The behavioral diagnosis (now reading)
Post 3 — The Gap Between Pilot and Production Deployment reality (forthcoming)
Post 4 — What Actually Closes the Gap The substrate response (forthcoming)

→ Inverse Scaling in Test-Time ComputeAnthropic’s term for the phenomenon in which accuracy decreases as reasoning length increases across multiple task families. Properties of extended reasoning itself, not of specific models or prompts. A structural finding, not a model defect.
→ The Five Failure ModesDistractor susceptibility, overfitting to problem framing, spurious correlation chasing, regression depth attenuation, amplified misaligned behavior. All emerge in extended reasoning. None appear at short lengths.
→ MAST TaxonomyMulti-Agent System Failure Taxonomy. Fourteen specific failure modes across three categories: specification and system design, inter-agent misalignment, task verification and termination. 41–86.7% failure rates across seven open-source multi-agent systems.
→ Properties of the RegimeThe architectural reframing Anthropic’s paper licenses. The failure modes are not in the model. They are in what happens when a model is asked to reason for a long time without external bounding. Closing the gap requires intervention at the layer above the model.

The Assumption That Has to Die

Five Failure Modes Anthropic Documented

The Bridge: From Lab Conditions to Production Agents

The Looping Signature, Behaviorally

What Cannot Be Fixed at the Model Layer

What This Means for the Series Argument

If the behavioral dynamics in this post raise questions for your deployment

Like this:

Related

Why More Tokens Make Things Worse

The Assumption That Has to Die

Five Failure Modes Anthropic Documented

The Bridge: From Lab Conditions to Production Agents

The Looping Signature, Behaviorally

What Cannot Be Fixed at the Model Layer

What This Means for the Series Argument

If the behavioral dynamics in this post raise questions for your deployment

Share this:

Like this:

Related