There is a number worth sitting with before any architectural argument begins. A frontier coding agent solving a single real-world software issue consumes, on average, 4.17 million tokens. The same model, asked the same kind of question without agentic scaffolding — multi-round chat, code reasoning — consumes between 1,200 and 3,400 tokens. The agentic version is not 10× more expensive, or 100× more expensive. It is roughly 1,000× more expensive, and the gap is structural.
The number comes from a recent empirical study by Bai et al. (arXiv:2604.22750), which traced token consumption across eight frontier LLMs running on SWE-bench Verified through the OpenHands agent framework. The study is the first systematic accounting of where agentic tokens go. Its findings are not a critique of any particular model. They describe a property of the architecture: agents accumulate context, and accumulated context is what costs money.
This post is the first in a four-part series on what that property means for enterprises trying to operate agentic systems at scale. The argument across the series is that the economic behavior we are about to examine — heavy-tailed variance, runaway cost, prediction failure — is not a sign that agents are broken. It is a sign that the substrate underneath them is missing. The harness that would bound this behavior does not yet exist in most deployments. The cost is what its absence looks like on an invoice.
The 1,000× Gap Is Not an Outlier — It Is the Architecture
Bai’s headline comparison sets the frame. Across the three task categories examined: code reasoning — single-turn problem solving without tool interaction — averages 1,190 tokens at $0.016 per task. Code chat — multi-turn dialogue about a coding problem — averages 3,390 tokens at $0.023. Agentic coding — autonomous problem solving with tool use, file access, test execution — averages 4.17 million tokens at $1.857. The cost difference is large. The structural difference behind it is larger.
Compared to 1.33 for code chat and 0.16 for code reasoning. For every token an agent generates in agentic coding, it consumes 153 tokens of context. This is the architectural signature: agents do not pay to think — they pay to re-read what they already saw.
This is not an inefficiency that will be optimized away in the next model generation. It is what the architecture does. An agent working on a software issue reads files. It runs tests. It reads the output. It re-reads files it already viewed. It accumulates this material into its context window and feeds the entire accumulated history back into the model at every subsequent step. The same content gets processed dozens of times across the trajectory. The cost is the cost of that re-processing, multiplied across the rounds the agent takes to finish.
A second study published the same year, Salim et al.’s Tokenomics (arXiv:2601.14470, MSR ’26), found the same pattern in a different framework. Analyzing ChatDev — a multi-agent system rather than a single-agent harness — Salim et al. found input tokens averaged 53.9% of consumption, and that token usage concentrated in the iterative review phase, which accounted for 59.4% of total tokens. Different framework, different task set, same architectural diagnosis: agentic systems are economically discontinuous from chat, and the discontinuity is driven by the cost of moving context around, not the cost of generating answers.
A third paper, Wang et al.’s AgentTaxo (ICLR 2025 Workshop), named this pattern explicitly. Inter-agent communication in multi-agent systems creates what the authors call a “communication tax” — duplicated tokens reused across agent calls for validation and verification. Three independent studies, three different architectures, the same finding: the cost of an agent is the cost of the context it carries forward. Models do not pay this cost when used in isolation. Agents pay it by construction.
The Variance Is Not Noise
If 4.17 million tokens were the typical cost, enterprises could budget against it. The harder finding in the Bai study is that the typical cost does not exist — the distribution is heavy-tailed and stochastic to a degree that makes upfront pricing structurally difficult.
Across 500 problem instances and four independent runs per problem per model, Bai et al. found that the most expensive problem in the dataset cost approximately 7 million more tokens than the cheapest, on average. High-cost problems also exhibited the largest cross-run variance — the harder the task, the more unstable the agent’s behavior across attempts. On the same problem, with the same agent, the most expensive run cost roughly 2× the least expensive run across all eight models tested.
A practitioner report by Shakti Mishra, published in May 2026, captured the same phenomenon at the deployment layer. Mishra documents two enterprise teams running structurally identical multi-agent workflows. One team’s cost per run: $0.12. The other team’s cost per run: $1.40. The 11× spread came not from task difficulty, model selection, or scale — it came from architectural choices in how the agents were assembled.
For an enterprise running a thousand such workflows per day, that spread is the difference between an annual operating cost of $44,000 and $511,000. The model is not what determined the bill. The architecture above the model was.
This is not measurement error. It is a property of how agents explore. Two attempts at the same task can produce trajectories that diverge dramatically — one finds the relevant file in three calls, another spends fifteen rounds re-reading the same files before converging on the same answer. The user pays for both trajectories whether or not either succeeds.
More Tokens Do Not Buy More Accuracy
The third Bai finding is the one most likely to surprise a practitioner working under the assumption that more compute means better outcomes. It does not, and the data is consistent across every model tested.
At the problem level, tasks that consumed more input tokens had overall lower accuracy across all eight frontier models. Higher cost did not mean harder reasoning being applied to harder problems — it meant the agent struggling. At the run level, when the same agent was run on the same problem four times, the runs were ranked by token cost into four buckets — MinCost, LowerCost, UpperCost, MaxCost. Accuracy increased modestly from MinCost to LowerCost, then saturated and slightly degraded at higher cost levels. Spending more tokens on the same problem did not produce a better answer. It produced the same answer through a longer detour, or no answer through a longer detour, or a worse answer through the longest detour.
- More compute means more thinking
- Higher cost runs represent harder reasoning
- Allocating more budget improves accuracy
- Token cost is a proxy for problem difficulty
- Failed expensive runs are unfortunate but rare
- The cost-quality curve has a useful slope
- Token cost decouples from reasoning depth at scale
- High-cost runs are dominated by repeated file access
- Accuracy saturates and slightly degrades at higher cost
- Token cost is a proxy for unbounded exploration
- Failed expensive runs are a structural feature of the architecture
- The lever enterprises think they have does not exist
The behavioral signature of high-cost failure, as Bai documents, is striking: repeated file viewing and repeated file modification both rose sharply with cost quartile. Expensive runs were not reasoning more deeply. They were looping. The agent re-read files it had already read, re-edited files it had already edited, and accumulated context that did not contain new information. The token bill grew. The progress did not.
This is the architectural diagnosis at the economic surface. The runaway cost is not the agent reasoning harder. It is the agent without a stopping mechanism, exploring without a budget signal, accumulating context without an interrupt. Post 2 of this series will examine the behavioral diagnosis in detail, including Anthropic’s own research showing that extended reasoning can degrade rather than improve performance on carefully designed tasks. For now, the point at the economic layer is simpler: the relationship between cost and value, in agentic systems as currently deployed, is not monotonic.
Models Cannot Predict Their Own Cost
The fourth Bai finding closes the economic case. If agents could anticipate their own token consumption before executing a task, enterprises could at least make informed decisions — set budget caps, choose cheaper models for cheaper tasks, route high-cost work through tighter controls. The study tested whether frontier models can do this. They cannot.
Bai et al. asked each of the eight models tested to predict its own input and output token consumption for a task before executing it. The agent was given full access to the repository, full tool-calling capability, and the freedom to inspect the environment before committing to an estimate. The agent was instructed not to solve the task, only to predict what solving it would cost.
The best correlation between predicted and actual token usage across all models, all conditions: Pearson r = 0.39. That is the ceiling. Most models came in between 0.05 and 0.34 for input tokens, 0.04 and 0.39 for output tokens. All models systematically underestimated — predicted values stayed compressed in the low millions while actual values stretched into the tens of millions. The underestimation persists with or without an in-context example.
Input-token prediction was consistently harder than output-token prediction, because input growth depends on paths the agent has not yet taken — files not yet read, tests not yet run, context not yet accumulated. The agent cannot predict what it has not yet explored.
This matters at the deployment layer for a reason that is not immediately obvious. Enterprise pricing models — outcomes-based pricing, fixed-fee agent contracts, per-task billing — all assume that someone, somewhere, can estimate what a task will cost before committing to it. If the agent cannot estimate, and the model providers cannot estimate, and the deployment platforms cannot estimate, then the entity bearing the variance is whoever is at the end of the contract. Currently that is the customer.
The Phase-Level Picture: Cache Reads Dominate
A final piece of the economic picture comes from Bai’s phase-level analysis, which decomposed Claude Sonnet 4.5 trajectories into five problem-solving phases: Setup, Explore, Fix, Validate, Closeout. In every phase, the largest cost category was cache-read input tokens — the cumulative reuse of previously-seen context. Cache reads dominated even though they are individually priced at roughly 1/80th the cost of output tokens. The sheer volume of accumulated context being re-processed at the cached rate was large enough that cheap-per-token cache reads still outweighed expensive-per-token output in aggregate cost.
Caching is currently the load-bearing optimization that makes agentic deployment economically viable. Without it, the costs Bai documents would be roughly an order of magnitude higher. With it, the costs are still 1,000× chat. Caching is not solving the problem — it is making the unsolved problem barely tolerable. What caching cannot fix is the upstream cause: the agent’s lack of a mechanism for deciding what context belongs in the next round at all. Cache reads stay cheap as long as the context window keeps growing in a predictable way. They stop being cheap the moment the agent decides to introduce new content — a new file view, a fresh test execution, a tool call returning a large payload.
The agent has no governance over its own context construction. The substrate it runs on assumes context will be managed by the application layer above it. The application layer assumes the model knows how to manage its own context. Neither is true. The cost is what falls into that gap.
What the Corpus Has Not Yet Said
This post has examined what agents cost. It has not examined three questions the next three posts address.
First: why does the cost behave this way? The behavioral diagnosis — looping, redundant exploration, inability to recognize unsolvable tasks — is well-documented in the literature, including in research published by Anthropic itself on inverse scaling in test-time compute. Post 2 examines that diagnosis directly.
Second: what happens when this cost behavior meets enterprise reality? The deployment data — MIT NANDA’s finding that 95% of GenAI pilots deliver zero measurable P&L impact, the IBM 2026 IBV CEO Study showing only 25% of AI initiatives delivered expected ROI and just 16% scaled enterprise-wide, S&P Global’s report that 42% of enterprises abandoned most AI projects in 2025, and Morgan Stanley Research’s finding that only 30% of North American AI adopters cited quantifiable business impact in Q4 2025 — describes the consequence at the institutional level. Post 3 examines what the POC Wall looks like when the cost variance documented above hits a procurement process designed for predictable software.
Third: what closes the gap? The substrate response is not theoretical. Liu et al.’s budget-aware agent scaling work (arXiv:2511.17006) and Chen et al.’s SEMAP protocol (arXiv:2510.12120) — which demonstrated a 69.6% reduction in function-level failures through protocol-layer intervention — both point in the same direction. The response is at the harness layer, not the model layer. Post 4 examines that response and where Luminity’s harness IP sits in that landscape.
The Frame This Series Operates Under
The argument across these four posts is not that agents are too expensive. It is that the cost is the visible signature of an invisible substrate gap, and the gap is structural. No prompt-engineering pattern, no model upgrade, no procurement clause closes it. Closing it requires building infrastructure that does not currently exist in most enterprise stacks: a harness layer that bounds the agent’s exploration, governs its context construction, audits its trajectory against the authorized objective, and makes its cost behavior predictable enough to price.
Enterprises that operate agentic systems without that substrate are not running cheap agents. They are running unbounded ones, on infrastructure that was never built to bound them. The bill is one of the ways the gap announces itself. It is not the most expensive way. The behavioral failures, the deployment ROI gap, and the legal exposure documented in the next three posts are larger costs by orders of magnitude.
Agentic systems consume tokens at a rate roughly 1,000× chat. The variance is heavy-tailed: same task, same agent, 2× cost spread; similar workflows across teams, 11× cost spread. Higher cost does not buy higher accuracy — at scale, it buys looping. Frontier models cannot predict their own consumption with correlations above 0.39, and they systematically underestimate. Caching is making the unsolved problem barely tolerable, not solving it.
The first lunch is the one just unpacked. There are no free lunches. The next one is on the table.
