The Gap Between Pilot and Production

There are no free lunches. The first one — examined in Post 1 — was the runaway token bill and the heavy-tailed variance. The second — examined in Post 2 — was the inverse scaling regime that Anthropic’s own research documented, where more compute under identifiable conditions produces less accuracy. Two layers, one phenomenon. This post turns to the third layer: what happens when those dynamics meet enterprise deployment reality. The deployment data is no longer a mystery the field is trying to interpret. It is the institutional signature of the same substrate gap. The third lunch is on the table.

The Number That Should Not Be Survivable

The most cited single statistic in enterprise AI in 2025 came not from a vendor, not from an analyst firm, and not from a model provider. It came from a research project at MIT’s NANDA initiative — the academic working group studying the deployment of decentralized AI systems. The headline finding, published in mid-2025 under the title The GenAI Divide, is the kind of number that should have triggered a category-wide reset on the day it was released.

95%

MIT NANDA — The GenAI Divide (2025)

Across 350 enterprise GenAI deployments studied, 95% delivered zero measurable P&L impact, despite an estimated $30–40 billion in cumulative spend. The number is not survivable as a baseline outcome in any other category of enterprise software investment.

A 5% success rate on a discretionary capital deployment would, in the ordinary course, produce a board-level review, a vendor consolidation, a procurement audit, and a sharp realignment of capital allocation toward the surviving 5%. The fact that this has not happened in the agentic AI category — that the spending has continued, that the budgets have grown, that the procurement processes have remained substantively unchanged — is itself a signal worth pausing on. It tells us that either the data is not believed, the data is not understood, or the institutional incentives are misaligned in a specific way. This post argues for the second possibility: the data is not yet understood, because the field has been treating it as a series of separate failures rather than as a single structural signature.

The MIT NANDA finding is the headline, but it is not isolated. It is one of at least four independent institutional studies that converged in 2025 on substantively the same conclusion. The convergence is what makes the signal structural.

Four Independent Studies Converge

Four institutions, four methodologies, four populations. One diagnosis.

MIT NANDA — 2025

95%

Of 350 enterprise GenAI deployments delivered zero measurable P&L impact

IBM 2026 IBV CEO Study

25%

Of AI initiatives delivered expected ROI; 16% reached enterprise scale

S&P Global — 2025

42%

Of enterprises abandoned the majority of AI projects in 2025

Morgan Stanley Research — Q4 2025

30%

Of North American AI adopters cited quantifiable business impact in Q4 2025

The second study comes from IBM’s Institute for Business Value 2026 CEO Study, drawn from interviews with several thousand senior executives at firms above a defined revenue threshold, which surfaces two numbers worth holding next to each other. Only 25% of AI initiatives reported by the CEOs studied delivered the ROI those initiatives were expected to produce. Only 16% of AI initiatives reached enterprise-wide scale. The remaining majority — by IBM’s own classification — were either in indefinite pilot status, had been quietly deprioritized, or had been terminated.

The third study comes from S&P Global Market Intelligence, whose 2025 enterprise AI deployment report drew on a different methodology — annual survey panel data from over 1,500 enterprise IT decision-makers across multiple verticals — and arrived at a third independent number. 42% of S&P’s respondents reported having abandoned the majority of their AI projects in 2025, up substantially from the equivalent figure the prior year. The report characterized the trend as a maturation of expectations: enterprises that had committed capital under broad strategic mandates in 2023 and 2024 were now conducting cold-eyed reviews of which deployments justified continued investment.

The fourth study comes from Morgan Stanley Research’s Q4 2025 analysis of North American AI adopter business impact, drawn from earnings disclosures and analyst coverage of enterprises that had publicly identified themselves as material AI deployers. Morgan Stanley’s methodology examined which of those adopters could substantiate, in their own disclosures and against their own pre-deployment baselines, a quantifiable business impact attributable to AI deployment in Q4 2025. Only 30% of the North American AI adopters in Morgan Stanley’s sample cited a quantifiable business impact, up from 16% in Q4 2024 — a meaningful year-over-year improvement that nonetheless leaves 70% of self-identified adopters unable to substantiate a measurable impact for the deployments they had publicly committed to.

These four studies — MIT NANDA, IBM 2026 IBV CEO Study, S&P Global, Morgan Stanley Research — used four different methodologies, surveyed four different populations, and worked from four different definitions of failure. They converged on substantively the same finding: enterprise agentic AI deployment is failing at rates that, in any other category of enterprise software investment, would already have produced a category-wide reset. The numbers vary. The signal is the same, and the signal is structural.

The POC Wall, At The Institutional Layer

The convergence of those four studies is the institutional signature of the same dynamic Post 1 and Post 2 documented at the technical and behavioral layers. The looping signature from Post 1 — repeated file viewing, repeated tool calls, runaway token consumption with heavy-tailed variance — is what an agent is doing inside a single trajectory when the substrate underneath it cannot bound exploration. The inverse scaling regime from Post 2 is what a model is doing inside an extended reasoning chain when nothing above the model decides when reasoning is sufficient. Both are signatures of the same gap. The institutional signature is what the same gap produces inside an enterprise procurement, deployment, and operations cycle.

The POC Wall — Luminity IP

The structural break between a pilot deployment that demonstrates capability in a controlled environment and a production deployment that operates against an actual enterprise process under actual enterprise constraints. The term was introduced in the Harness Imperative series and developed across that work. Most agentic systems do not cross the wall. The pilot succeeds. The production deployment fails. The four studies cited above are measuring different facets of that same wall.

The POC Wall is not a procurement problem. It is not solved by tighter vendor selection, longer pilots, more rigorous evaluation criteria, or better contracting. Those interventions address the symptoms — they do not address the cause. The cause is that the substrate the pilot is running on is sufficient for demonstration and insufficient for production.

Demonstration Environment (Pilot)

Cost variance does not matter — budget is exploratory
Looping signature produces no visible consequence
Inverse scaling regime rarely reached — tasks are simple, bounded
Behavioral failures absorbed by demonstrator’s tolerance for retries
Success criteria designed around the demonstration, not the operation
Failure has no counterparty

Production Environment

Cost variance matters — budget is fixed against measurable benefit
Looping signature produces visible operational consequences
Inverse scaling regime regularly reached — tasks are complex, unbounded
Behavioral failures absorbed by customers, counterparties, regulators
Success criteria measured by lagging operational outcomes
Failure has named counterparties — and balance-sheet consequences

The POC Wall is what the substrate gap looks like from the outside. The 95% failure rate, the 25% ROI ceiling, the 42% abandonment rate, and the 30% quantifiable-impact number are what crosses the wall look like at scale. Two examples make the dynamic concrete.

When the Failure Becomes Public — Two Worked Examples

The first is Klarna. In late 2023 and early 2024, Klarna publicly announced that its AI customer-service agent, deployed via a partnership with OpenAI, was performing the work of approximately 700 full-time customer service roles. The announcement was widely covered as an early operational proof point for agentic deployment at enterprise scale. Through 2024, Klarna continued to position the deployment as a category-defining example. In early 2026, the company publicly reversed course. The reversal was reported on the basis of two specific findings disclosed by Klarna’s own leadership: a customer satisfaction metric decline of approximately 22% from pre-deployment baselines, and a revenue attribution concern in which Klarna’s leadership could not isolate the AI agent’s contribution to customer outcomes from the contribution of remaining human staff. The company announced it would be rehiring customer service roles and re-routing meaningful portions of customer interactions back to human handling.

Klarna — the operational consequence

The Klarna reversal is not a story about Klarna making a mistake. It is a story about what happens when an agentic deployment crosses the POC Wall and meets the dynamics Posts 1 and 2 documented. The agent was capable of performing customer service work in the limited sense that mattered for the pilot. It was not capable of performing customer service work in the unbounded sense that mattered for the production deployment, because the substrate underneath it provided no mechanism for bounding the inverse scaling regime when conversations extended, no mechanism for recognizing when the agent had drifted off the authorized objective, and no mechanism for the firm to detect those failures except by waiting for them to show up in lagging customer satisfaction data.

The 22% CSAT drop is the institutional reading of the looping signature.

The second example is Moffatt v. Air Canada, 2024 BCCRT 149 — the British Columbia Civil Resolution Tribunal decision issued in February 2024. The facts of the case are public and worth holding precisely. A passenger booked a flight on Air Canada in late 2022 following the death of a grandparent. The passenger interacted with Air Canada’s customer service chatbot, which provided information about Air Canada’s bereavement fare policy. The chatbot’s stated policy contradicted Air Canada’s actual published bereavement fare policy on a material point — specifically, on whether a customer could apply for the bereavement discount after the fact rather than at the time of booking. The passenger relied on the chatbot’s stated policy. Air Canada subsequently refused to honor the discount on the basis that the customer had been told incorrect information by what Air Canada’s lawyers characterized in the tribunal proceedings as “a separate legal entity.” The tribunal disagreed. The decision held that Air Canada was responsible for all information provided on its website — including by the chatbot — and ordered the airline to pay the difference between the bereavement fare and the regular fare the passenger had paid, plus tribunal fees.

Air Canada — the legal precedent

The Air Canada decision is the first significant common-law precedent establishing direct corporate liability for the outputs of an autonomous customer-facing AI agent. The decision did not turn on the chatbot’s technology, the model behind it, or the vendor relationship. It turned on a structural finding: the company deployed the system, the system produced output to a customer, the output bound the company.

Whether the chatbot was running on a frontier model, a fine-tuned model, a retrieval-augmented system, or a deterministic decision tree was not material to the tribunal’s reasoning. What was material was that the output had been provided in the company’s name, and the company had no mechanism in place to constrain what the system could promise.

The two examples — Klarna and Air Canada — describe the two flavors of consequence the POC Wall produces. Klarna is the operational consequence: customer outcomes degrade, revenue attribution fails, and the firm reverses a public commitment under cost pressure. Air Canada is the legal consequence: courts and tribunals establish that the firm is responsible for the outputs of systems it deploys, regardless of vendor relationships, regardless of model provenance, regardless of the firm’s own characterization of the system as a “separate legal entity.” Both consequences flow from the same structural absence: nothing in the deployment stack bounded what the agent was allowed to do on the firm’s behalf.

Capital Markets Are Beginning to Price the Gap

The convergence at the operational and legal layers is now showing up at a third institutional layer: capital markets. Two findings published in 2025 are worth holding next to each other.

The first comes from Citi’s credit research group, which published an analysis in Q3 2025 examining credit spreads on debt issued by enterprises with material disclosed AI deployment programs. The Citi analysis controlled for sector, debt levels, scale, and conventional credit metrics, and isolated a residual spread differential that the researchers attributed to AI deployment exposure.

30bps

Citi Credit Research — Q3 2025

Approximately 30 basis points of additional credit spread on debt issued by firms with significant AI deployment commitments, relative to comparable issuers without such commitments. Thirty basis points is not, on its own, category-defining — but it is the first time, to the best of the available public record, that credit researchers have isolated an AI-specific spread component at all. The fact that it is measurable is what matters.

The second finding comes from Morgan Stanley Research’s same Q4 2025 analysis cited above. Beyond the headline 30% quantifiable-impact number, the analysis surfaced a structural observation: enterprises whose AI deployments in 2024 and 2025 could not be substantiated against pre-deployment baselines were subsequently penalized in equity-research models that fed into institutional investor positioning, and the penalty was specifically associated with the firm’s inability to articulate a credible operational basis for distinguishing future AI deployments from the unsubstantiated ones. The penalty was not for having tried and failed. It was for being unable to explain why the next attempt would be different.

The Capital-Markets Implication

Both findings point in the same direction. The deployment gap is now visible to the institutional actors that price corporate risk. The implication for enterprise treasurers, investor relations functions, and procurement processes is structural. Until firms can articulate a credible operational basis for distinguishing the failed deployments from the deployments they are now undertaking, the credit and equity markets are positioned to apply a discount. The discount is small in 2025. There is no structural reason to assume it will remain small.

The Same Gap, A Third Time

The argument across Series 14 has built one finding across three layers. Post 1 showed the economic surface: agentic systems are 1,000× chat, the variance is heavy-tailed, more compute does not buy more accuracy, and models cannot predict their own cost. Post 2 showed the behavioral diagnosis: Anthropic’s own research documents five failure modes that emerge specifically in extended reasoning, MAST documents the multi-agent mirror at 41–86.7% failure rates, and the failures are properties of the regime rather than properties of the models. This post has shown the institutional signature: 95% pilot failure rates at MIT NANDA, 25% ROI ceilings and 16% scale ceilings at the IBM 2026 IBV CEO Study, 42% abandonment rates at S&P Global, 30% quantifiable-impact rates at Morgan Stanley Research, 30 basis points of credit spread at Citi, the Klarna operational reversal, and the Air Canada legal precedent.

Three layers. One gap.

The three layers are not independent failure categories. They are three measurements of the same phenomenon at three different scales of observation. The token bill is what the gap looks like inside a single trajectory. The behavioral signature is what the gap looks like inside a reasoning chain. The deployment failure rate is what the gap looks like inside an enterprise procurement and operations cycle. The Klarna reversal is what the gap looks like inside a customer service organization at the moment the lagging indicators arrive. The Air Canada decision is what the gap looks like inside a courtroom. The Citi spread is what the gap looks like inside a credit committee.

None of these are addressable at the layer they appear at. The deployment failures are not solved by better procurement. The legal precedents are not addressed by tighter contracts. The credit spreads are not closed by better investor communication. The behavioral failures are not patched by better models. The token bills are not bounded by better caching. All six are signatures of the same absent layer.

Post 3 — The Institutional Signature

Four institutional studies converged in 2025 on substantively the same finding: enterprise agentic AI deployment is failing at rates that would have produced a category-wide reset in any other software category. The POC Wall is the structural break the studies are measuring. Klarna is the operational consequence. Air Canada is the legal consequence. Citi’s 30bps is the capital-markets consequence. The same substrate gap shows up at all three layers — and at the trajectory layer (Post 1) and the reasoning layer (Post 2).

Post 4 — What Actually Closes the Gap examines the structural response. Liu et al.’s budget-aware tool-use work (arXiv:2511.17006) and Chen et al.’s SEMAP protocol (arXiv:2510.12120) — which demonstrated a 69.6% reduction in function-level failures through protocol-layer intervention — point to the same architectural conclusion: the response is at the harness layer, not the model layer, not the framework layer, and not the procurement layer.

The fourth lunch is on the table.

Series 14 · The Economics of Agents Without a Harness

Post 01 · Published What Agents Actually Cost

Post 02 · Published Why More Tokens Make Things Worse

Post 03 · Now Reading The Gap Between Pilot and Production

Post 04 · Forthcoming What Actually Closes the Gap

Four independent institutional studies converged in 2025-2026 on substantively the same finding about enterprise agentic AI deployment failure. MIT NANDA (95% pilot failure), IBM 2026 IBV CEO Study (25% ROI, 16% scaled), S&P Global (42% abandonment), Morgan Stanley Research (30% quantifiable impact Q4 2025). The Klarna reversal and the Air Canada legal precedent are the operational and legal consequences. The Citi 30bps credit spread is the capital-markets consequence. Three layers — economic, behavioral, institutional — one substrate gap.

Post 1 — What Agents Actually Cost The economic surface (published)
Post 2 — Why More Tokens Make Things Worse The behavioral diagnosis (published)
Post 3 — The Gap Between Pilot and Production Deployment reality (now reading)
Post 4 — What Actually Closes the Gap The substrate response (forthcoming)

→ The POC WallThe structural break between a pilot deployment that demonstrates capability in a controlled environment and a production deployment that operates against an actual enterprise process under actual enterprise constraints. Not a procurement problem. A substrate problem. Originally introduced in the Harness Imperative series.
→ The Institutional SignatureWhat the substrate gap looks like at the layer of enterprise procurement, deployment, and operations. The four convergent studies in 2025 are measuring it. The Klarna reversal and the Air Canada decision are the named-firm versions of it.
→ Moffatt v. Air Canada2024 BCCRT 149. The first significant common-law precedent establishing direct corporate liability for the outputs of an autonomous customer-facing AI agent. The decision turned on a structural finding: the company deployed the system, the output bound the company, the vendor or “separate entity” defense failed.
→ The Three LayersThe economic surface (Post 1), the behavioral diagnosis (Post 2), and the institutional signature (Post 3) are three measurements of the same substrate gap at three scales of observation. The 1,000× token gap, the inverse scaling regime, and the 95% pilot failure rate are the same phenomenon.

The Number That Should Not Be Survivable

Four Independent Studies Converge

The POC Wall, At The Institutional Layer

When the Failure Becomes Public — Two Worked Examples

Capital Markets Are Beginning to Price the Gap

The Same Gap, A Third Time

If the deployment dynamics in this post raise questions for your environment

Like this:

Related

The Gap Between Pilot and Production

The Number That Should Not Be Survivable

Four Independent Studies Converge

The POC Wall, At The Institutional Layer

When the Failure Becomes Public — Two Worked Examples

Capital Markets Are Beginning to Price the Gap

The Same Gap, A Third Time

If the deployment dynamics in this post raise questions for your environment

Share this:

Like this:

Related