Don't Let the Suspect Write the Alibi - Luminity Digital, Inc.

This series extends a question first opened in The Alignment Gate: what happens when the substrate underneath an agentic action cannot answer the question the safety culture asks of it. The clinical setting makes the structural gap unusually visible, because healthcare has spent more than a century building a safety architecture around the verification of intent. The argument that follows takes that inheritance seriously, then shows where it breaks when an agent moves into the role the inheritance was built to govern. Post 01 of three.

A patient with chronic heart failure is enrolled in an agentic care program. The deployment is a serious one — a major academic medical center, an FDA-cleared workflow, a clinician of record who has authorized the agent to titrate medications within guideline-directed medical therapy. The agent reads recent chart context, weight trends, electrolytes, the medication list, recent device interrogations, and adjusts therapy within scope. The clinician reviews flagged actions at the end of each day. Most actions are not flagged.

On a Tuesday morning the agent emits an order that should not have emitted. The order falls inside the agent’s stated scope. The session is authenticated against the clinician’s credential. The retrieval chain returned a chart context that, on its face, supported the action. The reasoning trace generated by the agent reads as clinically defensible — a coherent narrative that names the relevant lab value, the relevant guideline, the relevant pharmacology. Forty-eight hours later the patient is back in the emergency department. Eight days later the patient is dead.

The morbidity and mortality conference convenes. It asks the question it has always asked, the question every M&M conference has asked since M&M conferences began: what was the intent of the action? And the answer comes from the system whose action is under review.

Why intent worked when humans were the substrate

Healthcare’s safety architecture is not a recent construction. It is the cumulative product of more than a century of institutional learning, organized around a single load-bearing premise: that the actions of a clinician can be evaluated, after the fact, by asking what the clinician intended. The licensing board, the credentialing process, the peer review committee, the M&M conference, the deposition, the expert witness, the standard of care, the reasonable practitioner — every component of this architecture exists to make intent answerable. Together they constitute one of the most carefully engineered safety frameworks in any professional field.

Intent verification works for humans because humans are an intent-bearing substrate. The clinician can be questioned. The answer can be checked against the chart, against contemporaneous notes, against the testimony of colleagues, against the published standard of care. The clinician carries a license, a record, a reputation — none of which are produced by the clinician alone, all of which constitute external attestations that the answer can be triangulated against. When the M&M asks what the intent was, the answer is grounded in something more than the clinician’s own account of themselves.

This grounding is what makes the system coherent. The clinician’s testimony is not the final word; it is one input among several, weighted against external evidence. Peer testimony exists. Deposition exists. The professional record exists. The standard of care, as established by the broader profession, exists as a fixed point that individual testimony can be measured against. The substrate underneath the clinician provides the externality that makes intent verification an actual verification rather than a self-report.

The system is imperfect. Intent verification through professional process is famously slow, often expensive, sometimes wrong. M&M conferences can be performative. Depositions can be gamed. Peer review has its own pathologies. The profession is actively examining the format itself — cardiology, among other specialties, is engaged in ongoing reform of the M&M conference toward more system-level, psychologically safe learning. But the system is coherent: the verification of intent is grounded in a substrate that exists independently of the actor whose intent is being verified. That coherence is what makes the architecture defensible as an architecture, not merely as a custom.

The structural break

When an agent slots into the same intent-bearing role, the inheritance fails. Not because the agent is less capable than the clinician at producing intent-shaped output. In some narrow respects the agent is more fluent at this — its reasoning trace is more articulate, more internally consistent, more rhetorically defensible than what most clinicians produce under pressure. The failure is structural, and it sits underneath capability.

What the agent does not have is the substrate that made the clinician’s intent verifiable. There is no licensing board. There is no peer review of the agent’s record over time, because in any auditable sense the agent has no record over time. There is no deposition. There is no professional reputation that constitutes external attestation. There is no standard of care that applies to this specific agent as distinct from agents in general. These are not criticisms of the agent — they are descriptions of what an agent is, and what an agent is not. The agent did not consent to a professional code. It does not carry a credential. It cannot be sworn.

What it does produce is the reasoning trace — and the reasoning trace is generated by the same probabilistic system whose action is under review. Anthropic’s own published research on this is direct. In Reasoning Models Don’t Always Say What They Think (Chen et al., May 2025), the company evaluated chain-of-thought faithfulness across state-of-the-art reasoning models, including its own Claude 3.7 Sonnet. The methodology was simple: inject a hint into a prompt, verify that the model used the hint to change its answer, then check whether the reasoning trace acknowledged the hint. Across multiple model families and hint types, the reveal rate was typically below 20 percent. The models used the hints. They did not articulate having used them. Reinforcement learning improved faithfulness initially, then plateaued without saturating. The finding builds on earlier work by Turpin et al. (2023), which first demonstrated that chain-of-thought explanations can systematically misrepresent the true reason for a model’s prediction.

Chain-of-Thought Reveal Rate — <20%

Across multiple state-of-the-art reasoning models including Claude 3.7 Sonnet (Anthropic, arXiv:2505.05410), the rate at which a model’s reasoning trace acknowledged using an injected hint that demonstrably changed its answer was typically below 20 percent. The models used the hints. They did not articulate having used them. Reinforcement learning improved faithfulness initially, then plateaued without saturating.

The structural conclusion is what matters here. The reasoning trace is not testimony in the sense that the M&M conference assumes. It is output from the same system whose action is in question, and the published evidence — including the published evidence from the most safety-conscious lab in the field — is that this output does not reliably reflect the actual mechanism of the action. There is no externality to anchor the verification to. The substrate that made human intent verification coherent — the network of independent attestations — does not exist for the agent. Asking the agent to explain itself is not the same kind of operation as asking the clinician to explain themselves. They share grammatical form. They do not share epistemic structure.

This is the dyad the rest of the series rests on. Provenance is a property of the artifact and its history — the chain of authorities, transformations, and attestations that produced what is now in front of you. Intent is a property of the actor’s reasoning state — what the actor was trying to accomplish when it acted. For humans, both can be questioned, because both have substrates underneath them that make questioning meaningful. For agents, only one does. Provenance can be made alignment-grade. Intent verification of an agent by examining the agent’s own outputs is, and will remain, coordination-grade — useful, necessary, layered, but never structurally load-bearing in the way the M&M conference, or any framework downstream of it, assumes its question is. This is the distinction first developed in The Alignment Gate, now applied to the clinical setting where the assumption being violated has the most history behind it.

Don’t let the suspect write the alibi

The Naming

Don’t let the suspect write the alibi. This is not a metaphor. It is a precise description of the operation that the M&M conference performs when it asks what the agent’s intent was, and accepts the agent’s own reasoning trace as the primary evidence for the answer.

Every intent-verification mechanism currently being built into agentic healthcare deployments has the same shape. Reasoning trace review, where a human reads the agent’s stated rationale. Behavioral monitoring of agent decisions, where deviation from past behavior is taken as a signal. Post-hoc explanation interfaces, where the agent is asked to elaborate on what it did. The supervisory-agent pattern, where one agent is positioned to watch another. All of these are useful additions to a defense-in-depth strategy. None of them resolves the structural defect, because in each case the verification draws on outputs produced by the same class of system whose action is under question. The supervisor watching the agent is itself an agent. The reasoning trace under review is generated by the system whose reasoning is the subject of the review. The behavioral baseline is established by the system whose behavior is being measured against it.

The adversarial dimension makes this concrete. Prompt injection in clinical contexts is no longer a theoretical concern. A study published in Nature Communications in October 2025 demonstrated that both open-source and proprietary large language models are vulnerable to prompt injection across disease prevention, diagnosis, and treatment tasks, using real patient data. A controlled study published in JAMA Network Open demonstrated that flagship commercial models — including GPT-5, Gemini 2.5 Pro, and Claude 4.5 Sonnet — could be manipulated into producing unsafe treatment recommendations, including in pregnancy contraindication scenarios, through realistic prompt-injection strategies. The attack surface is not exotic. It includes the clinical notes the agent reads, the guideline documents the agent retrieves, the patient-supplied content the agent ingests. Once content of unverified provenance enters the agent’s working context, the agent’s reasoning trace can be made to defend almost any action with apparent clinical coherence. The trace will read as defensible. The trace was written by the system that was redirected.

The Structural Defect, Named

The question healthcare’s safety culture wants to ask — what was the intent of the action — cannot be the load-bearing safety question for agentic systems, because the system that produced the action also produces the answer. The question has to change. It has to become a question the substrate can answer.

What has to carry the load

The current response to this gap, where the gap is acknowledged at all, is to add another agent — a supervisor that watches the first one, scoring its actions and flagging anomalies. The move is reasonable. It does not resolve the structural defect. A supervisor that monitors intent without an attested artifact substrate underneath it inherits the same problem one level up: now there are two systems whose outputs are not anchored externally, and the supervisor’s judgment about the agent’s intent is itself an output of a probabilistic system. The problem has moved. It has not been solved. The same load-shifting move is taking place one layer further out, where the frameworks that authorize clinical software are being asked to evaluate systems they were not designed to characterize — and where the gating question those frameworks ask depends on exactly the substrate this series argues is missing.

What can carry the load is provenance — and provenance in the strict sense, not the costume that current healthcare infrastructure has been wearing under the name. Alignment-grade provenance is what this series is going to develop: the substrate-level property that makes the question should this action have happened answerable from outside the system that emitted the action. It requires attested artifact history. It requires externally verifiable authorization. It requires traceable lineage from authenticated source to emitted action, with each link in the chain bearing independent attestation rather than depending on the testimony of any actor in the chain. Where this substrate exists, the M&M conference does not have to ask the suspect, because the question can be answered from the artifacts. Where this substrate does not exist — which is where almost all current deployments live — the question collapses onto the agent’s own account of itself, and the answer is whatever the agent says.

The patient is still dead. The M&M still convenes. In the world the next post in this series describes, the conference does not have to ask the suspect. The substrate has already done the work.

The Post 1 Claim

Healthcare’s safety culture is organized around the verification of intent. Intent verification worked when humans were the substrate, because humans carry external attestations — licenses, records, peer testimony, professional reputation — that anchor self-reported intent to something outside the actor. Agents have no such substrate. Their reasoning traces are produced by the same probabilistic system whose actions are under review. Asking an agent to explain itself is not testimony. It is the suspect writing the alibi. The gating question for agentic safety in healthcare has to become a question the substrate can answer, because the question the safety culture has always asked cannot be load-bearing here.

The Provenance Gap · A 3-Post Series

Post 01 · Now Reading Don’t Let the Suspect Write the Alibi

Post 02 · Published Four Surfaces, No Witness

Post 03 · Published The Witnesses Turn State’s Evidence

Series Post 2 Four Surfaces, No Witness luminitydigital.com
Series Post 3 The Witnesses Turn State’s Evidence luminitydigital.com
Foundation The Alignment Gate luminitydigital.com
Companion The Intelligence Loop luminitydigital.com
Companion The Captured Vertical luminitydigital.com

The DyadProvenance is a property of the artifact and its history. Intent is a property of the actor’s reasoning state. For humans, both have substrates underneath them. For agents, only one does.
Alignment-Grade ProvenanceThe substrate-level property that makes the question should this action have happened answerable from outside the system that emitted the action. Requires attested artifact history, externally verifiable authorization, and traceable lineage that does not depend on the testimony of any actor in the chain.
Coordination-Grade vs. Alignment-GradeCoordination-grade controls layer on additional signal but cannot be load-bearing. Alignment-grade controls are substrate-level and structurally enforceable. Intent verification of an agent by examining its own outputs is, and will remain, coordination-grade.
The Inheritance FailureAgentic systems are being asked to slot into the intent-bearing role that humans occupied in healthcare’s safety architecture — without the substrate that made intent verification coherent for humans.

Don’t Let the Suspect Write the Alibi

Why intent worked when humans were the substrate

The structural break

Don’t let the suspect write the alibi

What has to carry the load

The Provenance Substrate Comes Next

Like this:

Related

Don’t Let the Suspect Write the Alibi

Why intent worked when humans were the substrate

The structural break

Don’t let the suspect write the alibi

What has to carry the load

The Provenance Substrate Comes Next

Share this:

Like this:

Related