When you work on AI in a regulated environment, you will hear the same question again and again: can the model say the same thing twice? The question is not wrong. It just aims too low. The real standard for regulated AI is whether the workflow can prove what happened.

Consider a finance team. They do not need a chatbot that can answer a question about revenue. They need to know which records the system could see, which permissions applied, which prompt version ran, which model generated the answer, which retrieval results shaped it, who approved it, and what became the official record.

A health insurer faces the same challenge. They do not need an algorithm that can recommend a care decision. They need to prove that a qualified human reviewed the evidence before the denial, delay, or modification of care. California’s SB 1120, the Physicians Make Decisions Act, now puts that principle into law for health plans that use AI or algorithms in medical necessity decisions.

In my experience, the conversation about AI governance keeps circling the wrong center. We keep asking whether the model is deterministic when we should be asking whether the workflow is reconstructable. Can the organization show what happened, why it was allowed, who had authority, and what became official? That is the Reconstructability Standard. The goal is to build workflows that make AI variation visible, bounded, testable, reviewable, and reconstructable, not to make probabilistic models pretend to be calculators. Regulated organizations will not trust AI because it is intelligent. They will trust AI when the workflow can prove what happened.

Why lowering the temperature is not enough

A common answer to AI unpredictability is: lower the temperature. That answer works for demos. It is too small for regulated work.

Lower randomness can help. Fixed model versions, fixed seeds, controlled infrastructure, deterministic tools, and careful sampling settings can make a system easier to test and debug. In some environments, they can make model behavior highly repeatable. But repeatability is not accountability.

Even if the model gives the same answer twice, the workflow around it may not. The source document may have changed. The user may have different permissions. The retrieval layer may return different evidence. A prompt or tool schema may have been updated. A human reviewer may edit the draft before approval. Production adds integration, security review, compliance checks, drift, broken tool contracts, downstream edits, and audit requests. A demo never has to survive that burden.

This is why AI pilots often stall at the handoff to production. The model may be capable, but the workflow cannot answer basic questions:

  • Which facts did the AI rely on?
  • Which version of the prompt shaped the output?
  • Which tool call changed the result?
  • Which human approved the action?
  • Which policy allowed it?
  • Which final record became authoritative?

If the organization cannot answer those questions, the system is not production-ready for regulated use. This is one of the most common failure modes I see: a capable model embedded in a workflow that cannot answer basic audit questions.

The three kinds of determinism, and which one matters

There are three different ideas that often get collapsed into one word. The first is model-level determinism: the model gives the same output for the same input. For LLMs, that is useful where teams can achieve it, but it should not be the only operating assumption.

Computational determinism means the same calculation produces the same result. In finance, actuarial work, optimization, and official business records, this still matters. If two users ask for the same approved revenue number, the answer cannot depend on location, timing, or prompt phrasing. In those cases, deterministic code or clear-box systems should handle the exact computation.

Workflow reconstructability means the organization can rebuild the decision context after the fact. It can show what inputs were available, what the AI produced, what controls ran, what humans decided, and what became the official record.

Regulated AI needs the third idea as a baseline. In some workflows, it also needs the second. The mistake is treating the model as the place where all accountability must live.

I want to be explicit about something that matters for audits. A model-generated explanation is not an audit trail. Chain-of-thought text is generated by the same probabilistic system as the answer. It may be useful as a draft explanation, but you should not treat it as verified evidence of what actually happened. The audit record has to live outside the model, in tamper-resistant systems the organization can inspect and defend.

It lives in the input packet, the evidence set, the prompt registry, the model configuration, the tool logs, the policy checks, the review decision, the approval record, and the final artifact. That is the artifact layer. It is where AI accountability becomes possible.

When the prompt replay is not enough

Imagine a compliance reviewer investigating an AI-generated recommendation that changed between Monday and Friday. Replaying the prompt is not enough.

On Monday, the user may have had access to one set of documents. By Friday, a source record may have changed. The retrieval layer may have returned different evidence. A policy prompt may have been updated. A tool schema may have failed validation. A model version may have changed. A human reviewer may have edited the draft before approval.

If the team only stored the prompt and response, the investigation stalls. What the organization needs is the whole workflow path: the original input and user instruction, the records and documents available at that moment, the permissions and retrieval boundaries in force, the prompt and policy versions, the model and configuration, the retrieved evidence, the tool calls and validations, the draft output before review, the human edits and approval decision, and the final record that entered the business system.

That is the Reconstructability Standard in practice. Not that the model will repeat the same words tomorrow, but that the organization can reconstruct why this outcome happened today.

Regulation is moving toward the artifact layer

This is no longer only an engineering preference. The EU AI Act requires high-risk AI systems to provide transparency for deployers, support human oversight, and maintain logs. NIST’s AI 600-1 Generative AI Profile names risks such as hallucination, data provenance, synthetic content, and human oversight as governance concerns for generative AI. Banking model risk expectations under SR 11-7 already push institutions toward conceptual soundness review, independent validation, ongoing monitoring, change management, and documentation.

Insurance is making the separation between AI output and human decision especially explicit. California’s SB 1120 requires physician or qualified health care provider oversight for medical necessity determinations involving AI or algorithms. New York’s Department of Financial Services has told insurers using AI and external data that they need governance, documentation, testing, and controls that explain how models function, what inputs they use, and how outputs affect decisions.

The direction is clear: regulators care less about whether an AI system sounds confident and more about whether the organization can explain, review, and control the path from input to outcome.

The UnitedHealthcare nH Predict controversy shows the risk. A Senate Permanent Subcommittee on Investigations report found that UnitedHealthcare’s post-acute care prior authorization denial rate rose sharply from 2020 to 2022 as the company expanded automation initiatives. Litigation has alleged that algorithmic predictions helped drive denials that were later reversed on appeal. Whatever the final legal outcome, the controversy illustrates the danger of insufficient transparency, weak oversight, and unclear authority when AI touches clinical coverage. In consequential workflows, the algorithm should not be treated as the decision by itself.

Keep the LLM on one side of the line

A probabilistic model can generate useful artifacts. It can draft a memo, summarize a claim file, extract fields from intake documents, propose inventory adjustments, identify missing evidence, or produce a structured recommendation.

But the workflow must decide what is true, safe, allowed, and final. That separation matters most when the output triggers a consequential action: a payment, denial, system-of-record update, legal filing, clinical coverage decision, credit decision, or regulated communication.

Consider an insurance claims intake workflow. A safe design does not ask an AI agent to approve or deny complex claims on its own. It asks the system to ingest documents, extract facts, flag exceptions, cite evidence, record confidence, route uncertain cases, and present a reviewable packet to a human decision-maker. The final disposition belongs to the authorized reviewer, not the model.

The same pattern applies to finance. An AI assistant can help a finance team find records, summarize variance, or explain a metric. But if the question asks for an official number, the LLM should not invent the answer through token generation. A deterministic calculation layer should compute the number from governed data. The AI can explain the result, but it should not become the source of truth.

The pattern also applies to law. An AI tool can draft a brief section or summarize precedent. But the filing becomes an official legal record only after citation validation, source checking, attorney review, and approval. Court sanctions for hallucinated legal citations show the danger of letting generation pass into the record without a validation layer.

Human review alone is not enough if it means a rushed rubber stamp. Meaningful review needs evidence, authority, time, training, workload design, and a recorded decision. The reviewer must be able to approve, reject, edit, escalate, or override the AI output under a defined policy. The artifact layer does not guarantee good judgment. It makes judgment visible enough to test, challenge, and improve.

Measure variation before you deny it

The goal is not to deny AI variation. You need to measure it before the workflow reaches production.

A regulated team should test the same workflow across repeated runs, realistic inputs, changed retrieval states, edge cases, permission differences, tool failures, and model updates. If a target field changes across runs when it should not, the workflow is not ready for that use case.

This is not only a model benchmark problem. Benchmarks can show capability. They do not prove the system will perform delegated work consistently inside the enterprise. A tool-calling agent can fail because the schema changed. A retrieval workflow can fail because the source boundary is too loose. A clinical support workflow can fail because the human approval step happens after the harm, not before. A finance workflow can fail because the model generated a plausible number instead of calling the governed calculation layer.

Variance testing should answer a practical question: is the variation acceptable for this workflow’s consequence level? For low-risk drafting, more variation is tolerable. For official records, financial outputs, legal filings, clinical coverage, and system updates, variation must be tightly controlled or moved out of the LLM.

The five questions a regulated workflow must answer

A regulated AI workflow is accountable enough when it can answer five questions without heroic investigation.

1. What shaped the output?

The workflow should preserve the input packet: documents, records, data, user instructions, permissions, system state, retrieval boundaries, prompt versions, policy versions, model configuration, tools, schemas, and runtime controls. If two users receive different answers, the input packet should show whether they had different permissions, different source access, different timing, or different instructions.

2. What evidence did the AI use?

The workflow should preserve the evidence set: retrieved documents, citations, records, facts, confidence scores, source boundaries, and validation results. For regulated work, sources are not background material. They are audit objects.

3. What happened during execution?

The workflow should preserve the execution path: tool calls, validations, policy checks, fallback flows, memory writes, system actions, blocked operations, and delegation chains. Agentic workflows need more than prompt-response logs. They need records of actions, state changes, failed operations, and the authority each agent or tool used.

4. Who reviewed and approved it?

The workflow should preserve the review path: the generated artifact, later edits, reviewer identity, approval decision, rejection, escalation, override, and governing policy. The record should show who initiated the action, whether the actor was human or AI, and why the action proceeded.

5. What became official?

The workflow should preserve the final business record: the approved artifact, decision, communication, update, filing, or system-of-record change. The final record should link back to the draft, evidence, controls, and approval path. A workflow that cannot be reconstructed is difficult to defend in a dispute, audit, incident review, or regulatory examination.

You do not need to store everything forever. You need to define the minimum evidence required to defend the workflow at the level of risk it creates.

Not every workflow needs the same controls

Not every AI workflow needs the same controls. A drafting assistant for internal brainstorming does not need the same artifact stack as an AI-assisted claims decision. A customer support summarizer does not need the same assurance as a system that updates a bank’s official ledger. A legal research assistant does not need deterministic computation for every sentence, but it does need citation validation before anything enters a filing. In my experience, a simple three-tier framework is usually enough.

Tier 1: AI-assisted drafting

The AI generates candidate text, summaries, or options. A human owns the final artifact. The main controls are source capture, draft comparison, review, and final approval.

Tier 2: AI-generated output with human approval

The AI recommends an action that affects a customer, patient, employee, counterparty, or regulated record. The main controls are evidence packets, policy checks, reviewer authority, approval records, escalation paths, and audit trails.

Tier 3: AI-linked consequential action

The workflow can trigger financial, clinical, legal, operational, or system-of-record changes. The main controls are deterministic computation where exactness matters, runtime enforcement, strong permissioning, mandatory pre-approval, rollback, monitoring, and incident response.

The framework does not make every workflow heavy. It helps teams stop applying light controls to heavy-risk work. That is the mistake I see most often: a team treats a Tier 3 workflow like a Tier 1 prototype because nobody named the difference.

Start with the artifact layer, not the governance framework

A governance checklist fails if nobody owns the records. Platform teams often own model access, logging, gateways, and tool permissions. Application teams own workflow design, user experience, and business logic. Compliance teams own policy interpretation, documentation, testing evidence, and regulatory response. Business owners own the risk of the final process. A production AI workflow needs all four.

The first move is not to build a massive new governance platform. Start by picking one workflow and defining its artifact layer:

  • The input packet
  • The evidence set
  • The prompt and policy registry
  • The model, tool, and schema versions
  • The execution and validation log
  • The review and approval record
  • The final business record
  • The variance tests required before release

That package can start as a sidecar log, a decision artifact store, a prompt registry, or an event stream tied to the existing application. The architecture matters less than the discipline: the workflow must create records while work happens, not after someone asks for an audit.

The platform can record the model call, but it may not know whether the output became an official record. The application can show the user journey, but it may not know whether a retrieved source was permitted. Compliance can define review standards, but it cannot manually inspect every AI output. The business owner can approve the use case, but only if the system shows the evidence path.

This is why AI governance cannot live only in a policy document. It has to live in the workflow.

The standard is controlled accountability, not perfect sameness

Before trusting the output, can we show which inputs shaped it? Before accepting the answer, can we show which evidence supports it? Before letting the AI act, can we show which authority it used? Before treating it as official, can we show the final record and its source trail? Before scaling it, can we show how it behaves across repeated runs? Those are the gateways. If the answer is no, the problem is that the workflow still cannot prove what happened. Intelligence is not the missing piece.