When you are responsible for putting AI agents into production, you will face a governance question that sounds like security but turns out to be about operating standards. A policy tells a team what an AI agent is allowed to do. It does not tell the team whether the agent deserves to keep doing it. That distinction matters more as agents move from demos into production, where a customer-support agent can resolve cases, issue refunds, update records, and escalate when customers show frustration, a software-development agent can open pull requests, change dependencies, and trigger deployment steps, and a finance or insurance agent can gather documents, apply rules, and recommend decisions that affect real people.

In each of these cases, the governance question is not only “Is this agent allowed?” The better question is “What evidence would make this agent worth trusting in production?”

Many organizations already have AI policies, committees, intake forms, and approval gates. In my experience, the gap is not in having a policy. The gap is operational: teams approve or deploy agents without defining what the agent must improve, what it must never do, what evidence will prove it is working, and when humans must intervene.

This does not mean permission controls are secondary. Security reviews, access limits, legal review, vendor risk checks, and compliance policies all matter. Granular access control, context-aware permissions, least privilege, and just-in-time authority can prevent real harm. The problem is that permission controls define what an agent can attempt. They do not define whether the agent is improving the workflow, interpreting policy correctly, escalating the right exceptions, or still deserves the autonomy it has been given.

Agents need governance that starts with a provable obligation: value created, behavior bounded, risk constrained, actions visible, controls tested, and ownership assigned. Call that the production autonomy standard: the risk-scaled evidence an agent must meet to earn, keep, expand, or lose the right to act.

Why permissions are not enough

Most organizations first encounter agent governance as a permission problem: securing system access, gating customer data, limiting tool use, approving external communications, or deciding whether the agent can execute transactions. Those boundaries are necessary. They are not sufficient.

Consider a support agent with permission to issue modest discounts. It stays inside that permission. It does not break into a system, exceed its refund limit, or access data it should not see. But it starts offering discounts in cases where customer history, fraud risk, pricing rules, or policy ambiguity should have triggered review. A better permission model might bind the discount to customer status, fraud checks, policy flags, or approval thresholds. The broader problem remains: the organization needs an operating standard that tests whether the agent’s decisions remain appropriate, not only whether its permissions were respected.

The same pattern appears in public examples. Klarna said in 2024 that its AI assistant was doing the work of 700 customer-service agents. By 2025, the company was again hiring humans for customer support after its CEO acknowledged that an overemphasis on cost had lowered quality. Public reporting does not prove that a different governance standard would have prevented that outcome. It does show the danger of treating efficiency as a sufficient production standard.

Air Canada offers the legal version of the same lesson. Its chatbot gave a passenger incorrect information about bereavement fare refunds. In 2024, a Canadian tribunal held the airline responsible for the chatbot’s statement. The agent did not need exotic permissions to create liability. It needed a policy-alignment standard, a failure definition, and a handoff rule.

This is where agent programs drift. They measure task volume, response speed, cost reduction, or hours saved, but they do not measure whether the workflow actually improved. They celebrate activity because activity is easy to count. Production trust requires a different standard.

For a software-development agent, the right metric is whether the agent improves merge-request-to-deployment time, defect rate, deployment frequency, review burden, rollback rate, and security quality, not lines of code generated. For a customer-support agent, the right metric is whether the agent resolves the right issues, escalates the right cases, protects user trust, and stays inside policy, not messages sent. For a regulated workflow, the right metric is whether the agent can operate inside legal, audit, data, and human-review constraints, not automation rate.

Business outcome matters. But it must sit inside a constraint envelope. In healthcare, financial services, insurance, employment, and other regulated contexts, an agent cannot optimize for speed or revenue if that optimization compromises rights, statutory duties, auditability, or required human judgment. The first useful governance move is therefore simple: define what improvement means, and define what the agent must not do to get there.

What makes agent governance different

Traditional software can still fail, but its behavior is usually specified through deterministic logic. Given the same input and state, the system should produce the same result. Governance can often focus on access, change control, testing, and policy compliance.

Agents change the standard because non-determinism combines with tool access, memory, retrieval, delegation, and action across systems. Their behavior can vary. Their mistakes can compound. A single bad interpretation can lead to several downstream actions before a person notices. That is why governance has to start before architecture hardens.

A team needs to define acceptable behavior. It needs thresholds for autonomy, risk boundaries, escalation rules, and observability from day one. It needs an answer to the question “What will we do when the agent behaves plausibly but incorrectly?” Agentic systems raise risks that ordinary software governance does not fully cover. OWASP’s Top 10 for Agentic Applications names risks such as goal hijacking, tool misuse, identity and privilege abuse, memory poisoning, and cascading failures. These risks do not make agents unusable. They make vague approval dangerous.

A high-authority agent in a complex environment needs stronger constraints than a low-authority agent summarizing internal documents. An agent that can send external communications needs different controls than one that drafts but cannot send. An agent that can modify production systems needs least privilege, reversible actions, circuit breakers, and explicit approval thresholds. Governance starts by making those distinctions visible.

The production autonomy standard

Before an agent moves from pilot to production, teams need a shared artifact that translates governance into operating decisions. The point is not to create another document for a review folder but to force the right conversations before the agent has power.

A production autonomy standard is the risk-scaled evidence an agent must meet to earn, keep, expand, or lose the right to act. A low-authority internal research assistant does not need the same depth of review as an agent that can move money, alter customer records, or modify production systems. But every agent needs explicit answers to the same basic questions: what it is for, what it can do, what it must not do, how success will be measured, how failure will be seen, and who can stop it.

The lightweight version might be a one-page standard for an internal research assistant: outcome definition, allowed data sources, uncertainty disclosure, one human escalation path, and a quarterly review. The full version belongs on customer-facing, financial, regulated, security-sensitive, or production-changing agents: documented failure thresholds, traceability, sampled audits, human override, incident response, and named shutdown authority.

Proportionality matters. Without it, governance becomes a launch-blocking ritual. With it, governance gives low-risk agents room to learn and high-risk agents the controls they need. A practical standard should cover eight areas.

Purpose and scope

Define the workflow, user, task, autonomy level, tools, systems, data, memory, and external channels. State explicitly what is out of scope. The minimum viable criterion is that the team can state in plain language what the agent does, where it acts, what it can access, and exactly what it is prohibited from doing.

Business outcome

Define the measurable workflow result the agent must improve. Include the baseline, target KPI, operating cost, quality measure, risk measure, and user or customer outcome. The minimum viable criterion is that the team can name the outcome that proves the agent is worth operating and the result that would make it not worth continuing.

Permission boundaries

Define what systems, tools, data, actions, and transactions the agent can access or execute. Apply least privilege. Separate read, recommend, draft, approve, and execute authority. Bind the agent’s authority to the user, task, session, and risk level wherever possible. The minimum viable criterion is that the team can distinguish what the agent can see, suggest, change, approve, and execute without human approval.

Task performance

Define what good performance looks like. Include accuracy, completeness, usefulness, repeatability, evidence handling, policy adherence, citation quality, and acceptable human correction. The minimum viable criterion is that the team can evaluate outputs against a documented standard, not a general impression that the agent seems useful.

Failure criteria

Define unacceptable failure modes, error thresholds, policy violations, user harms, operational risks, and evidence that the agent is no longer safe or useful enough to continue. The minimum viable criterion is that the team can answer, before launch, what evidence would force restriction, suspension, redesign, or retirement.

Escalation criteria

Define when the agent must hand off to a human. Include confidence, complexity, ambiguity, customer sentiment, exception type, regulated decision points, financial exposure, and risk thresholds. The minimum viable criterion is that the team can identify which situations require human review, who receives the escalation, and how quickly they must respond.

Human control model

Define ownership, override authority, pause and deactivation rights, user appeal paths, exception accountability, and manual review requirements. The minimum viable criterion is that the team can name the human accountable for the agent in production and the humans who can override, restrict, or shut it down.

Monitoring and lifecycle decisions

Define the traces, logs, samples, evals, audit trails, review datasets, governance metrics, and decision cadence required from day one. Specify what evidence justifies launch, expansion, restriction, retraining, redesign, suspension, or retirement. The minimum viable criterion is that the team can show how production evidence will change the agent’s autonomy over time.

Why visibility is not optional

A success criterion you cannot observe in time to inform a decision is an aspiration, not a control. That is why success criteria become architectural requirements.

If the team says the agent must cite sources, the system needs to preserve source references. Escalation based on low confidence requires confidence signals, routing logic, and records of escalations. Avoiding sensitive data exposure requires access boundaries, monitoring, and logs that show what data the agent touched. Policy compliance requires samples, evals, and audit trails that show whether the policy held in real use.

Teams need traces for agent actions and model calls. They need conversation-level trace IDs, replayable interactions, real-user query datasets, and automated checks. They need deterministic tests for rules the agent must always follow. They need review of real usage, not only lab prompts. Without visibility, human monitoring becomes a ritual.

Let me be direct about why this matters. A person asked to approve an opaque agent decision is not exercising control. They are signing off without evidence. In regulated workflows, that “human in the loop” model is especially weak. If the reviewer cannot see why the agent acted, what data it used, what policy thresholds applied, and what authority it exercised, the review is not meaningful enough for consequential decisions.

Decision-level provenance matters most when the agent’s action has consequences. Logs that show timestamps, tool calls, and actors are useful. But they are not enough if they do not show the rule version, evidence evaluated, approval authority, and threshold applied at the time of action. Without that record, teams reconstruct compliance from fragments after the fact. An agent cannot be wrong and invisible at the same time.

Human in the loop is not a plan

“Keep a human in the loop” sounds responsible. In practice, it is often too vague to be useful. You need to specify which human, at what point, with what information, under what threshold, with what authority, within what response time, and for which actions.

A support agent might be allowed to answer routine questions on its own, but it must escalate when sentiment worsens, confidence drops, the customer asks for a refund above a threshold, or the case involves legal risk. A financial services agent might require a high confidence threshold before recommending certain actions, with lower-confidence cases routed to a trained reviewer. A general operations agent might use a target escalation range so routine work proceeds autonomously while high-risk exceptions receive review. The exact thresholds vary by use case. The important point is that the thresholds exist.

Human control also needs suspension logic. Escalation sends a case to a person. Suspension pauses the agent’s ability to act until someone clears it. Those are different controls. An agent that gives one uncertain answer might need escalation. An agent that starts sending confidential data into an unsecured workflow needs suspension. An agent that takes several unexpected actions across tools needs a circuit breaker. An agent operating in a regulated domain might need mandatory manual review for entire classes of decisions, regardless of confidence.

The organization also needs an owner. Not a committee in the abstract. A named accountable team or executive who owns the agent’s production behavior, approves scope expansion, accepts residual risk, and has authority to restrict or retire the agent.

This is not about pretending agents must be flawless where humans are fallible. Humans also misread policy, miss context, and make inconsistent judgments. The difference is scale and speed. An autonomous agent can repeat the same bad interpretation across thousands of interactions before the organization understands the pattern. Governance often fails because no one is answerable when the criteria fail. If no one can say who can turn the agent off, no one has governed the agent.

From approval to autonomy decision

The standard works because it turns approval into an autonomy decision. Imagine a SaaS company testing a billing-support agent. The agent can answer billing questions and issue refunds below a defined threshold. During a limited launch, it improves first-contact resolution and keeps refund errors below the sampled audit threshold. On routine cases, it performs well.

But the review shows a pattern. Escalations spike when customer sentiment turns negative. Audit samples show inconsistent policy interpretation on annual-plan proration and grandfathered pricing. The agent is not dangerous enough to shut down, nor ready for expanded refund authority.

A permission-only review might say the agent stayed inside its access rights. A production autonomy standard enables a better decision: keep the agent live for routine billing inquiries, restrict refund autonomy for churn-risk or high-value accounts, require human review above the refund threshold, tighten escalation rules, and review again after a defined operating window. Binary approval is not the goal. The decision is launch, restrict, monitor, expand, redesign, or stop, based on evidence.

What the standard changes about how you govern

The standard changes the order of governance. Instead of asking a committee to approve a vague agent concept, the team must define the operating evidence first. Instead of debating policy in the abstract, security can inspect tool access and privilege, legal can identify the constraint envelope, product can define workflow outcomes, operations can define escalation and incident response, compliance can require audit trails and documentation, and engineering can build traces, evals, circuit breakers, and reversible actions into the system.

This does not make governance lighter. It makes governance usable. It also exposes weak use cases early. If a team cannot name the workflow outcome, the agent is not ready. If the agent needs broad access to several systems but no one can justify the permission set, it is not ready. If humans are expected to supervise but cannot see why the agent acted, it is not ready. If a vendor-managed agent cannot provide the traces, controls, or contractual assurances required for the workflow risk, it is not ready for that level of autonomy.

That sounds restrictive. In practice, it helps teams move faster on the right work. A team that defines success criteria before launch can build the right observability from the start, test the right risks, avoid automating a broken process, give the agent enough autonomy for routine work while protecting high-risk decisions, and scale only after the evidence supports scaling. The alternative is familiar: a promising pilot, a vague approval, a production incident, then a scramble to reconstruct what happened.

Governance as a lifecycle, not a gate

A success criteria standard should act as an operating license, not another shelf document. Before launch, it tells the team whether the agent is defined well enough to build. During testing, it tells the team what to measure. At production review, it tells approvers what evidence to inspect. After launch, it tells operators when to expand, restrict, pause, or retire the agent.

That lifecycle matters because agents do not stay still. Usage changes. Models change. Tools change. Policies change. Users discover edge cases. Attackers find new openings. Vendors update systems. Workflows drift. A governance model that only approves launch will age quickly. The better standard is continuous: the agent must keep proving it deserves production autonomy.

When you put permission on one axis and production governance on the other, four patterns emerge. Low permission and low governance: a sandbox assistant where neither matters much. Low permission but high governance: an overcontrolled low-risk tool. High permission but low governance: dangerous autonomy where the agent has power without controls. High permission and high governance: earned production autonomy, which is the only sustainable target for agents with real authority.

The riskiest agent is not always the one with the broadest permission. The harder risk to spot is the agent with meaningful permission and weak operating governance, because it can keep producing plausible, consequential errors without triggering an access-control alarm.

Agent governance is the operating standard an agent must meet to remain in production, and it starts with a question every team should answer before launch: who can turn it off, and under what evidence would they do it?