The Last-Mile AI Strategy Test

The easier it gets to add AI, the more valuable it becomes to know where AI does not belong.

Most teams no longer struggle to access models that can summarize, draft, classify, route, recommend, answer, and act well enough to produce impressive demos. They struggle to turn that capability into something users trust, repeat, and pay for. That abundance creates a new failure mode. Teams start asking “where can we add AI?” when the better question is “where does AI improve the work?”

The last mile is everything between model capability and realized user value: the workflow where the model appears, the context it receives, the control the user keeps, the failures the product catches, and the outcome the business measures. As model capability becomes broadly available, durable advantage moves to the product layer: the harness, workflow, data context, trust boundary, and operating loop that turn model output into user value. AI strategy is won by knowing where AI should create value, where it should stay out of the way, and how to prove that it works better than the baseline, not by having AI. A wrapper shows that a model can do something. A product proves that the thing matters inside a real workflow.

The P&L gap

One of the most telling AI statistics is not a model benchmark. MIT Project NANDA reported that 95% of enterprise generative AI deployments produced zero measurable P&L impact. Treat the figure as directional, not as a universal law. The sample, definitions, and maturity of the deployments matter. Many AI projects are early experiments. Some were never designed to move a financial metric. But the pattern names a problem many leaders recognize: weak data, poor workflow fit, no clear business outcome, and no production measurement plan.

A CIO quoted in the research put it plainly: “We’ve seen dozens of demos this year. Maybe one or two are genuinely useful. The rest are wrappers or science projects.” The point is not that enterprise AI does not work. The point is that demos do not compound into value unless the product changes a real workflow and measures a real outcome.

Model access is becoming a weaker moat

For many horizontal use cases, model access is no longer the scarce asset it was two years ago. Inference costs have fallen. Frontier models have converged enough that the same base capability can produce very different user experiences depending on the product around it. One product makes the model feel like a natural extension of the work. Another makes the same class of model feel slow, vague, and risky.

Model choice still matters in regulated domains where an accuracy gap can decide whether a product is safe to use. Domain-specific models, fine-tuning, routing, and cost-performance judgment still matter when the cost of a wrong answer is high. But for many teams building AI into horizontal products, the harder question has moved up the stack: what job should the model do, in what workflow, at what quality threshold, with what human control?

Workflow fit beats feature presence

Imagine a product review meeting where every proposed roadmap item has “AI” in the name. One idea is technically impressive. It uses an agent to scan customer accounts, infer next steps, and generate a full action plan. In the demo, it looks powerful. In the actual workflow, it requires reps to leave their CRM, open a separate dashboard, review uncertain recommendations, and copy the useful parts back into the system where their manager measures activity.

Another idea looks boring. It cleans messy account notes inside the CRM, drafts a next-best follow-up, flags uncertainty, and asks the rep to approve before anything reaches the customer. The second idea wins, not because it uses a better model or has a more ambitious vision. It wins because it changes the user’s day without forcing the user into a new one.

This is where many AI products break. They assume users want intelligence as a destination. Most users want less friction in work they already have to do. GitHub Copilot’s success appears to come in part from tight workflow embedding: it appears inside the IDE, helps at the point of composition, and does not ask the developer to move the work somewhere else.

AI creates value when it meets the user where the work already happens. That does not mean every AI product must live inside an existing surface. Some new AI-native workflows will replace old ones. But even then, the product has to earn the behavior change. A standalone AI workspace has to be better enough to justify switching costs, training, governance, and new habits. Workflow fit is strategy, not polish.

The automation boundary is a product decision

Klarna’s customer service AI is a useful public case because it shows both the promise and the risk. Klarna said its AI assistant handled millions of customer conversations and automated roughly two-thirds of customer service chats. For simple questions, such as order status and payment schedules, that kind of automation can work well.

Then the story became more complicated. Klarna faced public scrutiny over the limits of automation and later described a need to rebalance human support in areas where customer needs are more complex. The story became more complicated later. The lesson here is that automation boundaries matter, not that customer service AI failed. In disputes, fraud claims, hardship cases, and sensitive financial conversations, confident but wrong answers about fees, policy, or payment terms are not merely annoying. They create compliance and trust problems.

AI can answer simple transactional questions. AI can draft responses. AI can route cases. AI can summarize context for a human agent. But not every customer problem should move through the same automation path. Good AI strategy requires deciding whether AI should suggest, draft, decide, act, escalate, or stay silent. That decision depends on user trust, risk, reversibility, cost of error, and the emotional weight of the moment. A model that works for password resets may be the wrong product choice for a fraud dispute. A system that can draft a clinical note may not be allowed to make a clinical judgment. An agent that can update a record may need approval before it changes anything that affects money, rights, or safety.

The boundary is also a learning decision. When AI acts without review, the product may lose the correction signal that teaches the system what good looks like. In some cases, the most defensible product will automate less at first, because human review creates the feedback loop the product needs to improve. These are product decisions, not just policy decisions. In AI products, product sense includes confidence, security, cost, auditability, training, and production readiness.

Almost right is still expensive

A model can win a benchmark and still lose the user. Benchmarks test capability under controlled conditions. Products succeed or fail under messy conditions: bad inputs, unclear user intent, slow response times, edge cases, partial trust, broken data, and users who are trying to finish work while being measured by someone else.

A traditional feature usually behaves the same way each time. AI features vary by prompt, context, user, retrieved data, model update, and confidence threshold. They can fail silently. They can be almost right. They can create support burdens that do not appear in a demo. The Stack Overflow developer survey data points to this trust problem: AI tool adoption has risen, while many developers remain skeptical of output accuracy and report frustration with code that is close enough to look useful but wrong enough to require careful review.

That is the danger zone for AI products. Obviously wrong output is easier to reject. Almost-right output creates hidden review costs. Those costs compound inside organizations. Every uncertain output shifts ambiguity onto the next human in the chain. The user has to inspect it, correct it, explain it, defend it, or absorb the risk of acting on it. If the product saves five minutes of drafting but adds ten minutes of verification, the metric that moved was activity, not value. The best AI teams treat launch as the start of measurement, not the end of implementation.

Product judgment shapes the moat

Proprietary data, feedback loops, distribution, bundling, business model, and regulatory positioning often look like separate advantages. In many cases, they are. Distribution especially can function independently of product quality. Salesforce, Microsoft, ServiceNow, and other incumbents can push AI features into products customers already use. That is a real advantage.

But distribution answers a different question. Distribution determines whether AI reaches users. Product judgment determines whether it earns repeated use, useful data, and trust. Data moats are also real, but simply having data does not guarantee a useful AI product. Teams still have to decide what data matters, how to collect it, how to refresh it, where to surface it, and how to turn it into a better decision for the user. When implementation gets easier, weak product judgment becomes easier to see.

Before building an AI feature, teams need a gate that is harder to pass than “the model can do it.” Use this test before committing roadmap, budget, or organizational attention.

1. User problem

What specific user pain, decision, or workflow does this improve? If the answer is “users want AI,” the idea is not ready. The problem should exist without the AI feature. The user should already spend time, money, attention, or risk on it.

2. Baseline

What does the user do today, how will you measure AI against that baseline in production, and who owns that measurement after launch? Do not compare the AI to a demo. Compare it to the actual workaround, spreadsheet, colleague, search process, support queue, or manual review path the user relies on now.

For a customer support drafting feature, that could mean measuring current handle time, escalation rate, correction rate, customer satisfaction, and policy errors before AI enters the workflow. The baseline does not have to be perfect. It has to be honest enough to prevent the demo from becoming the control group.

3. Workflow fit

Where does this enter the user’s existing work without creating new friction? The best AI products often feel less like a new destination and more like a better moment inside the current workflow. If the feature requires users to leave the system where work is measured, expect adoption to weaken. If the product is creating a new AI-native workflow, raise the bar. The new behavior has to beat the old one by enough to justify migration, training, and governance.

4. Automation boundary

Should AI suggest, draft, decide, act, escalate, or stay silent? For agentic AI, add a harder question: what does the human review before the agent acts, what can the user override, and what audit trail exists after the action?

5. Quality threshold

What reliability level is required before users trust it? Name the dangerous failure mode. In some contexts, a visible error is tolerable. In others, a silent or confidently wrong answer is unacceptable. The threshold should match the cost of being wrong.

A low-risk summarization tool can tolerate a different error profile than a fraud, clinical, legal, or payment workflow. Start with the cost of the error, compare against the current human or process baseline, then decide what level of review the system needs before the output matters.

6. Trust mechanism

How will the product show uncertainty, allow correction, preserve control, and recover from failure? Trust is a product behavior, not a message in onboarding. Users trust systems that make limits visible and give them control when the system is unsure.

7. Defensibility

What gets stronger over time? The answer might be proprietary data, workflow context, feedback loops, distribution, habit, or ownership of a measurable business outcome. If nothing compounds, the feature is likely a thin layer over model access.

8. Measurement commitment

What P&L or user outcome metric will this be measured against, who owns it, and what happens if the metric does not move within a defined window? This is where many AI projects reveal their weakness. They measure prompts sent, seats activated, or demos completed. Useful AI should move a real outcome: lower handling time with stable satisfaction, faster cycle time with fewer errors, higher conversion with lower manual effort, better compliance with less review burden. If the metric does not move, the team needs permission to change the workflow, narrow the use case, lower the automation level, or kill the feature.

How to use the test

The test works best before building, not after launch. Take a proposed AI feature and force it through the questions in sequence. Do not let the team skip from “user problem” to “defensibility.” Most weak ideas try to do that: they start with a vague pain, assume AI will solve it, then claim a data flywheel will appear later.

The stronger sequence is narrower. Start with one painful workflow. Define the baseline. Decide where AI enters. Set the automation boundary. Name the required quality threshold. Design the trust mechanism. Then ask what compounds if the product works.

This sequence changes the roadmap conversation. A team ranking AI use cases by executive excitement will choose visible, demo-friendly ideas. A team ranking by customer impact, technical feasibility, risk posture, business value, and measurement clarity will choose smaller use cases that can actually reach production. That difference matters. Many useful AI products begin with unglamorous work: cleaning inputs, drafting recommendations, routing cases, summarizing context, flagging uncertainty, or reducing review time. The value comes from making the work better, not making the demo louder.

Before adding AI to a surface, ask whether the user problem would matter without AI. Before choosing a model, ask what baseline the product must beat. Before building a standalone experience, ask whether the AI belongs inside a workflow users already trust. Before automating a step, ask what the human must review, control, or override. Before launching, ask what failure mode would damage trust fastest.

If the model gets ten times better, does this product become obsolete, or does the thing you own become more valuable?