What "production-ready" actually means for an AI agent.

The demo always works. The hard part is the next forty things — eval frameworks, fallback paths, human escalation, observability, cost ceilings. Here's the checklist we run.

An agent demo and an agent in production are two different products that happen to share a codebase.

The demo runs against a curated set of inputs, in front of an audience that's already inclined to be impressed, with a human in the loop who can quietly intervene when things drift. Production runs against whatever a real user sends at 2am on a public holiday, with no curator, against a model that may have been silently updated by the provider last week, with consequences that show up in tickets, refunds, or regulatory filings.

Bridging that gap is the work. Most of the AI programmes we're invited into are stuck somewhere on this bridge — a brilliant pilot that's been "two months from launch" for nine months, because every time the team thinks they're done, a new failure mode shows up that nobody had budgeted for.

Here's the readiness checklist we run with clients. It's not exhaustive — production is never finished — but it covers the categories where we see the most expensive surprises.

1. Evaluation, not vibes

If the only test for whether your agent is working is "the team tried it and it felt good," you don't have a system, you have a hope.

Production-ready agents have evaluation suites that run automatically on every change — to the prompts, the model, the tools, the orchestration logic. The suite has at least three layers:

  • Reference tasks with deterministic graders. Things you can score programmatically: did the agent produce JSON in the right schema? Did it call the correct tool? Did the SQL it wrote actually run?
  • Reference tasks with rubric-based graders. Things you score with another LLM-as-judge against a written rubric. Less precise, but covers the open-ended outputs deterministic checks miss.
  • Adversarial tests. Inputs designed to make the agent misbehave — jailbreaks, prompt injections, ambiguous instructions, edge cases drawn from past production incidents.

Without this, every change is a guess. With it, you can ship with confidence and detect regressions before customers do.

2. The control loop, bounded

An agent that can call tools recursively can also loop indefinitely. We've seen agents in pilot environments rack up four-figure inference bills in a single session because a tool returned an error the model interpreted as "try again," and so it did, three thousand times.

"An unbounded agent is just a very expensive way to have a bug."

Production agents have multiple safety nets:

  • Hard turn limits. No agentic loop runs more than N turns without escalating to a human or terminating with a clear failure state.
  • Token budgets per session. A cost ceiling, enforced at the orchestration layer, not by trusting the model.
  • Time budgets. If a session is taking longer than expected, kill it and surface the failure rather than letting it stretch to infinity.
  • Repetition detection. If the agent calls the same tool with similar arguments three times in a row, that's almost always a loop. Detect and break.

3. Tool access, scoped tightly

The most powerful agents are the ones with the most tools. They're also the ones with the largest blast radius when something goes wrong.

The principle we apply is the same one that's served security engineering for thirty years: least privilege. The agent should have access to exactly the tools it needs for the task it's been given, and no more. That means:

  • Different agents (or different agent roles) get different tool sets, scoped at runtime.
  • Write actions get separate handling from read actions — usually with a confirmation step or human approval.
  • Tool inputs are validated at the orchestration layer before they're executed, not just trusted because the model produced them.
  • Sensitive operations — payments, deletions, external sends — go through a deterministic policy layer that the agent can't bypass even if it's been jailbroken.

4. Observability that tells you why, not just what

When an agent does something unexpected in production, the question is never "what did it do?" — you can see that. The question is always "why did it do that?" And answering that requires observability that traditional APM tools weren't built for.

What you need to capture, end to end:

  • The full prompt sent to the model, including system prompt, tools, and message history.
  • The full response, including any reasoning traces.
  • Every tool call, with arguments and results.
  • Latency and token usage at each step.
  • The version of every component — model, prompt template, tool spec — that participated.

Then a way to slice all of that by user, by session, by failure type, by date. The first time you have to debug a weird production behaviour without this, you'll wish you had it. The second time, you'll build it.

A pattern that works Treat every agent session as a structured trace, not a log stream. Tools like LangSmith, Langfuse, Arize Phoenix, or a well-instrumented OpenTelemetry setup all work — what matters is that you can reconstruct a session in 30 seconds, not 30 minutes.

5. Human escalation as a first-class path

The most important question to answer about your agent isn't "can it handle the task?" It's "what happens when it can't?"

The pilot version usually doesn't answer this — it either tries until it succeeds or fails silently. Production agents need an explicit escalation path: a clear point at which the agent stops trying, hands off to a human (with full context), and the user is told what's happening.

Designing this well is more product work than ML work. What's the threshold for handoff — confidence score, turn count, specific error types? What does the human see when they pick up the conversation? What's the SLA for response? How does the human's resolution feed back into your eval set so the agent gets better?

Get this right and your agent appears reliable even when individual sessions fail, because failure has a graceful path. Get it wrong and every failure looks like the system being broken.

6. Versioning everything

An agent in production is a composition of components, any of which can change: the model itself (provider-side updates), prompts, tool definitions, retrieval indexes, the orchestration code. A bug introduced in any one of these can manifest as misbehaviour in any other.

The discipline is straightforward but rarely applied early enough: everything that affects the agent's behaviour gets versioned, and the version that ran each session is captured in the trace. When something breaks in production, you need to be able to answer "what changed?" in minutes, not days.

7. Cost as a product feature

Cost in an agentic system isn't just a finance concern — it's a product constraint. A workflow that costs $0.20 per execution at scale of 10,000/day is $730,000/year. Tripling the number of tool calls to make the agent slightly more thorough might double the bill without doubling the value.

Production-ready agents have:

  • Cost per session as a tracked metric, alongside latency and accuracy.
  • A tiered model strategy — using cheaper, faster models for routine sub-tasks and reserving the expensive ones for the steps that need them.
  • Caching where possible, especially for tool outputs that don't change often.
  • A monthly review of cost-per-outcome, with engineering owning the optimisation work.

The honest version

Building an agent that demos well is, increasingly, not very hard. The frameworks are good. The models are capable. A motivated engineer can ship a working prototype in a week.

Building an agent that runs in production for two years without becoming a liability is a substantially harder problem, and it's not the one most teams are budgeting for. The cost is in the operational scaffolding — eval suites, observability, escalation paths, governance — and that work isn't visible to anyone except the team doing it.

If you're scoping an agentic AI programme right now and the plan jumps straight from "pilot works" to "full rollout," there's a missing chapter in there. We'd be happy to help you write it.