A mid-market ecommerce retailer in 2026 has a customer-service problem the founders ten years ago did not. Contact volumes climb with every channel added — chat, WhatsApp, email, Instagram DM, the help-centre form, the order tracking page — and the cost per contact has not dropped meaningfully. Queues get long, response times slip, CSAT softens, and the team ends up either staffing up against contracting margins or quietly degrading service.
AI agents are the obvious lever, and after three years of unfocused hype the technology has actually matured. The question is no longer “does this work?” — it works, sometimes embarrassingly well — but “how do we build it without shipping something that hallucinates refunds and quietly burns through customer trust?”
This guide is the version of that build I would write for a CTO or head of product at a UK mid-market retailer who is evaluating their options. Operator detail. No vendor pitch. Build-it-yourself bias where it earns its keep.
Why Now: The 2026 State of Agents
Three things changed between the early Bot-Era of 2023 and where we sit today.
First, the model floor moved up. A Haiku-class or GPT-4.1-mini-class model in 2026 is roughly where a frontier model sat in early 2024 — good enough for the vast majority of customer-service reasoning at a fraction of the cost. You no longer pay frontier prices to get production-grade behaviour on bounded tasks.
Second, the agent stacks matured. Three distinct architectures now ship in production:
- Scripted-LLM hybrid: a deterministic state machine where each state is an LLM call. Predictable, easy to debug, and the right starting point for a lot of merchants.
- Tool-using agents: an LLM with access to a set of functions (order lookup, product search, refund initiation), choosing which to call based on conversation context. This is the dominant shape in 2026.
- Autonomous task agents: multi-step planners that can spawn sub-tasks, recover from errors, and reason about goals across a session. Powerful, harder to control, mostly overkill for customer service today — but emerging in returns workflows and multi-issue conversations.
Third, the evaluation tooling caught up. Frameworks like Promptfoo, Langfuse, Braintrust, and Inspect made it possible for product teams without ML backgrounds to run serious evaluations — a topic I cover in detail in LLM evaluation for non-ML teams.
These three shifts mean that a competent product engineering team can ship a working agent in a quarter without specialist ML hires.
When an AI Agent Earns Its Keep
Not every customer-service load justifies an agent. The honest decision framework comes down to four variables:
- Volume. Below a few thousand contacts per month, the engineering and ops cost dominates the labour saved. Above ten thousand, the unit economics get attractive fast.
- Predictability. If 60-80% of contacts cluster into a handful of intents — order status, returns, product questions, account issues — an agent earns its keep. If contacts are dominated by one-off, idiosyncratic queries, the gain is smaller.
- Escalation latency tolerance. If your customers will wait 30 seconds for an agent to fail and hand to a human, fine. If they expect sub-five-second response on every contact, the architecture has to be tighter and the escalation logic has to be near-instant.
- Brand-risk profile. If you sell luxury, medical, financial, or legal products — or if your brand voice is part of the product — the cost of an agent saying something slightly wrong is high. Some retailers are simply not the right fit, and saying so is part of the consulting job.
If three of those four point in the right direction, an agent is worth the build. If two or fewer do, save the budget.
Architecture Choices: Managed, Framework, or Build-From-Primitives
Three roads lead to a working agent. Each makes a different trade-off between speed-to-market, control, and unit economics.
Managed platforms
Intercom Fin, Zendesk AI, Salesforce Einstein, Gorgias AI, Ada — all ship an LLM-powered agent that plugs into the helpdesk you already have. Setup takes days to weeks. Pricing is per-resolution or per-seat. Quality is good on common intents and varies on edge cases.
The right choice when: your support stack is already on the platform, contact volumes are moderate, your brand-risk profile tolerates the platform’s default behaviour, and your knowledge base lives mostly inside the helpdesk.
The wrong choice when: contact volumes are high enough that per-resolution pricing breaks the unit economics, you need bespoke tool calls into your commerce platform, or your brand voice has to be tuned tightly enough that the platform’s defaults feel off.
Frameworks
LangGraph, the OpenAI Agents SDK, CrewAI, AutoGen, the Anthropic SDK with tool use — these are the toolkits a product engineering team uses to build the agent on their own infrastructure.
The right choice when: you want control over the prompt, the retrieval, the tool calls, and the unit economics; your knowledge base lives outside the helpdesk; or your contact mix has bespoke patterns the managed platforms handle poorly. This is increasingly the default for mid-market retailers with engineering capacity.
The trade-off: you own the engineering. Maintenance, evaluation, observability, model upgrades — all of it lives on your team. Budget for this honestly. A working agent on a framework typically takes one senior engineer half-time for the build phase and a quarter-time after launch.
Build-from-primitives
Direct API calls to the model, your own retrieval layer, your own state management, your own tool-call routing. This makes sense at the high end of contact volume — eight figures of contacts a year — where every layer of abstraction costs measurable money, and where you have an engineering team that can carry the maintenance burden.
For most mid-market retailers, this is over-engineered. A framework gives you 90% of the control with a fraction of the build cost.
I have no horse in this race — Rogue ships AI and data services across all three patterns depending on what fits the merchant. The honest framing is that “managed vs framework vs primitives” is a real trade-off, not a vendor war, and the answer is usually somewhere in the middle.
The Knowledge Base: RAG-Powered or Fine-Tuned?
Almost every production customer-service agent in 2026 is RAG-powered, not fine-tuned. The reason is operational: knowledge bases change. Policies update, products launch, shipping lanes shift. A fine-tuned model is a snapshot; retrieval is a live wire.
The longer version of this decision lives in RAG vs fine-tuning. The short version: use RAG unless you have a tone or behaviour problem that retrieval cannot fix.
The interesting question is not RAG-or-fine-tuning. It is retrieval design.
Index per intent, not per document
The mistake teams make is dumping every help-centre article into one giant vector index and praying that semantic search figures it out. It does not. Past a few hundred documents, retrieval quality collapses without scoping.
The pattern that works: index per intent. Order issues, returns and refunds, product information, account and login, shipping and delivery, payments. Each intent gets its own namespace. The agent classifies the contact intent first (a cheap classifier or the same LLM with a small prompt), then queries the right namespace.
Cross-cutting indexes for the full product catalogue and the full policy library live alongside, but they are queried selectively rather than as the default surface area.
Document hygiene matters more than the embedding model
The team that obsesses over choosing OpenAI embeddings vs Cohere vs Voyage vs the latest open-source model is solving the wrong problem. Document quality dominates retrieval quality.
Concretely: every document in the knowledge base needs an owner, a last-reviewed date, a clear scope statement, and chunking that respects semantic boundaries. A 200-document knowledge base where every document is current and well-scoped beats a 2,000-document index where most are stale.
This is the unglamorous work that determines whether the agent actually answers questions correctly. We cover the wider preparation work in the AI readiness checklist for ecommerce.
Tool calls fetch authoritative data
Anything that can be looked up in real time should be — order status, inventory, account state. Putting these into the knowledge base is a recipe for stale answers. The agent should call a function, get the live state, and answer from that. Retrieval is for policies and product information; tools are for live data.
This split is what separates an agent that quietly drifts out of date from one that keeps working. It is also where composable commerce architectures pay off — clean, well-documented APIs make tool-calling cheap, while monolithic platforms make every integration a project.
Escalation Logic
The single highest-leverage piece of an agent’s design — more than the model, more than the prompt, more than the retrieval — is the escalation logic. It is also the piece most teams under-invest in.
The principle: escalate early and gracefully. The cost of a customer rage-quitting a broken agent is higher than the cost of a slightly-too-eager handoff to a human.
A working escalation policy covers:
- Confidence-based escalation. If the agent’s confidence in its answer is below a threshold, hand off. Confidence can be a calibrated model output, an LLM-as-judge sanity check, or a heuristic on retrieval quality.
- Topic-based escalation. Some intents — refund disputes over a threshold, complaints, accessibility issues, anything legal or medical-adjacent — go straight to a human regardless of the agent’s confidence.
- Customer-signal escalation. If the customer types “human”, “agent”, “speak to someone”, or any frustration signal, hand off immediately. No retry loops.
- Loop detection. If the same customer asks the same thing twice and the agent’s answer has not satisfied them, hand off.
When the agent escalates, it has to pass full context to the human — the conversation, the retrieved documents, the tool calls made, the agent’s reasoning. The number-one complaint about poorly-built agents is “I had to repeat everything to the human.” Solve that in the architecture.
Logging is non-negotiable. Every escalation goes into a queue that a human reviews periodically — not just to handle the contact, but to learn what the agent should have done differently.
The Evaluation Harness
Without an evaluation harness, you are shipping based on vibes. The pattern that works has three layers:
- Offline evals against a curated test set. Fifty to two hundred representative contacts, each with a known good answer or a defined acceptable-answer rubric. Runs in CI. Gates merges to the prompt, the retrieval system, or the model.
- Online evals on sampled production traffic. A small percentage of live conversations scored on the fly against the same rubric. Catches drift, edge cases, and distribution shifts the offline test set misses.
- Human review of a 5% sample. Random sample of conversations reviewed weekly by a domain expert. Catches the failure modes the automated graders miss — tone problems, subtle hallucinations, brand-voice drift.
This is the same three-layer pattern I describe in detail in LLM evaluation for non-ML teams. It is not specific to customer service, but customer service is one of the contexts where skipping it is most expensive.
Observability and Cost
You cannot debug what you cannot see. Every production agent in 2026 is wired up to a tracing platform — Langfuse, Helicone, Braintrust, OpenAI traces, Datadog LLM Observability, or in-house equivalents. The shopping list:
- Per-conversation traces with the full prompt, retrieval results, tool calls, model outputs, and final response
- Per-conversation cost tracking broken down by token usage, retrieval calls, and tool invocations
- Latency monitoring at every stage of the agent loop
- Quality scores from the online eval pipeline
- Alerting thresholds on cost spikes, latency degradation, quality drops, and escalation rate changes
Cost is the one most teams under-monitor. A well-built agent runs at a few pence per conversation. A badly-built agent — too many tool calls, oversized retrieval results, frontier model where a smaller one would do — can run at 30-50p per conversation. The factor-of-ten difference matters at scale, and the only way to manage it is to instrument it from day one. This is also where platform engineering discipline pays — observability is a platform problem, not an AI problem, and treating it that way is what keeps the agent operable a year after launch.
The same observability work also feeds the evaluation harness. The traces become training data for the next iteration. Skipping observability does not just hide problems — it kills the iteration loop. The wider production discipline applies here exactly the same way it does in any RAG pipeline — log everything, sample what you review, automate the rest.
The Seven Build Mistakes I See Most Often
A short, opinionated list. Every one of these has cost a team months of rebuild.
- Skipping intent-level retrieval scoping. One giant index, terrible recall, agent answers from the wrong document. The fix is unglamorous and time-consuming, and there is no shortcut.
- Trusting model knowledge for live data. Asking the LLM about an order status instead of calling the order API. Always wrong, often confidently wrong.
- No evaluation harness before launch. Shipping based on demo-quality testing. Reality bites in week three.
- Late-binding the escalation logic. Treating handoff as an afterthought. Customers churn through an agent that cannot recognise its own limits.
- Picking the most powerful model first. Starting with a frontier model, building a working agent on it, and then discovering the unit economics break. Build on the cheaper model first; upgrade where evals demand it.
- No human review sample. Trusting automated evals end-to-end. There are failure modes only humans catch.
- Treating the agent as a one-time build. Shipping it and walking away. Knowledge bases drift, customer behaviour shifts, models change. The agent needs an owner.
A 90-Day Implementation Arc
A pragmatic week-by-week plan from kickoff to first 5% of conversations live.
- Weeks 1-3 — Discovery. Contact taxonomy. Audit the knowledge base. Define escalation policy. Pick the architecture (managed, framework, primitives). Write the contract for what “good” looks like — three to five concrete success criteria.
- Weeks 4-6 — First loop. Build retrieval per intent. Wire up the first set of tool calls (order status, product lookup). Stand up the agent loop. Build the evaluation harness — fifty curated contacts is enough to start.
- Weeks 7-9 — Hardening. Iterate against the offline evals until quality crosses the threshold. Build observability and per-conversation cost tracking. Implement the escalation logic and the human handoff context-passing.
- Weeks 10-11 — Shadow phase. Run the agent against live traffic without showing customers the output. Compare to what humans actually said. Catch the embarrassing failures here.
- Weeks 12-13 — Live on 5%. Route a sampled percentage of contacts to the agent. Daily human review of the sample. Fix what breaks. Earn the right to scale up.
Teams that try to compress this into 30 days ship something they later have to rebuild. The 90-day arc is not slow — it is the right speed for a system this important.
When Not To Ship One
The honest closing thought. AI agents are the right answer for a lot of mid-market ecommerce customer-service loads. They are the wrong answer for some. Specifically:
- Small contact volumes. A single human handling the queue comfortably is cheaper and better than the engineering and ops cost of a working agent. Threshold is roughly a few thousand contacts a month.
- High-stakes categories. Medical, legal, financial advice, accessibility, complaints with regulatory implications — these belong with humans.
- Brand-voice-as-product. If your brand is the reason customers buy and the cost of getting tone slightly wrong is higher than the labour saved, the trade is not favourable. Hospitality, luxury, and high-end DTC often sit here.
If you are in one of those categories, the right move is to invest in better tooling for human agents — better unified inboxes, better knowledge-base search, better routing — rather than replacing them.
For everyone else, the build is real, the technology is mature, and the teams that get there first carry a meaningful operating advantage for as long as their competitors are still arguing about whether it works. It works. The question is whether you build it well.