Guides

LLM Evaluation for Non-ML Teams: A Practical Guide

April 24, 2026 | 10 min read

LLM EvaluationAI QualityAI EngineeringAI StrategyLLMProduct AI

A product team ships an AI feature. The demo works. The first week of production traffic looks fine. Six months later, support tickets are quietly accumulating because the LLM produces wrong answers 18% of the time on a specific class of query that the team never tested. By the time someone notices, trust in the feature is already gone and the rebuild takes longer than the original build.

This pattern repeats because LLM evaluation is widely treated as an ML research discipline that product teams cannot do. That framing is wrong in 2026. Evaluation is mostly a product and engineering problem: curating test cases, defining grading rubrics, and running results systematically. This guide covers the practical version of that work for teams without ML specialists.

Why Evaluation Separates Demo AI from Production AI

LLMs behave differently across inputs. The same prompt that produces a polished, accurate response on a cherry-picked example can produce a confident hallucination on an adjacent one. The gap between “it worked on my five test inputs” and “it works on the full distribution of production traffic” is where most AI projects quietly fail.

Evaluation is how product teams measure that gap and manage it. Without evaluation:

You cannot compare prompt versions without guessing whether the change improved or regressed the system
You cannot upgrade models without a risk of silent quality drops
You cannot explain to stakeholders how confident you are in a specific feature’s output
You cannot reproduce or diagnose production failures because you have no ground truth to compare against

Evaluation is not glamorous. It is the discipline that turns demo AI into production AI, and the teams that invest in it are the teams whose AI features are still live two years after launch.

The Three Layers of Evaluation

A useful evaluation pipeline has three layers. Each catches different failure modes, and all three are needed for serious production systems.

Layer 1: offline evals against a curated test set

Offline evals are the workhorse of the system. You build a test set of representative inputs, you define a grading rubric, and you run the model against the test set any time something changes — the prompt, the model version, the retrieval system, the context window.

The output is a numeric score (and usually a breakdown by failure mode) that the team tracks over time. If the score drops, the change regresses; if it rises, the change improves. Offline evals run in CI and gate merges, the same way unit tests do for traditional software.

What makes an offline eval useful is the test set. Fifty to two hundred inputs covering:

Happy path: common, straightforward queries the system should handle well
Edge cases: unusual phrasings, ambiguous queries, missing context
Known failure modes: queries you have previously seen the model get wrong
Adversarial inputs: attempts to break the system — off-topic requests, prompt injection, harmful queries
Distribution samples: real queries pulled from production logs and labelled

The test set is the single most important artefact in an evaluation pipeline. Spend time on it.

Layer 2: online evals on sampled production traffic

Offline evals use a fixed test set, which means they cannot catch drift — real-world queries evolve, user behaviour shifts, new edge cases emerge. Online evals close that gap.

The pattern: a small percentage of production traffic (typically 1-10%) is sent through an evaluation pipeline that scores the LLM output on the fly. The scores are logged, aggregated, and monitored. When scores drop, alerts fire. When new failure patterns emerge, they get added to the offline test set for future regression testing.

Online evals are implemented in 2026 through frameworks like Langfuse, Braintrust, Helicone, and Datadog LLM Observability. The engineering work is modest — instrumenting the LLM call site to log inputs, outputs, and context, then scoring asynchronously against a rubric.

Layer 3: human review of a random sample

Automated evaluation has blind spots. Grading rubrics miss subtleties; LLM-as-judge graders have their own biases; metrics can look fine while the actual output feels wrong to a human reader. A small random sample of production output reviewed weekly by a domain expert catches failures the automated layers miss.

Review takes 15-30 minutes for a sample of 20-50 queries. The reviewer flags output as “good”, “borderline”, or “bad” with a short note on each flagged case. The flagged cases become new test cases in the offline evaluation set, closing the loop.

Skipping the human layer is a common failure mode. Automated evaluation alone tends to overfit to whatever the rubric explicitly measures, missing everything else. Human review is the check against that drift.

Grading Rubrics That Actually Work

The hardest part of an evaluation pipeline is defining what “good” means. A useful rubric is specific, testable, and grounded in the business outcome rather than abstract AI qualities.

For a customer service AI, a rubric might include:

Correctness: does the response contain accurate information according to the knowledge base?
Completeness: does it address the customer’s actual question or leave gaps?
Tone: does it match the brand voice as defined in the style guide?
Format: does it follow the expected structure (greeting, body, sign-off)?
Safety: does it avoid making commitments, promises, or legal claims that the business cannot back?
Escalation: does it correctly recognise when a human should take over?

Each criterion scored on a short scale (binary pass/fail, or three-point) produces a multi-dimensional quality signal. The aggregate score is useful; the breakdown is more useful — it tells you whether a regression is about correctness or about tone, which changes what fixes the problem.

For a product search AI, the rubric changes: relevance, ranking, handling of zero-results cases, robustness to misspellings, behaviour on ambiguous queries. Every task has its own rubric. Generic LLM benchmarks (MMLU, HellaSwag, etc.) measure model capability in the abstract; they do not tell you whether a specific model works for your specific task.

LLM-as-Judge: Useful, Not Magic

Grading hundreds of test cases by hand every time the prompt changes is impractical. The standard workaround is LLM-as-judge: a strong model (typically GPT-4, Claude Opus, or similar frontier model) grades the output of your production model against the rubric.

LLM-as-judge works well when:

The rubric is well-defined and testable
The task is objective enough that two humans would agree on the grade
The judge model is meaningfully stronger than the model being evaluated
The judge has been validated against human grades on a sample

It fails when:

The grading requires deep domain expertise the judge lacks
The rubric involves subtle aesthetic or contextual judgements
The judge is the same model being evaluated (it gives itself high marks)
Prompt injection or adversarial inputs manipulate the judge

The practical pattern: use LLM-as-judge as the primary automated scorer, validate it against human grades on 50-100 cases monthly, and flag any widening gap between human and LLM grades as a sign the judge rubric needs refinement.

What to Evaluate Beyond Correctness

Correctness is necessary but not sufficient. Production systems need evaluation on additional dimensions:

Latency. A correct response that takes 8 seconds is often worse than an adequate response in 2. Track p50 and p99 latency alongside quality metrics. Some quality improvements (longer reasoning chains, more retrieval) trade latency for quality; track the trade-off explicitly.

Cost. Frontier models cost 10-100x more per query than mid-tier models. Track cost per query and cost per successful completion. A cheap model that produces 90% quality at 10% cost is often the right engineering choice over a premium model at 95% quality.

Refusal rate. Safety-tuned models sometimes refuse legitimate queries. For customer-facing systems, high refusal rates are a silent quality issue — users hit dead ends and lose trust. Track refusals and sample them for review.

Format adherence. Structured outputs — JSON, specific headers, bounded length — need explicit format evaluation. “Correct answer in the wrong format” is a real production failure mode when downstream systems parse the output.

Distribution robustness. How does the system behave on queries in a different language, tone, or style than the test set? Regional variants, non-native speakers, users in a hurry, and users typing on mobile all produce different input distributions. Sample testing on realistic distribution variants catches failures the homogeneous test set misses.

Tools in 2026

The evaluation tooling landscape has matured significantly. The practical choices for a product team in 2026:

Promptfoo. Open-source, YAML-configured evaluation suite that integrates with CI. Good starting point for offline evals. Free and self-hosted.
Langfuse. Open-source LLM observability with eval and tracing support. Strong for combining offline and online evaluation in one platform.
Braintrust. Commercial evaluation platform with strong UX for rubric design, test case management, and team collaboration. Higher cost but significant time savings for larger teams.
Helicone and Phoenix (Arize). Production-focused LLM observability with built-in evaluation. Good fit for online eval and monitoring.
Inspect AI (UK AISI). Open-source evaluation framework from the UK AI Safety Institute — strong for safety and adversarial testing.
OpenAI Evals. Free framework maintained by OpenAI, tightly integrated with OpenAI models. Reasonable default if your stack is OpenAI-primary.

The choice of tool matters less than the discipline of running evaluation consistently. A team running basic Promptfoo in CI beats a team with Braintrust that forgets to run it.

Common Evaluation Failure Modes

Five patterns that undermine evaluation in practice:

Test set leakage. The prompt gets optimised against the test set until it scores perfectly on it and mediocrely on everything else. Fix: reserve a held-out test set never used for iteration, and rotate the iteration set over time.

Rubric drift. The grading rubric evolves informally as reviewers see edge cases, invalidating historical comparisons. Fix: version the rubric, log which version scored which run, and update the full historical test set when the rubric changes materially.

Overreliance on aggregate scores. A single number hides whether quality is dropping on a specific subpopulation while improving on average. Fix: always report aggregate scores with per-segment breakdowns.

Ignoring cost and latency. Quality optimisation that triples cost or latency is not always a win. Fix: define target operating characteristics (max p99 latency, max cost per query) and reject changes that violate them regardless of quality gains.

No production coverage. Offline evals run clean, the feature ships, real users produce queries the test set never covered, and quality is poor. Fix: route a percentage of production traffic into online evaluation and refresh the offline test set monthly from sampled production logs.

Getting Started Without an ML Team

A realistic first-month evaluation programme for a non-ML product team:

Pick one AI feature already in production or near-production
Build an offline test set of 50-100 inputs covering happy path, edge cases, and known failures
Define a 4-6 criterion grading rubric scored on a three-point scale
Pick one tool (Promptfoo is the friendliest starting point) and wire it into CI
Have a domain expert review a random sample of 20-30 production outputs weekly, flag failures, and add them to the test set
Expand to online evaluation on sampled production traffic in month two

This is not glamorous work. It is the infrastructure that turns a prototype into a reliable system, and the teams that skip it are the teams whose AI features quietly stop being trustworthy over time.

Getting Help

We build evaluation pipelines as part of AI engineering engagements through our AI engineering services. For the upstream architecture question that evaluation helps answer, our RAG vs fine-tuning framework covers the decision framework. For the business framing, our AI strategy for mid-market businesses and AI readiness checklist cover how evaluation fits into a wider AI programme.

If you want a second opinion on an existing AI feature’s quality — or a structured evaluation pipeline for a feature about to ship — get in touch.

Frequently asked questions

What is LLM evaluation? +

LLM evaluation is the process of measuring whether a language model produces outputs of acceptable quality for a specific task. It covers three dimensions: accuracy (are the answers correct?), reliability (does it handle edge cases without breaking?), and alignment (does it behave the way you need it to on your specific use case?). Evaluation is what separates a demo that works on carefully chosen examples from a production system that works across the full distribution of real inputs.

Do you need an ML engineer to evaluate an LLM? +

No. LLM evaluation in 2026 is primarily a product and engineering discipline rather than an ML research one. A product engineer and a domain expert can build and run a serious evaluation pipeline using tools like Promptfoo, Langfuse, Braintrust, Inspect, or OpenAI Evals. ML specialists add value on difficult evaluation design questions — how to measure subjective qualities like helpfulness or tone — but the foundational work is well within reach of a mid-market product team.

What should you measure when evaluating an LLM? +

Measure task-specific correctness first — does the LLM produce the right answer for your use case? Then measure failure modes that matter to your business: hallucinations, off-topic responses, format deviations, unsafe outputs, and refusals to help. Measure latency and cost per query alongside quality so the trade-offs are visible. Finally, measure robustness — does performance hold up when inputs are noisy, ambiguous, adversarial, or in a different language than the training distribution.

How many test cases do you need for a useful evaluation? +

A useful baseline evaluation needs fifty to two hundred curated test cases covering the full spectrum of expected inputs, including happy paths, edge cases, and known failure modes. Smaller sets are noisy; larger sets are expensive to maintain. The quality of the test set matters more than the size — fifty carefully chosen cases that represent real production distribution beats a thousand cases that only cover obvious inputs. Grow the test set as real production data reveals new failure modes.

Can an LLM grade another LLM's output? +

Yes, and it is a common pattern in evaluation pipelines — often called LLM-as-judge. A stronger model (frequently GPT-4-class or Claude Opus-class) grades the outputs of a production model against a rubric. LLM-as-judge works well for tasks with well-defined rubrics and fails on subjective or highly domain-specific judgements. The right approach is to validate the LLM judge against human judgements on a sample of cases before trusting it at scale, and to refresh this validation when models change.

What is the difference between offline and online LLM evaluation? +

Offline evaluation runs your model against a fixed test set before deployment — typically scripted into CI, producing a quality score the team tracks over time. Online evaluation runs against real production traffic, usually a sampled percentage, producing quality scores on real user inputs. Offline catches regressions before they ship; online catches drift, edge cases, and distribution shifts in production. Production AI systems need both; one is a weak substitute for the other.

How often should you re-run LLM evaluations? +

Offline evals should run on every change to the prompt, model, or retrieval system — ideally in CI so they gate merges. Online evals should run continuously on a sampled percentage of production traffic, with alerting when quality scores drop. Full human review of a random sample should happen weekly for small systems and daily for high-volume or high-stakes systems. The discipline is the same as traditional software monitoring, applied to AI output quality instead of system performance.

Want the AI Readiness?

Download the playbook we use for this. Free, no credit card.

Get the free checklist