Why do most first RAG pipelines fail in production?

Most first RAG pipelines fail because teams optimise for the generation side — prompt engineering, model selection, output formatting — while leaving the retrieval side on default settings. The result is a system that produces fluent answers from the wrong context. Fixed-size chunking, a single embedding model for every query, top-k=5 regardless of query type, and no retrieval-only evaluation are the four most common silent failures. By the time anyone notices, the team has already shipped, support tickets are accumulating, and trust in the feature is gone.

What is the most common RAG chunking mistake?

The most common mistake is chunking by token count rather than semantic boundary. A fixed 512-token window cuts sentences mid-clause, splits sections, and severs the relationship between a heading and its content. The chunk that retrieves looks plausible to a vector search but lacks the context the model needs to answer correctly. Better approaches respect document structure: chunk on heading hierarchy, paragraph boundaries, and structured elements like list items, tables, or code blocks. Late chunking — embedding the full document and slicing the embeddings afterwards — is another increasingly common pattern in 2026.

What is retrieval evaluation and why does it matter?

Retrieval evaluation measures whether your RAG system pulls the right context for a given query, independent of what the language model does with that context. It matters because retrieval is the bottleneck of every RAG pipeline — if the relevant chunk does not enter the context window, the model cannot generate a correct answer regardless of how strong the prompt or model is. A useful retrieval eval needs a labelled set of query-to-relevant-chunk pairs and metrics like recall@k, MRR, or nDCG. Most teams skip this step and try to debug RAG quality through generation outputs alone, which is much slower and noisier.

How do you stop a RAG system from hallucinating citations?

Enforce structured output with citation IDs that map directly to the chunks retrieved during this specific query. The model is instructed to cite using the IDs in its retrieval context, and a post-generation validator checks that every cited ID exists and matches its source. Citations to chunks that were never retrieved get rejected and the response regenerated or flagged. This is the only reliable defence against citation drift — relying on the model to cite honestly without enforcement reproduces the original hallucination problem one layer up.

Should you use top-k=5 for retrieval in RAG?

Top-k=5 is a reasonable default for prototyping but a poor production policy. Different query types need different k values — a factual lookup might need k=2 with a high relevance threshold, while a multi-hop question needs k=20 followed by reranking and filtering. The better pattern is relevance-thresholded retrieval (return all chunks above a similarity score) combined with a cross-encoder reranker that re-scores the candidates with a more expensive but more accurate model. Some teams also add late chunking and parent-document retrieval to expand context around the highest-scoring chunks.

How often should a RAG index be refreshed?

Refresh frequency depends on how fast the underlying knowledge changes. Documentation that updates daily needs delta indexing on a near-real-time pipeline; policy documents that change quarterly can run on a scheduled batch refresh. The discipline that matters more than frequency is per-document TTL and re-embedding triggers: every document gets a freshness timestamp, every update triggers re-embedding, and every deletion removes the chunks from the index. A RAG system without a refresh strategy degrades silently — the model continues answering from stale context while users assume it is current.

What is a query intent classifier and why use one in RAG?

A query intent classifier is a lightweight model that routes incoming queries to different retrieval and generation paths based on what the user is trying to do. 'How do I do X?' (procedural), 'What is X?' (definitional), 'Is X allowed?' (policy lookup), and 'Compare X and Y' (multi-document) all benefit from different retrieval strategies, different k values, and different prompt templates. A small classifier — even a few-shot prompt to a fast model — improves quality more than most retrieval optimisations because it matches the search strategy to the question.

When is RAG the wrong tool entirely?

RAG is the wrong tool when the gap is behavioural rather than informational, when the underlying knowledge is too large or unstructured to index meaningfully, or when the latency budget cannot accommodate retrieval. Behavioural problems — tone, format, persona — are better solved through prompt engineering or fine-tuning. Highly structured queries over relational data are usually better served by SQL or a query agent. And real-time conversational interfaces with sub-second latency budgets often cannot afford a full retrieval round-trip; in those cases, smaller in-context knowledge or a fine-tuned model is the right answer.

Why Your First RAG Pipeline Will Fail (and What to Build Instead)

A team builds a RAG pipeline. The demo is impressive — well-chosen documents, well-chosen queries, the model produces accurate, well-cited answers. The team ships. A month later, support tickets are quietly accumulating because the production system retrieves the wrong context on a third of real-world queries, fabricates citation links, and confidently answers from documentation that was deprecated last quarter.

This is the standard arc of a first RAG pipeline. The failures are not random — they cluster around the same eight patterns, all on the retrieval side, all invisible until production traffic hits them. This guide is about those eight failure modes and the minimal-viable architecture that avoids them from day one.

This is not a beginner’s introduction. If you are still deciding whether to use RAG at all, our RAG vs fine-tuning decision framework covers that question. If you want the business framing, our RAG for business post covers value and use cases. This post assumes you have already decided to build, and want to ship something that works.

Why Retrieval Is the Bottleneck

Most teams spend 80% of their RAG engineering effort on the generation side: prompt design, model choice, output formatting, structured generation. That allocation is upside-down. Retrieval is the bottleneck of every production RAG system, because if the relevant chunk does not enter the context window, no prompt and no model can rescue the answer.

The mental model worth holding: a RAG system is a search engine bolted to a language model. Search quality determines answer quality. Generation matters, but it cannot recover from upstream retrieval failures — it can only conceal them behind fluent prose.

The eight failure modes below are all upstream of the model. The fix in every case is engineering discipline applied to the retrieval pipeline before the language model sees a single token.

Failure Mode 1: Chunking by Token Count, Not Semantic Boundary

The default chunking strategy in every quickstart tutorial is fixed-size — split the document into 512-token windows with 50-token overlap. It produces chunks. It also produces chunks that cut sentences in half, separate headings from their content, and split tables across boundaries.

A vector search over fixed-size chunks finds the chunk that best matches the query embedding. That chunk often lacks the context the model needs to answer because the meaningful unit was split across two adjacent chunks, only one of which retrieved. The model receives half a paragraph and confidently fills in the missing context from its training data — which is exactly the hallucination RAG was supposed to prevent.

What to do instead: chunk on semantic boundaries. Respect heading hierarchy, paragraph breaks, list items, and structured elements like code blocks or tables. For long documents, use hierarchical chunking — one set of chunks for paragraph-level retrieval and a parent-document layer for context expansion when a chunk hits. Late chunking, where the full document is embedded and sliced afterwards, is increasingly common in 2026 because it preserves long-range semantic context that early chunking destroys.

Failure Mode 2: One Embedding Model for Every Query

The standard pattern is to use a single embedding model — typically OpenAI’s text-embedding-3-small or a Cohere model — for both indexing and queries. It works in demos because the demo queries are similar in shape to the indexed documents. It fails in production because real users ask short, ambiguous, conversational questions that look nothing like the long, formal documents in the index.

The result is a query distribution that does not match the indexed distribution, and a vector search that retrieves on the wrong dimensions of similarity. Users phrase the same intent five different ways; only one of those ways retrieves correctly.

What to do instead: decouple query embedding from document embedding. The standard techniques are HyDE (hypothetical document embeddings), where a fast model generates a plausible answer to the query and that answer gets embedded for the search; query rewriting, where the user query is reformulated into a search-shaped query before embedding; and multi-vector indexing, where each document is represented by multiple embeddings (summary, keywords, full text) and the query searches across all of them. Pick one. The lift over a single embedding model is usually substantial.

Failure Mode 3: No Retrieval-Only Evaluation

The team eyeballs five examples, the answers look right, the system ships. Six weeks later, a production audit shows that 30% of queries retrieve the wrong context — the model is answering from irrelevant chunks but doing so fluently enough that nobody flagged it.

This happens because most teams evaluate the end-to-end output and never the retrieval step in isolation. End-to-end evaluation is noisy; a fluent wrong answer can score higher than a halting correct one if the rubric is generic. Retrieval evaluation is much sharper: did the right chunk make it into the top-k or not?

What to do instead: build a retrieval-only eval before you touch generation. Curate 50-200 queries with their expected relevant chunks (a “golden set”). Measure recall@k, MRR, and nDCG. Run this every time the chunking strategy, embedding model, or indexing pipeline changes. Our guide to LLM evaluation covers the eval discipline at large; for RAG specifically, retrieval evaluation is the single highest-leverage investment a team can make.

Failure Mode 4: Top-k=5 Dogma

Every RAG tutorial uses top-k=5. Most production systems also use top-k=5. There is no reason for this beyond inertia, and it is a poor policy because different query types need different retrieval depths.

A factual lookup (“what is the SLA for tier 2?”) often needs k=1 or k=2 with a high similarity threshold. A multi-hop question (“how does tier 2 SLA compare to enterprise plans for incident response?”) needs k=15-20 followed by aggressive reranking. A summarisation request needs almost everything related to the topic. Hardcoding k=5 satisfies none of these well.

What to do instead: combine relevance-thresholded retrieval with a cross-encoder reranker. Pull every chunk above a similarity score (say, 0.75 cosine), then rerank the candidates using a cross-encoder model that scores the query-chunk pair directly. The reranker is more expensive per pair than vector similarity but runs on a small candidate set, so total latency stays manageable. The output is a query-aware top-k rather than a fixed one.

Failure Mode 5: Citation Drift and Fabricated Sources

The model is instructed to cite its sources. The output looks well-cited. The links go to documents that don’t exist, or to real documents that say something different from what the model claims. This is citation drift — the model fabricates citation strings because the prompt asks for citations and producing plausible ones is easier than refusing.

This is the failure mode that quietly destroys trust in RAG. Users learn to ignore the citations, and once that happens, the entire value proposition of RAG (verifiable, traceable answers) is gone.

What to do instead: enforce structured output with citation IDs that map directly to the chunks retrieved for this specific query. Each retrieved chunk gets a short ID ([1], [2], etc.). The model is instructed to cite only those IDs. A post-generation validator parses the output, checks that every cited ID exists in the retrieval set, and rejects or flags responses with invalid citations. The validator is twenty lines of code; it is the difference between citations that are decorative and citations that are load-bearing.

Failure Mode 6: Stale Index With No Refresh Strategy

The team indexes the corpus once before launch. Six months later, the documentation has been updated, the policies have changed, and the product itself has shipped two new features. The index has not. Users get answers from documents that have been deprecated, edited, or replaced — but the answers feel current because the model presents them confidently.

Stale indexes are insidious because nothing breaks. The system continues to retrieve and answer. It is just answering from a frozen snapshot of an organisation that has moved on.

What to do instead: build an indexing pipeline, not an indexing event. Every document gets a last_indexed_at timestamp and a TTL. Every update to a source document triggers re-embedding of the affected chunks. Every deletion removes the chunks from the index. Delta indexing — re-embedding only what has changed — keeps the cost manageable for daily or hourly refresh cycles. The orchestration layer (Airflow, Dagster, Temporal) is where this discipline lives, and our platform engineering services team builds these regularly because they are tedious but load-bearing infrastructure.

Failure Mode 7: No Query Intent Classifier

The system treats every query the same way. “How do I reset my password?” (procedural), “What is two-factor authentication?” (definitional), and “Is GDPR consent required for this flow?” (policy lookup) all hit the same retrieval pipeline with the same k, the same prompt template, and the same generation parameters. The results are mediocre across all three because the strategy is not matched to the question.

A query intent classifier — even a few-shot prompt to a fast model that returns one of 4-8 intent labels — lets the system route to different retrieval and generation paths. Procedural queries route to a step-extraction prompt with k=3 over how-to documents. Definitional queries route to a glossary index with k=1. Policy queries route to a policy-only index with stricter citation enforcement. Each path is simpler and better than a one-size-fits-all pipeline.

What to do instead: build a small classifier as the first step of the query pipeline. Define your intent taxonomy (5-8 categories is plenty), label 100-200 queries, and either fine-tune a small model or run a few-shot prompt against a fast model. The branching logic that follows is more flexible and easier to evaluate per-branch than a monolithic pipeline.

Failure Mode 8: Prompt Drift Between Dev and Prod

The prompt that produced the demo is buried in a Python f-string, edited inline whenever someone tweaks behaviour, and never versioned. Three engineers make changes over two months. Nobody knows which prompt produced last week’s regression. The team rolls back code expecting to roll back the prompt and finds it didn’t move.

Prompts are programs. Treating them as inline strings is the equivalent of leaving SQL queries in your view layer — it works until it doesn’t, and when it doesn’t the failure is hard to diagnose.

What to do instead: build a prompt registry. Each prompt has a name, a version, and a stored history. Changes go through code review. Production reads the latest stable version; offline evaluation runs against any version on demand. CI gates regressions: if a prompt change drops eval score, the merge is blocked. Lightweight registries (a YAML directory + git history) work fine for small teams; larger teams use Langfuse, PromptLayer, or Braintrust for full prompt management with diff views and rollback.

What to Build Instead: A Minimal-Viable Production RAG

Avoiding the eight failure modes above does not require an elaborate architecture. The minimal-viable production RAG looks like this:

Semantic chunking with a parent-document layer. Chunk on heading and paragraph boundaries; keep the parent document available for context expansion when retrieval hits a chunk.
HyDE or query rewriting at the retrieval boundary. Reshape the user query into a search-shaped form before embedding. A few-shot prompt to a fast model is enough.
Relevance-thresholded retrieval with a cross-encoder reranker. Skip top-k=5. Pull above a threshold, rerank, and return a query-aware result set.
Retrieval-only evaluation in CI. A golden set of 100 query-chunk pairs, recall@k tracked on every change to chunking, embedding, or indexing.
Structured output with enforced citation IDs. Citation IDs that map to retrieved chunks, validated post-generation, rejected if fabricated.
Delta indexing pipeline with per-document TTL. Updates re-embed, deletions remove, freshness timestamps surface stale chunks. Runs daily at minimum.

The choice of vector database matters less than this discipline. Pinecone, Weaviate, Qdrant, and pgvector all support these patterns. Pick the one that fits your stack — if you already run Postgres, pgvector reduces operational surface; if you need managed infrastructure, Pinecone is the simplest; Weaviate and Qdrant sit between the two with strong open-source options. The team that ships these six patterns on pgvector will outperform the team that ships top-k=5 on Pinecone every time.

When RAG Is the Wrong Tool

Worth saying explicitly: RAG is not always the answer, and forcing it where it doesn’t fit produces worse outcomes than not using it at all. Three patterns where RAG is the wrong tool:

Behavioural problems. If the gap is tone, persona, format, or style, RAG cannot fix it. Prompt engineering or fine-tuning is the right tool. The RAG vs fine-tuning framework goes deeper.
Highly structured data. If the source of truth is a relational database with clean schemas, a query agent or text-to-SQL pipeline outperforms RAG by a wide margin. Retrieval over unstructured text is a poor substitute for a SELECT statement.
Sub-second latency budgets. Real-time conversational interfaces with strict latency targets often cannot afford the retrieval round-trip, especially with a reranker in the path. Smaller in-context knowledge or a fine-tuned model usually wins on latency-bound use cases.

If the use case fits one of the above, save yourself the engineering. RAG is powerful but it is not free, and the failure modes in this post compound when the architecture was wrong from the start.

Getting Help

We build production RAG systems as part of AI and data engineering engagements — semantic chunking, retrieval evaluation, citation enforcement, delta indexing, and the orchestration that keeps the index honest. For the upstream architectural decision, our RAG vs fine-tuning decision framework covers when RAG is the right tool. For the business framing, our RAG for business post covers value and use cases. For the evaluation discipline that catches all eight failure modes, our LLM evaluation guide covers the practical pipeline. And for the broader programme context, our AI readiness checklist and AI agents for customer service posts cover where RAG fits in a wider AI strategy.

If your first RAG pipeline is in production and quietly underperforming — or you are about to ship one and want a second pair of eyes on the architecture — get in touch.