field notes / scale & infrastructure

The RAG Demo Tax

Why your proof-of-concept works, your production doesn't, and the gap has a name.

RAG demos work because they sit on small, clean corpora with one user and a generous latency budget. Production RAG is a different engineering problem. Teams discover this around month four, when accuracy quietly collapses. Here is why, and what the teams who get it right actually build.

The RAG demo is the most reliably impressive piece of AI theatre of the past three years. Point the LLM at a folder of PDFs. Ask it questions. Watch it answer with citations. Ship to the exec team. Budget approved.

Then the real corpus goes in. Hundreds of thousands of documents. Millions. A domain-specific vocabulary. Multi-tenant isolation requirements. Multiple languages. A freshness requirement. A latency SLO. And the accuracy numbers that were so impressive at 500 documents do something quietly catastrophic at 500,000.

The teams who hit this wall almost always hit it in month four or five. The symptoms are always the same. Quality metrics plateau. Users stop trusting the system. Engineering starts layering on filters, reranking, and system-prompt patches. Nothing quite fixes it. Eventually someone asks, in a retro, whether the project was viable at this scale in the first place.

The answer is yes. But not the way it was built.

What actually changes when you scale

Three things break at the same time, reinforcing each other.

Retrieval quality degrades non-linearly with corpus size. At 500 documents, even naive chunking and off-the-shelf embeddings produce reasonable recall. At 500,000, the semantic neighbourhood around any given query fills up with near-duplicates, related-but-wrong content, and chunks whose semantic match to the query is better than the chunk that actually answers it.

User queries look nothing like documents. Your users do not phrase questions the way subject matter experts wrote the source material. The dense-vector match is an estimate of whether two pieces of text are talking about the same thing, and it turns out users asking “what’s our policy on working from home” and an HR policy that never uses the phrase “working from home” are, by cosine similarity, not that close.

The LLM is not a forgiving consumer of retrieved context. Stanford’s Lost in the Middle paper (Liu et al., TACL 2023) found language models consistently perform best when the relevant information sits at the start or the end of the context window, and significantly worse when it sits in the middle. That finding has held up across long-context-tuned models. Which means your top-ranked chunks get read. Your middle-ranked chunks effectively do not.

Chroma’s Context Rot research (July 2025) takes the finding further. Even a single irrelevant distractor in the retrieved set reduces accuracy below baseline. The degradation is non-uniform. Weak semantic similarity between the question and the target chunk compounds the problem.

The architectural implication is blunt. You cannot fix poor retrieval by stuffing more context in. More context makes it worse.

The chunking problem everyone underestimates

The single highest-leverage, lowest-discussed variable in production RAG is the chunking strategy.

Chroma published the cleanest analysis we have seen (July 2024) on how much chunking matters. On identical corpora, chunking strategy alone shifted retrieval recall by up to 9 percentage points. OpenAI’s default chunker, 800 tokens with 400 overlap, produced “below-average recall and the lowest scores across all other metrics.”

The best performers, at the same corpus, same embeddings, same retriever, same k:

ClusterSemanticChunker at 200 tokens: 87.3 per cent recall
LLM-based chunker with GPT-4o: 91.9 per cent recall
Recursive character splitter at 200 tokens, no overlap: 88.1 per cent recall

Same embedding model. Same retriever. Same k. Nine points of recall on the table, entirely from the way you slice the documents before you embed them.

Most production RAG deployments we encounter are running the OpenAI defaults unchanged.

Anthropic’s contextual retrieval, and why it works

The single most important piece of production RAG research in the past eighteen months is Anthropic’s Contextual Retrieval (September 2024).

The method is unromantic. For each chunk, before embedding it, use a cheap LLM (Claude 3 Haiku in the original paper) to generate a short contextual preamble, 50-100 tokens, describing where the chunk sits in the parent document and what it is about. Prepend that to the chunk. Then embed normally. And in parallel, index the same chunks with BM25 lexical search. Combine both retrieval signals. Rerank the top candidates with a cross-encoder.

The numbers on Anthropic’s benchmarks:

Contextual embeddings alone: reduce retrieval failure rate from 5.7 per cent to 3.7 per cent. A 35 per cent relative reduction.
Contextual embeddings plus contextual BM25: 49 per cent relative reduction, down to 2.9 per cent failure.
Plus reranking on top-20 candidates: 67 per cent relative reduction, down to 1.9 per cent.

That is a vendor-published benchmark on Anthropic’s own evaluation set. Discount appropriately. But the shape of the result has been independently reproduced across enough production deployments that the architectural lesson is now defensible: a single-retriever dense-embedding system is a dev-grade setup. A multi-stage hybrid retriever with contextual embeddings and a reranker is the floor for production.

Why the components matter

Contextual embeddings solve the “this chunk could be from a policy, a precedent, a press release, or a footnote” problem. A chunk that says “the threshold is 90 days” is not a useful embedding target. A chunk that says “[context: Australian APRA CPS 230 operational risk obligations] the threshold is 90 days” is.

Hybrid retrieval with BM25 is the unglamorous bit of the stack that keeps the system honest. Dense embeddings are a semantic match. Lexical match is an exact-keyword match. Any query containing an account number, a product code, a proper noun, a legal-section reference, or a specific term of art is a query that BM25 handles better than any embedding model. Hybrid retrieval combines both. Skipping BM25 is a choice to fail on the exact queries your domain experts will ask first.

Reranking handles the 20-to-5 squeeze. You retrieve twenty candidates from the hybrid retriever cheaply. A cross-encoder reranker reads each candidate against the query and scores them. You pass the top five to the LLM. Pinecone’s production guidance is explicit on why: applying a cross-encoder to a full 40M-record corpus “would take 50-plus hours per query on a V100.” Applying it to twenty pre-retrieved candidates returns in under 100 milliseconds. Rerankers are architecturally second-stage, never first-stage.

Where dense-vector retrieval structurally fails

Hybrid retrieval plus reranking fixes most RAG quality problems. It does not fix all of them.

Two classes of query defeat vector retrieval fundamentally, regardless of how clever your embeddings are.

Global sensemaking queries. “Summarise the key themes across our last three years of customer interviews.” Vector retrieval is a pinpoint tool. It finds the k chunks most semantically similar to the query. The query “themes across three years” does not have a semantic pinpoint. It has a corpus-wide aggregation.

Microsoft’s GraphRAG paper (April 2024, revised February 2025) frames this precisely. The companion blog post uses language worth stealing: “Baseline RAG struggles to connect the dots” and “Baseline RAG struggles with queries that require aggregation of information across the dataset.” The fix they propose is to pre-build an entity knowledge graph over the corpus, pre-compute community summaries, and route aggregation queries through the graph rather than the vector index.

Queries that need hierarchical context. Stanford’s RAPTOR (January 2024) addresses this with recursive summarisation, building a tree of progressively more abstracted summaries over the document set. GPT-4 on QuALITY, with RAPTOR-structured retrieval, scored 20 absolute accuracy points higher than the prior best.

The architectural lesson is the same across both: for the queries vector retrieval structurally cannot serve, you need a different index, not a better embedding model.

The long-context fantasy

A recurring question we get from engineering leadership: “why do we need any of this? Can we not just use a two-million-token context window and stuff the whole corpus in?”

Databricks published the definitive takedown of this argument in August 2024. On their benchmark, average retrieval recall rose from 0.468 at 2,000 tokens to 0.95 at 125,000 tokens. Promising. But generation quality degraded for most models past a threshold. Llama-3.1-405B fell off a cliff past 32k. GPT-4 past 64k. Claude-3-Sonnet’s refusal rate climbed from 3.7 per cent at 16k to 49.5 per cent at 64k. DBRX started summarising instead of answering. Mixtral produced nonsense.

Only GPT-4o and Claude-3.5-Sonnet maintained consistent quality across long contexts. And even those models pay for the privilege in latency and cost.

The conclusion is blunt. Long context is not a substitute for retrieval. It is a complement. You still need to retrieve the right chunks. You just have more room to put them once you have. If your retrieval is bad, more context makes it worse, because you are now feeding the model more noise.

The cost curve nobody budgets for

Here is a number worth pausing on. Fifty million documents, embedded with OpenAI’s text-embedding-3-large at 3,072 dimensions, stored at float32, is:

50,000,000 × 3,072 × 4 bytes = 614.4 GB of raw vectors.

Before the HNSW graph overhead (typically 1.5 to 2 times raw). Before replication. Before metadata. Before any queryable payload.

At Pinecone Standard pricing (USD 0.33 per GB per month for storage), that is about USD 200 per month on storage alone, pre-index overhead. With realistic index overhead and sensible replication, USD 500 to 1,000 per month is the floor. Before a single query.

Reads and writes are priced separately. Pinecone Standard bills USD 4-4.50 per million write units and USD 16-18 per million read units, with a USD 50 per month floor. Enterprise tier is higher.

Embeddings are their own cost line. Voyage AI prices voyage-4-large at USD 0.12 per million tokens and voyage-context-3 at USD 0.18 per million. OpenAI’s text-embedding-3-large has been priced around USD 0.13 per million (standard) or USD 0.065 per million (batch), though OpenAI’s pricing has shifted multiple times, so reconfirm at contract.

Rerankers add a third cost line. Voyage’s rerank-2.5 is USD 0.05 per million tokens of input. At production query volumes, reranking cost usually dominates embedding cost on the steady state.

None of this is prohibitive. All of it is meaningful. The point is that RAG-at-scale cost modelling has three moving lines (storage, compute per query, embedding and reranking) that finance-grade planning treats explicitly. Most proof-of-concept budgets treat them as rounding.

Evaluation, or why “it feels better” is not a metric

We have seen more production RAG projects killed by the absence of an evaluation harness than by any other single cause. Not because the systems are bad. Because nobody can prove they are good, and nobody can tell if a change made them better or worse.

The tooling here is mature now. There is no excuse for vibes-based evaluation in 2026.

Ragas is the most widely adopted framework. The repo is actively maintained, at v0.4.3 as of January 2026 across 89 releases. Core metrics: faithfulness (proportion of generated claims supported by retrieved context), answer relevancy (cosine similarity between question embedding and back-inferred questions from the answer), context relevancy, context recall.
ARES (Stanford/Databricks, November 2023) evaluates context relevance, answer faithfulness, and answer relevance with prediction-powered inference. Key property: judges remain accurate under domain shift, which matters for enterprise deployments with evolving corpora.
BEIR is the standard heterogeneous retrieval benchmark, 18 datasets across domains. Key finding worth pinning to the wall: BM25 is a “robust baseline” that dense retrievers often fail to beat out-of-domain. Which is why hybrid retrieval keeps winning.
MTEB is the leaderboard for embedding models, covering retrieval, classification, clustering, semantic similarity, reranking, and bitext mining. The MMTEB 2025 extension adds multilingual breadth.

Databricks’ own best-practices guide for LLM-judge RAG evaluation is also directly cite-worthy: LLM judges agree with human graders over 80 per cent of the time on exact grade, over 95 per cent within one grade on a 0-3 scale. GPT-3.5 with few-shot examples gives a roughly 10x cost reduction and 3x speedup over GPT-4 as judge.

The key line from the Databricks post is worth tattooing on the engineering manager’s monitor: “Evaluation benchmarks cannot transfer between different applications.” Generic BEIR results will not tell you whether your RAG system is good at your actual task. You have to build your own eval set. A hundred hand-labelled queries with known good answers is a start. Three hundred is a working harness. A thousand is a system you can ship against.

The multi-tenant problem

For any enterprise RAG system that serves more than one customer, more than one business unit, or more than one security classification, the multi-tenant index design is the load-bearing architectural decision.

Two structural patterns, both documented by the vendors themselves.

Qdrant recommends a single collection with group_id payload isolation plus custom sharding. The explicit trade-off they call out: “global requests without the group_id filter will be slower since they will necessitate scanning all groups.” They recommend setting m=0 on the global HNSW index to force per-tenant indexes when isolation dominates.

Weaviate offers five tenant states (Active, Inactive, Offloaded, Offloading, Onloading) specifically designed for cost management on long-tail tenants. Their benchmarked operational range is 18,000-19,000 active tenants per node. A hard constraint to note: “cross-tenant references aren’t supported.” If your application requires one source of truth across customers, you are building a federation layer on top.

Both vendor docs are good engineering reference. Neither can tell you which pattern fits your regulatory perimeter. That is a design decision, not a vendor-selection decision.

Governance: the line nobody draws early enough

The final production-RAG consideration is the one that gets discovered last and bites hardest.

Retrieved context can leak. Across tenants. Across classifications. Into hallucinations the user treats as authoritative. Into system logs that get retained longer than the original document’s retention policy allowed.

OWASP’s LLM08 Vector and Embedding Weaknesses in the 2025 Top 10 names this directly: “In multi-tenant environments where multiple classes of users or applications share the same vector database, there’s a risk of context leakage between users or queries.” And notes the embedding-inversion risk: “attackers can exploit vulnerabilities to invert embeddings and recover significant amounts of source information.” Mitigation recommendation: a permission-aware vector database.

LLM02 Sensitive Information Disclosure covers the same ground from a different angle. PII, financial data, health data, credentials, and privileged legal content can all leak via retrieved context and LLM responses.

For Australian operators, the OAIC’s AI guidance, updated January 2025, makes clear that the Privacy Act 1988 and APPs apply to AI inputs and outputs, including AI-generated content containing personal data. APP 3 (collection), APP 6 (use and disclosure), APP 11 (security) all apply to the retrieved-context pipeline.

For any operator with European exposure, the EU AI Act timeline is running. Prohibited practices and AI-literacy obligations since 2 February 2025. GPAI governance obligations since 2 August 2025. Full application 2 August 2026. High-risk system traceability requirements land in Article 10 (data governance) and Article 13 (transparency).

The architecture that actually works

If the above seems like a lot: it is. Production RAG is not a weekend project. It is a sustained engineering program. The good news is the pattern is stable now.

Ingestion. Curated, versioned, with explicit retention and classification metadata. Not “drop in a bucket and point the pipeline at it.”

Chunking. Semantic, not character-based. 150-300 token chunks, contextually preambled per the Anthropic Contextual Retrieval method.

Indexing. Hybrid, both dense embeddings and BM25. Tenant-isolated by design. Versioned so embedding-model migrations are planned events, not emergencies.

Retrieval. Multi-stage. Cheap first-stage hybrid retrieval to ~20 candidates. Cross-encoder reranking to the top 5. Structural indexes (entity graph, RAPTOR tree) for aggregation queries.

Generation. With explicit handling for “no good answer in the retrieved context” as a first-class outcome, not a fallback. The model should say “I don’t know from this corpus” when that is the correct answer.

Evaluation. A domain-specific test set the engineering team owns. Automated runs on every retrieval-layer change. Faithfulness, answer relevancy, context relevancy, context recall. Ragas or ARES or a homegrown harness, but something that makes “is this system getting better or worse” a question you can actually answer.

Governance. Per-tenant ACL on retrieval. Audit log of what was retrieved for what query. PII redaction where retention policy requires it. A clear threat model covering the multi-tenant leakage surface.

The bottom line

Your proof-of-concept worked because the problem was small. Production is not small.

The difference between the teams who ship RAG systems that work and the teams who do not is not compute budget. It is not model choice. It is not the vector database. It is the willingness to treat retrieval quality, eval-driven development, tenant isolation, and governance as first-class engineering concerns, alongside the model-layer work.

A rough heuristic from the ones we have seen ship well: 20 per cent of the engineering effort goes on the model-layer pieces (prompting, generation, tool use, reranking). Sixty per cent goes on retrieval and evaluation. Twenty per cent goes on governance and ops. Teams that invert those ratios, spending 60 per cent on prompts and 20 per cent on retrieval, ship the demo. They do not ship the system.

The demo tax is real. It is paid the first time a user asks a question that should work and the system confidently answers something plausible and wrong, and the user quietly stops trusting the tool. You do not recover that trust by adding a bigger model.

You recover it by building the system the right way the first time.

Written by Bash.ai. If this problem is one you’re in the middle of, we’d rather hear about it than write about it.

start_a_conversation more field notes