RAG Works Great Until Production Shows Up

So you built a RAG system. You watched one YouTube tutorial, copy-pasted some LangChain code, threw 3 PDFs at a local Chroma instance, and it answered "What is the refund policy?" with suspicious accuracy.

You showed it to your manager. They clapped. Someone said the word "production." And now you're here, because production is not a demo with more users. Production is a demo that has been sleep-deprived, underfed, and actively lied to by your own data pipeline.

What RAG Actually Is (We have seen in last blog as well):

The core idea is simple:

Instead of asking the LLM "hey buddy, do you remember everything?", you go: "here, read this document, now answer the question."

Think of the LLM as a brilliant intern who has read every book ever written but has the long-term memory of a goldfish. RAG is you handing that intern a printout right before the client call and saying "just talk about what's on this paper."

Two pipelines. This is where tutorials ghost you:

The Indexing Pipeline (runs in the background, like your conscience):

Eat documents → chop into chunks → turn into vectors → store → pray

The Query Pipeline (runs in real-time, like your anxiety):

User asks question → embed it → find closest chunks → stuff into LLM context → hope

The math is cosine similarity — measuring the angle between two vectors. Small angle = similar meaning. You don't need to understand it deeply. Just know that "Apple the fruit" and "Apple the company" will end up weirdly close to each other, and that will bite you someday.

Chunking: Where Dreams Go to Die

Chunking sounds trivial. It is not. It is the #1 place RAG systems silently fail while wearing a smile.

The Naive Approach™: "Split every 512 tokens, 128-token overlap, ship it."

What actually happens:

Chunk 1: "The refund policy allows customers to return products within 30 days. To initiate"
Chunk 2: "a return, the customer must first log into their account and navigate to the Orders"
Chunk 3: "section. IMPORTANT: Items marked as final sale cannot be returned under any"

Congrats. You've made a jigsaw puzzle where every piece looks like sky. The retriever grabs Chunk 2 for "how do I return something?" and your LLM confidently tells the customer to "navigate to Orders" — no mention of the 30-day window, no mention of final sale. Someone writes a very angry Trustpilot review. What actually works:

Recursive splitting: Paragraphs first, then sentences, then characters as fallback. Preserves structure.
Semantic chunking: Split where cosine similarity between adjacent sentences drops sharply — that's where the topic actually changes.
Structure-aware splitting: For code, split at function/class boundaries using AST parsing. For legal docs, split at clauses. Respect the document's own structure.

Also: store metadata with every chunk — source doc ID, section heading, page number, timestamp, content hash. You'll need all of it later. Future you is already grateful.

Your Embedding Model: A Commitment You're Not Ready For

Choosing an embedding model is like getting a tattoo: easy to do, hard to undo, regrettable at 2am.

Every vector in your database was made by that model. Switch models later and your new query embeddings speak a completely different geometric language than your stored vectors. It's like querying in French and getting results indexed in Mandarin. You'd have to re-embed everything.

Solid choices as of mid-2026:

text-embedding-3-large (OpenAI): Great recall, API-dependent. Hope they're not down during your demo.
bge-large-en-v1.5 (BAAI): Open-source, self-hostable, shockingly competitive. The underdog pick.
e5-mistral-7b-instruct: Instruction-tuned, great for asymmetric retrieval.

Treat it like a database choice. You wouldn't casually migrate from Postgres to MongoDB mid-project. Same energy.

The Indexing Pipeline: The Part Tutorials Skip

Your documents are not static. They get updated, deleted, deprecated, and occasionally replaced by a newer version that contradicts everything in the older one. Your pipeline needs to handle this — or your RAG system will serve 2019 information with full, unshakeable confidence.

The Chunk Identity Problem

One document = many chunks. Update that document and you can't just "edit a row." You have to find all the old chunk IDs, delete them, re-chunk the new version (which might have a different number of chunks), re-embed, and re-insert.

Vector databases don't help you with any of that. They just store vectors. So you need a document registry — a plain Postgres table mapping document IDs to chunk IDs:

CREATE TABLE doc_chunk_registry (
    doc_id          TEXT NOT NULL,
    chunk_vector_id TEXT NOT NULL,
    content_hash    TEXT NOT NULL,
    version         INTEGER NOT NULL DEFAULT 1,
    status          TEXT NOT NULL DEFAULT 'active',  -- 'active' | 'superseded'
    PRIMARY KEY (doc_id, chunk_vector_id)
);

When an update arrives: look up old chunk IDs → delete from vector store → mark as superseded → re-chunk → re-embed → register new chunks. Not glamorous. Absolutely necessary.

Don't Re-Embed What Didn't Change

Re-embedding 100,000 documents × 10 chunks = 1 million API calls. That's money. The fix is embarrassingly simple — content hashing:

def should_reindex(doc_id: str, new_content: str, registry_db) -> bool:
    row = registry_db.query_one(
        "SELECT content_hash FROM doc_chunk_registry WHERE doc_id = %s AND status = 'active' LIMIT 1",
        (doc_id,)
    )
    if row is None:
        return True  # New document
    return hashlib.sha256(new_content.encode()).hexdigest() != row["content_hash"]

Hash the content. If it matches, skip. Most "updates" are someone changing a document title or timestamp — the actual text didn't change. Don't pay to re-embed that.

Zero-Downtime Updates (a.k.a. Don't Cut Off Your Own Head)

Your reindexing pipeline crashes halfway through 10,000 documents. Now your index is half version N, half version N+1, and nobody knows which half is which. Your retriever doesn't know. Your LLM doesn't know. Users just get weird, inconsistent answers and blame the AI.

The fix is alias-based deployment, borrowed from Elasticsearch ops:

rag_index_2026_05_14  ← fully built, fully validated
rag_index_current     ← alias, pointing at the above after swap

Build the new index completely. Validate against benchmark queries. Atomically swap the alias. Keep the old one around for rollback — because you will need to roll back. Everyone does at least once.

Observability: So You Know Why It's Lying

Here's the cruelest joke in production RAG: the system is wrong, but it doesn't look wrong. The LLM speaks in full sentences, sounds completely confident, and the answer is eloquent garbage.

Without observability, when a user complains you have two pieces of information:

The question they asked
The wrong answer they got

That's nothing. You can't tell if the retriever grabbed the wrong document, the right document but the wrong section, or the right section that the LLM just ignored. Three completely different root causes. Identical from the outside.

The Span Architecture You Need

rag_request (root)
  ├── embedding.query          ← latency, model, tokens
  ├── retrieval.vector_search  ← results, scores, filters applied
  ├── retrieval.rerank         ← how much did rankings shift?
  ├── prompt.assembly          ← total tokens sent to LLM
  └── llm.generate             ← model, output tokens, stop reason

When someone files a ticket saying "your AI told me I could return a 3-year-old mattress," you open the trace and see:

"Retrieved 3 chunks from refund_policy_v1_deprecated_2021.pdf"

Bug found. Without the trace, you spend three hours blaming the model. With it, you fix the index in 10 minutes.

Close the Feedback Loop

After your main LLM answers, send the answer + retrieved context + question to a smaller, cheaper model and ask it to score two things:

Faithfulness: Did the answer stay within what the context says, or did the LLM start improvising?
Relevance: Did it actually address the question?

Log scores with the trace ID. Now you have a queryable dataset: "show me all requests with faithfulness < 0.7 in the last 7 days." Drill into those traces. You'll find one of three patterns:

Wrong document → index or pipeline problem
Right document, wrong section → chunking boundary problem
Right chunks, ignored by LLM → generation/prompt problem

Can't distinguish these without chunk-level attribution. Without it, every bad answer looks like "the AI is bad." With it, bad answers become tickets with assignees.

Version Your Index in Every Trace

You update your index on Tuesday. Answer quality drops on Tuesday. Without index version in your traces, you'll spend 4 hours in a war room ruling out deployments, model updates, and Mercury being in retrograde before someone checks the index.

The fix is two lines:

span.set_attribute("retrieval.index_version", current_index_alias)

span.set_attribute("retrieval.index_updated_at", index_metadata["updated_at"])

Post-incident with this: 15 minutes. Without it: 4 hours and a lot of blame.

The TL;DR (For When Your Manager Asks)

Production RAG needs exactly three things:

Indexing Pipeline Done Right — document registry, content hashing, correct deletes, alias-based zero-downtime deployment.
Retrieval That Actually Works — hybrid search (vector + BM25), cross-encoder reranking, metadata filtering.
Observability You'll Use — chunk-level attribution per request, retrieval quality metrics over time, index version in every trace. If your system doesn't have all three, it's not a production system. It's a demo that hasn't failed yet.

Good luck out there. Store your content hashes. Version your indexes. Trace everything. And for the love of all things good, don't use fixed-size chunking on your legal documents.

RAG Works Great Until Production Shows Up

What RAG Actually Is (We have seen in last blog as well):

Chunking: Where Dreams Go to Die

Your Embedding Model: A Commitment You're Not Ready For

The Indexing Pipeline: The Part Tutorials Skip

The Chunk Identity Problem

Don't Re-Embed What Didn't Change

Zero-Downtime Updates (a.k.a. Don't Cut Off Your Own Head)

Observability: So You Know Why It's Lying

The Span Architecture You Need

Close the Feedback Loop

Version Your Index in Every Trace

The TL;DR (For When Your Manager Asks)

Comments

More from this blog

So You Think You're Just Comparing Two Files? (Cute.)

REST in Peace

Who’s Burning Your CPU Cycles?

SOLID Principles & Design Patterns

Command Palette

What RAG Actually Is (We have seen in last blog as well):

Chunking: Where Dreams Go to Die

Your Embedding Model: A Commitment You're Not Ready For

The Indexing Pipeline: The Part Tutorials Skip

The Chunk Identity Problem

Don't Re-Embed What Didn't Change

Zero-Downtime Updates (a.k.a. Don't Cut Off Your Own Head)

Observability: So You Know Why It's Lying

The Span Architecture You Need

Close the Feedback Loop

Version Your Index in Every Trace

The TL;DR (For When Your Manager Asks)

Comments

More from this blog