Designing a RAG pipeline that survives production
A RAG demo is a weekend. A RAG system you can trust is a different project, and most of its problems are not in the generation step.
Where it actually breaks
- Retrieval, not generation. If the right chunk never makes it into context, no prompt can save the answer. Chunking, embeddings, and ranking are the real work.
- Grounding. The model must answer from the retrieved context, not from its priors. Without enforcement, it confidently fills gaps.
- Verification. For anything factual, a second pass that checks claims against sources is not optional.
A shape that holds up
query → retrieve (hybrid) → rerank → ground → generate → verify → cite
│
fail → escalate
Each stage is observable and independently testable. When quality drops, you can tell which stage moved — retrieval recall, rerank precision, or grounding — instead of guessing at the prompt.
The lesson
Treat RAG as an information-retrieval system with a language model attached, not a language model with documents attached. The ordering of those words is the whole difference in reliability.