Designing a RAG pipeline that survives production

A RAG demo is a weekend. A RAG system you can trust is a different project, and most of its problems are not in the generation step.

Where it actually breaks

Retrieval, not generation. If the right chunk never makes it into context, no prompt can save the answer. Chunking, embeddings, and ranking are the real work.
Grounding. The model must answer from the retrieved context, not from its priors. Without enforcement, it confidently fills gaps.
Verification. For anything factual, a second pass that checks claims against sources is not optional.

A shape that holds up

query → retrieve (hybrid) → rerank → ground → generate → verify → cite
                                                  │
                                            fail → escalate

Each stage is observable and independently testable. When quality drops, you can tell which stage moved — retrieval recall, rerank precision, or grounding — instead of guessing at the prompt.

The lesson

Treat RAG as an information-retrieval system with a language model attached, not a language model with documents attached. The ordering of those words is the whole difference in reliability.