Why an Off-the-Shelf Model Is Not Enough
Large language models are trained on the public internet up to a cutoff date. They know nothing about your contracts, your product documentation, your support tickets, or last week's policy change. Ask one a question about your business and it will either refuse or, worse, invent a confident-sounding answer.
Retrieval-Augmented Generation (RAG) fixes this without retraining the model. Instead of relying on what the model memorised, you retrieve the most relevant pieces of your own data at query time and hand them to the model as context. The model then answers using that grounded information.
What RAG Actually Does
At its core, RAG is three steps wrapped around a normal model call:
- Retrieve, find the chunks of your knowledge base most relevant to the user's question.
- Augment, insert those chunks into the prompt as supporting context.
- Generate, let the model answer using the retrieved context, ideally with citations.
The result is an assistant that can answer "What is our refund window for enterprise plans?" with your actual policy text, not a plausible guess.
The RAG Pipeline, Step by Step
Ingestion and Chunking
You start by collecting source documents (PDFs, wiki pages, tickets, code) and splitting them into chunks. Chunk size matters more than people expect. Too large and retrieval returns noise; too small and you lose context.
| Chunking approach | Best for | Trade-off |
|---|---|---|
| Fixed-size with overlap | General documents | Simple, but can split mid-thought |
| Sentence or paragraph based | Articles and policies | Preserves meaning, variable sizes |
| Structure aware (headings) | Technical docs and manuals | Best relevance, more engineering effort |
Embeddings and the Vector Store
Each chunk is converted into an embedding, a numeric vector that captures meaning. These vectors live in a vector database (pgvector, Pinecone, Qdrant, Weaviate). When a question comes in, it is embedded the same way, and the store returns the closest matching chunks by cosine similarity.
Retrieval and Re-ranking
Pure vector search is a strong baseline, but the best systems combine it with keyword search (hybrid retrieval) and a re-ranking step that reorders the top candidates by true relevance. This single addition often produces the biggest quality jump in a RAG system.
Generation with Citations
Finally the retrieved chunks are formatted into the prompt with instructions to answer only from the provided context and to cite sources. If the context does not contain the answer, the model should say so rather than guess.
The goal is not to make the model sound smart. The goal is to make it honest, grounded, and traceable back to a source a human can verify.
Where RAG Projects Go Wrong
- Bad chunking that breaks tables, code, or logical sections apart.
- Embedding mismatch, using one model for documents and a different family for queries.
- No evaluation, shipping on vibes instead of measuring retrieval quality.
- Stale indexes, the knowledge base changes but nobody re-indexes it.
- Over-stuffed prompts, dumping twenty chunks in and drowning the real answer.
How to Know It Actually Works
Treat RAG like any other system: measure it. Track faithfulness (does the answer stick to the retrieved context), answer relevance (does it address the question), and context precision and recall (did retrieval surface the right material). Build a small evaluation set of real questions with known answers and run it on every change.
Security and Governance
Enterprise RAG must respect who is allowed to see what. Bake access control into retrieval so a user only ever gets chunks they are permitted to read. Strip or mask sensitive data during ingestion, log every query and the sources returned, and define a retention policy. A RAG assistant that leaks restricted documents is worse than no assistant at all.
When Not to Use RAG
RAG is the right tool when answers must be grounded in changing, proprietary knowledge. It is the wrong tool when you need the model to adopt a consistent style, format, or behaviour, that is a job for fine-tuning. Many production systems use both: fine-tuning for behaviour, RAG for knowledge.
Conclusion
RAG turns a generic model into something that actually knows your business, safely, traceably, and without an expensive training run. The hard parts are not the model call; they are ingestion quality, retrieval tuning, and honest evaluation.
Want to build an AI assistant grounded in your own data? Our AI agents practice designs and ships production RAG systems. Book a consultation to scope yours.

