Retrieval Augmented Generation
What
RAG = Retrieve relevant documents, then generate an answer using them as context. Grounds LLM responses in actual data, addressing the core limitations of pure parametric models: hallucination, stale knowledge, and untraceable reasoning.
The core insight: LLMs store knowledge parametrically (in weights) but can’t access up-to-date or private information. RAG adds a non-parametric memory layer — an external knowledge base — that the model queries at inference time.
Why RAG
| Problem with pure LLMs | What RAG adds |
|---|---|
| Hallucinates facts | Retrieved docs provide grounded evidence |
| Knowledge cutoff (e.g., GPT-4: 2023-01) | Fresh documents from any date |
| Can’t access private data | Knowledge base can contain anything |
| No citation of sources | Retrieved docs enable citation |
| Expensive to update weights | Knowledge base updated without retraining |
The Three RAG Paradigms
Naive RAG (2020 original)
The basic retrieve-then-generate pipeline:
Query → Embed → Top-k retrieval → Combine with prompt → Generate
Limitations: semantic similarity doesn’t always match relevance; retrieved docs may contain noise; single retrieval pass.
Advanced RAG (2021-2023)
Improvements at each stage:
- Pre-retrieval: Query expansion, reformulation, query decomposition
- Post-retrieval: Reranking, context compression, selective context
- Retrieval: Hybrid search (dense + sparse), iterative retrieval
Modular RAG (2023+)
RAG becomes a toolkit of interchangeable components:
- Specialized retrievers (web search, knowledge graphs, structured data)
- Multiple retrieval passes per query (self-RAG, reactive retrieval)
- Routing between retrieval and direct generation
- Graph-based knowledge representation
The Retrieval Pipeline
Step 1: Document Ingestion
Raw documents → Chunking → Embedding → Indexing → Vector DB
Chunking strategies significantly affect retrieval quality:
| Strategy | How | Best for |
|---|---|---|
| Fixed-size (e.g., 512 tokens) | Split by token count | Simple, consistent |
| Semantic (sentence/paragraph) | Split at natural boundaries | Coherent content |
| Recursive | Hierarchical splitting | Complex documents |
| Small-to-large | Store fine-grained chunks, retrieve parent | Dense information |
Chunk overlap: overlap between consecutive chunks (e.g., 20% overlap) prevents cutting relevant context across boundaries.
Step 2: Embedding
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
# Or for better quality: "BAAI/bge-large-en-v1.5" or "e5-large-v2"
doc_embeddings = model.encode(documents, batch_size=32, show_progress_bar=True)Embedding model selection matters enormously:
| Model | Dimensions | Quality | Speed |
|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Good | Very fast |
| bge-large-en-v1.5 | 1024 | Excellent | Medium |
| e5-large-v2 | 1024 | Excellent | Medium |
| OpenAI text-embedding-3-large | 3072 | SOTA | API cost |
Step 3: Vector Indexing
import faiss
# Flat index — exact search, good for <1M vectors
index = faiss.IndexFlatL2(384)
# IVF (inverted file) — approximate, scales to billions
quantizer = faiss.IndexFlatL2(384)
index = faiss.IndexIVFFlat(quantizer, 384, nlist=100)
index.train(doc_embeddings)
index.add(doc_embeddings)
# HNSW — graph-based, excellent recall/peed tradeoff
index = faiss.IndexHNSWFlat(384, 32)
index.add(doc_embeddings)Production vector DBs: Pinecone, Weaviate, Qdrant, Milvus, Chroma (for prototyping). These provide managed services, filtering, hybrid search, and horizontal scaling.
Step 4: Retrieval and Reranking
# Simple retrieval
query_vec = model.encode(["What is gradient descent?"])
distances, indices = index.search(query_vec, k=5)
retrieved = [documents[i] for i in indices[0]]
# Reranking with cross-encoder (much more accurate but slower)
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [(query, doc) for doc in retrieved]
scores = reranker.predict(pairs)
ranked = sorted(zip(retrieved, scores), key=lambda x: -x[1])Why reranking: bi-encoder retrieval (fast, embedding-based) maximizes semantic similarity. Cross-encoder reranking (slower, 50-100ms per query) scores relevance more accurately by jointly encoding query + document.
Step 5: Generation with Context
context = "\n\n".join(ranked_docs[:3])
prompt = f"""Answer the question based on the retrieved documents.
Question: {query}
Documents:
{context}
Answer (cite the document numbers like [1], [2]):"""
response = llm.generate(prompt)Hybrid Search
Combining dense (embedding-based) and sparse (keyword-based / BM25) retrieval covers both semantic and exact matches:
# Dense: semantic similarity via embeddings
dense_results = embedding_model.search(query, top_k=20)
# Sparse: keyword matching via BM25
sparse_results = bm25_search(query, documents, top_k=20)
# Reciprocal Rank Fusion (RRF): combine rankings
def rrf_fusion(dense_ranks, sparse_ranks, k=60):
scores = {}
for rank, (doc_id, _) in enumerate(dense_ranks):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
for rank, (doc_id, _) in enumerate(sparse_ranks):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: -x[1])Iterative and Adaptive Retrieval
Self-RAG (Rubin et al., 2022)
The model learns to decide when to retrieve, using special tokens:
[Retrieval]— decide to retrieve[Relevant]— retrieved document is relevant[Irrelevant]— skip retrieved document[Hallucination]— model is hallucinating[No Retrieval]— no retrieval needed
Adaptive RAG (2024)
Routing between strategies based on query type:
- Factual questions → retrieve
- Code generation → no retrieval (internal knowledge)
- Recent events → web search + RAG
- Local documents → RAG only
Evaluation
RAG evaluation has three dimensions (from RAGAS, ARES, TruLens):
| Dimension | What it measures | Metrics |
|---|---|---|
| Retrieval | Are the right documents retrieved? | Precision@k, Recall@k, MRR, NDCG |
| Generation | Is the answer accurate and relevant? | Faithfulness, Answer Relevancy, Context Precision |
| End-to-end | Does RAG improve over no-RAG? | Human preference, Task-specific metrics |
# RAGAS (key metrics)
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision]
)Production Considerations
Retrieval-augmented generation for your data
# Typical stack: LangChain + Chroma + OpenAI
from langchain.document_loaders import PDFPlumberLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
loader = PDFPlumberLoader("document.pdf")
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = splitter.split_documents(loader.load())
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4o"),
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 4})
)Handling long documents
For documents exceeding context length, use:
- Parent-document retrieval: chunk into small pieces, index by parent section
- Hierarchical retrieval: coarse first (sections), then fine (paragraphs)
- Summarization RAG: retrieve documents, summarize, retrieve from summaries
Latency vs quality tradeoff
- Streaming generation starts after retrieval completes
- Async retrieval + generation overlap reduces latency
- For sub-100ms retrieval: use approximate nearest neighbor (HNSW/IVF) + caching
- For best quality: use reranker + cross-encoder
Key Papers
- Retrieval-Augmented Generation for Large Language Models: A Survey (Gao et al., 2023/2024) — comprehensive survey covering Naive/Advanced/Modular RAG, evaluation, and future directions · arXiv:2312.10997
- A Systematic Literature Review of Retrieval-Augmented Generation (Brown et al., 2025) — PRISMA-compliant systematic review of 128 papers through May 2025 · arXiv:2508.06401
- Self-RAG: Learning to Retrieve, Generate, and Critique (Rubin et al., 2022) — adaptive retrieval via special tokens · arXiv:2310.11511
- Precise Zero-Shot Dense Retrieval without Relevance Labels (Khattab & Zaharia, 2020) — dense retrieval fundamentals · arXiv:2212.10449
Links
- Embeddings — how text becomes vectors
- Language Models — the generation component
- Prompt Engineering — effective prompting for RAG
- Fine-Tuning LLMs — when to fine-tune vs RAG
- Key Papers — foundational transformer and attention papers