Retrieval Augmented Generation

What

RAG = Retrieve relevant documents, then generate an answer using them as context. Grounds LLM responses in actual data, addressing the core limitations of pure parametric models: hallucination, stale knowledge, and untraceable reasoning.

The core insight: LLMs store knowledge parametrically (in weights) but can’t access up-to-date or private information. RAG adds a non-parametric memory layer — an external knowledge base — that the model queries at inference time.

Why RAG

Problem with pure LLMs	What RAG adds
Hallucinates facts	Retrieved docs provide grounded evidence
Knowledge cutoff (e.g., GPT-4: 2023-01)	Fresh documents from any date
Can’t access private data	Knowledge base can contain anything
No citation of sources	Retrieved docs enable citation
Expensive to update weights	Knowledge base updated without retraining

The Three RAG Paradigms

Naive RAG (2020 original)

The basic retrieve-then-generate pipeline:

Query → Embed → Top-k retrieval → Combine with prompt → Generate

Limitations: semantic similarity doesn’t always match relevance; retrieved docs may contain noise; single retrieval pass.

Advanced RAG (2021-2023)

Improvements at each stage:

Pre-retrieval: Query expansion, reformulation, query decomposition
Post-retrieval: Reranking, context compression, selective context
Retrieval: Hybrid search (dense + sparse), iterative retrieval

Modular RAG (2023+)

RAG becomes a toolkit of interchangeable components:

Specialized retrievers (web search, knowledge graphs, structured data)
Multiple retrieval passes per query (self-RAG, reactive retrieval)
Routing between retrieval and direct generation
Graph-based knowledge representation

The Retrieval Pipeline

Step 1: Document Ingestion

Raw documents → Chunking → Embedding → Indexing → Vector DB

Chunking strategies significantly affect retrieval quality:

Strategy	How	Best for
Fixed-size (e.g., 512 tokens)	Split by token count	Simple, consistent
Semantic (sentence/paragraph)	Split at natural boundaries	Coherent content
Recursive	Hierarchical splitting	Complex documents
Small-to-large	Store fine-grained chunks, retrieve parent	Dense information

Chunk overlap: overlap between consecutive chunks (e.g., 20% overlap) prevents cutting relevant context across boundaries.

Step 2: Embedding

from sentence_transformers import SentenceTransformer
 
model = SentenceTransformer("all-MiniLM-L6-v2")
# Or for better quality: "BAAI/bge-large-en-v1.5" or "e5-large-v2"
 
doc_embeddings = model.encode(documents, batch_size=32, show_progress_bar=True)

Embedding model selection matters enormously:

Model	Dimensions	Quality	Speed
all-MiniLM-L6-v2	384	Good	Very fast
bge-large-en-v1.5	1024	Excellent	Medium
e5-large-v2	1024	Excellent	Medium
OpenAI text-embedding-3-large	3072	SOTA	API cost

Step 3: Vector Indexing

import faiss
 
# Flat index — exact search, good for <1M vectors
index = faiss.IndexFlatL2(384)
 
# IVF (inverted file) — approximate, scales to billions
quantizer = faiss.IndexFlatL2(384)
index = faiss.IndexIVFFlat(quantizer, 384, nlist=100)
index.train(doc_embeddings)
index.add(doc_embeddings)
 
# HNSW — graph-based, excellent recall/peed tradeoff
index = faiss.IndexHNSWFlat(384, 32)
index.add(doc_embeddings)

Production vector DBs: Pinecone, Weaviate, Qdrant, Milvus, Chroma (for prototyping). These provide managed services, filtering, hybrid search, and horizontal scaling.

Step 4: Retrieval and Reranking

# Simple retrieval
query_vec = model.encode(["What is gradient descent?"])
distances, indices = index.search(query_vec, k=5)
retrieved = [documents[i] for i in indices[0]]
 
# Reranking with cross-encoder (much more accurate but slower)
from sentence_transformers import CrossEncoder
 
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [(query, doc) for doc in retrieved]
scores = reranker.predict(pairs)
ranked = sorted(zip(retrieved, scores), key=lambda x: -x[1])

Why reranking: bi-encoder retrieval (fast, embedding-based) maximizes semantic similarity. Cross-encoder reranking (slower, 50-100ms per query) scores relevance more accurately by jointly encoding query + document.

Step 5: Generation with Context

context = "\n\n".join(ranked_docs[:3])
prompt = f"""Answer the question based on the retrieved documents.
 
Question: {query}
 
Documents:
{context}
 
Answer (cite the document numbers like [1], [2]):"""
 
response = llm.generate(prompt)

Hybrid Search

Combining dense (embedding-based) and sparse (keyword-based / BM25) retrieval covers both semantic and exact matches:

# Dense: semantic similarity via embeddings
dense_results = embedding_model.search(query, top_k=20)
 
# Sparse: keyword matching via BM25
sparse_results = bm25_search(query, documents, top_k=20)
 
# Reciprocal Rank Fusion (RRF): combine rankings
def rrf_fusion(dense_ranks, sparse_ranks, k=60):
    scores = {}
    for rank, (doc_id, _) in enumerate(dense_ranks):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    for rank, (doc_id, _) in enumerate(sparse_ranks):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: -x[1])

Iterative and Adaptive Retrieval

Self-RAG (Rubin et al., 2022)

The model learns to decide when to retrieve, using special tokens:

[Retrieval] — decide to retrieve
[Relevant] — retrieved document is relevant
[Irrelevant] — skip retrieved document
[Hallucination] — model is hallucinating
[No Retrieval] — no retrieval needed

Adaptive RAG (2024)

Routing between strategies based on query type:

Factual questions → retrieve
Code generation → no retrieval (internal knowledge)
Recent events → web search + RAG
Local documents → RAG only

Evaluation

RAG evaluation has three dimensions (from RAGAS, ARES, TruLens):

Dimension	What it measures	Metrics
Retrieval	Are the right documents retrieved?	Precision@k, Recall@k, MRR, NDCG
Generation	Is the answer accurate and relevant?	Faithfulness, Answer Relevancy, Context Precision
End-to-end	Does RAG improve over no-RAG?	Human preference, Task-specific metrics

# RAGAS (key metrics)
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
 
result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision]
)

Production Considerations