Retrieval Augmented Generation

What

RAG = Retrieve relevant documents, then generate an answer using them as context. Grounds LLM responses in actual data, addressing the core limitations of pure parametric models: hallucination, stale knowledge, and untraceable reasoning.

The core insight: LLMs store knowledge parametrically (in weights) but can’t access up-to-date or private information. RAG adds a non-parametric memory layer — an external knowledge base — that the model queries at inference time.

Why RAG

Problem with pure LLMsWhat RAG adds
Hallucinates factsRetrieved docs provide grounded evidence
Knowledge cutoff (e.g., GPT-4: 2023-01)Fresh documents from any date
Can’t access private dataKnowledge base can contain anything
No citation of sourcesRetrieved docs enable citation
Expensive to update weightsKnowledge base updated without retraining

The Three RAG Paradigms

Naive RAG (2020 original)

The basic retrieve-then-generate pipeline:

Query → Embed → Top-k retrieval → Combine with prompt → Generate

Limitations: semantic similarity doesn’t always match relevance; retrieved docs may contain noise; single retrieval pass.

Advanced RAG (2021-2023)

Improvements at each stage:

  • Pre-retrieval: Query expansion, reformulation, query decomposition
  • Post-retrieval: Reranking, context compression, selective context
  • Retrieval: Hybrid search (dense + sparse), iterative retrieval

Modular RAG (2023+)

RAG becomes a toolkit of interchangeable components:

  • Specialized retrievers (web search, knowledge graphs, structured data)
  • Multiple retrieval passes per query (self-RAG, reactive retrieval)
  • Routing between retrieval and direct generation
  • Graph-based knowledge representation

The Retrieval Pipeline

Step 1: Document Ingestion

Raw documents → Chunking → Embedding → Indexing → Vector DB

Chunking strategies significantly affect retrieval quality:

StrategyHowBest for
Fixed-size (e.g., 512 tokens)Split by token countSimple, consistent
Semantic (sentence/paragraph)Split at natural boundariesCoherent content
RecursiveHierarchical splittingComplex documents
Small-to-largeStore fine-grained chunks, retrieve parentDense information

Chunk overlap: overlap between consecutive chunks (e.g., 20% overlap) prevents cutting relevant context across boundaries.

Step 2: Embedding

from sentence_transformers import SentenceTransformer
 
model = SentenceTransformer("all-MiniLM-L6-v2")
# Or for better quality: "BAAI/bge-large-en-v1.5" or "e5-large-v2"
 
doc_embeddings = model.encode(documents, batch_size=32, show_progress_bar=True)

Embedding model selection matters enormously:

ModelDimensionsQualitySpeed
all-MiniLM-L6-v2384GoodVery fast
bge-large-en-v1.51024ExcellentMedium
e5-large-v21024ExcellentMedium
OpenAI text-embedding-3-large3072SOTAAPI cost

Step 3: Vector Indexing

import faiss
 
# Flat index — exact search, good for <1M vectors
index = faiss.IndexFlatL2(384)
 
# IVF (inverted file) — approximate, scales to billions
quantizer = faiss.IndexFlatL2(384)
index = faiss.IndexIVFFlat(quantizer, 384, nlist=100)
index.train(doc_embeddings)
index.add(doc_embeddings)
 
# HNSW — graph-based, excellent recall/peed tradeoff
index = faiss.IndexHNSWFlat(384, 32)
index.add(doc_embeddings)

Production vector DBs: Pinecone, Weaviate, Qdrant, Milvus, Chroma (for prototyping). These provide managed services, filtering, hybrid search, and horizontal scaling.

Step 4: Retrieval and Reranking

# Simple retrieval
query_vec = model.encode(["What is gradient descent?"])
distances, indices = index.search(query_vec, k=5)
retrieved = [documents[i] for i in indices[0]]
 
# Reranking with cross-encoder (much more accurate but slower)
from sentence_transformers import CrossEncoder
 
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [(query, doc) for doc in retrieved]
scores = reranker.predict(pairs)
ranked = sorted(zip(retrieved, scores), key=lambda x: -x[1])

Why reranking: bi-encoder retrieval (fast, embedding-based) maximizes semantic similarity. Cross-encoder reranking (slower, 50-100ms per query) scores relevance more accurately by jointly encoding query + document.

Step 5: Generation with Context

context = "\n\n".join(ranked_docs[:3])
prompt = f"""Answer the question based on the retrieved documents.
 
Question: {query}
 
Documents:
{context}
 
Answer (cite the document numbers like [1], [2]):"""
 
response = llm.generate(prompt)

Combining dense (embedding-based) and sparse (keyword-based / BM25) retrieval covers both semantic and exact matches:

# Dense: semantic similarity via embeddings
dense_results = embedding_model.search(query, top_k=20)
 
# Sparse: keyword matching via BM25
sparse_results = bm25_search(query, documents, top_k=20)
 
# Reciprocal Rank Fusion (RRF): combine rankings
def rrf_fusion(dense_ranks, sparse_ranks, k=60):
    scores = {}
    for rank, (doc_id, _) in enumerate(dense_ranks):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    for rank, (doc_id, _) in enumerate(sparse_ranks):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: -x[1])

Iterative and Adaptive Retrieval

Self-RAG (Rubin et al., 2022)

The model learns to decide when to retrieve, using special tokens:

  • [Retrieval] — decide to retrieve
  • [Relevant] — retrieved document is relevant
  • [Irrelevant] — skip retrieved document
  • [Hallucination] — model is hallucinating
  • [No Retrieval] — no retrieval needed

Adaptive RAG (2024)

Routing between strategies based on query type:

  • Factual questions → retrieve
  • Code generation → no retrieval (internal knowledge)
  • Recent events → web search + RAG
  • Local documents → RAG only

Evaluation

RAG evaluation has three dimensions (from RAGAS, ARES, TruLens):

DimensionWhat it measuresMetrics
RetrievalAre the right documents retrieved?Precision@k, Recall@k, MRR, NDCG
GenerationIs the answer accurate and relevant?Faithfulness, Answer Relevancy, Context Precision
End-to-endDoes RAG improve over no-RAG?Human preference, Task-specific metrics
# RAGAS (key metrics)
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
 
result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision]
)

Production Considerations

Retrieval-augmented generation for your data

# Typical stack: LangChain + Chroma + OpenAI
from langchain.document_loaders import PDFPlumberLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
 
loader = PDFPlumberLoader("document.pdf")
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = splitter.split_documents(loader.load())
 
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o"),
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4})
)

Handling long documents

For documents exceeding context length, use:

  • Parent-document retrieval: chunk into small pieces, index by parent section
  • Hierarchical retrieval: coarse first (sections), then fine (paragraphs)
  • Summarization RAG: retrieve documents, summarize, retrieve from summaries

Latency vs quality tradeoff

  • Streaming generation starts after retrieval completes
  • Async retrieval + generation overlap reduces latency
  • For sub-100ms retrieval: use approximate nearest neighbor (HNSW/IVF) + caching
  • For best quality: use reranker + cross-encoder

Key Papers

  • Retrieval-Augmented Generation for Large Language Models: A Survey (Gao et al., 2023/2024) — comprehensive survey covering Naive/Advanced/Modular RAG, evaluation, and future directions · arXiv:2312.10997
  • A Systematic Literature Review of Retrieval-Augmented Generation (Brown et al., 2025) — PRISMA-compliant systematic review of 128 papers through May 2025 · arXiv:2508.06401
  • Self-RAG: Learning to Retrieve, Generate, and Critique (Rubin et al., 2022) — adaptive retrieval via special tokens · arXiv:2310.11511
  • Precise Zero-Shot Dense Retrieval without Relevance Labels (Khattab & Zaharia, 2020) — dense retrieval fundamentals · arXiv:2212.10449