31 KiB
RAG Architecture Patterns
Context
You're building a RAG (Retrieval-Augmented Generation) system to give LLMs access to external knowledge. Common mistakes:
- No chunking strategy (full docs → overflow, poor precision)
- Poor retrieval (cosine similarity alone → misses exact matches)
- No re-ranking (irrelevant results prioritized)
- No evaluation (can't measure or optimize quality)
- Context overflow (too many chunks → cost, latency, 'lost in middle')
This skill provides effective RAG architecture: chunking, hybrid search, re-ranking, evaluation, and complete pipeline design.
What is RAG?
RAG = Retrieval-Augmented Generation
Problem: LLMs have knowledge cutoffs and can't access private/recent data.
Solution: Retrieve relevant information, inject into prompt, generate answer.
# Without RAG:
answer = llm("What is our return policy?")
# LLM: "I don't have access to your specific return policy."
# With RAG:
relevant_docs = retrieval_system.search("return policy")
context = '\n'.join(relevant_docs)
prompt = f"Context: {context}\n\nQuestion: What is our return policy?\nAnswer:"
answer = llm(prompt)
# LLM: "Our return policy allows returns within 30 days..." (from retrieved docs)
When to use RAG:
- ✅ Private data (company docs, internal knowledge base)
- ✅ Recent data (news, updates since LLM training cutoff)
- ✅ Large knowledge base (can't fit in prompt/fine-tuning)
- ✅ Need citations (retrieval provides source documents)
- ✅ Changing information (update docs, not model)
When NOT to use RAG:
- ❌ General knowledge (already in LLM)
- ❌ Small knowledge base (< 100 docs → few-shot examples in prompt)
- ❌ Reasoning tasks (RAG provides facts, not reasoning)
RAG Architecture Overview
User Query
↓
1. Query Processing (optional: expansion, rewriting)
↓
2. Retrieval (dense + sparse hybrid search)
↓
3. Re-ranking (refine top results)
↓
4. Context Selection (top-k chunks)
↓
5. Prompt Construction (inject context)
↓
6. LLM Generation
↓
Answer (with citations)
Component 1: Document Processing & Chunking
Why Chunking?
Problem: Documents are long (10k-100k tokens), embeddings and LLMs have limits.
Solution: Split documents into chunks (500-1000 tokens each).
Chunking Strategies
1. Fixed-size chunking (simple, works for most cases):
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Characters (roughly 750 tokens)
chunk_overlap=200, # Overlap for continuity
separators=["\n\n", "\n", ". ", " ", ""] # Try these in order
)
chunks = splitter.split_text(document)
Parameters:
chunk_size: 500-1000 tokens typical (600-1500 characters)chunk_overlap: 10-20% of chunk_size (continuity between chunks)separators: Try semantic boundaries first (paragraphs > sentences > words)
2. Semantic chunking (preserves meaning):
def semantic_chunking(text, max_chunk_size=1000):
# Split on semantic boundaries
sections = text.split('\n\n## ') # Markdown headers
chunks = []
current_chunk = []
current_size = 0
for section in sections:
section_size = len(section)
if current_size + section_size <= max_chunk_size:
current_chunk.append(section)
current_size += section_size
else:
# Flush current chunk
if current_chunk:
chunks.append('\n\n'.join(current_chunk))
current_chunk = [section]
current_size = section_size
# Flush remaining
if current_chunk:
chunks.append('\n\n'.join(current_chunk))
return chunks
Benefits: Preserves topic boundaries, more coherent chunks.
3. Recursive chunking (LangChain default):
# Try splitting on larger boundaries first, fallback to smaller
separators = [
"\n\n", # Paragraphs (try first)
"\n", # Lines
". ", # Sentences
" ", # Words
"" # Characters (last resort)
]
# For each separator:
# - If chunk fits: Done
# - If chunk too large: Try next separator
# Result: Largest semantic unit that fits in chunk_size
Best for: Mixed documents (code + prose, structured + unstructured).
Chunking Best Practices
Metadata preservation:
chunks = []
for page_num, page_text in enumerate(pdf_pages):
page_chunks = splitter.split_text(page_text)
for chunk_idx, chunk in enumerate(page_chunks):
chunks.append({
'text': chunk,
'metadata': {
'source': 'document.pdf',
'page': page_num,
'chunk_id': f"{page_num}_{chunk_idx}"
}
})
# Later: Cite sources in answer
# "According to page 42 of document.pdf..."
Overlap for continuity:
# Without overlap: Sentence split across chunks (loss of context)
chunk1 = "...the process is simple. First,"
chunk2 = "you need to configure the settings..."
# With overlap (200 chars):
chunk1 = "...the process is simple. First, you need to configure"
chunk2 = "First, you need to configure the settings..."
# Overlap preserves context!
Chunk size guidelines:
Embedding model limit | Chunk size
----------------------|------------
512 tokens | 400 tokens (leave room for overlap)
1024 tokens | 800 tokens
2048 tokens | 1500 tokens
Typical: 500-1000 tokens per chunk (balance precision vs context)
Component 2: Embeddings
What are Embeddings?
Vector representation of text capturing semantic meaning.
text = "What is the return policy?"
embedding = embedding_model.encode(text)
# embedding: [0.234, -0.123, 0.891, ...] (384-1536 dimensions)
# Similar texts have similar embeddings (high cosine similarity)
query_emb = embed("return policy")
doc1_emb = embed("Returns accepted within 30 days") # High similarity
doc2_emb = embed("Product specifications") # Low similarity
Embedding Models
Popular models:
# 1. OpenAI embeddings (API-based)
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Dimensions: 1536, Cost: $0.02 per 1M tokens
# 2. Sentence Transformers (open-source, local)
from sentence_transformers import SentenceTransformer
embeddings = SentenceTransformer('all-MiniLM-L6-v2')
# Dimensions: 384, Cost: $0 (local), Fast
# 3. Domain-specific
embeddings = SentenceTransformer('allenai-specter') # Scientific papers
embeddings = SentenceTransformer('msmarco-distilbert-base-v4') # Search/QA
Selection criteria:
| Model | Dimensions | Speed | Quality | Cost | Use Case |
|---|---|---|---|---|---|
| OpenAI text-3-small | 1536 | Medium | Very Good | $0.02/1M | General (API) |
| OpenAI text-3-large | 3072 | Slow | Excellent | $0.13/1M | High quality |
| all-MiniLM-L6-v2 | 384 | Fast | Good | $0 | General (local) |
| all-mpnet-base-v2 | 768 | Medium | Very Good | $0 | General (local) |
| msmarco-* | 768 | Medium | Excellent | $0 | Search/QA |
Evaluation:
# Test on your domain!
from sentence_transformers import util
query = "What is the return policy?"
docs = ["Returns within 30 days", "Shipping takes 5-7 days", "Product warranty"]
for model_name in ['all-MiniLM-L6-v2', 'all-mpnet-base-v2', 'msmarco-distilbert-base-v4']:
model = SentenceTransformer(model_name)
query_emb = model.encode(query)
doc_embs = model.encode(docs)
similarities = util.cos_sim(query_emb, doc_embs)[0]
print(f"{model_name}: {similarities}")
# Pick model with highest similarity for relevant doc
Component 3: Vector Databases
Store and retrieve embeddings efficiently.
Popular Vector DBs:
# 1. Chroma (simple, local)
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_texts(chunks, embeddings)
# 2. Pinecone (managed, scalable)
import pinecone
pinecone.init(api_key="...", environment="...")
vectorstore = Pinecone.from_texts(chunks, embeddings, index_name="my-index")
# 3. Weaviate (open-source, scalable)
from langchain.vectorstores import Weaviate
vectorstore = Weaviate.from_texts(chunks, embeddings)
# 4. FAISS (Facebook, local, fast)
from langchain.vectorstores import FAISS
vectorstore = FAISS.from_texts(chunks, embeddings)
Vector DB Selection:
| Database | Type | Scale | Cost | Hosting | Best For |
|---|---|---|---|---|---|
| Chroma | Local | Small (< 1M) | $0 | Self | Development |
| FAISS | Local | Medium (< 10M) | $0 | Self | Production (self-hosted) |
| Pinecone | Cloud | Large (billions) | $70+/mo | Managed | Production (managed) |
| Weaviate | Both | Large | $0-$200/mo | Both | Production (flexible) |
Similarity Search:
# Basic similarity search
query = "What is the return policy?"
results = vectorstore.similarity_search(query, k=5)
# Returns: Top 5 most similar chunks
# With scores
results = vectorstore.similarity_search_with_score(query, k=5)
# Returns: [(chunk, similarity_score), ...]
# similarity_score: 0.0-1.0 (higher = more similar)
# With threshold
results = vectorstore.similarity_search_with_score(query, k=10)
filtered = [(chunk, score) for chunk, score in results if score > 0.7]
# Only keep highly similar results
Component 4: Retrieval Strategies
1. Dense Retrieval (Semantic)
Uses embeddings (what we've discussed).
query_embedding = embedding_model.encode(query)
# Find docs with embeddings most similar to query_embedding
results = vectorstore.similarity_search(query, k=10)
Pros:
- ✅ Semantic similarity (understands meaning, not just keywords)
- ✅ Handles synonyms, paraphrasing
Cons:
- ❌ Misses exact keyword matches
- ❌ Can confuse similar-sounding but different concepts
2. Sparse Retrieval (Keyword)
Classic information retrieval (BM25, TF-IDF).
from langchain.retrievers import BM25Retriever
# BM25: Keyword-based ranking
bm25_retriever = BM25Retriever.from_texts(chunks)
results = bm25_retriever.get_relevant_documents(query)
How BM25 works:
Score(query, doc) = sum over query terms of:
IDF(term) * (TF(term) * (k1 + 1)) / (TF(term) + k1 * (1 - b + b * doc_length / avg_doc_length))
Where:
- TF = term frequency (how often term appears in doc)
- IDF = inverse document frequency (rarity of term)
- k1, b = tuning parameters
Pros:
- ✅ Exact keyword matches (important for IDs, SKUs, technical terms)
- ✅ Fast (no neural network)
- ✅ Explainable (can see which keywords matched)
Cons:
- ❌ No semantic understanding (misses synonyms, paraphrasing)
- ❌ Sensitive to exact wording
3. Hybrid Retrieval (Dense + Sparse)
Combine both for best results!
from langchain.retrievers import EnsembleRetriever
# Dense retriever (semantic)
dense_retriever = vectorstore.as_retriever(search_kwargs={'k': 20})
# Sparse retriever (keyword)
sparse_retriever = BM25Retriever.from_texts(chunks)
# Ensemble (hybrid)
hybrid_retriever = EnsembleRetriever(
retrievers=[dense_retriever, sparse_retriever],
weights=[0.5, 0.5] # Equal weight (tune based on evaluation)
)
results = hybrid_retriever.get_relevant_documents(query)
When hybrid helps:
# Query: "What is the SKU for product ABC-123?"
# Dense only:
# - Might retrieve: "product catalog", "product specifications"
# - Misses: Exact SKU "ABC-123" (keyword)
# Sparse only:
# - Retrieves: "ABC-123" (keyword match)
# - Misses: Semantically similar products
# Hybrid:
# - Retrieves: Exact SKU + related products
# - Best of both worlds!
Weight tuning:
# Evaluate different weights
for dense_weight in [0.3, 0.5, 0.7]:
sparse_weight = 1 - dense_weight
retriever = EnsembleRetriever(
retrievers=[dense_retriever, sparse_retriever],
weights=[dense_weight, sparse_weight]
)
mrr = evaluate_retrieval(retriever, test_set)
print(f"Dense:{dense_weight}, Sparse:{sparse_weight} → MRR:{mrr:.3f}")
# Example output:
# Dense:0.3, Sparse:0.7 → MRR:0.65
# Dense:0.5, Sparse:0.5 → MRR:0.72 # Best!
# Dense:0.7, Sparse:0.3 → MRR:0.68
Component 5: Re-Ranking
Refine coarse retrieval ranking with cross-encoder.
Why Re-Ranking?
Retrieval (bi-encoder):
- Encodes query and docs separately
- Fast: O(1) for pre-computed doc embeddings
- Coarse: Single similarity score
Re-ranking (cross-encoder):
- Jointly encodes query + doc
- Slow: O(n) for n docs (must process each pair)
- Precise: Sees query-doc interactions
Pipeline:
1. Retrieval: Get top 20-50 (fast, broad)
2. Re-ranking: Refine to top 5-10 (slow, precise)
Implementation:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load cross-encoder for re-ranking
model = AutoModelForSequenceClassification.from_pretrained(
'cross-encoder/ms-marco-MiniLM-L-6-v2'
)
tokenizer = AutoTokenizer.from_pretrained('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank(query, retrieved_docs, top_k=5):
# Score each doc with cross-encoder
scores = []
for doc in retrieved_docs:
inputs = tokenizer(query, doc, return_tensors='pt', truncation=True, max_length=512)
with torch.no_grad():
score = model(**inputs).logits[0][0].item()
scores.append((doc, score))
# Sort by score (descending)
reranked = sorted(scores, key=lambda x: x[1], reverse=True)
# Return top-k
return [doc for doc, score in reranked[:top_k]]
# Usage
initial_results = vectorstore.similarity_search(query, k=20) # Over-retrieve
final_results = rerank(query, initial_results, top_k=5) # Re-rank
Re-Ranking Models:
| Model | Size | Speed | Quality | Use Case |
|---|---|---|---|---|
| ms-marco-MiniLM-L-6-v2 | 80MB | Fast | Good | General |
| ms-marco-MiniLM-L-12-v2 | 120MB | Medium | Very Good | Better quality |
| cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 | 120MB | Medium | Very Good | Multilingual |
Impact of Re-Ranking:
# Without re-ranking:
results = vectorstore.similarity_search(query, k=5)
mrr = 0.55 # First relevant at rank ~2
# With re-ranking:
initial = vectorstore.similarity_search(query, k=20)
results = rerank(query, initial, top_k=5)
mrr = 0.82 # First relevant at rank ~1.2
# Improvement: 27% better ranking!
Component 6: Query Processing
Query Expansion
Expand query with synonyms, related terms.
def expand_query(query, llm):
prompt = f"""
Generate 3 alternative phrasings of this query:
Original: {query}
Alternatives (semantically similar):
1.
2.
3.
"""
alternatives = llm(prompt)
# Retrieve using all variants, merge results
all_results = []
for alt_query in [query] + alternatives:
results = vectorstore.similarity_search(alt_query, k=10)
all_results.extend(results)
# Deduplicate and re-rank
unique_results = list(set(all_results))
return rerank(query, unique_results, top_k=5)
Query Rewriting
Simplify or decompose complex queries.
def rewrite_query(query, llm):
# Complex query
if is_complex(query):
prompt = f"""
Break this complex query into simpler sub-queries:
Query: {query}
Sub-queries:
1.
2.
"""
sub_queries = llm(prompt)
# Retrieve for each sub-query
all_results = []
for sub_q in sub_queries:
results = vectorstore.similarity_search(sub_q, k=5)
all_results.extend(results)
return all_results
return vectorstore.similarity_search(query, k=5)
HyDE (Hypothetical Document Embeddings)
Generate hypothetical answer, retrieve similar docs.
def hyde_retrieval(query, llm, vectorstore):
# Generate hypothetical answer
prompt = f"Answer this question in detail: {query}"
hypothetical_answer = llm(prompt)
# Retrieve docs similar to hypothetical answer (not query)
results = vectorstore.similarity_search(hypothetical_answer, k=5)
return results
# Why this works:
# - Queries are short, sparse
# - Answers are longer, richer
# - Doc-to-doc similarity (answer vs docs) better than query-to-doc
Component 7: Context Management
Context Budget
max_context_tokens = 4000 # Budget for retrieved context
selected_chunks = []
total_tokens = 0
for chunk in reranked_results:
chunk_tokens = count_tokens(chunk)
if total_tokens + chunk_tokens <= max_context_tokens:
selected_chunks.append(chunk)
total_tokens += chunk_tokens
else:
break # Stop when budget exceeded
# Result: Best chunks that fit in budget
Lost in the Middle Problem
LLMs prioritize start and end of context, miss middle.
# Research finding: Place most important info at start or end
def order_for_llm(chunks):
# Best chunks at start and end
if len(chunks) <= 2:
return chunks
# Put most relevant at positions 0 and -1
ordered = [chunks[0]] # Most relevant (start)
ordered.extend(chunks[1:-1]) # Less relevant (middle)
ordered.append(chunks[-1]) # Second most relevant (end)
return ordered
Contextual Compression
Filter retrieved chunks to most relevant sentences.
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
# Compressor: Extract relevant sentences
compressor = LLMChainExtractor.from_llm(llm)
# Wrap retriever
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever()
)
# Retrieved chunks are automatically filtered to relevant parts
compressed_docs = compression_retriever.get_relevant_documents(query)
Component 8: Prompt Construction
Basic RAG Prompt:
context = '\n\n'.join(retrieved_chunks)
prompt = f"""
Answer the question based on the context below. If the answer is not in the context, say "I don't have enough information to answer that."
Context:
{context}
Question: {query}
Answer:
"""
answer = llm(prompt)
With Citations:
context_with_ids = []
for i, chunk in enumerate(retrieved_chunks):
context_with_ids.append(f"[{i+1}] {chunk['text']}")
context = '\n\n'.join(context_with_ids)
prompt = f"""
Answer the question based on the context below. Cite sources using [number] format.
Context:
{context}
Question: {query}
Answer (with citations):
"""
answer = llm(prompt)
# Output: "The return policy allows returns within 30 days [1]. Shipping takes 5-7 business days [3]."
With Metadata:
context_with_metadata = []
for chunk in retrieved_chunks:
source = chunk['metadata']['source']
page = chunk['metadata']['page']
context_with_metadata.append(f"From {source} (page {page}):\n{chunk['text']}")
context = '\n\n'.join(context_with_metadata)
prompt = f"""
Answer the question and cite your sources.
Context:
{context}
Question: {query}
Answer:
"""
Evaluation Metrics
Retrieval Metrics
1. Mean Reciprocal Rank (MRR):
def calculate_mrr(retrieval_results, relevant_docs):
"""
MRR = average of (1 / rank of first relevant doc)
Example:
Query 1: First relevant at rank 2 → 1/2 = 0.5
Query 2: First relevant at rank 1 → 1/1 = 1.0
Query 3: No relevant docs → 0
MRR = (0.5 + 1.0 + 0) / 3 = 0.5
"""
mrr_scores = []
for results, relevant in zip(retrieval_results, relevant_docs):
for i, result in enumerate(results):
if result in relevant:
mrr_scores.append(1 / (i + 1))
break
else:
mrr_scores.append(0) # No relevant found
return np.mean(mrr_scores)
# Interpretation:
# MRR = 1.0: First result always relevant (perfect!)
# MRR = 0.5: First relevant at rank ~2 (good)
# MRR = 0.3: First relevant at rank ~3-4 (okay)
# MRR < 0.3: Poor retrieval (needs improvement)
2. Precision@k:
def calculate_precision_at_k(retrieval_results, relevant_docs, k=5):
"""
Precision@k = (# relevant docs in top-k) / k
Example:
Top 5 results: [relevant, irrelevant, relevant, irrelevant, irrelevant]
Precision@5 = 2/5 = 0.4
"""
precision_scores = []
for results, relevant in zip(retrieval_results, relevant_docs):
top_k = results[:k]
relevant_in_topk = len([r for r in top_k if r in relevant])
precision_scores.append(relevant_in_topk / k)
return np.mean(precision_scores)
# Target: Precision@5 > 0.7 (70% of top-5 are relevant)
3. Recall@k:
def calculate_recall_at_k(retrieval_results, relevant_docs, k=5):
"""
Recall@k = (# relevant docs in top-k) / (total relevant docs)
Example:
Total relevant: 5
Found in top-5: 2
Recall@5 = 2/5 = 0.4
"""
recall_scores = []
for results, relevant in zip(retrieval_results, relevant_docs):
top_k = results[:k]
relevant_in_topk = len([r for r in top_k if r in relevant])
recall_scores.append(relevant_in_topk / len(relevant))
return np.mean(recall_scores)
# Interpretation:
# Recall@5 = 1.0: All relevant docs in top-5 (perfect!)
# Recall@5 = 0.5: Half of relevant docs in top-5
4. NDCG (Normalized Discounted Cumulative Gain):
def calculate_ndcg(retrieval_results, relevance_scores, k=5):
"""
NDCG considers position and graded relevance (0, 1, 2, 3...)
DCG = sum of (relevance / log2(rank + 1))
NDCG = DCG / ideal_DCG (normalized to 0-1)
"""
from sklearn.metrics import ndcg_score
# relevance_scores: 2D array of relevance (0-3) for each result
# Higher relevance = more relevant
ndcg = ndcg_score(relevance_scores, retrieval_results, k=k)
return ndcg
# NDCG = 1.0: Perfect ranking
# NDCG > 0.7: Good ranking
# NDCG < 0.5: Poor ranking
Generation Metrics
1. Exact Match:
def calculate_exact_match(predictions, ground_truth):
"""Percentage of predictions that exactly match ground truth."""
matches = [pred == truth for pred, truth in zip(predictions, ground_truth)]
return np.mean(matches)
2. F1 Score (token-level):
def calculate_f1(prediction, ground_truth):
"""F1 score based on token overlap."""
pred_tokens = prediction.split()
truth_tokens = ground_truth.split()
common = set(pred_tokens) & set(truth_tokens)
if len(common) == 0:
return 0.0
precision = len(common) / len(pred_tokens)
recall = len(common) / len(truth_tokens)
f1 = 2 * precision * recall / (precision + recall)
return f1
3. LLM-as-Judge:
def evaluate_with_llm(answer, ground_truth, llm):
"""Use LLM to judge answer quality."""
prompt = f"""
Rate the quality of this answer on a scale of 1-5:
1 = Completely wrong
2 = Mostly wrong
3 = Partially correct
4 = Mostly correct
5 = Completely correct
Ground truth: {ground_truth}
Answer to evaluate: {answer}
Rating (1-5):
"""
rating = llm(prompt)
return int(rating)
End-to-End Evaluation
def evaluate_rag_system(rag_system, test_set):
"""
Complete evaluation: retrieval + generation
"""
# Retrieval metrics
retrieval_results = []
relevant_docs = []
# Generation metrics
predictions = []
ground_truth = []
for test_case in test_set:
query = test_case['query']
# Retrieve
retrieved = rag_system.retrieve(query)
retrieval_results.append(retrieved)
relevant_docs.append(test_case['relevant_docs'])
# Generate
answer = rag_system.generate(query, retrieved)
predictions.append(answer)
ground_truth.append(test_case['expected_answer'])
# Calculate metrics
metrics = {
'retrieval_mrr': calculate_mrr(retrieval_results, relevant_docs),
'retrieval_precision@5': calculate_precision_at_k(retrieval_results, relevant_docs, k=5),
'generation_f1': np.mean([calculate_f1(p, t) for p, t in zip(predictions, ground_truth)]),
'generation_exact_match': calculate_exact_match(predictions, ground_truth),
}
return metrics
Complete RAG Pipeline
Basic Implementation:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
# 1. Load documents
documents = load_documents('docs/')
# 2. Chunk documents
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)
# 3. Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)
# 4. Create retrieval chain
llm = OpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={'k': 5}),
return_source_documents=True
)
# 5. Query
result = qa_chain({"query": "What is the return policy?"})
answer = result['result']
sources = result['source_documents']
Advanced Implementation (Hybrid + Re-ranking):
from langchain.retrievers import EnsembleRetriever, BM25Retriever
from transformers import AutoModelForSequenceClassification, AutoTokenizer
class AdvancedRAG:
def __init__(self, documents):
# Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
self.chunks = splitter.split_documents(documents)
# Embeddings
self.embeddings = OpenAIEmbeddings()
self.vectorstore = Chroma.from_documents(self.chunks, self.embeddings)
# Hybrid retrieval
dense_retriever = self.vectorstore.as_retriever(search_kwargs={'k': 20})
sparse_retriever = BM25Retriever.from_documents(self.chunks)
self.retriever = EnsembleRetriever(
retrievers=[dense_retriever, sparse_retriever],
weights=[0.5, 0.5]
)
# Re-ranker
self.rerank_model = AutoModelForSequenceClassification.from_pretrained(
'cross-encoder/ms-marco-MiniLM-L-6-v2'
)
self.rerank_tokenizer = AutoTokenizer.from_pretrained(
'cross-encoder/ms-marco-MiniLM-L-6-v2'
)
# LLM
self.llm = OpenAI(temperature=0)
def retrieve(self, query, k=5):
# Hybrid retrieval (over-retrieve)
initial_results = self.retriever.get_relevant_documents(query)[:20]
# Re-rank
scores = []
for doc in initial_results:
inputs = self.rerank_tokenizer(
query, doc.page_content,
return_tensors='pt',
truncation=True,
max_length=512
)
score = self.rerank_model(**inputs).logits[0][0].item()
scores.append((doc, score))
# Sort by score
reranked = sorted(scores, key=lambda x: x[1], reverse=True)
# Return top-k
return [doc for doc, score in reranked[:k]]
def generate(self, query, retrieved_docs):
# Build context
context = '\n\n'.join([f"[{i+1}] {doc.page_content}"
for i, doc in enumerate(retrieved_docs)])
# Construct prompt
prompt = f"""
Answer the question based on the context below. Cite sources using [number].
If the answer is not in the context, say "I don't have enough information."
Context:
{context}
Question: {query}
Answer:
"""
# Generate
answer = self.llm(prompt)
return answer, retrieved_docs
def query(self, query):
retrieved_docs = self.retrieve(query, k=5)
answer, sources = self.generate(query, retrieved_docs)
return {
'answer': answer,
'sources': sources
}
# Usage
rag = AdvancedRAG(documents)
result = rag.query("What is the return policy?")
print(result['answer'])
print(f"Sources: {[doc.metadata for doc in result['sources']]}")
Optimization Strategies
1. Caching
import functools
@functools.lru_cache(maxsize=1000)
def cached_retrieval(query):
"""Cache retrieval results for common queries."""
return vectorstore.similarity_search(query, k=5)
# Saves embedding + retrieval cost for repeated queries
2. Async Retrieval
import asyncio
async def async_retrieve(queries, vectorstore):
"""Retrieve for multiple queries in parallel."""
tasks = [vectorstore.asimilarity_search(q, k=5) for q in queries]
results = await asyncio.gather(*tasks)
return results
3. Metadata Filtering
# Filter by metadata before similarity search
results = vectorstore.similarity_search(
query,
k=5,
filter={"source": "product_docs"} # Only search product docs
)
# Faster (smaller search space) + more relevant (right domain)
4. Index Optimization
# FAISS index optimization
import faiss
# 1. Train index on sample (faster search)
quantizer = faiss.IndexFlatL2(embedding_dim)
index = faiss.IndexIVFFlat(quantizer, embedding_dim, n_clusters)
index.train(sample_embeddings)
# 2. Set search parameters
index.nprobe = 10 # Trade-off: accuracy vs speed
# Result: 5-10× faster search with minimal quality loss
Common Pitfalls
Pitfall 1: No chunking
Problem: Full docs → overflow, poor precision Fix: Chunk to 500-1000 tokens
Pitfall 2: Dense-only retrieval
Problem: Misses exact keyword matches Fix: Hybrid search (dense + sparse)
Pitfall 3: No re-ranking
Problem: Coarse ranking, wrong results prioritized Fix: Over-retrieve (k=20), re-rank to top-5
Pitfall 4: Too much context
Problem: > 10k tokens → cost, latency, 'lost in middle' Fix: Top 5 chunks (5k tokens), optimize retrieval precision
Pitfall 5: No evaluation
Problem: Can't measure or optimize Fix: Build test set, measure MRR, Precision@k
Summary
Core principles:
- Chunk documents: 500-1000 tokens, semantic boundaries, overlap for continuity
- Hybrid retrieval: Dense (semantic) + Sparse (keyword) = best results
- Re-rank: Over-retrieve (k=20-50), refine to top-5 with cross-encoder
- Evaluate systematically: MRR, Precision@k, Recall@k, NDCG for retrieval; F1, Exact Match for generation
- Keep context focused: Top 5 chunks (~5k tokens), optimize retrieval not context size
Pipeline:
Documents → Chunk → Embed → Vector DB
Query → Hybrid Retrieval (k=20) → Re-rank (k=5) → Context → LLM → Answer
Metrics targets:
- MRR > 0.7 (first relevant in top ~1.4)
- Precision@5 > 0.7 (70% of top-5 relevant)
- Generation F1 > 0.8 (80% token overlap)
Key insight: RAG quality depends on retrieval precision. Optimize retrieval (chunking, hybrid search, re-ranking, evaluation) before adding context or changing LLMs.