# RAG Architecture Patterns ## Context You're building a RAG (Retrieval-Augmented Generation) system to give LLMs access to external knowledge. Common mistakes: - **No chunking strategy** (full docs → overflow, poor precision) - **Poor retrieval** (cosine similarity alone → misses exact matches) - **No re-ranking** (irrelevant results prioritized) - **No evaluation** (can't measure or optimize quality) - **Context overflow** (too many chunks → cost, latency, 'lost in middle') **This skill provides effective RAG architecture: chunking, hybrid search, re-ranking, evaluation, and complete pipeline design.** ## What is RAG? **RAG = Retrieval-Augmented Generation** **Problem:** LLMs have knowledge cutoffs and can't access private/recent data. **Solution:** Retrieve relevant information, inject into prompt, generate answer. ```python # Without RAG: answer = llm("What is our return policy?") # LLM: "I don't have access to your specific return policy." # With RAG: relevant_docs = retrieval_system.search("return policy") context = '\n'.join(relevant_docs) prompt = f"Context: {context}\n\nQuestion: What is our return policy?\nAnswer:" answer = llm(prompt) # LLM: "Our return policy allows returns within 30 days..." (from retrieved docs) ``` **When to use RAG:** - ✅ Private data (company docs, internal knowledge base) - ✅ Recent data (news, updates since LLM training cutoff) - ✅ Large knowledge base (can't fit in prompt/fine-tuning) - ✅ Need citations (retrieval provides source documents) - ✅ Changing information (update docs, not model) **When NOT to use RAG:** - ❌ General knowledge (already in LLM) - ❌ Small knowledge base (< 100 docs → few-shot examples in prompt) - ❌ Reasoning tasks (RAG provides facts, not reasoning) ## RAG Architecture Overview ``` User Query ↓ 1. Query Processing (optional: expansion, rewriting) ↓ 2. Retrieval (dense + sparse hybrid search) ↓ 3. Re-ranking (refine top results) ↓ 4. Context Selection (top-k chunks) ↓ 5. Prompt Construction (inject context) ↓ 6. LLM Generation ↓ Answer (with citations) ``` ## Component 1: Document Processing & Chunking ### Why Chunking? **Problem:** Documents are long (10k-100k tokens), embeddings and LLMs have limits. **Solution:** Split documents into chunks (500-1000 tokens each). ### Chunking Strategies **1. Fixed-size chunking (simple, works for most cases):** ```python from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=1000, # Characters (roughly 750 tokens) chunk_overlap=200, # Overlap for continuity separators=["\n\n", "\n", ". ", " ", ""] # Try these in order ) chunks = splitter.split_text(document) ``` **Parameters:** - `chunk_size`: 500-1000 tokens typical (600-1500 characters) - `chunk_overlap`: 10-20% of chunk_size (continuity between chunks) - `separators`: Try semantic boundaries first (paragraphs > sentences > words) **2. Semantic chunking (preserves meaning):** ```python def semantic_chunking(text, max_chunk_size=1000): # Split on semantic boundaries sections = text.split('\n\n## ') # Markdown headers chunks = [] current_chunk = [] current_size = 0 for section in sections: section_size = len(section) if current_size + section_size <= max_chunk_size: current_chunk.append(section) current_size += section_size else: # Flush current chunk if current_chunk: chunks.append('\n\n'.join(current_chunk)) current_chunk = [section] current_size = section_size # Flush remaining if current_chunk: chunks.append('\n\n'.join(current_chunk)) return chunks ``` **Benefits:** Preserves topic boundaries, more coherent chunks. **3. Recursive chunking (LangChain default):** ```python # Try splitting on larger boundaries first, fallback to smaller separators = [ "\n\n", # Paragraphs (try first) "\n", # Lines ". ", # Sentences " ", # Words "" # Characters (last resort) ] # For each separator: # - If chunk fits: Done # - If chunk too large: Try next separator # Result: Largest semantic unit that fits in chunk_size ``` **Best for:** Mixed documents (code + prose, structured + unstructured). ### Chunking Best Practices **Metadata preservation:** ```python chunks = [] for page_num, page_text in enumerate(pdf_pages): page_chunks = splitter.split_text(page_text) for chunk_idx, chunk in enumerate(page_chunks): chunks.append({ 'text': chunk, 'metadata': { 'source': 'document.pdf', 'page': page_num, 'chunk_id': f"{page_num}_{chunk_idx}" } }) # Later: Cite sources in answer # "According to page 42 of document.pdf..." ``` **Overlap for continuity:** ```python # Without overlap: Sentence split across chunks (loss of context) chunk1 = "...the process is simple. First," chunk2 = "you need to configure the settings..." # With overlap (200 chars): chunk1 = "...the process is simple. First, you need to configure" chunk2 = "First, you need to configure the settings..." # Overlap preserves context! ``` **Chunk size guidelines:** ``` Embedding model limit | Chunk size ----------------------|------------ 512 tokens | 400 tokens (leave room for overlap) 1024 tokens | 800 tokens 2048 tokens | 1500 tokens Typical: 500-1000 tokens per chunk (balance precision vs context) ``` ## Component 2: Embeddings ### What are Embeddings? **Vector representation of text capturing semantic meaning.** ```python text = "What is the return policy?" embedding = embedding_model.encode(text) # embedding: [0.234, -0.123, 0.891, ...] (384-1536 dimensions) # Similar texts have similar embeddings (high cosine similarity) query_emb = embed("return policy") doc1_emb = embed("Returns accepted within 30 days") # High similarity doc2_emb = embed("Product specifications") # Low similarity ``` ### Embedding Models **Popular models:** ```python # 1. OpenAI embeddings (API-based) from langchain.embeddings import OpenAIEmbeddings embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # Dimensions: 1536, Cost: $0.02 per 1M tokens # 2. Sentence Transformers (open-source, local) from sentence_transformers import SentenceTransformer embeddings = SentenceTransformer('all-MiniLM-L6-v2') # Dimensions: 384, Cost: $0 (local), Fast # 3. Domain-specific embeddings = SentenceTransformer('allenai-specter') # Scientific papers embeddings = SentenceTransformer('msmarco-distilbert-base-v4') # Search/QA ``` **Selection criteria:** | Model | Dimensions | Speed | Quality | Cost | Use Case | |-------|------------|-------|---------|------|----------| | OpenAI text-3-small | 1536 | Medium | Very Good | $0.02/1M | General (API) | | OpenAI text-3-large | 3072 | Slow | Excellent | $0.13/1M | High quality | | all-MiniLM-L6-v2 | 384 | Fast | Good | $0 | General (local) | | all-mpnet-base-v2 | 768 | Medium | Very Good | $0 | General (local) | | msmarco-* | 768 | Medium | Excellent | $0 | Search/QA | **Evaluation:** ```python # Test on your domain! from sentence_transformers import util query = "What is the return policy?" docs = ["Returns within 30 days", "Shipping takes 5-7 days", "Product warranty"] for model_name in ['all-MiniLM-L6-v2', 'all-mpnet-base-v2', 'msmarco-distilbert-base-v4']: model = SentenceTransformer(model_name) query_emb = model.encode(query) doc_embs = model.encode(docs) similarities = util.cos_sim(query_emb, doc_embs)[0] print(f"{model_name}: {similarities}") # Pick model with highest similarity for relevant doc ``` ## Component 3: Vector Databases **Store and retrieve embeddings efficiently.** ### Popular Vector DBs: ```python # 1. Chroma (simple, local) from langchain.vectorstores import Chroma vectorstore = Chroma.from_texts(chunks, embeddings) # 2. Pinecone (managed, scalable) import pinecone pinecone.init(api_key="...", environment="...") vectorstore = Pinecone.from_texts(chunks, embeddings, index_name="my-index") # 3. Weaviate (open-source, scalable) from langchain.vectorstores import Weaviate vectorstore = Weaviate.from_texts(chunks, embeddings) # 4. FAISS (Facebook, local, fast) from langchain.vectorstores import FAISS vectorstore = FAISS.from_texts(chunks, embeddings) ``` ### Vector DB Selection: | Database | Type | Scale | Cost | Hosting | Best For | |----------|------|-------|------|---------|----------| | Chroma | Local | Small (< 1M) | $0 | Self | Development | | FAISS | Local | Medium (< 10M) | $0 | Self | Production (self-hosted) | | Pinecone | Cloud | Large (billions) | $70+/mo | Managed | Production (managed) | | Weaviate | Both | Large | $0-$200/mo | Both | Production (flexible) | ### Similarity Search: ```python # Basic similarity search query = "What is the return policy?" results = vectorstore.similarity_search(query, k=5) # Returns: Top 5 most similar chunks # With scores results = vectorstore.similarity_search_with_score(query, k=5) # Returns: [(chunk, similarity_score), ...] # similarity_score: 0.0-1.0 (higher = more similar) # With threshold results = vectorstore.similarity_search_with_score(query, k=10) filtered = [(chunk, score) for chunk, score in results if score > 0.7] # Only keep highly similar results ``` ## Component 4: Retrieval Strategies ### 1. Dense Retrieval (Semantic) **Uses embeddings (what we've discussed).** ```python query_embedding = embedding_model.encode(query) # Find docs with embeddings most similar to query_embedding results = vectorstore.similarity_search(query, k=10) ``` **Pros:** - ✅ Semantic similarity (understands meaning, not just keywords) - ✅ Handles synonyms, paraphrasing **Cons:** - ❌ Misses exact keyword matches - ❌ Can confuse similar-sounding but different concepts ### 2. Sparse Retrieval (Keyword) **Classic information retrieval (BM25, TF-IDF).** ```python from langchain.retrievers import BM25Retriever # BM25: Keyword-based ranking bm25_retriever = BM25Retriever.from_texts(chunks) results = bm25_retriever.get_relevant_documents(query) ``` **How BM25 works:** ``` Score(query, doc) = sum over query terms of: IDF(term) * (TF(term) * (k1 + 1)) / (TF(term) + k1 * (1 - b + b * doc_length / avg_doc_length)) Where: - TF = term frequency (how often term appears in doc) - IDF = inverse document frequency (rarity of term) - k1, b = tuning parameters ``` **Pros:** - ✅ Exact keyword matches (important for IDs, SKUs, technical terms) - ✅ Fast (no neural network) - ✅ Explainable (can see which keywords matched) **Cons:** - ❌ No semantic understanding (misses synonyms, paraphrasing) - ❌ Sensitive to exact wording ### 3. Hybrid Retrieval (Dense + Sparse) **Combine both for best results!** ```python from langchain.retrievers import EnsembleRetriever # Dense retriever (semantic) dense_retriever = vectorstore.as_retriever(search_kwargs={'k': 20}) # Sparse retriever (keyword) sparse_retriever = BM25Retriever.from_texts(chunks) # Ensemble (hybrid) hybrid_retriever = EnsembleRetriever( retrievers=[dense_retriever, sparse_retriever], weights=[0.5, 0.5] # Equal weight (tune based on evaluation) ) results = hybrid_retriever.get_relevant_documents(query) ``` **When hybrid helps:** ```python # Query: "What is the SKU for product ABC-123?" # Dense only: # - Might retrieve: "product catalog", "product specifications" # - Misses: Exact SKU "ABC-123" (keyword) # Sparse only: # - Retrieves: "ABC-123" (keyword match) # - Misses: Semantically similar products # Hybrid: # - Retrieves: Exact SKU + related products # - Best of both worlds! ``` **Weight tuning:** ```python # Evaluate different weights for dense_weight in [0.3, 0.5, 0.7]: sparse_weight = 1 - dense_weight retriever = EnsembleRetriever( retrievers=[dense_retriever, sparse_retriever], weights=[dense_weight, sparse_weight] ) mrr = evaluate_retrieval(retriever, test_set) print(f"Dense:{dense_weight}, Sparse:{sparse_weight} → MRR:{mrr:.3f}") # Example output: # Dense:0.3, Sparse:0.7 → MRR:0.65 # Dense:0.5, Sparse:0.5 → MRR:0.72 # Best! # Dense:0.7, Sparse:0.3 → MRR:0.68 ``` ## Component 5: Re-Ranking **Refine coarse retrieval ranking with cross-encoder.** ### Why Re-Ranking? ``` Retrieval (bi-encoder): - Encodes query and docs separately - Fast: O(1) for pre-computed doc embeddings - Coarse: Single similarity score Re-ranking (cross-encoder): - Jointly encodes query + doc - Slow: O(n) for n docs (must process each pair) - Precise: Sees query-doc interactions ``` **Pipeline:** ``` 1. Retrieval: Get top 20-50 (fast, broad) 2. Re-ranking: Refine to top 5-10 (slow, precise) ``` ### Implementation: ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer import torch # Load cross-encoder for re-ranking model = AutoModelForSequenceClassification.from_pretrained( 'cross-encoder/ms-marco-MiniLM-L-6-v2' ) tokenizer = AutoTokenizer.from_pretrained('cross-encoder/ms-marco-MiniLM-L-6-v2') def rerank(query, retrieved_docs, top_k=5): # Score each doc with cross-encoder scores = [] for doc in retrieved_docs: inputs = tokenizer(query, doc, return_tensors='pt', truncation=True, max_length=512) with torch.no_grad(): score = model(**inputs).logits[0][0].item() scores.append((doc, score)) # Sort by score (descending) reranked = sorted(scores, key=lambda x: x[1], reverse=True) # Return top-k return [doc for doc, score in reranked[:top_k]] # Usage initial_results = vectorstore.similarity_search(query, k=20) # Over-retrieve final_results = rerank(query, initial_results, top_k=5) # Re-rank ``` ### Re-Ranking Models: | Model | Size | Speed | Quality | Use Case | |-------|------|-------|---------|----------| | ms-marco-MiniLM-L-6-v2 | 80MB | Fast | Good | General | | ms-marco-MiniLM-L-12-v2 | 120MB | Medium | Very Good | Better quality | | cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 | 120MB | Medium | Very Good | Multilingual | ### Impact of Re-Ranking: ```python # Without re-ranking: results = vectorstore.similarity_search(query, k=5) mrr = 0.55 # First relevant at rank ~2 # With re-ranking: initial = vectorstore.similarity_search(query, k=20) results = rerank(query, initial, top_k=5) mrr = 0.82 # First relevant at rank ~1.2 # Improvement: 27% better ranking! ``` ## Component 6: Query Processing ### Query Expansion **Expand query with synonyms, related terms.** ```python def expand_query(query, llm): prompt = f""" Generate 3 alternative phrasings of this query: Original: {query} Alternatives (semantically similar): 1. 2. 3. """ alternatives = llm(prompt) # Retrieve using all variants, merge results all_results = [] for alt_query in [query] + alternatives: results = vectorstore.similarity_search(alt_query, k=10) all_results.extend(results) # Deduplicate and re-rank unique_results = list(set(all_results)) return rerank(query, unique_results, top_k=5) ``` ### Query Rewriting **Simplify or decompose complex queries.** ```python def rewrite_query(query, llm): # Complex query if is_complex(query): prompt = f""" Break this complex query into simpler sub-queries: Query: {query} Sub-queries: 1. 2. """ sub_queries = llm(prompt) # Retrieve for each sub-query all_results = [] for sub_q in sub_queries: results = vectorstore.similarity_search(sub_q, k=5) all_results.extend(results) return all_results return vectorstore.similarity_search(query, k=5) ``` ### HyDE (Hypothetical Document Embeddings) **Generate hypothetical answer, retrieve similar docs.** ```python def hyde_retrieval(query, llm, vectorstore): # Generate hypothetical answer prompt = f"Answer this question in detail: {query}" hypothetical_answer = llm(prompt) # Retrieve docs similar to hypothetical answer (not query) results = vectorstore.similarity_search(hypothetical_answer, k=5) return results # Why this works: # - Queries are short, sparse # - Answers are longer, richer # - Doc-to-doc similarity (answer vs docs) better than query-to-doc ``` ## Component 7: Context Management ### Context Budget ```python max_context_tokens = 4000 # Budget for retrieved context selected_chunks = [] total_tokens = 0 for chunk in reranked_results: chunk_tokens = count_tokens(chunk) if total_tokens + chunk_tokens <= max_context_tokens: selected_chunks.append(chunk) total_tokens += chunk_tokens else: break # Stop when budget exceeded # Result: Best chunks that fit in budget ``` ### Lost in the Middle Problem **LLMs prioritize start and end of context, miss middle.** ```python # Research finding: Place most important info at start or end def order_for_llm(chunks): # Best chunks at start and end if len(chunks) <= 2: return chunks # Put most relevant at positions 0 and -1 ordered = [chunks[0]] # Most relevant (start) ordered.extend(chunks[1:-1]) # Less relevant (middle) ordered.append(chunks[-1]) # Second most relevant (end) return ordered ``` ### Contextual Compression **Filter retrieved chunks to most relevant sentences.** ```python from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import LLMChainExtractor # Compressor: Extract relevant sentences compressor = LLMChainExtractor.from_llm(llm) # Wrap retriever compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=vectorstore.as_retriever() ) # Retrieved chunks are automatically filtered to relevant parts compressed_docs = compression_retriever.get_relevant_documents(query) ``` ## Component 8: Prompt Construction ### Basic RAG Prompt: ```python context = '\n\n'.join(retrieved_chunks) prompt = f""" Answer the question based on the context below. If the answer is not in the context, say "I don't have enough information to answer that." Context: {context} Question: {query} Answer: """ answer = llm(prompt) ``` ### With Citations: ```python context_with_ids = [] for i, chunk in enumerate(retrieved_chunks): context_with_ids.append(f"[{i+1}] {chunk['text']}") context = '\n\n'.join(context_with_ids) prompt = f""" Answer the question based on the context below. Cite sources using [number] format. Context: {context} Question: {query} Answer (with citations): """ answer = llm(prompt) # Output: "The return policy allows returns within 30 days [1]. Shipping takes 5-7 business days [3]." ``` ### With Metadata: ```python context_with_metadata = [] for chunk in retrieved_chunks: source = chunk['metadata']['source'] page = chunk['metadata']['page'] context_with_metadata.append(f"From {source} (page {page}):\n{chunk['text']}") context = '\n\n'.join(context_with_metadata) prompt = f""" Answer the question and cite your sources. Context: {context} Question: {query} Answer: """ ``` ## Evaluation Metrics ### Retrieval Metrics **1. Mean Reciprocal Rank (MRR):** ```python def calculate_mrr(retrieval_results, relevant_docs): """ MRR = average of (1 / rank of first relevant doc) Example: Query 1: First relevant at rank 2 → 1/2 = 0.5 Query 2: First relevant at rank 1 → 1/1 = 1.0 Query 3: No relevant docs → 0 MRR = (0.5 + 1.0 + 0) / 3 = 0.5 """ mrr_scores = [] for results, relevant in zip(retrieval_results, relevant_docs): for i, result in enumerate(results): if result in relevant: mrr_scores.append(1 / (i + 1)) break else: mrr_scores.append(0) # No relevant found return np.mean(mrr_scores) # Interpretation: # MRR = 1.0: First result always relevant (perfect!) # MRR = 0.5: First relevant at rank ~2 (good) # MRR = 0.3: First relevant at rank ~3-4 (okay) # MRR < 0.3: Poor retrieval (needs improvement) ``` **2. Precision@k:** ```python def calculate_precision_at_k(retrieval_results, relevant_docs, k=5): """ Precision@k = (# relevant docs in top-k) / k Example: Top 5 results: [relevant, irrelevant, relevant, irrelevant, irrelevant] Precision@5 = 2/5 = 0.4 """ precision_scores = [] for results, relevant in zip(retrieval_results, relevant_docs): top_k = results[:k] relevant_in_topk = len([r for r in top_k if r in relevant]) precision_scores.append(relevant_in_topk / k) return np.mean(precision_scores) # Target: Precision@5 > 0.7 (70% of top-5 are relevant) ``` **3. Recall@k:** ```python def calculate_recall_at_k(retrieval_results, relevant_docs, k=5): """ Recall@k = (# relevant docs in top-k) / (total relevant docs) Example: Total relevant: 5 Found in top-5: 2 Recall@5 = 2/5 = 0.4 """ recall_scores = [] for results, relevant in zip(retrieval_results, relevant_docs): top_k = results[:k] relevant_in_topk = len([r for r in top_k if r in relevant]) recall_scores.append(relevant_in_topk / len(relevant)) return np.mean(recall_scores) # Interpretation: # Recall@5 = 1.0: All relevant docs in top-5 (perfect!) # Recall@5 = 0.5: Half of relevant docs in top-5 ``` **4. NDCG (Normalized Discounted Cumulative Gain):** ```python def calculate_ndcg(retrieval_results, relevance_scores, k=5): """ NDCG considers position and graded relevance (0, 1, 2, 3...) DCG = sum of (relevance / log2(rank + 1)) NDCG = DCG / ideal_DCG (normalized to 0-1) """ from sklearn.metrics import ndcg_score # relevance_scores: 2D array of relevance (0-3) for each result # Higher relevance = more relevant ndcg = ndcg_score(relevance_scores, retrieval_results, k=k) return ndcg # NDCG = 1.0: Perfect ranking # NDCG > 0.7: Good ranking # NDCG < 0.5: Poor ranking ``` ### Generation Metrics **1. Exact Match:** ```python def calculate_exact_match(predictions, ground_truth): """Percentage of predictions that exactly match ground truth.""" matches = [pred == truth for pred, truth in zip(predictions, ground_truth)] return np.mean(matches) ``` **2. F1 Score (token-level):** ```python def calculate_f1(prediction, ground_truth): """F1 score based on token overlap.""" pred_tokens = prediction.split() truth_tokens = ground_truth.split() common = set(pred_tokens) & set(truth_tokens) if len(common) == 0: return 0.0 precision = len(common) / len(pred_tokens) recall = len(common) / len(truth_tokens) f1 = 2 * precision * recall / (precision + recall) return f1 ``` **3. LLM-as-Judge:** ```python def evaluate_with_llm(answer, ground_truth, llm): """Use LLM to judge answer quality.""" prompt = f""" Rate the quality of this answer on a scale of 1-5: 1 = Completely wrong 2 = Mostly wrong 3 = Partially correct 4 = Mostly correct 5 = Completely correct Ground truth: {ground_truth} Answer to evaluate: {answer} Rating (1-5): """ rating = llm(prompt) return int(rating) ``` ### End-to-End Evaluation ```python def evaluate_rag_system(rag_system, test_set): """ Complete evaluation: retrieval + generation """ # Retrieval metrics retrieval_results = [] relevant_docs = [] # Generation metrics predictions = [] ground_truth = [] for test_case in test_set: query = test_case['query'] # Retrieve retrieved = rag_system.retrieve(query) retrieval_results.append(retrieved) relevant_docs.append(test_case['relevant_docs']) # Generate answer = rag_system.generate(query, retrieved) predictions.append(answer) ground_truth.append(test_case['expected_answer']) # Calculate metrics metrics = { 'retrieval_mrr': calculate_mrr(retrieval_results, relevant_docs), 'retrieval_precision@5': calculate_precision_at_k(retrieval_results, relevant_docs, k=5), 'generation_f1': np.mean([calculate_f1(p, t) for p, t in zip(predictions, ground_truth)]), 'generation_exact_match': calculate_exact_match(predictions, ground_truth), } return metrics ``` ## Complete RAG Pipeline ### Basic Implementation: ```python from langchain.chains import RetrievalQA from langchain.llms import OpenAI from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma from langchain.text_splitter import RecursiveCharacterTextSplitter # 1. Load documents documents = load_documents('docs/') # 2. Chunk documents splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) chunks = splitter.split_documents(documents) # 3. Create embeddings and vector store embeddings = OpenAIEmbeddings() vectorstore = Chroma.from_documents(chunks, embeddings) # 4. Create retrieval chain llm = OpenAI(temperature=0) qa_chain = RetrievalQA.from_chain_type( llm=llm, retriever=vectorstore.as_retriever(search_kwargs={'k': 5}), return_source_documents=True ) # 5. Query result = qa_chain({"query": "What is the return policy?"}) answer = result['result'] sources = result['source_documents'] ``` ### Advanced Implementation (Hybrid + Re-ranking): ```python from langchain.retrievers import EnsembleRetriever, BM25Retriever from transformers import AutoModelForSequenceClassification, AutoTokenizer class AdvancedRAG: def __init__(self, documents): # Chunk splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) self.chunks = splitter.split_documents(documents) # Embeddings self.embeddings = OpenAIEmbeddings() self.vectorstore = Chroma.from_documents(self.chunks, self.embeddings) # Hybrid retrieval dense_retriever = self.vectorstore.as_retriever(search_kwargs={'k': 20}) sparse_retriever = BM25Retriever.from_documents(self.chunks) self.retriever = EnsembleRetriever( retrievers=[dense_retriever, sparse_retriever], weights=[0.5, 0.5] ) # Re-ranker self.rerank_model = AutoModelForSequenceClassification.from_pretrained( 'cross-encoder/ms-marco-MiniLM-L-6-v2' ) self.rerank_tokenizer = AutoTokenizer.from_pretrained( 'cross-encoder/ms-marco-MiniLM-L-6-v2' ) # LLM self.llm = OpenAI(temperature=0) def retrieve(self, query, k=5): # Hybrid retrieval (over-retrieve) initial_results = self.retriever.get_relevant_documents(query)[:20] # Re-rank scores = [] for doc in initial_results: inputs = self.rerank_tokenizer( query, doc.page_content, return_tensors='pt', truncation=True, max_length=512 ) score = self.rerank_model(**inputs).logits[0][0].item() scores.append((doc, score)) # Sort by score reranked = sorted(scores, key=lambda x: x[1], reverse=True) # Return top-k return [doc for doc, score in reranked[:k]] def generate(self, query, retrieved_docs): # Build context context = '\n\n'.join([f"[{i+1}] {doc.page_content}" for i, doc in enumerate(retrieved_docs)]) # Construct prompt prompt = f""" Answer the question based on the context below. Cite sources using [number]. If the answer is not in the context, say "I don't have enough information." Context: {context} Question: {query} Answer: """ # Generate answer = self.llm(prompt) return answer, retrieved_docs def query(self, query): retrieved_docs = self.retrieve(query, k=5) answer, sources = self.generate(query, retrieved_docs) return { 'answer': answer, 'sources': sources } # Usage rag = AdvancedRAG(documents) result = rag.query("What is the return policy?") print(result['answer']) print(f"Sources: {[doc.metadata for doc in result['sources']]}") ``` ## Optimization Strategies ### 1. Caching ```python import functools @functools.lru_cache(maxsize=1000) def cached_retrieval(query): """Cache retrieval results for common queries.""" return vectorstore.similarity_search(query, k=5) # Saves embedding + retrieval cost for repeated queries ``` ### 2. Async Retrieval ```python import asyncio async def async_retrieve(queries, vectorstore): """Retrieve for multiple queries in parallel.""" tasks = [vectorstore.asimilarity_search(q, k=5) for q in queries] results = await asyncio.gather(*tasks) return results ``` ### 3. Metadata Filtering ```python # Filter by metadata before similarity search results = vectorstore.similarity_search( query, k=5, filter={"source": "product_docs"} # Only search product docs ) # Faster (smaller search space) + more relevant (right domain) ``` ### 4. Index Optimization ```python # FAISS index optimization import faiss # 1. Train index on sample (faster search) quantizer = faiss.IndexFlatL2(embedding_dim) index = faiss.IndexIVFFlat(quantizer, embedding_dim, n_clusters) index.train(sample_embeddings) # 2. Set search parameters index.nprobe = 10 # Trade-off: accuracy vs speed # Result: 5-10× faster search with minimal quality loss ``` ## Common Pitfalls ### Pitfall 1: No chunking **Problem:** Full docs → overflow, poor precision **Fix:** Chunk to 500-1000 tokens ### Pitfall 2: Dense-only retrieval **Problem:** Misses exact keyword matches **Fix:** Hybrid search (dense + sparse) ### Pitfall 3: No re-ranking **Problem:** Coarse ranking, wrong results prioritized **Fix:** Over-retrieve (k=20), re-rank to top-5 ### Pitfall 4: Too much context **Problem:** > 10k tokens → cost, latency, 'lost in middle' **Fix:** Top 5 chunks (5k tokens), optimize retrieval precision ### Pitfall 5: No evaluation **Problem:** Can't measure or optimize **Fix:** Build test set, measure MRR, Precision@k ## Summary **Core principles:** 1. **Chunk documents**: 500-1000 tokens, semantic boundaries, overlap for continuity 2. **Hybrid retrieval**: Dense (semantic) + Sparse (keyword) = best results 3. **Re-rank**: Over-retrieve (k=20-50), refine to top-5 with cross-encoder 4. **Evaluate systematically**: MRR, Precision@k, Recall@k, NDCG for retrieval; F1, Exact Match for generation 5. **Keep context focused**: Top 5 chunks (~5k tokens), optimize retrieval not context size **Pipeline:** ``` Documents → Chunk → Embed → Vector DB Query → Hybrid Retrieval (k=20) → Re-rank (k=5) → Context → LLM → Answer ``` **Metrics targets:** - MRR > 0.7 (first relevant in top ~1.4) - Precision@5 > 0.7 (70% of top-5 relevant) - Generation F1 > 0.8 (80% token overlap) **Key insight:** RAG quality depends on retrieval precision. Optimize retrieval (chunking, hybrid search, re-ranking, evaluation) before adding context or changing LLMs.