--- name: rag-implementation description: Build Retrieval-Augmented Generation (RAG) systems for LLM applications with vector databases and semantic search. Use when implementing knowledge-grounded AI, building document Q&A systems, or integrating LLMs with external knowledge bases. --- # RAG Implementation Master Retrieval-Augmented Generation (RAG) to build LLM applications that provide accurate, grounded responses using external knowledge sources. ## When to Use This Skill - Building Q&A systems over proprietary documents - Creating chatbots with current, factual information - Implementing semantic search with natural language queries - Reducing hallucinations with grounded responses - Enabling LLMs to access domain-specific knowledge - Building documentation assistants - Creating research tools with source citation ## Core Components ### 1. Vector Databases **Purpose**: Store and retrieve document embeddings efficiently **Options:** - **Pinecone**: Managed, scalable, fast queries - **Weaviate**: Open-source, hybrid search - **Milvus**: High performance, on-premise - **Chroma**: Lightweight, easy to use - **Qdrant**: Fast, filtered search - **FAISS**: Meta's library, local deployment ### 2. Embeddings **Purpose**: Convert text to numerical vectors for similarity search **Models:** - **text-embedding-ada-002** (OpenAI): General purpose, 1536 dims - **all-MiniLM-L6-v2** (Sentence Transformers): Fast, lightweight - **e5-large-v2**: High quality, multilingual - **Instructor**: Task-specific instructions - **bge-large-en-v1.5**: SOTA performance ### 3. Retrieval Strategies **Approaches:** - **Dense Retrieval**: Semantic similarity via embeddings - **Sparse Retrieval**: Keyword matching (BM25, TF-IDF) - **Hybrid Search**: Combine dense + sparse - **Multi-Query**: Generate multiple query variations - **HyDE**: Generate hypothetical documents ### 4. Reranking **Purpose**: Improve retrieval quality by reordering results **Methods:** - **Cross-Encoders**: BERT-based reranking - **Cohere Rerank**: API-based reranking - **Maximal Marginal Relevance (MMR)**: Diversity + relevance - **LLM-based**: Use LLM to score relevance ## Quick Start ```python from langchain.document_loaders import DirectoryLoader from langchain.text_splitters import RecursiveCharacterTextSplitter from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma from langchain.chains import RetrievalQA from langchain.llms import OpenAI # 1. Load documents loader = DirectoryLoader('./docs', glob="**/*.txt") documents = loader.load() # 2. Split into chunks text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, length_function=len ) chunks = text_splitter.split_documents(documents) # 3. Create embeddings and vector store embeddings = OpenAIEmbeddings() vectorstore = Chroma.from_documents(chunks, embeddings) # 4. Create retrieval chain qa_chain = RetrievalQA.from_chain_type( llm=OpenAI(), chain_type="stuff", retriever=vectorstore.as_retriever(search_kwargs={"k": 4}), return_source_documents=True ) # 5. Query result = qa_chain({"query": "What are the main features?"}) print(result['result']) print(result['source_documents']) ``` ## Advanced RAG Patterns ### Pattern 1: Hybrid Search ```python from langchain.retrievers import BM25Retriever, EnsembleRetriever # Sparse retriever (BM25) bm25_retriever = BM25Retriever.from_documents(chunks) bm25_retriever.k = 5 # Dense retriever (embeddings) embedding_retriever = vectorstore.as_retriever(search_kwargs={"k": 5}) # Combine with weights ensemble_retriever = EnsembleRetriever( retrievers=[bm25_retriever, embedding_retriever], weights=[0.3, 0.7] ) ``` ### Pattern 2: Multi-Query Retrieval ```python from langchain.retrievers.multi_query import MultiQueryRetriever # Generate multiple query perspectives retriever = MultiQueryRetriever.from_llm( retriever=vectorstore.as_retriever(), llm=OpenAI() ) # Single query → multiple variations → combined results results = retriever.get_relevant_documents("What is the main topic?") ``` ### Pattern 3: Contextual Compression ```python from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import LLMChainExtractor compressor = LLMChainExtractor.from_llm(llm) compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=vectorstore.as_retriever() ) # Returns only relevant parts of documents compressed_docs = compression_retriever.get_relevant_documents("query") ``` ### Pattern 4: Parent Document Retriever ```python from langchain.retrievers import ParentDocumentRetriever from langchain.storage import InMemoryStore # Store for parent documents store = InMemoryStore() # Small chunks for retrieval, large chunks for context child_splitter = RecursiveCharacterTextSplitter(chunk_size=400) parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000) retriever = ParentDocumentRetriever( vectorstore=vectorstore, docstore=store, child_splitter=child_splitter, parent_splitter=parent_splitter ) ``` ## Document Chunking Strategies ### Recursive Character Text Splitter ```python from langchain.text_splitters import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, length_function=len, separators=["\n\n", "\n", " ", ""] # Try these in order ) ``` ### Token-Based Splitting ```python from langchain.text_splitters import TokenTextSplitter splitter = TokenTextSplitter( chunk_size=512, chunk_overlap=50 ) ``` ### Semantic Chunking ```python from langchain.text_splitters import SemanticChunker splitter = SemanticChunker( embeddings=OpenAIEmbeddings(), breakpoint_threshold_type="percentile" ) ``` ### Markdown Header Splitter ```python from langchain.text_splitters import MarkdownHeaderTextSplitter headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), ] splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on) ``` ## Vector Store Configurations ### Pinecone ```python import pinecone from langchain.vectorstores import Pinecone pinecone.init(api_key="your-api-key", environment="us-west1-gcp") index = pinecone.Index("your-index-name") vectorstore = Pinecone(index, embeddings.embed_query, "text") ``` ### Weaviate ```python import weaviate from langchain.vectorstores import Weaviate client = weaviate.Client("http://localhost:8080") vectorstore = Weaviate(client, "Document", "content", embeddings) ``` ### Chroma (Local) ```python from langchain.vectorstores import Chroma vectorstore = Chroma( collection_name="my_collection", embedding_function=embeddings, persist_directory="./chroma_db" ) ``` ## Retrieval Optimization ### 1. Metadata Filtering ```python # Add metadata during indexing chunks_with_metadata = [] for i, chunk in enumerate(chunks): chunk.metadata = { "source": chunk.metadata.get("source"), "page": i, "category": determine_category(chunk.page_content) } chunks_with_metadata.append(chunk) # Filter during retrieval results = vectorstore.similarity_search( "query", filter={"category": "technical"}, k=5 ) ``` ### 2. Maximal Marginal Relevance ```python # Balance relevance with diversity results = vectorstore.max_marginal_relevance_search( "query", k=5, fetch_k=20, # Fetch 20, return top 5 diverse lambda_mult=0.5 # 0=max diversity, 1=max relevance ) ``` ### 3. Reranking with Cross-Encoder ```python from sentence_transformers import CrossEncoder reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') # Get initial results candidates = vectorstore.similarity_search("query", k=20) # Rerank pairs = [[query, doc.page_content] for doc in candidates] scores = reranker.predict(pairs) # Sort by score and take top k reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5] ``` ## Prompt Engineering for RAG ### Contextual Prompt ```python prompt_template = """Use the following context to answer the question. If you cannot answer based on the context, say "I don't have enough information." Context: {context} Question: {question} Answer:""" ``` ### With Citations ```python prompt_template = """Answer the question based on the context below. Include citations using [1], [2], etc. Context: {context} Question: {question} Answer (with citations):""" ``` ### With Confidence ```python prompt_template = """Answer the question using the context. Provide a confidence score (0-100%) for your answer. Context: {context} Question: {question} Answer: Confidence:""" ``` ## Evaluation Metrics ```python def evaluate_rag_system(qa_chain, test_cases): metrics = { 'accuracy': [], 'retrieval_quality': [], 'groundedness': [] } for test in test_cases: result = qa_chain({"query": test['question']}) # Check if answer matches expected accuracy = calculate_accuracy(result['result'], test['expected']) metrics['accuracy'].append(accuracy) # Check if relevant docs were retrieved retrieval_quality = evaluate_retrieved_docs( result['source_documents'], test['relevant_docs'] ) metrics['retrieval_quality'].append(retrieval_quality) # Check if answer is grounded in context groundedness = check_groundedness( result['result'], result['source_documents'] ) metrics['groundedness'].append(groundedness) return {k: sum(v)/len(v) for k, v in metrics.items()} ``` ## Resources - **references/vector-databases.md**: Detailed comparison of vector DBs - **references/embeddings.md**: Embedding model selection guide - **references/retrieval-strategies.md**: Advanced retrieval techniques - **references/reranking.md**: Reranking methods and when to use them - **references/context-window.md**: Managing context limits - **assets/vector-store-config.yaml**: Configuration templates - **assets/retriever-pipeline.py**: Complete RAG pipeline - **assets/embedding-models.md**: Model comparison and benchmarks ## Best Practices 1. **Chunk Size**: Balance between context and specificity (500-1000 tokens) 2. **Overlap**: Use 10-20% overlap to preserve context at boundaries 3. **Metadata**: Include source, page, timestamp for filtering and debugging 4. **Hybrid Search**: Combine semantic and keyword search for best results 5. **Reranking**: Improve top results with cross-encoder 6. **Citations**: Always return source documents for transparency 7. **Evaluation**: Continuously test retrieval quality and answer accuracy 8. **Monitoring**: Track retrieval metrics in production ## Common Issues - **Poor Retrieval**: Check embedding quality, chunk size, query formulation - **Irrelevant Results**: Add metadata filtering, use hybrid search, rerank - **Missing Information**: Ensure documents are properly indexed - **Slow Queries**: Optimize vector store, use caching, reduce k - **Hallucinations**: Improve grounding prompt, add verification step