Files
gh-tachyon-beep-skillpacks-…/skills/using-llm-specialist/rag-architecture-patterns.md
2025-11-30 08:59:54 +08:00

31 KiB
Raw Blame History

RAG Architecture Patterns

Context

You're building a RAG (Retrieval-Augmented Generation) system to give LLMs access to external knowledge. Common mistakes:

  • No chunking strategy (full docs → overflow, poor precision)
  • Poor retrieval (cosine similarity alone → misses exact matches)
  • No re-ranking (irrelevant results prioritized)
  • No evaluation (can't measure or optimize quality)
  • Context overflow (too many chunks → cost, latency, 'lost in middle')

This skill provides effective RAG architecture: chunking, hybrid search, re-ranking, evaluation, and complete pipeline design.

What is RAG?

RAG = Retrieval-Augmented Generation

Problem: LLMs have knowledge cutoffs and can't access private/recent data.

Solution: Retrieve relevant information, inject into prompt, generate answer.

# Without RAG:
answer = llm("What is our return policy?")
# LLM: "I don't have access to your specific return policy."

# With RAG:
relevant_docs = retrieval_system.search("return policy")
context = '\n'.join(relevant_docs)
prompt = f"Context: {context}\n\nQuestion: What is our return policy?\nAnswer:"
answer = llm(prompt)
# LLM: "Our return policy allows returns within 30 days..." (from retrieved docs)

When to use RAG:

  • Private data (company docs, internal knowledge base)
  • Recent data (news, updates since LLM training cutoff)
  • Large knowledge base (can't fit in prompt/fine-tuning)
  • Need citations (retrieval provides source documents)
  • Changing information (update docs, not model)

When NOT to use RAG:

  • General knowledge (already in LLM)
  • Small knowledge base (< 100 docs → few-shot examples in prompt)
  • Reasoning tasks (RAG provides facts, not reasoning)

RAG Architecture Overview

User Query
    ↓
1. Query Processing (optional: expansion, rewriting)
    ↓
2. Retrieval (dense + sparse hybrid search)
    ↓
3. Re-ranking (refine top results)
    ↓
4. Context Selection (top-k chunks)
    ↓
5. Prompt Construction (inject context)
    ↓
6. LLM Generation
    ↓
Answer (with citations)

Component 1: Document Processing & Chunking

Why Chunking?

Problem: Documents are long (10k-100k tokens), embeddings and LLMs have limits.

Solution: Split documents into chunks (500-1000 tokens each).

Chunking Strategies

1. Fixed-size chunking (simple, works for most cases):

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # Characters (roughly 750 tokens)
    chunk_overlap=200,  # Overlap for continuity
    separators=["\n\n", "\n", ". ", " ", ""]  # Try these in order
)

chunks = splitter.split_text(document)

Parameters:

  • chunk_size: 500-1000 tokens typical (600-1500 characters)
  • chunk_overlap: 10-20% of chunk_size (continuity between chunks)
  • separators: Try semantic boundaries first (paragraphs > sentences > words)

2. Semantic chunking (preserves meaning):

def semantic_chunking(text, max_chunk_size=1000):
    # Split on semantic boundaries
    sections = text.split('\n\n## ')  # Markdown headers

    chunks = []
    current_chunk = []
    current_size = 0

    for section in sections:
        section_size = len(section)

        if current_size + section_size <= max_chunk_size:
            current_chunk.append(section)
            current_size += section_size
        else:
            # Flush current chunk
            if current_chunk:
                chunks.append('\n\n'.join(current_chunk))
            current_chunk = [section]
            current_size = section_size

    # Flush remaining
    if current_chunk:
        chunks.append('\n\n'.join(current_chunk))

    return chunks

Benefits: Preserves topic boundaries, more coherent chunks.

3. Recursive chunking (LangChain default):

# Try splitting on larger boundaries first, fallback to smaller
separators = [
    "\n\n",  # Paragraphs (try first)
    "\n",    # Lines
    ". ",    # Sentences
    " ",     # Words
    ""       # Characters (last resort)
]

# For each separator:
# - If chunk fits: Done
# - If chunk too large: Try next separator
# Result: Largest semantic unit that fits in chunk_size

Best for: Mixed documents (code + prose, structured + unstructured).

Chunking Best Practices

Metadata preservation:

chunks = []
for page_num, page_text in enumerate(pdf_pages):
    page_chunks = splitter.split_text(page_text)

    for chunk_idx, chunk in enumerate(page_chunks):
        chunks.append({
            'text': chunk,
            'metadata': {
                'source': 'document.pdf',
                'page': page_num,
                'chunk_id': f"{page_num}_{chunk_idx}"
            }
        })

# Later: Cite sources in answer
# "According to page 42 of document.pdf..."

Overlap for continuity:

# Without overlap: Sentence split across chunks (loss of context)
chunk1 = "...the process is simple. First,"
chunk2 = "you need to configure the settings..."

# With overlap (200 chars):
chunk1 = "...the process is simple. First, you need to configure"
chunk2 = "First, you need to configure the settings..."
# Overlap preserves context!

Chunk size guidelines:

Embedding model limit | Chunk size
----------------------|------------
512 tokens           | 400 tokens (leave room for overlap)
1024 tokens          | 800 tokens
2048 tokens          | 1500 tokens

Typical: 500-1000 tokens per chunk (balance precision vs context)

Component 2: Embeddings

What are Embeddings?

Vector representation of text capturing semantic meaning.

text = "What is the return policy?"
embedding = embedding_model.encode(text)
# embedding: [0.234, -0.123, 0.891, ...] (384-1536 dimensions)

# Similar texts have similar embeddings (high cosine similarity)
query_emb = embed("return policy")
doc1_emb = embed("Returns accepted within 30 days")  # High similarity
doc2_emb = embed("Product specifications")  # Low similarity

Embedding Models

Popular models:

# 1. OpenAI embeddings (API-based)
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Dimensions: 1536, Cost: $0.02 per 1M tokens

# 2. Sentence Transformers (open-source, local)
from sentence_transformers import SentenceTransformer
embeddings = SentenceTransformer('all-MiniLM-L6-v2')
# Dimensions: 384, Cost: $0 (local), Fast

# 3. Domain-specific
embeddings = SentenceTransformer('allenai-specter')  # Scientific papers
embeddings = SentenceTransformer('msmarco-distilbert-base-v4')  # Search/QA

Selection criteria:

Model Dimensions Speed Quality Cost Use Case
OpenAI text-3-small 1536 Medium Very Good $0.02/1M General (API)
OpenAI text-3-large 3072 Slow Excellent $0.13/1M High quality
all-MiniLM-L6-v2 384 Fast Good $0 General (local)
all-mpnet-base-v2 768 Medium Very Good $0 General (local)
msmarco-* 768 Medium Excellent $0 Search/QA

Evaluation:

# Test on your domain!
from sentence_transformers import util

query = "What is the return policy?"
docs = ["Returns within 30 days", "Shipping takes 5-7 days", "Product warranty"]

for model_name in ['all-MiniLM-L6-v2', 'all-mpnet-base-v2', 'msmarco-distilbert-base-v4']:
    model = SentenceTransformer(model_name)

    query_emb = model.encode(query)
    doc_embs = model.encode(docs)

    similarities = util.cos_sim(query_emb, doc_embs)[0]
    print(f"{model_name}: {similarities}")

# Pick model with highest similarity for relevant doc

Component 3: Vector Databases

Store and retrieve embeddings efficiently.

# 1. Chroma (simple, local)
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_texts(chunks, embeddings)

# 2. Pinecone (managed, scalable)
import pinecone
pinecone.init(api_key="...", environment="...")
vectorstore = Pinecone.from_texts(chunks, embeddings, index_name="my-index")

# 3. Weaviate (open-source, scalable)
from langchain.vectorstores import Weaviate
vectorstore = Weaviate.from_texts(chunks, embeddings)

# 4. FAISS (Facebook, local, fast)
from langchain.vectorstores import FAISS
vectorstore = FAISS.from_texts(chunks, embeddings)

Vector DB Selection:

Database Type Scale Cost Hosting Best For
Chroma Local Small (< 1M) $0 Self Development
FAISS Local Medium (< 10M) $0 Self Production (self-hosted)
Pinecone Cloud Large (billions) $70+/mo Managed Production (managed)
Weaviate Both Large $0-$200/mo Both Production (flexible)
# Basic similarity search
query = "What is the return policy?"
results = vectorstore.similarity_search(query, k=5)
# Returns: Top 5 most similar chunks

# With scores
results = vectorstore.similarity_search_with_score(query, k=5)
# Returns: [(chunk, similarity_score), ...]
# similarity_score: 0.0-1.0 (higher = more similar)

# With threshold
results = vectorstore.similarity_search_with_score(query, k=10)
filtered = [(chunk, score) for chunk, score in results if score > 0.7]
# Only keep highly similar results

Component 4: Retrieval Strategies

1. Dense Retrieval (Semantic)

Uses embeddings (what we've discussed).

query_embedding = embedding_model.encode(query)
# Find docs with embeddings most similar to query_embedding
results = vectorstore.similarity_search(query, k=10)

Pros:

  • Semantic similarity (understands meaning, not just keywords)
  • Handles synonyms, paraphrasing

Cons:

  • Misses exact keyword matches
  • Can confuse similar-sounding but different concepts

2. Sparse Retrieval (Keyword)

Classic information retrieval (BM25, TF-IDF).

from langchain.retrievers import BM25Retriever

# BM25: Keyword-based ranking
bm25_retriever = BM25Retriever.from_texts(chunks)
results = bm25_retriever.get_relevant_documents(query)

How BM25 works:

Score(query, doc) = sum over query terms of:
  IDF(term) * (TF(term) * (k1 + 1)) / (TF(term) + k1 * (1 - b + b * doc_length / avg_doc_length))

Where:
- TF = term frequency (how often term appears in doc)
- IDF = inverse document frequency (rarity of term)
- k1, b = tuning parameters

Pros:

  • Exact keyword matches (important for IDs, SKUs, technical terms)
  • Fast (no neural network)
  • Explainable (can see which keywords matched)

Cons:

  • No semantic understanding (misses synonyms, paraphrasing)
  • Sensitive to exact wording

3. Hybrid Retrieval (Dense + Sparse)

Combine both for best results!

from langchain.retrievers import EnsembleRetriever

# Dense retriever (semantic)
dense_retriever = vectorstore.as_retriever(search_kwargs={'k': 20})

# Sparse retriever (keyword)
sparse_retriever = BM25Retriever.from_texts(chunks)

# Ensemble (hybrid)
hybrid_retriever = EnsembleRetriever(
    retrievers=[dense_retriever, sparse_retriever],
    weights=[0.5, 0.5]  # Equal weight (tune based on evaluation)
)

results = hybrid_retriever.get_relevant_documents(query)

When hybrid helps:

# Query: "What is the SKU for product ABC-123?"

# Dense only:
# - Might retrieve: "product catalog", "product specifications"
# - Misses: Exact SKU "ABC-123" (keyword)

# Sparse only:
# - Retrieves: "ABC-123" (keyword match)
# - Misses: Semantically similar products

# Hybrid:
# - Retrieves: Exact SKU + related products
# - Best of both worlds!

Weight tuning:

# Evaluate different weights
for dense_weight in [0.3, 0.5, 0.7]:
    sparse_weight = 1 - dense_weight

    retriever = EnsembleRetriever(
        retrievers=[dense_retriever, sparse_retriever],
        weights=[dense_weight, sparse_weight]
    )

    mrr = evaluate_retrieval(retriever, test_set)
    print(f"Dense:{dense_weight}, Sparse:{sparse_weight} → MRR:{mrr:.3f}")

# Example output:
# Dense:0.3, Sparse:0.7 → MRR:0.65
# Dense:0.5, Sparse:0.5 → MRR:0.72  # Best!
# Dense:0.7, Sparse:0.3 → MRR:0.68

Component 5: Re-Ranking

Refine coarse retrieval ranking with cross-encoder.

Why Re-Ranking?

Retrieval (bi-encoder):
- Encodes query and docs separately
- Fast: O(1) for pre-computed doc embeddings
- Coarse: Single similarity score

Re-ranking (cross-encoder):
- Jointly encodes query + doc
- Slow: O(n) for n docs (must process each pair)
- Precise: Sees query-doc interactions

Pipeline:

1. Retrieval: Get top 20-50 (fast, broad)
2. Re-ranking: Refine to top 5-10 (slow, precise)

Implementation:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load cross-encoder for re-ranking
model = AutoModelForSequenceClassification.from_pretrained(
    'cross-encoder/ms-marco-MiniLM-L-6-v2'
)
tokenizer = AutoTokenizer.from_pretrained('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query, retrieved_docs, top_k=5):
    # Score each doc with cross-encoder
    scores = []
    for doc in retrieved_docs:
        inputs = tokenizer(query, doc, return_tensors='pt', truncation=True, max_length=512)
        with torch.no_grad():
            score = model(**inputs).logits[0][0].item()
        scores.append((doc, score))

    # Sort by score (descending)
    reranked = sorted(scores, key=lambda x: x[1], reverse=True)

    # Return top-k
    return [doc for doc, score in reranked[:top_k]]

# Usage
initial_results = vectorstore.similarity_search(query, k=20)  # Over-retrieve
final_results = rerank(query, initial_results, top_k=5)  # Re-rank

Re-Ranking Models:

Model Size Speed Quality Use Case
ms-marco-MiniLM-L-6-v2 80MB Fast Good General
ms-marco-MiniLM-L-12-v2 120MB Medium Very Good Better quality
cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 120MB Medium Very Good Multilingual

Impact of Re-Ranking:

# Without re-ranking:
results = vectorstore.similarity_search(query, k=5)
mrr = 0.55  # First relevant at rank ~2

# With re-ranking:
initial = vectorstore.similarity_search(query, k=20)
results = rerank(query, initial, top_k=5)
mrr = 0.82  # First relevant at rank ~1.2

# Improvement: 27% better ranking!

Component 6: Query Processing

Query Expansion

Expand query with synonyms, related terms.

def expand_query(query, llm):
    prompt = f"""
    Generate 3 alternative phrasings of this query:

    Original: {query}

    Alternatives (semantically similar):
    1.
    2.
    3.
    """

    alternatives = llm(prompt)
    # Retrieve using all variants, merge results
    all_results = []
    for alt_query in [query] + alternatives:
        results = vectorstore.similarity_search(alt_query, k=10)
        all_results.extend(results)

    # Deduplicate and re-rank
    unique_results = list(set(all_results))
    return rerank(query, unique_results, top_k=5)

Query Rewriting

Simplify or decompose complex queries.

def rewrite_query(query, llm):
    # Complex query
    if is_complex(query):
        prompt = f"""
        Break this complex query into simpler sub-queries:

        Query: {query}

        Sub-queries:
        1.
        2.
        """
        sub_queries = llm(prompt)

        # Retrieve for each sub-query
        all_results = []
        for sub_q in sub_queries:
            results = vectorstore.similarity_search(sub_q, k=5)
            all_results.extend(results)

        return all_results

    return vectorstore.similarity_search(query, k=5)

HyDE (Hypothetical Document Embeddings)

Generate hypothetical answer, retrieve similar docs.

def hyde_retrieval(query, llm, vectorstore):
    # Generate hypothetical answer
    prompt = f"Answer this question in detail: {query}"
    hypothetical_answer = llm(prompt)

    # Retrieve docs similar to hypothetical answer (not query)
    results = vectorstore.similarity_search(hypothetical_answer, k=5)

    return results

# Why this works:
# - Queries are short, sparse
# - Answers are longer, richer
# - Doc-to-doc similarity (answer vs docs) better than query-to-doc

Component 7: Context Management

Context Budget

max_context_tokens = 4000  # Budget for retrieved context

selected_chunks = []
total_tokens = 0

for chunk in reranked_results:
    chunk_tokens = count_tokens(chunk)

    if total_tokens + chunk_tokens <= max_context_tokens:
        selected_chunks.append(chunk)
        total_tokens += chunk_tokens
    else:
        break  # Stop when budget exceeded

# Result: Best chunks that fit in budget

Lost in the Middle Problem

LLMs prioritize start and end of context, miss middle.

# Research finding: Place most important info at start or end

def order_for_llm(chunks):
    # Best chunks at start and end
    if len(chunks) <= 2:
        return chunks

    # Put most relevant at positions 0 and -1
    ordered = [chunks[0]]  # Most relevant (start)
    ordered.extend(chunks[1:-1])  # Less relevant (middle)
    ordered.append(chunks[-1])  # Second most relevant (end)

    return ordered

Contextual Compression

Filter retrieved chunks to most relevant sentences.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# Compressor: Extract relevant sentences
compressor = LLMChainExtractor.from_llm(llm)

# Wrap retriever
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever()
)

# Retrieved chunks are automatically filtered to relevant parts
compressed_docs = compression_retriever.get_relevant_documents(query)

Component 8: Prompt Construction

Basic RAG Prompt:

context = '\n\n'.join(retrieved_chunks)

prompt = f"""
Answer the question based on the context below. If the answer is not in the context, say "I don't have enough information to answer that."

Context:
{context}

Question: {query}

Answer:
"""

answer = llm(prompt)

With Citations:

context_with_ids = []
for i, chunk in enumerate(retrieved_chunks):
    context_with_ids.append(f"[{i+1}] {chunk['text']}")

context = '\n\n'.join(context_with_ids)

prompt = f"""
Answer the question based on the context below. Cite sources using [number] format.

Context:
{context}

Question: {query}

Answer (with citations):
"""

answer = llm(prompt)
# Output: "The return policy allows returns within 30 days [1]. Shipping takes 5-7 business days [3]."

With Metadata:

context_with_metadata = []
for chunk in retrieved_chunks:
    source = chunk['metadata']['source']
    page = chunk['metadata']['page']
    context_with_metadata.append(f"From {source} (page {page}):\n{chunk['text']}")

context = '\n\n'.join(context_with_metadata)

prompt = f"""
Answer the question and cite your sources.

Context:
{context}

Question: {query}

Answer:
"""

Evaluation Metrics

Retrieval Metrics

1. Mean Reciprocal Rank (MRR):

def calculate_mrr(retrieval_results, relevant_docs):
    """
    MRR = average of (1 / rank of first relevant doc)

    Example:
    Query 1: First relevant at rank 2 → 1/2 = 0.5
    Query 2: First relevant at rank 1 → 1/1 = 1.0
    Query 3: No relevant docs → 0
    MRR = (0.5 + 1.0 + 0) / 3 = 0.5
    """
    mrr_scores = []

    for results, relevant in zip(retrieval_results, relevant_docs):
        for i, result in enumerate(results):
            if result in relevant:
                mrr_scores.append(1 / (i + 1))
                break
        else:
            mrr_scores.append(0)  # No relevant found

    return np.mean(mrr_scores)

# Interpretation:
# MRR = 1.0: First result always relevant (perfect!)
# MRR = 0.5: First relevant at rank ~2 (good)
# MRR = 0.3: First relevant at rank ~3-4 (okay)
# MRR < 0.3: Poor retrieval (needs improvement)

2. Precision@k:

def calculate_precision_at_k(retrieval_results, relevant_docs, k=5):
    """
    Precision@k = (# relevant docs in top-k) / k

    Example:
    Top 5 results: [relevant, irrelevant, relevant, irrelevant, irrelevant]
    Precision@5 = 2/5 = 0.4
    """
    precision_scores = []

    for results, relevant in zip(retrieval_results, relevant_docs):
        top_k = results[:k]
        relevant_in_topk = len([r for r in top_k if r in relevant])
        precision_scores.append(relevant_in_topk / k)

    return np.mean(precision_scores)

# Target: Precision@5 > 0.7 (70% of top-5 are relevant)

3. Recall@k:

def calculate_recall_at_k(retrieval_results, relevant_docs, k=5):
    """
    Recall@k = (# relevant docs in top-k) / (total relevant docs)

    Example:
    Total relevant: 5
    Found in top-5: 2
    Recall@5 = 2/5 = 0.4
    """
    recall_scores = []

    for results, relevant in zip(retrieval_results, relevant_docs):
        top_k = results[:k]
        relevant_in_topk = len([r for r in top_k if r in relevant])
        recall_scores.append(relevant_in_topk / len(relevant))

    return np.mean(recall_scores)

# Interpretation:
# Recall@5 = 1.0: All relevant docs in top-5 (perfect!)
# Recall@5 = 0.5: Half of relevant docs in top-5

4. NDCG (Normalized Discounted Cumulative Gain):

def calculate_ndcg(retrieval_results, relevance_scores, k=5):
    """
    NDCG considers position and graded relevance (0, 1, 2, 3...)

    DCG = sum of (relevance / log2(rank + 1))
    NDCG = DCG / ideal_DCG (normalized to 0-1)
    """
    from sklearn.metrics import ndcg_score

    # relevance_scores: 2D array of relevance (0-3) for each result
    # Higher relevance = more relevant

    ndcg = ndcg_score(relevance_scores, retrieval_results, k=k)
    return ndcg

# NDCG = 1.0: Perfect ranking
# NDCG > 0.7: Good ranking
# NDCG < 0.5: Poor ranking

Generation Metrics

1. Exact Match:

def calculate_exact_match(predictions, ground_truth):
    """Percentage of predictions that exactly match ground truth."""
    matches = [pred == truth for pred, truth in zip(predictions, ground_truth)]
    return np.mean(matches)

2. F1 Score (token-level):

def calculate_f1(prediction, ground_truth):
    """F1 score based on token overlap."""
    pred_tokens = prediction.split()
    truth_tokens = ground_truth.split()

    common = set(pred_tokens) & set(truth_tokens)

    if len(common) == 0:
        return 0.0

    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(truth_tokens)
    f1 = 2 * precision * recall / (precision + recall)

    return f1

3. LLM-as-Judge:

def evaluate_with_llm(answer, ground_truth, llm):
    """Use LLM to judge answer quality."""
    prompt = f"""
    Rate the quality of this answer on a scale of 1-5:
    1 = Completely wrong
    2 = Mostly wrong
    3 = Partially correct
    4 = Mostly correct
    5 = Completely correct

    Ground truth: {ground_truth}
    Answer to evaluate: {answer}

    Rating (1-5):
    """

    rating = llm(prompt)
    return int(rating)

End-to-End Evaluation

def evaluate_rag_system(rag_system, test_set):
    """
    Complete evaluation: retrieval + generation
    """
    # Retrieval metrics
    retrieval_results = []
    relevant_docs = []

    # Generation metrics
    predictions = []
    ground_truth = []

    for test_case in test_set:
        query = test_case['query']

        # Retrieve
        retrieved = rag_system.retrieve(query)
        retrieval_results.append(retrieved)
        relevant_docs.append(test_case['relevant_docs'])

        # Generate
        answer = rag_system.generate(query, retrieved)
        predictions.append(answer)
        ground_truth.append(test_case['expected_answer'])

    # Calculate metrics
    metrics = {
        'retrieval_mrr': calculate_mrr(retrieval_results, relevant_docs),
        'retrieval_precision@5': calculate_precision_at_k(retrieval_results, relevant_docs, k=5),
        'generation_f1': np.mean([calculate_f1(p, t) for p, t in zip(predictions, ground_truth)]),
        'generation_exact_match': calculate_exact_match(predictions, ground_truth),
    }

    return metrics

Complete RAG Pipeline

Basic Implementation:

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 1. Load documents
documents = load_documents('docs/')

# 2. Chunk documents
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)

# 3. Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)

# 4. Create retrieval chain
llm = OpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={'k': 5}),
    return_source_documents=True
)

# 5. Query
result = qa_chain({"query": "What is the return policy?"})
answer = result['result']
sources = result['source_documents']

Advanced Implementation (Hybrid + Re-ranking):

from langchain.retrievers import EnsembleRetriever, BM25Retriever
from transformers import AutoModelForSequenceClassification, AutoTokenizer

class AdvancedRAG:
    def __init__(self, documents):
        # Chunk
        splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
        self.chunks = splitter.split_documents(documents)

        # Embeddings
        self.embeddings = OpenAIEmbeddings()
        self.vectorstore = Chroma.from_documents(self.chunks, self.embeddings)

        # Hybrid retrieval
        dense_retriever = self.vectorstore.as_retriever(search_kwargs={'k': 20})
        sparse_retriever = BM25Retriever.from_documents(self.chunks)

        self.retriever = EnsembleRetriever(
            retrievers=[dense_retriever, sparse_retriever],
            weights=[0.5, 0.5]
        )

        # Re-ranker
        self.rerank_model = AutoModelForSequenceClassification.from_pretrained(
            'cross-encoder/ms-marco-MiniLM-L-6-v2'
        )
        self.rerank_tokenizer = AutoTokenizer.from_pretrained(
            'cross-encoder/ms-marco-MiniLM-L-6-v2'
        )

        # LLM
        self.llm = OpenAI(temperature=0)

    def retrieve(self, query, k=5):
        # Hybrid retrieval (over-retrieve)
        initial_results = self.retriever.get_relevant_documents(query)[:20]

        # Re-rank
        scores = []
        for doc in initial_results:
            inputs = self.rerank_tokenizer(
                query, doc.page_content,
                return_tensors='pt',
                truncation=True,
                max_length=512
            )
            score = self.rerank_model(**inputs).logits[0][0].item()
            scores.append((doc, score))

        # Sort by score
        reranked = sorted(scores, key=lambda x: x[1], reverse=True)

        # Return top-k
        return [doc for doc, score in reranked[:k]]

    def generate(self, query, retrieved_docs):
        # Build context
        context = '\n\n'.join([f"[{i+1}] {doc.page_content}"
                               for i, doc in enumerate(retrieved_docs)])

        # Construct prompt
        prompt = f"""
        Answer the question based on the context below. Cite sources using [number].
        If the answer is not in the context, say "I don't have enough information."

        Context:
        {context}

        Question: {query}

        Answer:
        """

        # Generate
        answer = self.llm(prompt)

        return answer, retrieved_docs

    def query(self, query):
        retrieved_docs = self.retrieve(query, k=5)
        answer, sources = self.generate(query, retrieved_docs)

        return {
            'answer': answer,
            'sources': sources
        }

# Usage
rag = AdvancedRAG(documents)
result = rag.query("What is the return policy?")
print(result['answer'])
print(f"Sources: {[doc.metadata for doc in result['sources']]}")

Optimization Strategies

1. Caching

import functools

@functools.lru_cache(maxsize=1000)
def cached_retrieval(query):
    """Cache retrieval results for common queries."""
    return vectorstore.similarity_search(query, k=5)

# Saves embedding + retrieval cost for repeated queries

2. Async Retrieval

import asyncio

async def async_retrieve(queries, vectorstore):
    """Retrieve for multiple queries in parallel."""
    tasks = [vectorstore.asimilarity_search(q, k=5) for q in queries]
    results = await asyncio.gather(*tasks)
    return results

3. Metadata Filtering

# Filter by metadata before similarity search
results = vectorstore.similarity_search(
    query,
    k=5,
    filter={"source": "product_docs"}  # Only search product docs
)

# Faster (smaller search space) + more relevant (right domain)

4. Index Optimization

# FAISS index optimization
import faiss

# 1. Train index on sample (faster search)
quantizer = faiss.IndexFlatL2(embedding_dim)
index = faiss.IndexIVFFlat(quantizer, embedding_dim, n_clusters)
index.train(sample_embeddings)

# 2. Set search parameters
index.nprobe = 10  # Trade-off: accuracy vs speed

# Result: 5-10× faster search with minimal quality loss

Common Pitfalls

Pitfall 1: No chunking

Problem: Full docs → overflow, poor precision Fix: Chunk to 500-1000 tokens

Pitfall 2: Dense-only retrieval

Problem: Misses exact keyword matches Fix: Hybrid search (dense + sparse)

Pitfall 3: No re-ranking

Problem: Coarse ranking, wrong results prioritized Fix: Over-retrieve (k=20), re-rank to top-5

Pitfall 4: Too much context

Problem: > 10k tokens → cost, latency, 'lost in middle' Fix: Top 5 chunks (5k tokens), optimize retrieval precision

Pitfall 5: No evaluation

Problem: Can't measure or optimize Fix: Build test set, measure MRR, Precision@k

Summary

Core principles:

  1. Chunk documents: 500-1000 tokens, semantic boundaries, overlap for continuity
  2. Hybrid retrieval: Dense (semantic) + Sparse (keyword) = best results
  3. Re-rank: Over-retrieve (k=20-50), refine to top-5 with cross-encoder
  4. Evaluate systematically: MRR, Precision@k, Recall@k, NDCG for retrieval; F1, Exact Match for generation
  5. Keep context focused: Top 5 chunks (~5k tokens), optimize retrieval not context size

Pipeline:

Documents → Chunk → Embed → Vector DB
Query → Hybrid Retrieval (k=20) → Re-rank (k=5) → Context → LLM → Answer

Metrics targets:

  • MRR > 0.7 (first relevant in top ~1.4)
  • Precision@5 > 0.7 (70% of top-5 relevant)
  • Generation F1 > 0.8 (80% token overlap)

Key insight: RAG quality depends on retrieval precision. Optimize retrieval (chunking, hybrid search, re-ranking, evaluation) before adding context or changing LLMs.