# RAG Architecture Patterns

## Context

You're building a RAG (Retrieval-Augmented Generation) system to give LLMs access to external knowledge. Common mistakes:
- **No chunking strategy** (full docs → overflow, poor precision)
- **Poor retrieval** (cosine similarity alone → misses exact matches)
- **No re-ranking** (irrelevant results prioritized)
- **No evaluation** (can't measure or optimize quality)
- **Context overflow** (too many chunks → cost, latency, 'lost in middle')

**This skill provides effective RAG architecture: chunking, hybrid search, re-ranking, evaluation, and complete pipeline design.**


## What is RAG?

**RAG = Retrieval-Augmented Generation**

**Problem:** LLMs have knowledge cutoffs and can't access private/recent data.

**Solution:** Retrieve relevant information, inject into prompt, generate answer.

```python
# Without RAG:
answer = llm("What is our return policy?")
# LLM: "I don't have access to your specific return policy."

# With RAG:
relevant_docs = retrieval_system.search("return policy")
context = '\n'.join(relevant_docs)
prompt = f"Context: {context}\n\nQuestion: What is our return policy?\nAnswer:"
answer = llm(prompt)
# LLM: "Our return policy allows returns within 30 days..." (from retrieved docs)
```

**When to use RAG:**
- ✅ Private data (company docs, internal knowledge base)
- ✅ Recent data (news, updates since LLM training cutoff)
- ✅ Large knowledge base (can't fit in prompt/fine-tuning)
- ✅ Need citations (retrieval provides source documents)
- ✅ Changing information (update docs, not model)

**When NOT to use RAG:**
- ❌ General knowledge (already in LLM)
- ❌ Small knowledge base (< 100 docs → few-shot examples in prompt)
- ❌ Reasoning tasks (RAG provides facts, not reasoning)


## RAG Architecture Overview

```
User Query
    ↓
1. Query Processing (optional: expansion, rewriting)
    ↓
2. Retrieval (dense + sparse hybrid search)
    ↓
3. Re-ranking (refine top results)
    ↓
4. Context Selection (top-k chunks)
    ↓
5. Prompt Construction (inject context)
    ↓
6. LLM Generation
    ↓
Answer (with citations)
```


## Component 1: Document Processing & Chunking

### Why Chunking?

**Problem:** Documents are long (10k-100k tokens), embeddings and LLMs have limits.

**Solution:** Split documents into chunks (500-1000 tokens each).

### Chunking Strategies

**1. Fixed-size chunking (simple, works for most cases):**

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # Characters (roughly 750 tokens)
    chunk_overlap=200,  # Overlap for continuity
    separators=["\n\n", "\n", ". ", " ", ""]  # Try these in order
)

chunks = splitter.split_text(document)
```

**Parameters:**
- `chunk_size`: 500-1000 tokens typical (600-1500 characters)
- `chunk_overlap`: 10-20% of chunk_size (continuity between chunks)
- `separators`: Try semantic boundaries first (paragraphs > sentences > words)

**2. Semantic chunking (preserves meaning):**

```python
def semantic_chunking(text, max_chunk_size=1000):
    # Split on semantic boundaries
    sections = text.split('\n\n## ')  # Markdown headers

    chunks = []
    current_chunk = []
    current_size = 0

    for section in sections:
        section_size = len(section)

        if current_size + section_size <= max_chunk_size:
            current_chunk.append(section)
            current_size += section_size
        else:
            # Flush current chunk
            if current_chunk:
                chunks.append('\n\n'.join(current_chunk))
            current_chunk = [section]
            current_size = section_size

    # Flush remaining
    if current_chunk:
        chunks.append('\n\n'.join(current_chunk))

    return chunks
```

**Benefits:** Preserves topic boundaries, more coherent chunks.

**3. Recursive chunking (LangChain default):**

```python
# Try splitting on larger boundaries first, fallback to smaller
separators = [
    "\n\n",  # Paragraphs (try first)
    "\n",    # Lines
    ". ",    # Sentences
    " ",     # Words
    ""       # Characters (last resort)
]

# For each separator:
# - If chunk fits: Done
# - If chunk too large: Try next separator
# Result: Largest semantic unit that fits in chunk_size
```

**Best for:** Mixed documents (code + prose, structured + unstructured).

### Chunking Best Practices

**Metadata preservation:**
```python
chunks = []
for page_num, page_text in enumerate(pdf_pages):
    page_chunks = splitter.split_text(page_text)

    for chunk_idx, chunk in enumerate(page_chunks):
        chunks.append({
            'text': chunk,
            'metadata': {
                'source': 'document.pdf',
                'page': page_num,
                'chunk_id': f"{page_num}_{chunk_idx}"
            }
        })

# Later: Cite sources in answer
# "According to page 42 of document.pdf..."
```

**Overlap for continuity:**
```python
# Without overlap: Sentence split across chunks (loss of context)
chunk1 = "...the process is simple. First,"
chunk2 = "you need to configure the settings..."

# With overlap (200 chars):
chunk1 = "...the process is simple. First, you need to configure"
chunk2 = "First, you need to configure the settings..."
# Overlap preserves context!
```

**Chunk size guidelines:**
```
Embedding model limit | Chunk size
----------------------|------------
512 tokens           | 400 tokens (leave room for overlap)
1024 tokens          | 800 tokens
2048 tokens          | 1500 tokens

Typical: 500-1000 tokens per chunk (balance precision vs context)
```


## Component 2: Embeddings

### What are Embeddings?

**Vector representation of text capturing semantic meaning.**

```python
text = "What is the return policy?"
embedding = embedding_model.encode(text)
# embedding: [0.234, -0.123, 0.891, ...] (384-1536 dimensions)

# Similar texts have similar embeddings (high cosine similarity)
query_emb = embed("return policy")
doc1_emb = embed("Returns accepted within 30 days")  # High similarity
doc2_emb = embed("Product specifications")  # Low similarity
```

### Embedding Models

**Popular models:**

```python
# 1. OpenAI embeddings (API-based)
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Dimensions: 1536, Cost: $0.02 per 1M tokens

# 2. Sentence Transformers (open-source, local)
from sentence_transformers import SentenceTransformer
embeddings = SentenceTransformer('all-MiniLM-L6-v2')
# Dimensions: 384, Cost: $0 (local), Fast

# 3. Domain-specific
embeddings = SentenceTransformer('allenai-specter')  # Scientific papers
embeddings = SentenceTransformer('msmarco-distilbert-base-v4')  # Search/QA
```

**Selection criteria:**

| Model | Dimensions | Speed | Quality | Cost | Use Case |
|-------|------------|-------|---------|------|----------|
| OpenAI text-3-small | 1536 | Medium | Very Good | $0.02/1M | General (API) |
| OpenAI text-3-large | 3072 | Slow | Excellent | $0.13/1M | High quality |
| all-MiniLM-L6-v2 | 384 | Fast | Good | $0 | General (local) |
| all-mpnet-base-v2 | 768 | Medium | Very Good | $0 | General (local) |
| msmarco-* | 768 | Medium | Excellent | $0 | Search/QA |

**Evaluation:**
```python
# Test on your domain!
from sentence_transformers import util

query = "What is the return policy?"
docs = ["Returns within 30 days", "Shipping takes 5-7 days", "Product warranty"]

for model_name in ['all-MiniLM-L6-v2', 'all-mpnet-base-v2', 'msmarco-distilbert-base-v4']:
    model = SentenceTransformer(model_name)

    query_emb = model.encode(query)
    doc_embs = model.encode(docs)

    similarities = util.cos_sim(query_emb, doc_embs)[0]
    print(f"{model_name}: {similarities}")

# Pick model with highest similarity for relevant doc
```


## Component 3: Vector Databases

**Store and retrieve embeddings efficiently.**

### Popular Vector DBs:

```python
# 1. Chroma (simple, local)
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_texts(chunks, embeddings)

# 2. Pinecone (managed, scalable)
import pinecone
pinecone.init(api_key="...", environment="...")
vectorstore = Pinecone.from_texts(chunks, embeddings, index_name="my-index")

# 3. Weaviate (open-source, scalable)
from langchain.vectorstores import Weaviate
vectorstore = Weaviate.from_texts(chunks, embeddings)

# 4. FAISS (Facebook, local, fast)
from langchain.vectorstores import FAISS
vectorstore = FAISS.from_texts(chunks, embeddings)
```

### Vector DB Selection:

| Database | Type | Scale | Cost | Hosting | Best For |
|----------|------|-------|------|---------|----------|
| Chroma | Local | Small (< 1M) | $0 | Self | Development |
| FAISS | Local | Medium (< 10M) | $0 | Self | Production (self-hosted) |
| Pinecone | Cloud | Large (billions) | $70+/mo | Managed | Production (managed) |
| Weaviate | Both | Large | $0-$200/mo | Both | Production (flexible) |

### Similarity Search:

```python
# Basic similarity search
query = "What is the return policy?"
results = vectorstore.similarity_search(query, k=5)
# Returns: Top 5 most similar chunks

# With scores
results = vectorstore.similarity_search_with_score(query, k=5)
# Returns: [(chunk, similarity_score), ...]
# similarity_score: 0.0-1.0 (higher = more similar)

# With threshold
results = vectorstore.similarity_search_with_score(query, k=10)
filtered = [(chunk, score) for chunk, score in results if score > 0.7]
# Only keep highly similar results
```


## Component 4: Retrieval Strategies

### 1. Dense Retrieval (Semantic)

**Uses embeddings (what we've discussed).**

```python
query_embedding = embedding_model.encode(query)
# Find docs with embeddings most similar to query_embedding
results = vectorstore.similarity_search(query, k=10)
```

**Pros:**
- ✅ Semantic similarity (understands meaning, not just keywords)
- ✅ Handles synonyms, paraphrasing

**Cons:**
- ❌ Misses exact keyword matches
- ❌ Can confuse similar-sounding but different concepts

### 2. Sparse Retrieval (Keyword)

**Classic information retrieval (BM25, TF-IDF).**

```python
from langchain.retrievers import BM25Retriever

# BM25: Keyword-based ranking
bm25_retriever = BM25Retriever.from_texts(chunks)
results = bm25_retriever.get_relevant_documents(query)
```

**How BM25 works:**
```
Score(query, doc) = sum over query terms of:
  IDF(term) * (TF(term) * (k1 + 1)) / (TF(term) + k1 * (1 - b + b * doc_length / avg_doc_length))

Where:
- TF = term frequency (how often term appears in doc)
- IDF = inverse document frequency (rarity of term)
- k1, b = tuning parameters
```

**Pros:**
- ✅ Exact keyword matches (important for IDs, SKUs, technical terms)
- ✅ Fast (no neural network)
- ✅ Explainable (can see which keywords matched)

**Cons:**
- ❌ No semantic understanding (misses synonyms, paraphrasing)
- ❌ Sensitive to exact wording

### 3. Hybrid Retrieval (Dense + Sparse)

**Combine both for best results!**

```python
from langchain.retrievers import EnsembleRetriever

# Dense retriever (semantic)
dense_retriever = vectorstore.as_retriever(search_kwargs={'k': 20})

# Sparse retriever (keyword)
sparse_retriever = BM25Retriever.from_texts(chunks)

# Ensemble (hybrid)
hybrid_retriever = EnsembleRetriever(
    retrievers=[dense_retriever, sparse_retriever],
    weights=[0.5, 0.5]  # Equal weight (tune based on evaluation)
)

results = hybrid_retriever.get_relevant_documents(query)
```

**When hybrid helps:**

```python
# Query: "What is the SKU for product ABC-123?"

# Dense only:
# - Might retrieve: "product catalog", "product specifications"
# - Misses: Exact SKU "ABC-123" (keyword)

# Sparse only:
# - Retrieves: "ABC-123" (keyword match)
# - Misses: Semantically similar products

# Hybrid:
# - Retrieves: Exact SKU + related products
# - Best of both worlds!
```

**Weight tuning:**
```python
# Evaluate different weights
for dense_weight in [0.3, 0.5, 0.7]:
    sparse_weight = 1 - dense_weight

    retriever = EnsembleRetriever(
        retrievers=[dense_retriever, sparse_retriever],
        weights=[dense_weight, sparse_weight]
    )

    mrr = evaluate_retrieval(retriever, test_set)
    print(f"Dense:{dense_weight}, Sparse:{sparse_weight} → MRR:{mrr:.3f}")

# Example output:
# Dense:0.3, Sparse:0.7 → MRR:0.65
# Dense:0.5, Sparse:0.5 → MRR:0.72  # Best!
# Dense:0.7, Sparse:0.3 → MRR:0.68
```


## Component 5: Re-Ranking

**Refine coarse retrieval ranking with cross-encoder.**

### Why Re-Ranking?

```
Retrieval (bi-encoder):
- Encodes query and docs separately
- Fast: O(1) for pre-computed doc embeddings
- Coarse: Single similarity score

Re-ranking (cross-encoder):
- Jointly encodes query + doc
- Slow: O(n) for n docs (must process each pair)
- Precise: Sees query-doc interactions
```

**Pipeline:**
```
1. Retrieval: Get top 20-50 (fast, broad)
2. Re-ranking: Refine to top 5-10 (slow, precise)
```

### Implementation:

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load cross-encoder for re-ranking
model = AutoModelForSequenceClassification.from_pretrained(
    'cross-encoder/ms-marco-MiniLM-L-6-v2'
)
tokenizer = AutoTokenizer.from_pretrained('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query, retrieved_docs, top_k=5):
    # Score each doc with cross-encoder
    scores = []
    for doc in retrieved_docs:
        inputs = tokenizer(query, doc, return_tensors='pt', truncation=True, max_length=512)
        with torch.no_grad():
            score = model(**inputs).logits[0][0].item()
        scores.append((doc, score))

    # Sort by score (descending)
    reranked = sorted(scores, key=lambda x: x[1], reverse=True)

    # Return top-k
    return [doc for doc, score in reranked[:top_k]]

# Usage
initial_results = vectorstore.similarity_search(query, k=20)  # Over-retrieve
final_results = rerank(query, initial_results, top_k=5)  # Re-rank
```

### Re-Ranking Models:

| Model | Size | Speed | Quality | Use Case |
|-------|------|-------|---------|----------|
| ms-marco-MiniLM-L-6-v2 | 80MB | Fast | Good | General |
| ms-marco-MiniLM-L-12-v2 | 120MB | Medium | Very Good | Better quality |
| cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 | 120MB | Medium | Very Good | Multilingual |

### Impact of Re-Ranking:

```python
# Without re-ranking:
results = vectorstore.similarity_search(query, k=5)
mrr = 0.55  # First relevant at rank ~2

# With re-ranking:
initial = vectorstore.similarity_search(query, k=20)
results = rerank(query, initial, top_k=5)
mrr = 0.82  # First relevant at rank ~1.2

# Improvement: 27% better ranking!
```


## Component 6: Query Processing

### Query Expansion

**Expand query with synonyms, related terms.**

```python
def expand_query(query, llm):
    prompt = f"""
    Generate 3 alternative phrasings of this query:

    Original: {query}

    Alternatives (semantically similar):
    1.
    2.
    3.
    """

    alternatives = llm(prompt)
    # Retrieve using all variants, merge results
    all_results = []
    for alt_query in [query] + alternatives:
        results = vectorstore.similarity_search(alt_query, k=10)
        all_results.extend(results)

    # Deduplicate and re-rank
    unique_results = list(set(all_results))
    return rerank(query, unique_results, top_k=5)
```

### Query Rewriting

**Simplify or decompose complex queries.**

```python
def rewrite_query(query, llm):
    # Complex query
    if is_complex(query):
        prompt = f"""
        Break this complex query into simpler sub-queries:

        Query: {query}

        Sub-queries:
        1.
        2.
        """
        sub_queries = llm(prompt)

        # Retrieve for each sub-query
        all_results = []
        for sub_q in sub_queries:
            results = vectorstore.similarity_search(sub_q, k=5)
            all_results.extend(results)

        return all_results

    return vectorstore.similarity_search(query, k=5)
```

### HyDE (Hypothetical Document Embeddings)

**Generate hypothetical answer, retrieve similar docs.**

```python
def hyde_retrieval(query, llm, vectorstore):
    # Generate hypothetical answer
    prompt = f"Answer this question in detail: {query}"
    hypothetical_answer = llm(prompt)

    # Retrieve docs similar to hypothetical answer (not query)
    results = vectorstore.similarity_search(hypothetical_answer, k=5)

    return results

# Why this works:
# - Queries are short, sparse
# - Answers are longer, richer
# - Doc-to-doc similarity (answer vs docs) better than query-to-doc
```


## Component 7: Context Management

### Context Budget

```python
max_context_tokens = 4000  # Budget for retrieved context

selected_chunks = []
total_tokens = 0

for chunk in reranked_results:
    chunk_tokens = count_tokens(chunk)

    if total_tokens + chunk_tokens <= max_context_tokens:
        selected_chunks.append(chunk)
        total_tokens += chunk_tokens
    else:
        break  # Stop when budget exceeded

# Result: Best chunks that fit in budget
```

### Lost in the Middle Problem

**LLMs prioritize start and end of context, miss middle.**

```python
# Research finding: Place most important info at start or end

def order_for_llm(chunks):
    # Best chunks at start and end
    if len(chunks) <= 2:
        return chunks

    # Put most relevant at positions 0 and -1
    ordered = [chunks[0]]  # Most relevant (start)
    ordered.extend(chunks[1:-1])  # Less relevant (middle)
    ordered.append(chunks[-1])  # Second most relevant (end)

    return ordered
```

### Contextual Compression

**Filter retrieved chunks to most relevant sentences.**

```python
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# Compressor: Extract relevant sentences
compressor = LLMChainExtractor.from_llm(llm)

# Wrap retriever
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever()
)

# Retrieved chunks are automatically filtered to relevant parts
compressed_docs = compression_retriever.get_relevant_documents(query)
```


## Component 8: Prompt Construction

### Basic RAG Prompt:

```python
context = '\n\n'.join(retrieved_chunks)

prompt = f"""
Answer the question based on the context below. If the answer is not in the context, say "I don't have enough information to answer that."

Context:
{context}

Question: {query}

Answer:
"""

answer = llm(prompt)
```

### With Citations:

```python
context_with_ids = []
for i, chunk in enumerate(retrieved_chunks):
    context_with_ids.append(f"[{i+1}] {chunk['text']}")

context = '\n\n'.join(context_with_ids)

prompt = f"""
Answer the question based on the context below. Cite sources using [number] format.

Context:
{context}

Question: {query}

Answer (with citations):
"""

answer = llm(prompt)
# Output: "The return policy allows returns within 30 days [1]. Shipping takes 5-7 business days [3]."
```

### With Metadata:

```python
context_with_metadata = []
for chunk in retrieved_chunks:
    source = chunk['metadata']['source']
    page = chunk['metadata']['page']
    context_with_metadata.append(f"From {source} (page {page}):\n{chunk['text']}")

context = '\n\n'.join(context_with_metadata)

prompt = f"""
Answer the question and cite your sources.

Context:
{context}

Question: {query}

Answer:
"""
```


## Evaluation Metrics

### Retrieval Metrics

**1. Mean Reciprocal Rank (MRR):**

```python
def calculate_mrr(retrieval_results, relevant_docs):
    """
    MRR = average of (1 / rank of first relevant doc)

    Example:
    Query 1: First relevant at rank 2 → 1/2 = 0.5
    Query 2: First relevant at rank 1 → 1/1 = 1.0
    Query 3: No relevant docs → 0
    MRR = (0.5 + 1.0 + 0) / 3 = 0.5
    """
    mrr_scores = []

    for results, relevant in zip(retrieval_results, relevant_docs):
        for i, result in enumerate(results):
            if result in relevant:
                mrr_scores.append(1 / (i + 1))
                break
        else:
            mrr_scores.append(0)  # No relevant found

    return np.mean(mrr_scores)

# Interpretation:
# MRR = 1.0: First result always relevant (perfect!)
# MRR = 0.5: First relevant at rank ~2 (good)
# MRR = 0.3: First relevant at rank ~3-4 (okay)
# MRR < 0.3: Poor retrieval (needs improvement)
```

**2. Precision@k:**

```python
def calculate_precision_at_k(retrieval_results, relevant_docs, k=5):
    """
    Precision@k = (# relevant docs in top-k) / k

    Example:
    Top 5 results: [relevant, irrelevant, relevant, irrelevant, irrelevant]
    Precision@5 = 2/5 = 0.4
    """
    precision_scores = []

    for results, relevant in zip(retrieval_results, relevant_docs):
        top_k = results[:k]
        relevant_in_topk = len([r for r in top_k if r in relevant])
        precision_scores.append(relevant_in_topk / k)

    return np.mean(precision_scores)

# Target: Precision@5 > 0.7 (70% of top-5 are relevant)
```

**3. Recall@k:**

```python
def calculate_recall_at_k(retrieval_results, relevant_docs, k=5):
    """
    Recall@k = (# relevant docs in top-k) / (total relevant docs)

    Example:
    Total relevant: 5
    Found in top-5: 2
    Recall@5 = 2/5 = 0.4
    """
    recall_scores = []

    for results, relevant in zip(retrieval_results, relevant_docs):
        top_k = results[:k]
        relevant_in_topk = len([r for r in top_k if r in relevant])
        recall_scores.append(relevant_in_topk / len(relevant))

    return np.mean(recall_scores)

# Interpretation:
# Recall@5 = 1.0: All relevant docs in top-5 (perfect!)
# Recall@5 = 0.5: Half of relevant docs in top-5
```

**4. NDCG (Normalized Discounted Cumulative Gain):**

```python
def calculate_ndcg(retrieval_results, relevance_scores, k=5):
    """
    NDCG considers position and graded relevance (0, 1, 2, 3...)

    DCG = sum of (relevance / log2(rank + 1))
    NDCG = DCG / ideal_DCG (normalized to 0-1)
    """
    from sklearn.metrics import ndcg_score

    # relevance_scores: 2D array of relevance (0-3) for each result
    # Higher relevance = more relevant

    ndcg = ndcg_score(relevance_scores, retrieval_results, k=k)
    return ndcg

# NDCG = 1.0: Perfect ranking
# NDCG > 0.7: Good ranking
# NDCG < 0.5: Poor ranking
```

### Generation Metrics

**1. Exact Match:**

```python
def calculate_exact_match(predictions, ground_truth):
    """Percentage of predictions that exactly match ground truth."""
    matches = [pred == truth for pred, truth in zip(predictions, ground_truth)]
    return np.mean(matches)
```

**2. F1 Score (token-level):**

```python
def calculate_f1(prediction, ground_truth):
    """F1 score based on token overlap."""
    pred_tokens = prediction.split()
    truth_tokens = ground_truth.split()

    common = set(pred_tokens) & set(truth_tokens)

    if len(common) == 0:
        return 0.0

    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(truth_tokens)
    f1 = 2 * precision * recall / (precision + recall)

    return f1
```

**3. LLM-as-Judge:**

```python
def evaluate_with_llm(answer, ground_truth, llm):
    """Use LLM to judge answer quality."""
    prompt = f"""
    Rate the quality of this answer on a scale of 1-5:
    1 = Completely wrong
    2 = Mostly wrong
    3 = Partially correct
    4 = Mostly correct
    5 = Completely correct

    Ground truth: {ground_truth}
    Answer to evaluate: {answer}

    Rating (1-5):
    """

    rating = llm(prompt)
    return int(rating)
```

### End-to-End Evaluation

```python
def evaluate_rag_system(rag_system, test_set):
    """
    Complete evaluation: retrieval + generation
    """
    # Retrieval metrics
    retrieval_results = []
    relevant_docs = []

    # Generation metrics
    predictions = []
    ground_truth = []

    for test_case in test_set:
        query = test_case['query']

        # Retrieve
        retrieved = rag_system.retrieve(query)
        retrieval_results.append(retrieved)
        relevant_docs.append(test_case['relevant_docs'])

        # Generate
        answer = rag_system.generate(query, retrieved)
        predictions.append(answer)
        ground_truth.append(test_case['expected_answer'])

    # Calculate metrics
    metrics = {
        'retrieval_mrr': calculate_mrr(retrieval_results, relevant_docs),
        'retrieval_precision@5': calculate_precision_at_k(retrieval_results, relevant_docs, k=5),
        'generation_f1': np.mean([calculate_f1(p, t) for p, t in zip(predictions, ground_truth)]),
        'generation_exact_match': calculate_exact_match(predictions, ground_truth),
    }

    return metrics
```


## Complete RAG Pipeline

### Basic Implementation:

```python
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 1. Load documents
documents = load_documents('docs/')

# 2. Chunk documents
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)

# 3. Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)

# 4. Create retrieval chain
llm = OpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={'k': 5}),
    return_source_documents=True
)

# 5. Query
result = qa_chain({"query": "What is the return policy?"})
answer = result['result']
sources = result['source_documents']
```

### Advanced Implementation (Hybrid + Re-ranking):

```python
from langchain.retrievers import EnsembleRetriever, BM25Retriever
from transformers import AutoModelForSequenceClassification, AutoTokenizer

class AdvancedRAG:
    def __init__(self, documents):
        # Chunk
        splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
        self.chunks = splitter.split_documents(documents)

        # Embeddings
        self.embeddings = OpenAIEmbeddings()
        self.vectorstore = Chroma.from_documents(self.chunks, self.embeddings)

        # Hybrid retrieval
        dense_retriever = self.vectorstore.as_retriever(search_kwargs={'k': 20})
        sparse_retriever = BM25Retriever.from_documents(self.chunks)

        self.retriever = EnsembleRetriever(
            retrievers=[dense_retriever, sparse_retriever],
            weights=[0.5, 0.5]
        )

        # Re-ranker
        self.rerank_model = AutoModelForSequenceClassification.from_pretrained(
            'cross-encoder/ms-marco-MiniLM-L-6-v2'
        )
        self.rerank_tokenizer = AutoTokenizer.from_pretrained(
            'cross-encoder/ms-marco-MiniLM-L-6-v2'
        )

        # LLM
        self.llm = OpenAI(temperature=0)

    def retrieve(self, query, k=5):
        # Hybrid retrieval (over-retrieve)
        initial_results = self.retriever.get_relevant_documents(query)[:20]

        # Re-rank
        scores = []
        for doc in initial_results:
            inputs = self.rerank_tokenizer(
                query, doc.page_content,
                return_tensors='pt',
                truncation=True,
                max_length=512
            )
            score = self.rerank_model(**inputs).logits[0][0].item()
            scores.append((doc, score))

        # Sort by score
        reranked = sorted(scores, key=lambda x: x[1], reverse=True)

        # Return top-k
        return [doc for doc, score in reranked[:k]]

    def generate(self, query, retrieved_docs):
        # Build context
        context = '\n\n'.join([f"[{i+1}] {doc.page_content}"
                               for i, doc in enumerate(retrieved_docs)])

        # Construct prompt
        prompt = f"""
        Answer the question based on the context below. Cite sources using [number].
        If the answer is not in the context, say "I don't have enough information."

        Context:
        {context}

        Question: {query}

        Answer:
        """

        # Generate
        answer = self.llm(prompt)

        return answer, retrieved_docs

    def query(self, query):
        retrieved_docs = self.retrieve(query, k=5)
        answer, sources = self.generate(query, retrieved_docs)

        return {
            'answer': answer,
            'sources': sources
        }

# Usage
rag = AdvancedRAG(documents)
result = rag.query("What is the return policy?")
print(result['answer'])
print(f"Sources: {[doc.metadata for doc in result['sources']]}")
```


## Optimization Strategies

### 1. Caching

```python
import functools

@functools.lru_cache(maxsize=1000)
def cached_retrieval(query):
    """Cache retrieval results for common queries."""
    return vectorstore.similarity_search(query, k=5)

# Saves embedding + retrieval cost for repeated queries
```

### 2. Async Retrieval

```python
import asyncio

async def async_retrieve(queries, vectorstore):
    """Retrieve for multiple queries in parallel."""
    tasks = [vectorstore.asimilarity_search(q, k=5) for q in queries]
    results = await asyncio.gather(*tasks)
    return results
```

### 3. Metadata Filtering

```python
# Filter by metadata before similarity search
results = vectorstore.similarity_search(
    query,
    k=5,
    filter={"source": "product_docs"}  # Only search product docs
)

# Faster (smaller search space) + more relevant (right domain)
```

### 4. Index Optimization

```python
# FAISS index optimization
import faiss

# 1. Train index on sample (faster search)
quantizer = faiss.IndexFlatL2(embedding_dim)
index = faiss.IndexIVFFlat(quantizer, embedding_dim, n_clusters)
index.train(sample_embeddings)

# 2. Set search parameters
index.nprobe = 10  # Trade-off: accuracy vs speed

# Result: 5-10× faster search with minimal quality loss
```


## Common Pitfalls

### Pitfall 1: No chunking
**Problem:** Full docs → overflow, poor precision
**Fix:** Chunk to 500-1000 tokens

### Pitfall 2: Dense-only retrieval
**Problem:** Misses exact keyword matches
**Fix:** Hybrid search (dense + sparse)

### Pitfall 3: No re-ranking
**Problem:** Coarse ranking, wrong results prioritized
**Fix:** Over-retrieve (k=20), re-rank to top-5

### Pitfall 4: Too much context
**Problem:** > 10k tokens → cost, latency, 'lost in middle'
**Fix:** Top 5 chunks (5k tokens), optimize retrieval precision

### Pitfall 5: No evaluation
**Problem:** Can't measure or optimize
**Fix:** Build test set, measure MRR, Precision@k


## Summary

**Core principles:**

1. **Chunk documents**: 500-1000 tokens, semantic boundaries, overlap for continuity
2. **Hybrid retrieval**: Dense (semantic) + Sparse (keyword) = best results
3. **Re-rank**: Over-retrieve (k=20-50), refine to top-5 with cross-encoder
4. **Evaluate systematically**: MRR, Precision@k, Recall@k, NDCG for retrieval; F1, Exact Match for generation
5. **Keep context focused**: Top 5 chunks (~5k tokens), optimize retrieval not context size

**Pipeline:**
```
Documents → Chunk → Embed → Vector DB
Query → Hybrid Retrieval (k=20) → Re-rank (k=5) → Context → LLM → Answer
```

**Metrics targets:**
- MRR > 0.7 (first relevant in top ~1.4)
- Precision@5 > 0.7 (70% of top-5 relevant)
- Generation F1 > 0.8 (80% token overlap)

**Key insight:** RAG quality depends on retrieval precision. Optimize retrieval (chunking, hybrid search, re-ranking, evaluation) before adding context or changing LLMs.