Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:28:30 +08:00
commit 171acedaa4
220 changed files with 85967 additions and 0 deletions

View File

@@ -0,0 +1,194 @@
---
name: chunking-strategy
description: Implement optimal chunking strategies in RAG systems and document processing pipelines. Use when building retrieval-augmented generation systems, vector databases, or processing large documents that require breaking into semantically meaningful segments for embeddings and search.
allowed-tools: Read, Write, Bash
category: artificial-intelligence
tags: [rag, chunking, vector-search, embeddings, document-processing]
version: 1.0.0
---
# Chunking Strategy for RAG Systems
## Overview
Implement optimal chunking strategies for Retrieval-Augmented Generation (RAG) systems and document processing pipelines. This skill provides a comprehensive framework for breaking large documents into smaller, semantically meaningful segments that preserve context while enabling efficient retrieval and search.
## When to Use
Use this skill when building RAG systems, optimizing vector search performance, implementing document processing pipelines, handling multi-modal content, or performance-tuning existing RAG systems with poor retrieval quality.
## Instructions
### Choose Chunking Strategy
Select appropriate chunking strategy based on document type and use case:
1. **Fixed-Size Chunking** (Level 1)
- Use for simple documents without clear structure
- Start with 512 tokens and 10-20% overlap
- Adjust size based on query type: 256 for factoid, 1024 for analytical
2. **Recursive Character Chunking** (Level 2)
- Use for documents with clear structural boundaries
- Implement hierarchical separators: paragraphs → sentences → words
- Customize separators for document types (HTML, Markdown)
3. **Structure-Aware Chunking** (Level 3)
- Use for structured documents (Markdown, code, tables, PDFs)
- Preserve semantic units: functions, sections, table blocks
- Validate structure preservation post-splitting
4. **Semantic Chunking** (Level 4)
- Use for complex documents with thematic shifts
- Implement embedding-based boundary detection
- Configure similarity threshold (0.8) and buffer size (3-5 sentences)
5. **Advanced Methods** (Level 5)
- Use Late Chunking for long-context embedding models
- Apply Contextual Retrieval for high-precision requirements
- Monitor computational costs vs. retrieval improvements
Reference detailed strategy implementations in [references/strategies.md](references/strategies.md).
### Implement Chunking Pipeline
Follow these steps to implement effective chunking:
1. **Pre-process documents**
- Analyze document structure and content types
- Identify multi-modal content (tables, images, code)
- Assess information density and complexity
2. **Select strategy parameters**
- Choose chunk size based on embedding model context window
- Set overlap percentage (10-20% for most cases)
- Configure strategy-specific parameters
3. **Process and validate**
- Apply chosen chunking strategy
- Validate semantic coherence of chunks
- Test with representative documents
4. **Evaluate and iterate**
- Measure retrieval precision and recall
- Monitor processing latency and resource usage
- Optimize based on specific use case requirements
Reference detailed implementation guidelines in [references/implementation.md](references/implementation.md).
### Evaluate Performance
Use these metrics to evaluate chunking effectiveness:
- **Retrieval Precision**: Fraction of retrieved chunks that are relevant
- **Retrieval Recall**: Fraction of relevant chunks that are retrieved
- **End-to-End Accuracy**: Quality of final RAG responses
- **Processing Time**: Latency impact on overall system
- **Resource Usage**: Memory and computational costs
Reference detailed evaluation framework in [references/evaluation.md](references/evaluation.md).
## Examples
### Basic Fixed-Size Chunking
```python
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Configure for factoid queries
splitter = RecursiveCharacterTextSplitter(
chunk_size=256,
chunk_overlap=25,
length_function=len
)
chunks = splitter.split_documents(documents)
```
### Structure-Aware Code Chunking
```python
def chunk_python_code(code):
"""Split Python code into semantic chunks"""
import ast
tree = ast.parse(code)
chunks = []
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
chunks.append(ast.get_source_segment(code, node))
return chunks
```
### Semantic Chunking with Embeddings
```python
def semantic_chunk(text, similarity_threshold=0.8):
"""Chunk text based on semantic boundaries"""
sentences = split_into_sentences(text)
embeddings = generate_embeddings(sentences)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
similarity = cosine_similarity(embeddings[i-1], embeddings[i])
if similarity < similarity_threshold:
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
chunks.append(" ".join(current_chunk))
return chunks
```
## Best Practices
### Core Principles
- Balance context preservation with retrieval precision
- Maintain semantic coherence within chunks
- Optimize for embedding model constraints
- Preserve document structure when beneficial
### Implementation Guidelines
- Start simple with fixed-size chunking (512 tokens, 10-20% overlap)
- Test thoroughly with representative documents
- Monitor both accuracy metrics and computational costs
- Iterate based on specific document characteristics
### Common Pitfalls to Avoid
- Over-chunking: Creating too many small, context-poor chunks
- Under-chunking: Missing relevant information due to oversized chunks
- Ignoring document structure and semantic boundaries
- Using one-size-fits-all approach for diverse content types
- Neglecting overlap for boundary-crossing information
## Constraints
### Resource Considerations
- Semantic and contextual methods require significant computational resources
- Late chunking needs long-context embedding models
- Complex strategies increase processing latency
- Monitor memory usage for large document processing
### Quality Requirements
- Validate chunk semantic coherence post-processing
- Test with domain-specific documents before deployment
- Ensure chunks maintain standalone meaning where possible
- Implement proper error handling for edge cases
## References
Reference detailed documentation in the [references/](references/) folder:
- [strategies.md](references/strategies.md) - Detailed strategy implementations
- [implementation.md](references/implementation.md) - Complete implementation guidelines
- [evaluation.md](references/evaluation.md) - Performance evaluation framework
- [tools.md](references/tools.md) - Recommended libraries and frameworks
- [research.md](references/research.md) - Key research papers and findings
- [advanced-strategies.md](references/advanced-strategies.md) - 11 comprehensive chunking methods
- [semantic-methods.md](references/semantic-methods.md) - Semantic and contextual approaches
- [visualization-tools.md](references/visualization-tools.md) - Evaluation and visualization tools

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,904 @@
# Performance Evaluation Framework
This document provides comprehensive methodologies for evaluating chunking strategy performance and effectiveness.
## Evaluation Metrics
### Core Retrieval Metrics
#### Retrieval Precision
Measures the fraction of retrieved chunks that are relevant to the query.
```python
def calculate_precision(retrieved_chunks: List[Dict], relevant_chunks: List[Dict]) -> float:
"""
Calculate retrieval precision
Precision = |Relevant ∩ Retrieved| / |Retrieved|
"""
retrieved_ids = {chunk.get('id') for chunk in retrieved_chunks}
relevant_ids = {chunk.get('id') for chunk in relevant_chunks}
intersection = retrieved_ids & relevant_ids
if not retrieved_ids:
return 0.0
return len(intersection) / len(retrieved_ids)
```
#### Retrieval Recall
Measures the fraction of relevant chunks that are successfully retrieved.
```python
def calculate_recall(retrieved_chunks: List[Dict], relevant_chunks: List[Dict]) -> float:
"""
Calculate retrieval recall
Recall = |Relevant ∩ Retrieved| / |Relevant|
"""
retrieved_ids = {chunk.get('id') for chunk in retrieved_chunks}
relevant_ids = {chunk.get('id') for chunk in relevant_chunks}
intersection = retrieved_ids & relevant_ids
if not relevant_ids:
return 0.0
return len(intersection) / len(relevant_ids)
```
#### F1-Score
Harmonic mean of precision and recall.
```python
def calculate_f1_score(precision: float, recall: float) -> float:
"""
Calculate F1-score
F1 = 2 * (Precision * Recall) / (Precision + Recall)
"""
if precision + recall == 0:
return 0.0
return 2 * (precision * recall) / (precision + recall)
```
### Mean Reciprocal Rank (MRR)
Measures the rank of the first relevant result.
```python
def calculate_mrr(queries: List[Dict], results: List[List[Dict]]) -> float:
"""
Calculate Mean Reciprocal Rank
"""
reciprocal_ranks = []
for query, query_results in zip(queries, results):
relevant_found = False
for rank, result in enumerate(query_results, 1):
if result.get('is_relevant', False):
reciprocal_ranks.append(1.0 / rank)
relevant_found = True
break
if not relevant_found:
reciprocal_ranks.append(0.0)
return sum(reciprocal_ranks) / len(reciprocal_ranks)
```
### Mean Average Precision (MAP)
Considers both precision and the ranking of relevant documents.
```python
def calculate_average_precision(retrieved_chunks: List[Dict], relevant_chunks: List[Dict]) -> float:
"""
Calculate Average Precision for a single query
"""
retrieved_ids = {chunk.get('id') for chunk in retrieved_chunks}
relevant_ids = {chunk.get('id') for chunk in relevant_chunks}
if not relevant_ids:
return 0.0
precisions = []
relevant_count = 0
for rank, chunk in enumerate(retrieved_chunks, 1):
if chunk.get('id') in relevant_ids:
relevant_count += 1
precision_at_rank = relevant_count / rank
precisions.append(precision_at_rank)
return sum(precisions) / len(relevant_ids) if relevant_ids else 0.0
def calculate_map(queries: List[Dict], results: List[List[Dict]]) -> float:
"""
Calculate Mean Average Precision across multiple queries
"""
average_precisions = []
for query, query_results in zip(queries, results):
ap = calculate_average_precision(query_results, query.get('relevant_chunks', []))
average_precisions.append(ap)
return sum(average_precisions) / len(average_precisions) if average_precisions else 0.0
```
### Normalized Discounted Cumulative Gain (NDCG)
Measures ranking quality with emphasis on highly relevant results.
```python
def calculate_dcg(retrieved_chunks: List[Dict]) -> float:
"""
Calculate Discounted Cumulative Gain
"""
dcg = 0.0
for rank, chunk in enumerate(retrieved_chunks, 1):
relevance = chunk.get('relevance_score', 0)
dcg += relevance / np.log2(rank + 1)
return dcg
def calculate_ndcg(retrieved_chunks: List[Dict], ideal_chunks: List[Dict]) -> float:
"""
Calculate Normalized Discounted Cumulative Gain
"""
dcg = calculate_dcg(retrieved_chunks)
idcg = calculate_dcg(ideal_chunks)
if idcg == 0:
return 0.0
return dcg / idcg
```
## End-to-End RAG Evaluation
### Answer Quality Metrics
#### Factual Consistency
Measures how well the generated answer aligns with retrieved chunks.
```python
import spacy
from transformers import pipeline
class FactualConsistencyEvaluator:
def __init__(self):
self.nlp = spacy.load("en_core_web_sm")
self.nli_pipeline = pipeline("text-classification",
model="roberta-large-mnli")
def evaluate_consistency(self, answer: str, retrieved_chunks: List[str]) -> float:
"""
Evaluate factual consistency between answer and retrieved context
"""
if not retrieved_chunks:
return 0.0
# Combine retrieved chunks as context
context = " ".join(retrieved_chunks[:3]) # Use top 3 chunks
# Use Natural Language Inference to check consistency
result = self.nli_pipeline(f"premise: {context} hypothesis: {answer}")
# Extract consistency score (entailment probability)
for item in result:
if item['label'] == 'ENTAILMENT':
return item['score']
elif item['label'] == 'CONTRADICTION':
return 1.0 - item['score']
return 0.5 # Neutral if NLI is inconclusive
```
#### Answer Completeness
Measures how completely the answer addresses the user's query.
```python
def evaluate_completeness(answer: str, query: str, reference_answer: str = None) -> float:
"""
Evaluate answer completeness
"""
# Extract key entities from query
query_entities = extract_entities(query)
answer_entities = extract_entities(answer)
# Calculate entity coverage
if not query_entities:
return 0.5 # Neutral if no entities in query
covered_entities = query_entities & answer_entities
entity_coverage = len(covered_entities) / len(query_entities)
# If reference answer is available, compare against it
if reference_answer:
reference_entities = extract_entities(reference_answer)
answer_reference_overlap = len(answer_entities & reference_entities) / max(len(reference_entities), 1)
return (entity_coverage + answer_reference_overlap) / 2
return entity_coverage
def extract_entities(text: str) -> set:
"""
Extract named entities from text (simplified)
"""
# This would use a proper NER model in practice
import re
# Simple noun phrase extraction as placeholder
noun_phrases = re.findall(r'\b[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\b', text)
return set(noun_phrases)
```
#### Response Relevance
Measures how relevant the answer is to the original query.
```python
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
class RelevanceEvaluator:
def __init__(self, model_name="all-MiniLM-L6-v2"):
self.model = SentenceTransformer(model_name)
def evaluate_relevance(self, query: str, answer: str) -> float:
"""
Evaluate semantic relevance between query and answer
"""
# Generate embeddings
query_embedding = self.model.encode([query])
answer_embedding = self.model.encode([answer])
# Calculate cosine similarity
similarity = cosine_similarity(query_embedding, answer_embedding)[0][0]
return float(similarity)
```
## Performance Metrics
### Processing Time
```python
import time
from dataclasses import dataclass
from typing import List, Dict
@dataclass
class PerformanceMetrics:
total_time: float
chunking_time: float
embedding_time: float
search_time: float
generation_time: float
throughput: float # documents per second
class PerformanceProfiler:
def __init__(self):
self.timings = {}
self.start_times = {}
def start_timer(self, operation: str):
self.start_times[operation] = time.time()
def end_timer(self, operation: str):
if operation in self.start_times:
duration = time.time() - self.start_times[operation]
if operation not in self.timings:
self.timings[operation] = []
self.timings[operation].append(duration)
return duration
return 0.0
def get_performance_metrics(self, document_count: int) -> PerformanceMetrics:
total_time = sum(sum(times) for times in self.timings.values())
return PerformanceMetrics(
total_time=total_time,
chunking_time=sum(self.timings.get('chunking', [0])),
embedding_time=sum(self.timings.get('embedding', [0])),
search_time=sum(self.timings.get('search', [0])),
generation_time=sum(self.timings.get('generation', [0])),
throughput=document_count / total_time if total_time > 0 else 0
)
```
### Memory Usage
```python
import psutil
import os
from typing import Dict, List
class MemoryProfiler:
def __init__(self):
self.process = psutil.Process(os.getpid())
self.memory_snapshots = []
def take_memory_snapshot(self, label: str):
"""Take a snapshot of current memory usage"""
memory_info = self.process.memory_info()
memory_mb = memory_info.rss / 1024 / 1024 # Convert to MB
self.memory_snapshots.append({
'label': label,
'memory_mb': memory_mb,
'timestamp': time.time()
})
def get_peak_memory_usage(self) -> float:
"""Get peak memory usage in MB"""
if not self.memory_snapshots:
return 0.0
return max(snapshot['memory_mb'] for snapshot in self.memory_snapshots)
def get_memory_usage_by_operation(self) -> Dict[str, float]:
"""Get memory usage breakdown by operation"""
if not self.memory_snapshots:
return {}
memory_by_op = {}
for i in range(1, len(self.memory_snapshots)):
prev_snapshot = self.memory_snapshots[i-1]
curr_snapshot = self.memory_snapshots[i]
operation = curr_snapshot['label']
memory_delta = curr_snapshot['memory_mb'] - prev_snapshot['memory_mb']
if operation not in memory_by_op:
memory_by_op[operation] = []
memory_by_op[operation].append(memory_delta)
return {op: sum(deltas) for op, deltas in memory_by_op.items()}
```
## Evaluation Datasets
### Standardized Test Sets
#### Question-Answer Pairs
```python
from dataclasses import dataclass
from typing import List, Optional
import json
@dataclass
class EvaluationQuery:
id: str
question: str
reference_answer: Optional[str]
relevant_chunk_ids: List[str]
query_type: str # factoid, analytical, comparative
difficulty: str # easy, medium, hard
domain: str # finance, medical, legal, technical
class EvaluationDataset:
def __init__(self, name: str):
self.name = name
self.queries: List[EvaluationQuery] = []
self.documents: Dict[str, str] = {}
self.chunks: Dict[str, Dict] = {}
def add_query(self, query: EvaluationQuery):
self.queries.append(query)
def add_document(self, doc_id: str, content: str):
self.documents[doc_id] = content
def add_chunk(self, chunk_id: str, content: str, doc_id: str, metadata: Dict):
self.chunks[chunk_id] = {
'id': chunk_id,
'content': content,
'doc_id': doc_id,
'metadata': metadata
}
def save_to_file(self, filepath: str):
data = {
'name': self.name,
'queries': [
{
'id': q.id,
'question': q.question,
'reference_answer': q.reference_answer,
'relevant_chunk_ids': q.relevant_chunk_ids,
'query_type': q.query_type,
'difficulty': q.difficulty,
'domain': q.domain
}
for q in self.queries
],
'documents': self.documents,
'chunks': self.chunks
}
with open(filepath, 'w') as f:
json.dump(data, f, indent=2)
@classmethod
def load_from_file(cls, filepath: str):
with open(filepath, 'r') as f:
data = json.load(f)
dataset = cls(data['name'])
dataset.documents = data['documents']
dataset.chunks = data['chunks']
for q_data in data['queries']:
query = EvaluationQuery(
id=q_data['id'],
question=q_data['question'],
reference_answer=q_data.get('reference_answer'),
relevant_chunk_ids=q_data['relevant_chunk_ids'],
query_type=q_data['query_type'],
difficulty=q_data['difficulty'],
domain=q_data['domain']
)
dataset.add_query(query)
return dataset
```
### Dataset Generation
#### Synthetic Query Generation
```python
import random
from typing import List, Dict
class SyntheticQueryGenerator:
def __init__(self):
self.query_templates = {
'factoid': [
"What is {concept}?",
"When did {event} occur?",
"Who developed {technology}?",
"How many {items} are mentioned?",
"What is the value of {metric}?"
],
'analytical': [
"Compare and contrast {concept1} and {concept2}.",
"Analyze the impact of {concept} on {domain}.",
"What are the advantages and disadvantages of {technology}?",
"Explain the relationship between {concept1} and {concept2}.",
"Evaluate the effectiveness of {approach} for {problem}."
],
'comparative': [
"Which is better: {option1} or {option2}?",
"How does {method1} differ from {method2}?",
"Compare the performance of {system1} and {system2}.",
"What are the key differences between {approach1} and {approach2}?"
]
}
def generate_queries_from_chunks(self, chunks: List[Dict], num_queries: int = 100) -> List[EvaluationQuery]:
"""Generate synthetic queries from document chunks"""
queries = []
# Extract entities and concepts from chunks
entities = self._extract_entities_from_chunks(chunks)
for i in range(num_queries):
query_type = random.choice(['factoid', 'analytical', 'comparative'])
template = random.choice(self.query_templates[query_type])
# Fill template with extracted entities
query_text = self._fill_template(template, entities)
# Find relevant chunks for this query
relevant_chunks = self._find_relevant_chunks(query_text, chunks)
query = EvaluationQuery(
id=f"synthetic_{i}",
question=query_text,
reference_answer=None, # Would need generation model
relevant_chunk_ids=[chunk['id'] for chunk in relevant_chunks],
query_type=query_type,
difficulty=random.choice(['easy', 'medium', 'hard']),
domain='synthetic'
)
queries.append(query)
return queries
def _extract_entities_from_chunks(self, chunks: List[Dict]) -> Dict[str, List[str]]:
"""Extract entities, concepts, and relationships from chunks"""
# This would use proper NER in practice
entities = {
'concepts': [],
'technologies': [],
'methods': [],
'metrics': [],
'events': []
}
for chunk in chunks:
content = chunk['content']
# Simplified entity extraction
words = content.split()
entities['concepts'].extend([word for word in words if len(word) > 6])
entities['technologies'].extend([word for word in words if 'technology' in word.lower()])
entities['methods'].extend([word for word in words if 'method' in word.lower()])
entities['metrics'].extend([word for word in words if '%' in word or '$' in word])
# Remove duplicates and limit
for key in entities:
entities[key] = list(set(entities[key]))[:50]
return entities
def _fill_template(self, template: str, entities: Dict[str, List[str]]) -> str:
"""Fill query template with random entities"""
import re
def replace_placeholder(match):
placeholder = match.group(1)
# Map placeholders to entity types
entity_mapping = {
'concept': 'concepts',
'concept1': 'concepts',
'concept2': 'concepts',
'technology': 'technologies',
'method': 'methods',
'method1': 'methods',
'method2': 'methods',
'metric': 'metrics',
'event': 'events',
'items': 'concepts',
'option1': 'concepts',
'option2': 'concepts',
'approach': 'methods',
'problem': 'concepts',
'domain': 'concepts',
'system1': 'concepts',
'system2': 'concepts'
}
entity_type = entity_mapping.get(placeholder, 'concepts')
available_entities = entities.get(entity_type, ['something'])
if available_entities:
return random.choice(available_entities)
else:
return 'something'
return re.sub(r'\{(\w+)\}', replace_placeholder, template)
def _find_relevant_chunks(self, query: str, chunks: List[Dict], k: int = 3) -> List[Dict]:
"""Find chunks most relevant to the query"""
# Simple keyword matching for synthetic generation
query_words = set(query.lower().split())
chunk_scores = []
for chunk in chunks:
chunk_words = set(chunk['content'].lower().split())
overlap = len(query_words & chunk_words)
chunk_scores.append((overlap, chunk))
# Sort by overlap and return top k
chunk_scores.sort(key=lambda x: x[0], reverse=True)
return [chunk for _, chunk in chunk_scores[:k]]
```
## A/B Testing Framework
### Statistical Significance Testing
```python
import numpy as np
from scipy import stats
from typing import List, Dict, Tuple
class ABTestAnalyzer:
def __init__(self):
self.significance_level = 0.05
def compare_metrics(self, control_metrics: List[float],
treatment_metrics: List[float],
metric_name: str) -> Dict:
"""
Compare metrics between control and treatment groups
"""
control_mean = np.mean(control_metrics)
treatment_mean = np.mean(treatment_metrics)
control_std = np.std(control_metrics)
treatment_std = np.std(treatment_metrics)
# Perform t-test
t_statistic, p_value = stats.ttest_ind(control_metrics, treatment_metrics)
# Calculate effect size (Cohen's d)
pooled_std = np.sqrt(((len(control_metrics) - 1) * control_std**2 +
(len(treatment_metrics) - 1) * treatment_std**2) /
(len(control_metrics) + len(treatment_metrics) - 2))
cohens_d = (treatment_mean - control_mean) / pooled_std if pooled_std > 0 else 0
# Determine significance
is_significant = p_value < self.significance_level
return {
'metric_name': metric_name,
'control_mean': control_mean,
'treatment_mean': treatment_mean,
'absolute_difference': treatment_mean - control_mean,
'relative_difference': ((treatment_mean - control_mean) / control_mean * 100) if control_mean != 0 else 0,
'control_std': control_std,
'treatment_std': treatment_std,
't_statistic': t_statistic,
'p_value': p_value,
'is_significant': is_significant,
'effect_size': cohens_d,
'significance_level': self.significance_level
}
def analyze_ab_test_results(self,
control_results: Dict[str, List[float]],
treatment_results: Dict[str, List[float]]) -> Dict:
"""
Analyze A/B test results across multiple metrics
"""
analysis_results = {}
# Ensure both dictionaries have the same keys
all_metrics = set(control_results.keys()) & set(treatment_results.keys())
for metric in all_metrics:
if metric in control_results and metric in treatment_results:
analysis_results[metric] = self.compare_metrics(
control_results[metric],
treatment_results[metric],
metric
)
# Calculate overall summary
significant_improvements = sum(1 for result in analysis_results.values()
if result['is_significant'] and result['relative_difference'] > 0)
significant_degradations = sum(1 for result in analysis_results.values()
if result['is_significant'] and result['relative_difference'] < 0)
analysis_results['summary'] = {
'total_metrics_compared': len(analysis_results),
'significant_improvements': significant_improvements,
'significant_degradations': significant_degradations,
'no_significant_change': len(analysis_results) - significant_improvements - significant_degradations
}
return analysis_results
```
## Automated Evaluation Pipeline
### End-to-End Evaluation
```python
class ChunkingEvaluationPipeline:
def __init__(self, strategies: Dict[str, Any], dataset: EvaluationDataset):
self.strategies = strategies
self.dataset = dataset
self.results = {}
self.profiler = PerformanceProfiler()
self.memory_profiler = MemoryProfiler()
def run_evaluation(self) -> Dict:
"""Run comprehensive evaluation of all strategies"""
evaluation_results = {}
for strategy_name, strategy in self.strategies.items():
print(f"Evaluating strategy: {strategy_name}")
# Reset profilers for each strategy
self.profiler = PerformanceProfiler()
self.memory_profiler = MemoryProfiler()
# Evaluate strategy
strategy_results = self._evaluate_strategy(strategy, strategy_name)
evaluation_results[strategy_name] = strategy_results
# Compare strategies
comparison_results = self._compare_strategies(evaluation_results)
return {
'individual_results': evaluation_results,
'comparison': comparison_results,
'recommendations': self._generate_recommendations(comparison_results)
}
def _evaluate_strategy(self, strategy: Any, strategy_name: str) -> Dict:
"""Evaluate a single chunking strategy"""
results = {
'strategy_name': strategy_name,
'retrieval_metrics': {},
'quality_metrics': {},
'performance_metrics': {}
}
# Track memory usage
self.memory_profiler.take_memory_snapshot(f"{strategy_name}_start")
# Process all documents
self.profiler.start_timer('total_processing')
all_chunks = {}
for doc_id, content in self.dataset.documents.items():
self.profiler.start_timer('chunking')
chunks = strategy.chunk(content)
self.profiler.end_timer('chunking')
all_chunks[doc_id] = chunks
self.memory_profiler.take_memory_snapshot(f"{strategy_name}_after_chunking")
# Generate embeddings for chunks
self.profiler.start_timer('embedding')
chunk_embeddings = self._generate_embeddings(all_chunks)
self.profiler.end_timer('embedding')
self.memory_profiler.take_memory_snapshot(f"{strategy_name}_after_embedding")
# Evaluate retrieval performance
retrieval_results = self._evaluate_retrieval(all_chunks, chunk_embeddings)
results['retrieval_metrics'] = retrieval_results
# Evaluate chunk quality
quality_results = self._evaluate_chunk_quality(all_chunks)
results['quality_metrics'] = quality_results
# Get performance metrics
self.profiler.end_timer('total_processing')
performance_metrics = self.profiler.get_performance_metrics(len(self.dataset.documents))
results['performance_metrics'] = performance_metrics.__dict__
# Get memory metrics
self.memory_profiler.take_memory_snapshot(f"{strategy_name}_end")
results['memory_metrics'] = {
'peak_memory_mb': self.memory_profiler.get_peak_memory_usage(),
'memory_by_operation': self.memory_profiler.get_memory_usage_by_operation()
}
return results
def _evaluate_retrieval(self, all_chunks: Dict, chunk_embeddings: Dict) -> Dict:
"""Evaluate retrieval performance"""
retrieval_metrics = {
'precision': [],
'recall': [],
'f1_score': [],
'mrr': [],
'map': []
}
for query in self.dataset.queries:
# Perform retrieval
self.profiler.start_timer('search')
retrieved_chunks = self._retrieve_chunks(query.question, chunk_embeddings, k=10)
self.profiler.end_timer('search')
# Get relevant chunks for this query
relevant_chunk_ids = set(query.relevant_chunk_ids)
relevant_chunks = [chunk for chunk in retrieved_chunks
if chunk.get('id') in relevant_chunk_ids]
# Calculate metrics
precision = calculate_precision(retrieved_chunks, relevant_chunks)
recall = calculate_recall(retrieved_chunks, relevant_chunks)
f1 = calculate_f1_score(precision, recall)
retrieval_metrics['precision'].append(precision)
retrieval_metrics['recall'].append(recall)
retrieval_metrics['f1_score'].append(f1)
# Calculate averages
return {metric: np.mean(values) for metric, values in retrieval_metrics.items()}
def _evaluate_chunk_quality(self, all_chunks: Dict) -> Dict:
"""Evaluate quality of generated chunks"""
quality_assessor = ChunkQualityAssessor()
quality_scores = []
for doc_id, chunks in all_chunks.items():
# Analyze document
content = self.dataset.documents[doc_id]
analyzer = DocumentAnalyzer()
analysis = analyzer.analyze(content)
# Assess chunk quality
scores = quality_assessor.assess_chunks(chunks, analysis)
quality_scores.append(scores)
# Aggregate quality scores
if quality_scores:
avg_scores = {}
for metric in quality_scores[0].keys():
avg_scores[metric] = np.mean([scores[metric] for scores in quality_scores])
return avg_scores
return {}
def _compare_strategies(self, evaluation_results: Dict) -> Dict:
"""Compare performance across strategies"""
ab_analyzer = ABTestAnalyzer()
comparison = {}
# Compare each metric across strategies
strategy_names = list(evaluation_results.keys())
for i in range(len(strategy_names)):
for j in range(i + 1, len(strategy_names)):
strategy1 = strategy_names[i]
strategy2 = strategy_names[j]
comparison_key = f"{strategy1}_vs_{strategy2}"
comparison[comparison_key] = {}
# Compare retrieval metrics
for metric in ['precision', 'recall', 'f1_score']:
if (metric in evaluation_results[strategy1]['retrieval_metrics'] and
metric in evaluation_results[strategy2]['retrieval_metrics']):
comparison[comparison_key][f"retrieval_{metric}"] = ab_analyzer.compare_metrics(
[evaluation_results[strategy1]['retrieval_metrics'][metric]],
[evaluation_results[strategy2]['retrieval_metrics'][metric]],
f"retrieval_{metric}"
)
return comparison
def _generate_recommendations(self, comparison_results: Dict) -> Dict:
"""Generate recommendations based on evaluation results"""
recommendations = {
'best_overall': None,
'best_for_precision': None,
'best_for_recall': None,
'best_for_performance': None,
'trade_offs': []
}
# This would analyze the comparison results and generate specific recommendations
# Implementation depends on specific use case requirements
return recommendations
def _generate_embeddings(self, all_chunks: Dict) -> Dict:
"""Generate embeddings for all chunks"""
# This would use the actual embedding model
# Placeholder implementation
embeddings = {}
for doc_id, chunks in all_chunks.items():
embeddings[doc_id] = []
for chunk in chunks:
# Generate embedding for chunk content
embedding = np.random.rand(384) # Placeholder
embeddings[doc_id].append({
'chunk': chunk,
'embedding': embedding
})
return embeddings
def _retrieve_chunks(self, query: str, chunk_embeddings: Dict, k: int = 10) -> List[Dict]:
"""Retrieve most relevant chunks for a query"""
# This would use actual similarity search
# Placeholder implementation
all_chunks = []
for doc_embeddings in chunk_embeddings.values():
for chunk_data in doc_embeddings:
all_chunks.append(chunk_data['chunk'])
# Simple random selection as placeholder
selected = random.sample(all_chunks, min(k, len(all_chunks)))
return selected
```
This comprehensive evaluation framework provides the tools needed to thoroughly assess chunking strategies across multiple dimensions: retrieval effectiveness, answer quality, system performance, and statistical significance. The modular design allows for easy extension and customization based on specific requirements and use cases.

View File

@@ -0,0 +1,709 @@
# Complete Implementation Guidelines
This document provides comprehensive implementation guidance for building effective chunking systems.
## System Architecture
### Core Components
```
Document Processor
├── Ingestion Layer
│ ├── Document Type Detection
│ ├── Format Parsing (PDF, HTML, Markdown, etc.)
│ └── Content Extraction
├── Analysis Layer
│ ├── Structure Analysis
│ ├── Content Type Identification
│ └── Complexity Assessment
├── Strategy Selection Layer
│ ├── Rule-based Selection
│ ├── ML-based Prediction
│ └── Adaptive Configuration
├── Chunking Layer
│ ├── Strategy Implementation
│ ├── Parameter Optimization
│ └── Quality Validation
└── Output Layer
├── Chunk Metadata Generation
├── Embedding Integration
└── Storage Preparation
```
## Pre-processing Pipeline
### Document Analysis Framework
```python
from dataclasses import dataclass
from typing import List, Dict, Any
import re
@dataclass
class DocumentAnalysis:
doc_type: str
structure_score: float # 0-1, higher means more structured
complexity_score: float # 0-1, higher means more complex
content_types: List[str]
language: str
estimated_tokens: int
has_multimodal: bool
class DocumentAnalyzer:
def __init__(self):
self.structure_patterns = {
'markdown': [r'^#+\s', r'^\*\*.*\*\*$', r'^\* ', r'^\d+\. '],
'html': [r'<h[1-6]>', r'<p>', r'<div>', r'<table>'],
'latex': [r'\\section', r'\\subsection', r'\\begin\{', r'\\end\{'],
'academic': [r'^\d+\.', r'^\d+\.\d+', r'^[A-Z]\.', r'^Figure \d+']
}
def analyze(self, content: str) -> DocumentAnalysis:
doc_type = self.detect_document_type(content)
structure_score = self.calculate_structure_score(content, doc_type)
complexity_score = self.calculate_complexity_score(content)
content_types = self.identify_content_types(content)
language = self.detect_language(content)
estimated_tokens = self.estimate_tokens(content)
has_multimodal = self.detect_multimodal_content(content)
return DocumentAnalysis(
doc_type=doc_type,
structure_score=structure_score,
complexity_score=complexity_score,
content_types=content_types,
language=language,
estimated_tokens=estimated_tokens,
has_multimodal=has_multimodal
)
def detect_document_type(self, content: str) -> str:
content_lower = content.lower()
if '<html' in content_lower or '<body' in content_lower:
return 'html'
elif '#' in content and '##' in content:
return 'markdown'
elif '\\documentclass' in content_lower or '\\begin{' in content_lower:
return 'latex'
elif any(keyword in content_lower for keyword in ['abstract', 'introduction', 'conclusion', 'references']):
return 'academic'
elif 'def ' in content or 'class ' in content or 'function ' in content_lower:
return 'code'
else:
return 'plain'
def calculate_structure_score(self, content: str, doc_type: str) -> float:
patterns = self.structure_patterns.get(doc_type, [])
if not patterns:
return 0.5 # Default for unstructured content
line_count = len(content.split('\n'))
structured_lines = 0
for line in content.split('\n'):
for pattern in patterns:
if re.search(pattern, line.strip()):
structured_lines += 1
break
return min(structured_lines / max(line_count, 1), 1.0)
def calculate_complexity_score(self, content: str) -> float:
# Factors that increase complexity
avg_sentence_length = self.calculate_avg_sentence_length(content)
vocabulary_richness = self.calculate_vocabulary_richness(content)
nested_structure = self.detect_nested_structure(content)
# Normalize and combine
complexity = (
min(avg_sentence_length / 30, 1.0) * 0.3 +
vocabulary_richness * 0.4 +
nested_structure * 0.3
)
return min(complexity, 1.0)
def identify_content_types(self, content: str) -> List[str]:
types = []
if '```' in content or 'def ' in content or 'function ' in content.lower():
types.append('code')
if '|' in content and '\n' in content:
types.append('tables')
if re.search(r'\!\[.*\]\(.*\)', content):
types.append('images')
if re.search(r'http[s]?://', content):
types.append('links')
if re.search(r'\d+\.\d+', content) or re.search(r'\$\d', content):
types.append('numbers')
return types if types else ['text']
def detect_language(self, content: str) -> str:
# Simple language detection - can be enhanced with proper language detection libraries
if re.search(r'[\u4e00-\u9fff]', content):
return 'chinese'
elif re.search(r'[u0600-\u06ff]', content):
return 'arabic'
elif re.search(r'[u0400-\u04ff]', content):
return 'russian'
else:
return 'english' # Default assumption
def estimate_tokens(self, content: str) -> int:
# Rough estimation - actual tokenization varies by model
word_count = len(content.split())
return int(word_count * 1.3) # Average tokens per word
def detect_multimodal_content(self, content: str) -> bool:
multimodal_indicators = [
r'\!\[.*\]\(.*\)', # Images
r'<iframe', # Embedded content
r'<object', # Embedded objects
r'<embed', # Embedded media
]
return any(re.search(pattern, content) for pattern in multimodal_indicators)
def calculate_avg_sentence_length(self, content: str) -> float:
sentences = re.split(r'[.!?]+', content)
sentences = [s.strip() for s in sentences if s.strip()]
if not sentences:
return 0
return sum(len(s.split()) for s in sentences) / len(sentences)
def calculate_vocabulary_richness(self, content: str) -> float:
words = content.lower().split()
if not words:
return 0
unique_words = set(words)
return len(unique_words) / len(words)
def detect_nested_structure(self, content: str) -> float:
# Detect nested lists, indented content, etc.
lines = content.split('\n')
indented_lines = 0
for line in lines:
if line.strip() and line.startswith(' '):
indented_lines += 1
return indented_lines / max(len(lines), 1)
```
### Strategy Selection Engine
```python
from abc import ABC, abstractmethod
from typing import Dict, Any
class ChunkingStrategy(ABC):
@abstractmethod
def chunk(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
pass
class StrategySelector:
def __init__(self):
self.strategies = {
'fixed_size': FixedSizeStrategy(),
'recursive': RecursiveStrategy(),
'structure_aware': StructureAwareStrategy(),
'semantic': SemanticStrategy(),
'adaptive': AdaptiveStrategy()
}
def select_strategy(self, analysis: DocumentAnalysis) -> str:
# Rule-based selection logic
if analysis.structure_score > 0.8 and analysis.doc_type in ['markdown', 'html', 'latex']:
return 'structure_aware'
elif analysis.complexity_score > 0.7 and analysis.estimated_tokens < 10000:
return 'semantic'
elif analysis.doc_type == 'code':
return 'structure_aware'
elif analysis.structure_score < 0.3:
return 'fixed_size'
elif analysis.complexity_score > 0.5:
return 'recursive'
else:
return 'adaptive'
def get_strategy(self, analysis: DocumentAnalysis) -> ChunkingStrategy:
strategy_name = self.select_strategy(analysis)
return self.strategies[strategy_name]
# Example strategy implementations
class FixedSizeStrategy(ChunkingStrategy):
def __init__(self, default_size=512, default_overlap=50):
self.default_size = default_size
self.default_overlap = default_overlap
def chunk(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
# Adjust parameters based on analysis
if analysis.complexity_score > 0.7:
chunk_size = 1024
elif analysis.complexity_score < 0.3:
chunk_size = 256
else:
chunk_size = self.default_size
overlap = int(chunk_size * 0.1) # 10% overlap
# Implementation here...
return self._fixed_size_chunk(content, chunk_size, overlap)
def _fixed_size_chunk(self, content: str, chunk_size: int, overlap: int) -> List[Dict[str, Any]]:
# Implementation using RecursiveCharacterTextSplitter or custom logic
pass
class AdaptiveStrategy(ChunkingStrategy):
def chunk(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
# Combine multiple strategies based on content characteristics
if analysis.structure_score > 0.6:
# Use structure-aware for structured parts
structured_chunks = self._chunk_structured_parts(content, analysis)
else:
# Use fixed-size for unstructured parts
unstructured_chunks = self._chunk_unstructured_parts(content, analysis)
# Merge and optimize
return self._merge_chunks(structured_chunks + unstructured_chunks)
def _chunk_structured_parts(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
# Implementation for structured content
pass
def _chunk_unstructured_parts(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
# Implementation for unstructured content
pass
def _merge_chunks(self, chunks: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
# Implementation for merging and optimizing chunks
pass
```
## Quality Assurance Framework
### Chunk Quality Metrics
```python
from typing import List, Dict, Any
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
class ChunkQualityAssessor:
def __init__(self):
self.quality_weights = {
'coherence': 0.3,
'completeness': 0.25,
'size_appropriateness': 0.2,
'semantic_similarity': 0.15,
'boundary_quality': 0.1
}
def assess_chunks(self, chunks: List[Dict[str, Any]], analysis: DocumentAnalysis) -> Dict[str, float]:
scores = {}
# Coherence: Do chunks make sense on their own?
scores['coherence'] = self._assess_coherence(chunks)
# Completeness: Do chunks preserve important information?
scores['completeness'] = self._assess_completeness(chunks, analysis)
# Size appropriateness: Are chunks within optimal size range?
scores['size_appropriateness'] = self._assess_size(chunks)
# Semantic similarity: Are chunks thematically consistent?
scores['semantic_similarity'] = self._assess_semantic_consistency(chunks)
# Boundary quality: Are chunk boundaries placed well?
scores['boundary_quality'] = self._assess_boundary_quality(chunks)
# Calculate overall quality score
overall_score = sum(
score * self.quality_weights[metric]
for metric, score in scores.items()
)
scores['overall'] = overall_score
return scores
def _assess_coherence(self, chunks: List[Dict[str, Any]]) -> float:
# Simple heuristic-based coherence assessment
coherence_scores = []
for chunk in chunks:
content = chunk['content']
# Check for complete sentences
sentences = re.split(r'[.!?]+', content)
complete_sentences = sum(1 for s in sentences if s.strip())
coherence = complete_sentences / max(len(sentences), 1)
coherence_scores.append(coherence)
return np.mean(coherence_scores)
def _assess_completeness(self, chunks: List[Dict[str, Any]], analysis: DocumentAnalysis) -> float:
# Check if important structural elements are preserved
if analysis.doc_type in ['markdown', 'html']:
return self._assess_structure_preservation(chunks, analysis)
else:
return self._assess_content_preservation(chunks)
def _assess_structure_preservation(self, chunks: List[Dict[str, Any]], analysis: DocumentAnalysis) -> float:
# Check if headings, lists, and other structural elements are preserved
preserved_elements = 0
total_elements = 0
for chunk in chunks:
content = chunk['content']
# Count preserved structural elements
headings = len(re.findall(r'^#+\s', content, re.MULTILINE))
lists = len(re.findall(r'^\s*[-*+]\s', content, re.MULTILINE))
preserved_elements += headings + lists
total_elements += 1 # Simplified count
return preserved_elements / max(total_elements, 1)
def _assess_content_preservation(self, chunks: List[Dict[str, Any]]) -> float:
# Simple check based on content ratio
total_content = ''.join(chunk['content'] for chunk in chunks)
# This would need comparison with original content
return 0.8 # Placeholder
def _assess_size(self, chunks: List[Dict[str, Any]]) -> float:
optimal_min = 100 # tokens
optimal_max = 1000 # tokens
size_scores = []
for chunk in chunks:
token_count = self._estimate_tokens(chunk['content'])
if optimal_min <= token_count <= optimal_max:
score = 1.0
elif token_count < optimal_min:
score = token_count / optimal_min
else:
score = max(0, 1 - (token_count - optimal_max) / optimal_max)
size_scores.append(score)
return np.mean(size_scores)
def _assess_semantic_consistency(self, chunks: List[Dict[str, Any]]) -> float:
# This would require embedding models for actual implementation
# Placeholder implementation
return 0.7
def _assess_boundary_quality(self, chunks: List[Dict[str, Any]]) -> float:
# Check if boundaries don't split important content
boundary_scores = []
for i, chunk in enumerate(chunks):
content = chunk['content']
# Check for incomplete sentences at boundaries
if not content.strip().endswith(('.', '!', '?', '>', '}')):
boundary_scores.append(0.5)
else:
boundary_scores.append(1.0)
return np.mean(boundary_scores)
def _estimate_tokens(self, content: str) -> int:
# Simple token estimation
return len(content.split()) * 4 // 3 # Rough approximation
```
## Error Handling and Edge Cases
### Robust Error Handling
```python
import logging
from typing import Optional, List
from dataclasses import dataclass
@dataclass
class ChunkingError:
error_type: str
message: str
chunk_index: Optional[int] = None
recovery_action: Optional[str] = None
class ChunkingErrorHandler:
def __init__(self):
self.logger = logging.getLogger(__name__)
self.error_handlers = {
'empty_content': self._handle_empty_content,
'oversized_chunk': self._handle_oversized_chunk,
'encoding_error': self._handle_encoding_error,
'memory_error': self._handle_memory_error,
'structure_parsing_error': self._handle_structure_parsing_error
}
def handle_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
error_type = self._classify_error(error)
handler = self.error_handlers.get(error_type, self._handle_generic_error)
return handler(error, context)
def _classify_error(self, error: Exception) -> str:
if isinstance(error, ValueError) and 'empty' in str(error).lower():
return 'empty_content'
elif isinstance(error, MemoryError):
return 'memory_error'
elif isinstance(error, UnicodeError):
return 'encoding_error'
elif 'too large' in str(error).lower():
return 'oversized_chunk'
elif 'parsing' in str(error).lower():
return 'structure_parsing_error'
else:
return 'generic_error'
def _handle_empty_content(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
self.logger.warning(f"Empty content encountered: {error}")
return ChunkingError(
error_type='empty_content',
message=str(error),
recovery_action='skip_empty_content'
)
def _handle_oversized_chunk(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
self.logger.warning(f"Oversized chunk detected: {error}")
return ChunkingError(
error_type='oversized_chunk',
message=str(error),
chunk_index=context.get('chunk_index'),
recovery_action='reduce_chunk_size'
)
def _handle_encoding_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
self.logger.error(f"Encoding error: {error}")
return ChunkingError(
error_type='encoding_error',
message=str(error),
recovery_action='fallback_encoding'
)
def _handle_memory_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
self.logger.error(f"Memory error during chunking: {error}")
return ChunkingError(
error_type='memory_error',
message=str(error),
recovery_action='process_in_batches'
)
def _handle_structure_parsing_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
self.logger.warning(f"Structure parsing failed: {error}")
return ChunkingError(
error_type='structure_parsing_error',
message=str(error),
recovery_action='fallback_to_fixed_size'
)
def _handle_generic_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
self.logger.error(f"Unexpected error during chunking: {error}")
return ChunkingError(
error_type='generic_error',
message=str(error),
recovery_action='skip_and_continue'
)
```
## Performance Optimization
### Caching and Memoization
```python
import hashlib
import pickle
from functools import lru_cache
from typing import Dict, Any, Optional
import redis
import json
class ChunkingCache:
def __init__(self, redis_url: Optional[str] = None):
if redis_url:
self.redis_client = redis.from_url(redis_url)
else:
self.redis_client = None
self.local_cache = {}
def _generate_cache_key(self, content: str, strategy: str, params: Dict[str, Any]) -> str:
content_hash = hashlib.md5(content.encode()).hexdigest()
params_str = json.dumps(params, sort_keys=True)
params_hash = hashlib.md5(params_str.encode()).hexdigest()
return f"chunking:{strategy}:{content_hash}:{params_hash}"
def get(self, content: str, strategy: str, params: Dict[str, Any]) -> Optional[List[Dict[str, Any]]]:
cache_key = self._generate_cache_key(content, strategy, params)
# Try local cache first
if cache_key in self.local_cache:
return self.local_cache[cache_key]
# Try Redis cache
if self.redis_client:
try:
cached_data = self.redis_client.get(cache_key)
if cached_data:
chunks = pickle.loads(cached_data)
self.local_cache[cache_key] = chunks # Cache locally too
return chunks
except Exception as e:
logging.warning(f"Redis cache error: {e}")
return None
def set(self, content: str, strategy: str, params: Dict[str, Any], chunks: List[Dict[str, Any]]) -> None:
cache_key = self._generate_cache_key(content, strategy, params)
# Store in local cache
self.local_cache[cache_key] = chunks
# Store in Redis cache
if self.redis_client:
try:
cached_data = pickle.dumps(chunks)
self.redis_client.setex(cache_key, 3600, cached_data) # 1 hour TTL
except Exception as e:
logging.warning(f"Redis cache set error: {e}")
def clear_local_cache(self):
self.local_cache.clear()
def clear_redis_cache(self):
if self.redis_client:
pattern = "chunking:*"
keys = self.redis_client.keys(pattern)
if keys:
self.redis_client.delete(*keys)
```
### Batch Processing
```python
import asyncio
import concurrent.futures
from typing import List, Callable, Any
class BatchChunkingProcessor:
def __init__(self, max_workers: int = 4, batch_size: int = 10):
self.max_workers = max_workers
self.batch_size = batch_size
def process_documents_batch(self, documents: List[str],
chunking_function: Callable[[str], List[Dict[str, Any]]]) -> List[List[Dict[str, Any]]]:
"""Process multiple documents in parallel"""
results = []
# Process in batches to avoid memory issues
for i in range(0, len(documents), self.batch_size):
batch = documents[i:i + self.batch_size]
with concurrent.futures.ThreadPoolExecutor(max_workers=self.max_workers) as executor:
future_to_doc = {
executor.submit(chunking_function, doc): doc
for doc in batch
}
batch_results = []
for future in concurrent.futures.as_completed(future_to_doc):
try:
chunks = future.result()
batch_results.append(chunks)
except Exception as e:
logging.error(f"Error processing document: {e}")
batch_results.append([]) # Empty result for failed processing
results.extend(batch_results)
return results
async def process_documents_async(self, documents: List[str],
chunking_function: Callable[[str], List[Dict[str, Any]]]) -> List[List[Dict[str, Any]]]:
"""Process documents asynchronously"""
semaphore = asyncio.Semaphore(self.max_workers)
async def process_single_document(doc: str) -> List[Dict[str, Any]]:
async with semaphore:
# Run the synchronous chunking function in an executor
loop = asyncio.get_event_loop()
return await loop.run_in_executor(None, chunking_function, doc)
tasks = [process_single_document(doc) for doc in documents]
return await asyncio.gather(*tasks, return_exceptions=True)
```
## Monitoring and Observability
### Metrics Collection
```python
import time
from dataclasses import dataclass
from typing import Dict, Any, List
from collections import defaultdict
@dataclass
class ChunkingMetrics:
total_documents: int
total_chunks: int
avg_chunk_size: float
processing_time: float
memory_usage: float
error_count: int
strategy_distribution: Dict[str, int]
class MetricsCollector:
def __init__(self):
self.metrics = defaultdict(list)
self.start_time = None
def start_timing(self):
self.start_time = time.time()
def end_timing(self) -> float:
if self.start_time:
duration = time.time() - self.start_time
self.metrics['processing_time'].append(duration)
self.start_time = None
return duration
return 0.0
def record_chunk_count(self, count: int):
self.metrics['chunk_count'].append(count)
def record_chunk_size(self, size: int):
self.metrics['chunk_size'].append(size)
def record_strategy_usage(self, strategy: str):
self.metrics['strategy'][strategy] = self.metrics['strategy'].get(strategy, 0) + 1
def record_error(self, error_type: str):
self.metrics['errors'].append(error_type)
def record_memory_usage(self, memory_mb: float):
self.metrics['memory_usage'].append(memory_mb)
def get_summary(self) -> ChunkingMetrics:
return ChunkingMetrics(
total_documents=len(self.metrics['processing_time']),
total_chunks=sum(self.metrics['chunk_count']),
avg_chunk_size=sum(self.metrics['chunk_size']) / max(len(self.metrics['chunk_size']), 1),
processing_time=sum(self.metrics['processing_time']),
memory_usage=sum(self.metrics['memory_usage']) / max(len(self.metrics['memory_usage']), 1),
error_count=len(self.metrics['errors']),
strategy_distribution=dict(self.metrics['strategy'])
)
def reset(self):
self.metrics.clear()
self.start_time = None
```
This implementation guide provides a comprehensive foundation for building robust, scalable chunking systems that can handle various document types and use cases while maintaining high quality and performance.

View File

@@ -0,0 +1,366 @@
# Key Research Papers and Findings
This document summarizes important research papers and findings related to chunking strategies for RAG systems.
## Seminal Papers
### "Reconstructing Context: Evaluating Advanced Chunking Strategies for RAG" (arXiv:2504.19754)
**Key Findings**:
- Page-level chunking achieved highest average accuracy (0.648) with lowest variance across different query types
- Optimal chunk size varies significantly by document type and query complexity
- Factoid queries perform better with smaller chunks (256-512 tokens)
- Complex analytical queries benefit from larger chunks (1024+ tokens)
**Methodology**:
- Evaluated 7 different chunking strategies across multiple document types
- Tested with both factoid and analytical queries
- Measured end-to-end RAG performance
**Practical Implications**:
- Start with page-level chunking for general-purpose RAG systems
- Adapt chunk size based on expected query patterns
- Consider hybrid approaches for mixed query types
### "Lost in the Middle: How Language Models Use Long Contexts"
**Key Findings**:
- Language models tend to pay more attention to information at the beginning and end of context
- Information in the middle of long contexts is often ignored
- Performance degradation is most severe for centrally located information
**Practical Implications**:
- Place most important information at chunk boundaries
- Consider chunk overlap to ensure important context appears multiple times
- Use ranking to prioritize relevant chunks for inclusion in context
### "Grounded Language Learning in a Simulated 3D World"
**Related Concepts**:
- Importance of grounding text in visual/contextual information
- Multi-modal learning approaches for better understanding
**Relevance to Chunking**:
- Supports contextual chunking approaches that preserve visual/contextual relationships
- Validates importance of maintaining document structure and relationships
## Industry Research
### NVIDIA Research: "Finding the Best Chunking Strategy for Accurate AI Responses"
**Key Findings**:
- Page-level chunking outperformed sentence and paragraph-level approaches
- Fixed-size chunking showed consistent but suboptimal performance
- Semantic chunking provided improvements for complex documents
**Technical Details**:
- Tested chunk sizes from 128 to 2048 tokens
- Evaluated across financial, technical, and legal documents
- Measured both retrieval accuracy and generation quality
**Recommendations**:
- Use 512-1024 token chunks as starting point
- Implement adaptive chunking based on document complexity
- Consider page boundaries as natural chunk separators
### Cohere Research: "Effective Chunking Strategies for RAG"
**Key Findings**:
- Recursive character splitting provides good balance of performance and simplicity
- Document structure awareness improves retrieval by 15-20%
- Overlap of 10-20% provides optimal context preservation
**Methodology**:
- Compared 12 chunking strategies across 6 document types
- Measured retrieval precision, recall, and F1-score
- Tested with both dense and sparse retrieval
**Best Practices Identified**:
- Start with recursive character splitting with 10-20% overlap
- Preserve document structure (headings, lists, tables)
- Customize chunk size based on embedding model context window
### Anthropic: "Contextual Retrieval"
**Key Innovation**:
- Enhance each chunk with LLM-generated contextual information before embedding
- Improves retrieval precision by 25-30% for complex documents
- Particularly effective for technical and academic content
**Implementation Approach**:
1. Split document using traditional methods
2. For each chunk, generate contextual information using LLM
3. Prepend context to chunk before embedding
4. Use hybrid search (dense + sparse) with weighted ranking
**Trade-offs**:
- Significant computational overhead (2-3x processing time)
- Higher embedding storage requirements
- Improved retrieval precision justifies cost for high-value applications
## Algorithmic Advances
### Semantic Chunking Algorithms
#### "Semantic Segmentation of Text Documents"
**Core Idea**: Use cosine similarity between consecutive sentence embeddings to identify natural boundaries.
**Algorithm**:
1. Split document into sentences
2. Generate embeddings for each sentence
3. Calculate similarity between consecutive sentences
4. Create boundaries where similarity drops below threshold
5. Merge short segments with neighbors
**Performance**: 20-30% improvement in retrieval relevance over fixed-size chunking for technical documents.
#### "Hierarchical Semantic Chunking"
**Core Idea**: Multi-level semantic segmentation for document organization.
**Algorithm**:
1. Document-level semantic analysis
2. Section-level boundary detection
3. Paragraph-level segmentation
4. Sentence-level refinement
**Benefits**: Maintains document hierarchy while adapting to semantic structure.
### Advanced Embedding Techniques
#### "Late Chunking: Contextual Chunk Embeddings"
**Core Innovation**: Generate embeddings for entire document first, then create chunk embeddings from token-level embeddings.
**Advantages**:
- Preserves global document context
- Reduces context fragmentation
- Better for documents with complex inter-relationships
**Requirements**:
- Long-context embedding models (8k+ tokens)
- Significant computational resources
- Specialized implementation
#### "Hierarchical Embedding Retrieval"
**Approach**: Create embeddings at multiple granularities (document, section, paragraph, sentence).
**Implementation**:
1. Generate embeddings at each level
2. Store in hierarchical vector database
3. Query at appropriate granularity based on information needs
**Performance**: 15-25% improvement in precision for complex queries.
## Evaluation Methodologies
### Retrieval-Augmented Generation Assessment Frameworks
#### RAGAS Framework
**Metrics**:
- **Faithfulness**: Consistency between generated answer and retrieved context
- **Answer Relevancy**: Relevance of generated answer to the question
- **Context Relevancy**: Relevance of retrieved context to the question
- **Context Recall**: Coverage of relevant information in retrieved context
**Evaluation Process**:
1. Generate questions from document corpus
2. Retrieve relevant chunks using different strategies
3. Generate answers using retrieved chunks
4. Evaluate using automated metrics and human judgment
#### ARES Framework
**Innovation**: Automated evaluation using synthetic questions and LLM-based assessment.
**Key Features**:
- Generates diverse question types (factoid, analytical, comparative)
- Uses LLMs to evaluate answer quality
- Provides scalable evaluation without human annotation
### Benchmark Datasets
#### Natural Questions (NQ)
**Description**: Real user questions from Google Search with relevant Wikipedia passages.
**Relevance**: Natural language queries with authentic relevance judgments.
#### MS MARCO
**Description**: Large-scale passage ranking dataset with real search queries.
**Relevance**: High-quality relevance judgments for passage retrieval.
#### HotpotQA
**Description**: Multi-hop question answering requiring information from multiple documents.
**Relevance**: Tests ability to retrieve and synthesize information from multiple chunks.
## Domain-Specific Research
### Medical Documents
#### "Optimal Chunking for Medical Question Answering"
**Key Findings**:
- Medical terminology requires specialized handling
- Section-based chunking (History, Diagnosis, Treatment) most effective
- Preserving doctor-patient dialogue context crucial
**Recommendations**:
- Use medical-specific tokenizers
- Preserve section headers and structure
- Maintain temporal relationships in medical histories
### Legal Documents
#### "Chunking Strategies for Legal Document Analysis"
**Key Findings**:
- Legal citations and cross-references require special handling
- Contract clause boundaries serve as natural chunk separators
- Case law benefits from hierarchical chunking
**Best Practices**:
- Preserve legal citation structure
- Use clause and section boundaries
- Maintain context for legal definitions and references
### Financial Documents
#### "SEC Filing Chunking for Financial Analysis"
**Key Findings**:
- Table preservation critical for financial data
- XBRL tagging provides natural segmentation
- Risk factors sections benefit from specialized treatment
**Approach**:
- Preserve complete tables when possible
- Use XBRL tags for structured data
- Create specialized chunks for risk sections
## Emerging Trends
### Multi-Modal Chunking
#### "Integrating Text, Tables, and Images in RAG Systems"
**Innovation**: Unified chunking approach for mixed-modal content.
**Approach**:
- Extract and describe images using vision models
- Preserve table structure and relationships
- Create unified embeddings for mixed content
**Results**: 35% improvement in complex document understanding.
### Adaptive Chunking
#### "Machine Learning-Based Chunk Size Optimization"
**Core Idea**: Use ML models to predict optimal chunking parameters.
**Features**:
- Document length and complexity
- Query type distribution
- Embedding model characteristics
- Performance requirements
**Benefits**: Dynamic optimization based on use case and content.
### Real-time Chunking
#### "Streaming Chunking for Live Document Processing"
**Innovation**: Process documents as they become available.
**Techniques**:
- Incremental boundary detection
- Dynamic chunk size adjustment
- Context preservation across chunks
**Applications**: Live news feeds, social media analysis, meeting transcripts.
## Implementation Challenges
### Computational Efficiency
#### "Scalable Chunking for Large Document Collections"
**Challenges**:
- Processing millions of documents efficiently
- Memory usage optimization
- Distributed processing requirements
**Solutions**:
- Batch processing with parallel execution
- Streaming approaches for large documents
- Distributed chunking with load balancing
### Quality Assurance
#### "Evaluating Chunk Quality at Scale"
**Challenges**:
- Automated quality assessment
- Detecting poor chunk boundaries
- Maintaining consistency across document types
**Approaches**:
- Heuristic-based quality metrics
- LLM-based evaluation
- Human-in-the-loop validation
## Future Research Directions
### Context-Aware Chunking
**Open Questions**:
- How to optimally preserve cross-chunk relationships?
- Can we predict chunk quality without human evaluation?
- What is the optimal balance between size and context?
### Domain Adaptation
**Research Areas**:
- Automatic domain detection and adaptation
- Transfer learning across domains
- Zero-shot chunking for new document types
### Evaluation Standards
**Needs**:
- Standardized evaluation benchmarks
- Cross-paper comparison methodologies
- Real-world performance metrics
## Practical Recommendations Based on Research
### Starting Points
1. **For General RAG Systems**: Page-level or recursive character chunking with 512-1024 tokens and 10-20% overlap
2. **For Technical Documents**: Structure-aware chunking with semantic boundary detection
3. **For High-Value Applications**: Contextual retrieval with LLM-generated context
### Evolution Strategy
1. **Begin**: Simple fixed-size chunking (512 tokens)
2. **Improve**: Add document structure awareness
3. **Optimize**: Implement semantic boundaries
4. **Advanced**: Consider contextual retrieval for critical use cases
### Key Success Factors
1. **Match strategy to document type and query patterns**
2. **Preserve document structure when beneficial**
3. **Use overlap to maintain context across boundaries**
4. **Monitor both accuracy and computational costs**
5. **Iterate based on specific use case requirements**
This research foundation provides evidence-based guidance for implementing effective chunking strategies across various domains and use cases.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,423 @@
# Detailed Chunking Strategies
This document provides comprehensive implementation details for all chunking strategies mentioned in the main skill.
## Level 1: Fixed-Size Chunking
### Implementation
```python
from langchain.text_splitter import RecursiveCharacterTextSplitter
class FixedSizeChunker:
def __init__(self, chunk_size=512, chunk_overlap=50):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
separators=["\n\n", "\n", " ", ""]
)
def chunk(self, documents):
return self.splitter.split_documents(documents)
```
### Parameter Recommendations
| Use Case | Chunk Size | Overlap | Rationale |
|----------|------------|---------|-----------|
| Factoid Queries | 256 | 25 | Small chunks for precise answers |
| General Q&A | 512 | 50 | Balanced approach for most cases |
| Analytical Queries | 1024 | 100 | Larger context for complex analysis |
| Code Documentation | 300 | 30 | Preserve code context while maintaining focus |
### Best Practices
- Start with 512 tokens and 10-20% overlap
- Adjust based on embedding model context window
- Use overlap for queries where context might span boundaries
- Monitor token count vs. character count based on model
## Level 2: Recursive Character Chunking
### Implementation
```python
from langchain.text_splitter import RecursiveCharacterTextSplitter
class RecursiveChunker:
def __init__(self, chunk_size=512, separators=None):
self.chunk_size = chunk_size
self.separators = separators or ["\n\n", "\n", " ", ""]
self.splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=0,
length_function=len,
separators=self.separators
)
def chunk(self, text):
return self.splitter.create_documents([text])
# Document-specific configurations
def get_chunker_for_document_type(doc_type):
configurations = {
"markdown": ["\n## ", "\n### ", "\n\n", "\n", " ", ""],
"html": ["</div>", "</p>", "\n\n", "\n", " ", ""],
"code": ["\n\n", "\n", " ", ""],
"plain": ["\n\n", "\n", " ", ""]
}
return RecursiveChunker(separators=configurations.get(doc_type, ["\n\n", "\n", " ", ""]))
```
### Customization Guidelines
- **Markdown**: Use headings as primary separators
- **HTML**: Use block-level tags as separators
- **Code**: Preserve function and class boundaries
- **Academic papers**: Prioritize paragraph and section breaks
## Level 3: Structure-Aware Chunking
### Markdown Documents
```python
import markdown
from bs4 import BeautifulSoup
class MarkdownChunker:
def __init__(self, max_chunk_size=512):
self.max_chunk_size = max_chunk_size
def chunk(self, markdown_text):
html = markdown.markdown(markdown_text)
soup = BeautifulSoup(html, 'html.parser')
chunks = []
current_chunk = ""
current_heading = "Introduction"
for element in soup.find_all(['h1', 'h2', 'h3', 'p', 'pre', 'table']):
if element.name.startswith('h'):
if current_chunk.strip():
chunks.append({
"content": current_chunk.strip(),
"heading": current_heading
})
current_heading = element.get_text().strip()
current_chunk = f"{element}\n"
elif element.name in ['pre', 'table']:
# Preserve code blocks and tables intact
if len(current_chunk) + len(str(element)) > self.max_chunk_size:
if current_chunk.strip():
chunks.append({
"content": current_chunk.strip(),
"heading": current_heading
})
current_chunk = f"{element}\n"
else:
current_chunk += f"{element}\n"
else:
current_chunk += str(element)
if current_chunk.strip():
chunks.append({
"content": current_chunk.strip(),
"heading": current_heading
})
return chunks
```
### Code Documents
```python
import ast
import re
class CodeChunker:
def __init__(self, language='python'):
self.language = language
def chunk_python(self, code):
tree = ast.parse(code)
chunks = []
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
start_line = node.lineno - 1
end_line = node.end_lineno if hasattr(node, 'end_lineno') else start_line + 10
lines = code.split('\n')
chunk_lines = lines[start_line:end_line]
chunks.append('\n'.join(chunk_lines))
return chunks
def chunk_javascript(self, code):
# Use regex for languages without AST parsers
function_pattern = r'(function\s+\w+\s*\([^)]*\)\s*\{[^}]*\})'
class_pattern = r'(class\s+\w+\s*\{[^}]*\})'
patterns = [function_pattern, class_pattern]
chunks = []
for pattern in patterns:
matches = re.finditer(pattern, code, re.MULTILINE | re.DOTALL)
for match in matches:
chunks.append(match.group(1))
return chunks
def chunk(self, code):
if self.language == 'python':
return self.chunk_python(code)
elif self.language == 'javascript':
return self.chunk_javascript(code)
else:
# Fallback to line-based chunking
return self.chunk_by_lines(code)
def chunk_by_lines(self, code, max_lines=50):
lines = code.split('\n')
chunks = []
for i in range(0, len(lines), max_lines):
chunk = '\n'.join(lines[i:i+max_lines])
chunks.append(chunk)
return chunks
```
### Tabular Data
```python
import pandas as pd
class TableChunker:
def __init__(self, max_rows=100, summary_rows=5):
self.max_rows = max_rows
self.summary_rows = summary_rows
def chunk(self, table_data):
if isinstance(table_data, str):
df = pd.read_csv(StringIO(table_data))
else:
df = table_data
chunks = []
if len(df) <= self.max_rows:
# Small table - keep intact
chunks.append({
"type": "full_table",
"content": df.to_string(),
"metadata": {
"rows": len(df),
"columns": len(df.columns)
}
})
else:
# Large table - create summary + chunks
summary = df.head(self.summary_rows)
chunks.append({
"type": "table_summary",
"content": f"Table Summary ({len(df)} rows, {len(df.columns)} columns):\n{summary.to_string()}",
"metadata": {
"total_rows": len(df),
"summary_rows": self.summary_rows,
"columns": list(df.columns)
}
})
# Chunk the remaining data
for i in range(self.summary_rows, len(df), self.max_rows):
chunk_df = df.iloc[i:i+self.max_rows]
chunks.append({
"type": "table_chunk",
"content": f"Rows {i+1}-{min(i+self.max_rows, len(df))}:\n{chunk_df.to_string()}",
"metadata": {
"start_row": i + 1,
"end_row": min(i + self.max_rows, len(df)),
"columns": list(df.columns)
}
})
return chunks
```
## Level 4: Semantic Chunking
### Implementation
```python
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
class SemanticChunker:
def __init__(self, model_name="all-MiniLM-L6-v2", similarity_threshold=0.8, buffer_size=3):
self.model = SentenceTransformer(model_name)
self.similarity_threshold = similarity_threshold
self.buffer_size = buffer_size
def split_into_sentences(self, text):
# Simple sentence splitting - can be enhanced with nltk/spacy
sentences = re.split(r'[.!?]+', text)
return [s.strip() for s in sentences if s.strip()]
def chunk(self, text):
sentences = self.split_into_sentences(text)
if len(sentences) <= self.buffer_size:
return [text]
# Create embeddings
embeddings = self.model.encode(sentences)
chunks = []
current_chunk_sentences = []
for i in range(len(sentences)):
current_chunk_sentences.append(sentences[i])
# Check if we should create a boundary
if i < len(sentences) - 1:
similarity = cosine_similarity(
[embeddings[i]],
[embeddings[i + 1]]
)[0][0]
if similarity < self.similarity_threshold and len(current_chunk_sentences) >= 2:
chunks.append(' '.join(current_chunk_sentences))
current_chunk_sentences = []
# Add remaining sentences
if current_chunk_sentences:
chunks.append(' '.join(current_chunk_sentences))
return chunks
```
### Parameter Tuning
| Parameter | Range | Effect |
|-----------|-------|--------|
| similarity_threshold | 0.5-0.9 | Higher values create more chunks |
| buffer_size | 1-10 | Larger buffers provide more context |
| model_name | Various | Different models for different domains |
### Optimization Tips
- Use domain-specific models for specialized content
- Adjust threshold based on content complexity
- Cache embeddings for repeated processing
- Consider batch processing for large documents
## Level 5: Advanced Contextual Methods
### Late Chunking
```python
import torch
from transformers import AutoTokenizer, AutoModel
class LateChunker:
def __init__(self, model_name="microsoft/DialoGPT-medium"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModel.from_pretrained(model_name)
def chunk(self, text, chunk_size=512):
# Tokenize entire document
tokens = self.tokenizer(text, return_tensors="pt", truncation=False)
# Get token-level embeddings
with torch.no_grad():
outputs = self.model(**tokens, output_hidden_states=True)
token_embeddings = outputs.last_hidden_state[0]
# Create chunk embeddings from token embeddings
chunks = []
for i in range(0, len(token_embeddings), chunk_size):
chunk_tokens = token_embeddings[i:i+chunk_size]
chunk_embedding = torch.mean(chunk_tokens, dim=0)
chunks.append({
"content": self.tokenizer.decode(tokens["input_ids"][0][i:i+chunk_size]),
"embedding": chunk_embedding.numpy()
})
return chunks
```
### Contextual Retrieval
```python
import openai
class ContextualChunker:
def __init__(self, api_key):
self.client = openai.OpenAI(api_key=api_key)
def generate_context(self, chunk, full_document):
prompt = f"""
Given the following document and a chunk from it, provide a brief context
that helps understand the chunk's meaning within the full document.
Document:
{full_document[:2000]}...
Chunk:
{chunk}
Context (max 50 words):
"""
response = self.client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
max_tokens=100,
temperature=0
)
return response.choices[0].message.content.strip()
def chunk_with_context(self, text, base_chunker):
# First create base chunks
base_chunks = base_chunker.chunk(text)
# Then add context to each chunk
contextualized_chunks = []
for chunk in base_chunks:
context = self.generate_context(chunk.page_content, text)
contextualized_content = f"Context: {context}\n\nContent: {chunk.page_content}"
contextualized_chunks.append({
"content": contextualized_content,
"original_content": chunk.page_content,
"context": context
})
return contextualized_chunks
```
## Performance Considerations
### Computational Cost Analysis
| Strategy | Time Complexity | Space Complexity | Relative Cost |
|----------|-----------------|------------------|---------------|
| Fixed-Size | O(n) | O(n) | Low |
| Recursive | O(n) | O(n) | Low |
| Structure-Aware | O(n log n) | O(n) | Medium |
| Semantic | O(n²) | O(n²) | High |
| Late Chunking | O(n) | O(n) | Very High |
| Contextual | O(n²) | O(n²) | Very High |
### Optimization Strategies
1. **Parallel Processing**: Process chunks concurrently when possible
2. **Caching**: Store embeddings and intermediate results
3. **Batch Operations**: Group similar operations together
4. **Progressive Loading**: Process large documents in streaming fashion
5. **Model Selection**: Choose appropriate models for task complexity

View File

@@ -0,0 +1,867 @@
# Recommended Libraries and Frameworks
This document provides a comprehensive guide to tools, libraries, and frameworks for implementing chunking strategies.
## Core Chunking Libraries
### LangChain
**Overview**: Comprehensive framework for building applications with large language models, includes robust text splitting utilities.
**Installation**:
```bash
pip install langchain langchain-text-splitters
```
**Key Features**:
- Multiple text splitting strategies
- Integration with various document loaders
- Support for different content types (code, markdown, etc.)
- Customizable separators and parameters
**Example Usage**:
```python
from langchain.text_splitter import (
RecursiveCharacterTextSplitter,
CharacterTextSplitter,
TokenTextSplitter,
MarkdownTextSplitter,
PythonCodeTextSplitter
)
# Basic recursive splitting
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_text(large_text)
# Markdown-specific splitting
markdown_splitter = MarkdownTextSplitter(
chunk_size=1000,
chunk_overlap=100
)
# Code-specific splitting
code_splitter = PythonCodeTextSplitter(
chunk_size=1000,
chunk_overlap=100
)
```
**Pros**:
- Well-maintained and actively developed
- Extensive documentation and examples
- Integrates well with other LangChain components
- Supports multiple document types
**Cons**:
- Can be heavy dependency for simple use cases
- Some advanced features require LangChain ecosystem
### LlamaIndex
**Overview**: Data framework for LLM applications with advanced indexing and retrieval capabilities.
**Installation**:
```bash
pip install llama-index
```
**Key Features**:
- Advanced semantic chunking
- Hierarchical indexing
- Context-aware retrieval
- Integration with vector databases
**Example Usage**:
```python
from llama_index.core.node_parser import (
SentenceSplitter,
SemanticSplitterNodeParser
)
from llama_index.core import SimpleDirectoryReader
from llama_index.embeddings.openai import OpenAIEmbedding
# Basic sentence splitting
splitter = SentenceSplitter(
chunk_size=1024,
chunk_overlap=20
)
# Semantic chunking with embeddings
embed_model = OpenAIEmbedding()
semantic_splitter = SemanticSplitterNodeParser(
buffer_size=1,
breakpoint_percentile_threshold=95,
embed_model=embed_model
)
# Load and process documents
documents = SimpleDirectoryReader("./data").load_data()
nodes = semantic_splitter.get_nodes_from_documents(documents)
```
**Pros**:
- Excellent semantic chunking capabilities
- Built for production RAG systems
- Strong vector database integration
- Active community support
**Cons**:
- More complex setup for basic use cases
- Semantic chunking requires embedding model setup
### Unstructured
**Overview**: Open-source library for processing unstructured documents, especially strong with multi-modal content.
**Installation**:
```bash
pip install "unstructured[pdf,png,jpg]"
```
**Key Features**:
- Multi-modal document processing
- Support for PDFs, images, and various formats
- Structure preservation
- Table extraction and processing
**Example Usage**:
```python
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
# Partition document by type
elements = partition(filename="document.pdf")
# Chunk by title/heading structure
chunks = chunk_by_title(
elements,
combine_text_under_n_chars=2000,
max_characters=10000,
new_after_n_chars=1500,
multipage_sections=True
)
# Access chunked content
for chunk in chunks:
print(f"Category: {chunk.category}")
print(f"Content: {chunk.text[:200]}...")
```
**Pros**:
- Excellent for PDF and image processing
- Preserves document structure
- Handles tables and figures well
- Strong multi-modal capabilities
**Cons**:
- Can be slower for large documents
- Requires additional dependencies for some formats
## Text Processing Libraries
### NLTK (Natural Language Toolkit)
**Installation**:
```bash
pip install nltk
```
**Key Features**:
- Sentence tokenization
- Language detection
- Text preprocessing
- Linguistic analysis
**Example Usage**:
```python
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
# Download required data
nltk.download('punkt')
nltk.download('stopwords')
# Sentence and word tokenization
text = "This is a sample sentence. This is another sentence."
sentences = sent_tokenize(text)
words = word_tokenize(text)
# Stop words removal
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]
```
### spaCy
**Installation**:
```bash
pip install spacy
python -m spacy download en_core_web_sm
```
**Key Features**:
- Industrial-strength NLP
- Named entity recognition
- Dependency parsing
- Sentence boundary detection
**Example Usage**:
```python
import spacy
# Load language model
nlp = spacy.load("en_core_web_sm")
# Process text
doc = nlp("This is a sample sentence. This is another sentence.")
# Extract sentences
sentences = [sent.text for sent in doc.sents]
# Named entities
entities = [(ent.text, ent.label_) for ent in doc.ents]
# Dependency parsing for better chunking
for token in doc:
print(f"{token.text}: {token.dep_} (head: {token.head.text})")
```
### Sentence Transformers
**Installation**:
```bash
pip install sentence-transformers
```
**Key Features**:
- Pre-trained sentence embeddings
- Semantic similarity calculation
- Multi-lingual support
- Custom model training
**Example Usage**:
```python
from sentence_transformers import SentenceTransformer, util
import numpy as np
# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Generate embeddings
sentences = ["This is a sentence.", "This is another sentence."]
embeddings = model.encode(sentences)
# Calculate semantic similarity
similarity = util.cos_sim(embeddings[0], embeddings[1])
# Find semantic boundaries for chunking
def find_semantic_boundaries(text, model, threshold=0.8):
sentences = [s.strip() for s in text.split('.') if s.strip()]
embeddings = model.encode(sentences)
boundaries = [0]
for i in range(1, len(sentences)):
similarity = util.cos_sim(embeddings[i-1], embeddings[i])
if similarity < threshold:
boundaries.append(i)
return boundaries
```
## Vector Databases and Search
### ChromaDB
**Installation**:
```bash
pip install chromadb
```
**Key Features**:
- In-memory and persistent storage
- Built-in embedding functions
- Similarity search
- Metadata filtering
**Example Usage**:
```python
import chromadb
from chromadb.utils import embedding_functions
# Initialize client
client = chromadb.Client()
# Create collection
collection = client.create_collection(
name="document_chunks",
embedding_function=embedding_functions.DefaultEmbeddingFunction()
)
# Add chunks
collection.add(
documents=[chunk["content"] for chunk in chunks],
metadatas=[chunk.get("metadata", {}) for chunk in chunks],
ids=[chunk["id"] for chunk in chunks]
)
# Search
results = collection.query(
query_texts=["What is chunking?"],
n_results=5
)
```
### Pinecone
**Installation**:
```bash
pip install pinecone-client
```
**Key Features**:
- Managed vector database service
- High-performance similarity search
- Metadata filtering
- Scalable infrastructure
**Example Usage**:
```python
import pinecone
from sentence_transformers import SentenceTransformer
# Initialize
pinecone.init(api_key="your-api-key", environment="your-environment")
index_name = "document-chunks"
# Create index if it doesn't exist
if index_name not in pinecone.list_indexes():
pinecone.create_index(
name=index_name,
dimension=384, # Match embedding model
metric="cosine"
)
index = pinecone.Index(index_name)
# Generate embeddings and upsert
model = SentenceTransformer('all-MiniLM-L6-v2')
for chunk in chunks:
embedding = model.encode(chunk["content"])
index.upsert(
vectors=[{
"id": chunk["id"],
"values": embedding.tolist(),
"metadata": chunk.get("metadata", {})
}]
)
# Search
query_embedding = model.encode("search query")
results = index.query(
vector=query_embedding.tolist(),
top_k=5,
include_metadata=True
)
```
### Weaviate
**Installation**:
```bash
pip install weaviate-client
```
**Key Features**:
- GraphQL API
- Hybrid search (dense + sparse)
- Real-time updates
- Schema validation
**Example Usage**:
```python
import weaviate
# Connect to Weaviate
client = weaviate.Client("http://localhost:8080")
# Define schema
client.schema.create_class({
"class": "DocumentChunk",
"description": "A chunk of document content",
"properties": [
{
"name": "content",
"dataType": ["text"]
},
{
"name": "source",
"dataType": ["string"]
}
]
})
# Add data
for chunk in chunks:
client.data_object.create(
data_object={
"content": chunk["content"],
"source": chunk.get("source", "unknown")
},
class_name="DocumentChunk"
)
# Search
results = client.query.get(
"DocumentChunk",
["content", "source"]
).with_near_text({
"concepts": ["search query"]
}).with_limit(5).do()
```
## Evaluation and Testing
### RAGAS
**Installation**:
```bash
pip install ragas
```
**Key Features**:
- RAG evaluation metrics
- Answer quality assessment
- Context relevance measurement
- Faithfulness evaluation
**Example Usage**:
```python
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_relevancy,
context_recall
)
from datasets import Dataset
# Prepare evaluation data
dataset = Dataset.from_dict({
"question": ["What is chunking?"],
"answer": ["Chunking is the process of breaking large documents into smaller segments"],
"contexts": [["Chunking involves dividing text into manageable pieces for better processing"]],
"ground_truth": ["Chunking is a document processing technique"]
})
# Evaluate
result = evaluate(
dataset=dataset,
metrics=[
faithfulness,
answer_relevancy,
context_relevancy,
context_recall
]
)
print(result)
```
### TruEra (TruLens)
**Installation**:
```bash
pip install trulens trulens-apps
```
**Key Features**:
- LLM application evaluation
- Feedback functions
- Hallucination detection
- Performance monitoring
**Example Usage**:
```python
from trulens.core import TruSession
from trulens.apps.custom import instrument
from trulens.feedback import GroundTruthAgreement
# Initialize session
session = TruSession()
# Define feedback functions
f_groundedness = GroundTruthAgreement(ground_truth)
# Evaluate chunks
@instrument
def chunk_and_query(text, query):
chunks = chunk_function(text)
relevant_chunks = search_function(chunks, query)
answer = generate_function(relevant_chunks, query)
return answer
# Record evaluation
with session:
chunk_and_query("large document text", "what is the main topic?")
```
## Document Processing
### PyPDF2
**Installation**:
```bash
pip install PyPDF2
```
**Key Features**:
- PDF text extraction
- Page manipulation
- Metadata extraction
- Form field processing
**Example Usage**:
```python
import PyPDF2
def extract_text_from_pdf(pdf_path):
text = ""
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
for page in reader.pages:
text += page.extract_text()
return text
# Extract text by page for better chunking
def extract_pages(pdf_path):
pages = []
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
for i, page in enumerate(reader.pages):
pages.append({
"page_number": i + 1,
"content": page.extract_text()
})
return pages
```
### python-docx
**Installation**:
```bash
pip install python-docx
```
**Key Features**:
- Microsoft Word document processing
- Paragraph and table extraction
- Style preservation
- Metadata access
**Example Usage**:
```python
from docx import Document
def extract_from_docx(docx_path):
doc = Document(docx_path)
content = []
for paragraph in doc.paragraphs:
if paragraph.text.strip():
content.append({
"type": "paragraph",
"text": paragraph.text,
"style": paragraph.style.name
})
for table in doc.tables:
table_text = []
for row in table.rows:
row_text = [cell.text for cell in row.cells]
table_text.append(" | ".join(row_text))
content.append({
"type": "table",
"text": "\n".join(table_text)
})
return content
```
## Specialized Libraries
### tiktoken (OpenAI)
**Installation**:
```bash
pip install tiktoken
```
**Key Features**:
- Accurate token counting for OpenAI models
- Fast encoding/decoding
- Multiple model support
- Language model specific tokenization
**Example Usage**:
```python
import tiktoken
# Get encoding for specific model
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
# Encode text
tokens = encoding.encode("This is a sample text")
print(f"Token count: {len(tokens)}")
# Decode tokens
text = encoding.decode(tokens)
# Count tokens without full encoding
def count_tokens(text, model="gpt-3.5-turbo"):
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
# Use in chunking
def chunk_by_tokens(text, max_tokens=1000):
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
tokens = encoding.encode(text)
chunks = []
for i in range(0, len(tokens), max_tokens):
chunk_tokens = tokens[i:i + max_tokens]
chunk_text = encoding.decode(chunk_tokens)
chunks.append(chunk_text)
return chunks
```
### PDFMiner
**Installation**:
```bash
pip install pdfminer.six
```
**Key Features**:
- Detailed PDF analysis
- Layout preservation
- Font and style information
- High-precision text extraction
**Example Usage**:
```python
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
def extract_structured_text(pdf_path):
structured_content = []
for page_layout in extract_pages(pdf_path):
page_content = []
for element in page_layout:
if isinstance(element, LTTextContainer):
text = element.get_text()
font_info = {
"font_size": element.height,
"is_bold": "Bold" in element.fontname,
"x0": element.x0,
"y0": element.y0
}
page_content.append({
"text": text.strip(),
"font_info": font_info
})
structured_content.append({
"page_number": page_layout.pageid,
"content": page_content
})
return structured_content
```
## Performance and Optimization
### Dask
**Installation**:
```bash
pip install dask[complete]
```
**Key Features**:
- Parallel processing
- Out-of-core computation
- Distributed computing
- Integration with pandas
**Example Usage**:
```python
import dask.bag as db
from dask.distributed import Client
# Setup distributed client
client = Client(n_workers=4)
# Parallel chunking of multiple documents
def chunk_document(document):
# Your chunking logic here
return chunk_function(document)
# Process documents in parallel
documents = ["doc1", "doc2", "doc3", ...] # List of document contents
document_bag = db.from_sequence(documents)
# Apply chunking function in parallel
chunked_documents = document_bag.map(chunk_document)
# Compute results
results = chunked_documents.compute()
```
### Ray
**Installation**:
```bash
pip install ray
```
**Key Features**:
- Distributed computing
- Actor model
- Autoscaling
- ML pipeline integration
**Example Usage**:
```python
import ray
# Initialize Ray
ray.init()
@ray.remote
class ChunkingWorker:
def __init__(self, strategy):
self.strategy = strategy
def chunk_documents(self, documents):
results = []
for doc in documents:
chunks = self.strategy.chunk(doc)
results.append(chunks)
return results
# Create workers
workers = [ChunkingWorker.remote(strategy) for _ in range(4)]
# Distribute work
documents_batch = [documents[i::4] for i in range(4)]
futures = [worker.chunk_documents.remote(batch)
for worker, batch in zip(workers, documents_batch)]
# Get results
results = ray.get(futures)
```
## Development and Testing
### pytest
**Installation**:
```bash
pip install pytest pytest-asyncio
```
**Example Tests**:
```python
import pytest
from your_chunking_module import FixedSizeChunker, SemanticChunker
class TestFixedSizeChunker:
def test_chunk_size_respect(self):
chunker = FixedSizeChunker(chunk_size=100, chunk_overlap=10)
text = "word " * 50 # 50 words
chunks = chunker.chunk(text)
for chunk in chunks:
assert len(chunk.split()) <= 100 # Account for word boundaries
def test_overlap_consistency(self):
chunker = FixedSizeChunker(chunk_size=50, chunk_overlap=10)
text = "word " * 30
chunks = chunker.chunk(text)
# Check overlap between consecutive chunks
for i in range(1, len(chunks)):
chunk1_words = set(chunks[i-1].split()[-10:])
chunk2_words = set(chunks[i].split()[:10])
overlap = len(chunk1_words & chunk2_words)
assert overlap >= 5 # Allow some tolerance
@pytest.mark.asyncio
async def test_semantic_chunker():
chunker = SemanticChunker()
text = "First topic sentence. Another sentence about first topic. " \
"Now switching to second topic. More about second topic."
chunks = await chunker.chunk_async(text)
# Should detect topic change and create boundary
assert len(chunks) >= 2
assert "first topic" in chunks[0].lower()
assert "second topic" in chunks[1].lower()
```
### Memory Profiler
**Installation**:
```bash
pip install memory-profiler
```
**Example Usage**:
```python
from memory_profiler import profile
@profile
def chunk_large_document():
chunker = FixedSizeChunker(chunk_size=1000)
large_text = "word " * 100000 # Large document
chunks = chunker.chunk(large_text)
return chunks
# Run with: python -m memory_profiler your_script.py
```
This comprehensive toolset provides everything needed to implement, test, and optimize chunking strategies for various use cases, from simple text processing to production-grade RAG systems.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,302 @@
---
name: prompt-engineering
category: backend
tags: [prompt-engineering, few-shot-learning, chain-of-thought, optimization, templates, system-prompts, llm-performance, ai-patterns]
version: 1.0.0
description: This skill should be used when creating, optimizing, or implementing advanced prompt patterns including few-shot learning, chain-of-thought reasoning, prompt optimization workflows, template systems, and system prompt design. It provides comprehensive frameworks for building production-ready prompts with measurable performance improvements.
---
# Prompt Engineering
This skill provides comprehensive frameworks for creating, optimizing, and implementing advanced prompt patterns that significantly improve LLM performance across various tasks and models.
## When to Use This Skill
Use this skill when:
- Creating new prompts for complex reasoning or analytical tasks
- Optimizing existing prompts for better accuracy or efficiency
- Implementing few-shot learning with strategic example selection
- Designing chain-of-thought reasoning for multi-step problems
- Building reusable prompt templates and systems
- Developing system prompts for consistent model behavior
- Troubleshooting poor prompt performance or failure modes
- Scaling prompt systems for production use cases
## Core Prompt Engineering Patterns
### 1. Few-Shot Learning Implementation
Select examples using semantic similarity and diversity sampling to maximize learning within context window constraints.
#### Example Selection Strategy
- Use `references/few-shot-patterns.md` for comprehensive selection frameworks
- Balance example count (3-5 optimal) with context window limitations
- Include edge cases and boundary conditions in example sets
- Prioritize diverse examples that cover problem space variations
- Order examples from simple to complex for progressive learning
#### Few-Shot Template Structure
```
Example 1 (Basic case):
Input: {representative_input}
Output: {expected_output}
Example 2 (Edge case):
Input: {challenging_input}
Output: {robust_output}
Example 3 (Error case):
Input: {problematic_input}
Output: {corrected_output}
Now handle: {target_input}
```
### 2. Chain-of-Thought Reasoning
Elicit step-by-step reasoning for complex problem-solving through structured thinking patterns.
#### Implementation Patterns
- Reference `references/cot-patterns.md` for detailed reasoning frameworks
- Use "Let's think step by step" for zero-shot CoT initiation
- Provide complete reasoning traces for few-shot CoT demonstrations
- Implement self-consistency by sampling multiple reasoning paths
- Include verification and validation steps in reasoning chains
#### CoT Template Structure
```
Let's approach this step-by-step:
Step 1: {break_down_the_problem}
Analysis: {detailed_reasoning}
Step 2: {identify_key_components}
Analysis: {component_analysis}
Step 3: {synthesize_solution}
Analysis: {solution_justification}
Final Answer: {conclusion_with_confidence}
```
### 3. Prompt Optimization Workflows
Implement iterative refinement processes with measurable performance metrics and systematic A/B testing.
#### Optimization Process
- Use `references/optimization-frameworks.md` for comprehensive optimization strategies
- Measure baseline performance before optimization attempts
- Implement single-variable changes for accurate attribution
- Track metrics: accuracy, consistency, latency, token efficiency
- Use statistical significance testing for A/B validation
- Document optimization iterations and their impacts
#### Performance Metrics Framework
- **Accuracy**: Task completion rate and output correctness
- **Consistency**: Response stability across multiple runs
- **Efficiency**: Token usage and response time optimization
- **Robustness**: Performance across edge cases and variations
- **Safety**: Adherence to guidelines and harm prevention
### 4. Template Systems Architecture
Build modular, reusable prompt components with variable interpolation and conditional sections.
#### Template Design Principles
- Reference `references/template-systems.md` for modular template frameworks
- Use clear variable naming conventions (e.g., `{user_input}`, `{context}`)
- Implement conditional sections for different scenario handling
- Design role-based templates for specific use cases
- Create hierarchical template composition patterns
#### Template Structure Example
```
# System Context
You are a {role} with {expertise_level} expertise in {domain}.
# Task Context
{if background_information}
Background: {background_information}
{endif}
# Instructions
{task_instructions}
# Examples
{example_count}
# Output Format
{output_specification}
# Input
{user_query}
```
### 5. System Prompt Design
Design comprehensive system prompts that establish consistent model behavior, output formats, and safety constraints.
#### System Prompt Components
- Use `references/system-prompt-design.md` for detailed design guidelines
- Define clear role specification and expertise boundaries
- Establish output format requirements and structural constraints
- Include safety guidelines and content policy adherence
- Set context for background information and domain knowledge
#### System Prompt Framework
```
You are an expert {role} specializing in {domain} with {experience_level} of experience.
## Core Capabilities
- List specific capabilities and expertise areas
- Define scope of knowledge and limitations
## Behavioral Guidelines
- Specify interaction style and communication approach
- Define error handling and uncertainty protocols
- Establish quality standards and verification requirements
## Output Requirements
- Specify format expectations and structural requirements
- Define content inclusion and exclusion criteria
- Establish consistency and validation requirements
## Safety and Ethics
- Include content policy adherence
- Specify bias mitigation requirements
- Define harm prevention protocols
```
## Implementation Workflows
### Workflow 1: Create New Prompt from Requirements
1. **Analyze Requirements**
- Identify task complexity and reasoning requirements
- Determine target model capabilities and limitations
- Define success criteria and evaluation metrics
- Assess need for few-shot learning or CoT reasoning
2. **Select Pattern Strategy**
- Use few-shot learning for classification or transformation tasks
- Apply CoT for complex reasoning or multi-step problems
- Implement template systems for reusable prompt architecture
- Design system prompts for consistent behavior requirements
3. **Draft Initial Prompt**
- Structure prompt with clear sections and logical flow
- Include relevant examples or reasoning demonstrations
- Specify output format and quality requirements
- Incorporate safety guidelines and constraints
4. **Validate and Test**
- Test with diverse input scenarios including edge cases
- Measure performance against defined success criteria
- Iterate refinement based on testing results
- Document optimization decisions and their rationale
### Workflow 2: Optimize Existing Prompt
1. **Performance Analysis**
- Measure current prompt performance metrics
- Identify failure modes and error patterns
- Analyze token efficiency and response latency
- Assess consistency across multiple runs
2. **Optimization Strategy**
- Apply systematic A/B testing with single-variable changes
- Use few-shot learning to improve task adherence
- Implement CoT reasoning for complex task components
- Refine template structure for better clarity
3. **Implementation and Testing**
- Deploy optimized prompts with controlled rollout
- Monitor performance metrics in production environment
- Compare against baseline using statistical significance
- Document improvements and lessons learned
### Workflow 3: Scale Prompt Systems
1. **Modular Architecture Design**
- Decompose complex prompts into reusable components
- Create template inheritance hierarchies
- Implement dynamic example selection systems
- Build automated quality assurance frameworks
2. **Production Integration**
- Implement prompt versioning and rollback capabilities
- Create performance monitoring and alerting systems
- Build automated testing frameworks for prompt validation
- Establish update and deployment workflows
## Quality Assurance
### Validation Requirements
- Test prompts with at least 10 diverse scenarios
- Include edge cases, boundary conditions, and failure modes
- Verify output format compliance and structural consistency
- Validate safety guideline adherence and harm prevention
- Measure performance across multiple model runs
### Performance Standards
- Achieve >90% task completion for well-defined use cases
- Maintain <5% variance across multiple runs for consistency
- Optimize token usage without sacrificing accuracy
- Ensure response latency meets application requirements
- Demonstrate robust handling of edge cases and unexpected inputs
## Integration with Other Skills
This skill integrates seamlessly with:
- **langchain4j-ai-services-patterns**: Interface-based prompt design
- **langchain4j-rag-implementation-patterns**: Context-enhanced prompting
- **langchain4j-testing-strategies**: Prompt validation frameworks
- **unit-test-parameterized**: Systematic prompt testing approaches
## Resources and References
- `references/few-shot-patterns.md`: Comprehensive few-shot learning frameworks
- `references/cot-patterns.md`: Chain-of-thought reasoning patterns and examples
- `references/optimization-frameworks.md`: Systematic prompt optimization methodologies
- `references/template-systems.md`: Modular template design and implementation
- `references/system-prompt-design.md`: System prompt architecture and best practices
## Usage Examples
### Example 1: Classification Task with Few-Shot Learning
```
Classify customer feedback into categories using semantic similarity for example selection and diversity sampling for edge case coverage.
```
### Example 2: Complex Reasoning with Chain-of-Thought
```
Implement step-by-step reasoning for financial analysis with verification steps and confidence scoring.
```
### Example 3: Template System for Customer Service
```
Create modular templates with role-based components and conditional sections for different inquiry types.
```
### Example 4: System Prompt for Code Generation
```
Design comprehensive system prompt with behavioral guidelines, output requirements, and safety constraints.
```
## Common Pitfalls and Solutions
- **Overfitting examples**: Use diverse example sets with semantic variety
- **Context window overflow**: Implement strategic example selection and compression
- **Inconsistent outputs**: Specify clear output formats and validation requirements
- **Poor generalization**: Include edge cases and boundary conditions in training examples
- **Safety violations**: Incorporate comprehensive content policies and harm prevention
## Performance Optimization
- Monitor token usage and implement compression strategies
- Use caching for repeated prompt components
- Optimize example selection for maximum learning efficiency
- Implement progressive disclosure for complex prompt systems
- Balance prompt complexity with response quality requirements
This skill provides the foundational patterns and methodologies for building production-ready prompt systems that consistently deliver high performance across diverse use cases and model types.

View File

@@ -0,0 +1,426 @@
# Chain-of-Thought Reasoning Patterns
This reference provides comprehensive frameworks for implementing effective chain-of-thought (CoT) reasoning that improves model performance on complex, multi-step problems.
## Core Principles
### Step-by-Step Reasoning Elicitation
#### Problem Decomposition Strategy
- Break complex problems into manageable sub-problems
- Identify dependencies and relationships between components
- Establish logical flow and sequence of reasoning steps
- Define clear decision points and validation criteria
#### Verification and Validation Integration
- Include self-checking mechanisms at critical junctures
- Implement consistency checks across reasoning steps
- Add confidence scoring for uncertain conclusions
- Provide fallback strategies for ambiguous situations
## Zero-Shot Chain-of-Thought Patterns
### Basic CoT Initiation
```
Let's think step by step to solve this problem:
1. First, I need to understand what the question is asking for
2. Then, I'll identify the key information and constraints
3. Next, I'll consider different approaches to solve it
4. I'll work through the solution methodically
5. Finally, I'll verify my answer makes sense
Problem: {problem_statement}
Step 1: Understanding the question
{analysis}
Step 2: Key information and constraints
{information_analysis}
Step 3: Solution approach
{approach_analysis}
Step 4: Working through the solution
{detailed_solution}
Step 5: Verification
{verification}
Final Answer: {conclusion}
```
### Enhanced CoT with Confidence
```
Let me think through this systematically, breaking down the problem and checking my reasoning at each step.
**Problem**: {problem_description}
**Step 1: Problem Analysis**
- What am I being asked to solve?
- What information is provided?
- What are the constraints?
- My confidence in understanding: {score}/10
**Step 2: Strategy Selection**
- Possible approaches:
1. {approach_1}
2. {approach_2}
3. {approach_3}
- Selected approach: {chosen_approach}
- Rationale: {reasoning_for_choice}
**Step 3: Execution**
- {detailed_step_by_step_solution}
**Step 4: Verification**
- Does the answer make sense?
- Have I addressed all parts of the question?
- Confidence in final answer: {score}/10
**Final Answer**: {solution_with_confidence_score}
```
## Few-Shot Chain-of-Thought Patterns
### Mathematical Reasoning Template
```
Solve the following math problem step by step.
Example 1:
Problem: A store sells apples for $2 each and oranges for $3 each. If John buys 4 apples and 2 oranges, and spends exactly $14, how much does each fruit cost?
Step 1: Set up the equation
Let a = cost of apples, o = cost of oranges
4a + 2o = 14
Step 2: Simplify the equation
Divide both sides by 2: 2a + o = 7
Step 3: Test integer solutions
If a = 2, then 2(2) + o = 7 → o = 3
If a = 3, then 2(3) + o = 7 → o = 1
Step 4: Verify the original problem
For a=2, o=3: 4(2) + 2(3) = 8 + 6 = 14 ✓
For a=3, o=1: 4(3) + 2(1) = 12 + 2 = 14 ✓
Both solutions work, so apples could cost $2 and oranges $3, OR apples $3 and oranges $1.
Example 2:
Problem: Sarah is twice as old as Tom was 3 years ago. In 5 years, Sarah will be 1.5 times Tom's current age. How old are they now?
Step 1: Define variables
Let S = Sarah's current age, T = Tom's current age
Step 2: Set up equations
"Sarah is twice as old as Tom was 3 years ago": S = 2(T - 3)
"In 5 years, Sarah will be 1.5 times Tom's current age": S + 5 = 1.5T
Step 3: Solve the system
From first equation: S = 2T - 6
Substitute into second: (2T - 6) + 5 = 1.5T
2T - 1 = 1.5T
0.5T = 1
T = 2
Step 4: Find Sarah's age
S = 2(2) - 6 = 4 - 6 = -2
Step 5: Check for validity
Negative age is impossible, so I must have made an error.
Let me re-examine the problem...
Actually, let me try again:
S = 2(T - 3) = 2T - 6
S + 5 = 1.5T
Substitute: (2T - 6) + 5 = 1.5T
2T - 1 = 1.5T
0.5T = 1
T = 2 years old
S = 2(2 - 3) = 2(-1) = -2
This gives a negative age. The problem might have no solution with integer ages.
Now solve:
Problem: {math_problem}
Step 1: Define variables
{variable_definitions}
Step 2: Set up equations
{equation_setup}
Step 3: Solve the system
{solution_process}
Step 4: Verify the solution
{verification}
Final Answer: {answer}
```
### Logical Reasoning Template
```
Analyze the logical argument and determine if it's valid.
Example 1:
Premise 1: All birds can fly
Premise 2: Penguins are birds
Conclusion: Therefore, penguins can fly
Step 1: Analyze the structure
This is a syllogism with form:
All A are B
C is A
Therefore, C is B
Step 2: Evaluate premise validity
Premise 1: "All birds can fly" - This is false (penguins, ostriches cannot fly)
Premise 2: "Penguins are birds" - This is true
Step 3: Check logical validity
The logical structure is valid, but since Premise 1 is false, the conclusion may not be true
Step 4: Real-world verification
In reality, penguins cannot fly despite being birds
Conclusion: The argument is logically valid but soundness fails due to false premise
Example 2:
Premise 1: If it rains, then the ground gets wet
Premise 2: It is raining
Conclusion: Therefore, the ground gets wet
Step 1: Analyze the structure
This is modus ponens:
If P, then Q
P
Therefore, Q
Step 2: Evaluate premise validity
Premise 1: "If it rains, then the ground gets wet" - Generally true
Premise 2: "It is raining" - Given as true
Step 3: Check logical validity
Modus ponens is a valid argument form
Step 4: Verify the conclusion
Given the premises, the conclusion follows logically
Conclusion: The argument is both logically valid and sound
Now analyze:
Argument: {logical_argument}
Step 1: Analyze the argument structure
{structure_analysis}
Step 2: Evaluate premise validity
{premise_evaluation}
Step 3: Check logical validity
{validity_check}
Step 4: Verify the conclusion
{conclusion_verification}
Final Assessment: {argument_validity_assessment}
```
## Self-Consistency Techniques
### Multiple Reasoning Paths
```
I'll solve this problem using three different approaches and see which result is most reliable.
**Problem**: {complex_problem}
**Approach 1: Direct Calculation**
{first_approach_reasoning}
Result 1: {result_1}
**Approach 2: Logical Deduction**
{second_approach_reasoning}
Result 2: {result_2}
**Approach 3: Pattern Recognition**
{third_approach_reasoning}
Result 3: {result_3}
**Consistency Analysis:**
- Approach 1 and 2 agree: {yes/no}
- Approach 1 and 3 agree: {yes/no}
- Approach 2 and 3 agree: {yes/no}
**Final Decision:**
{majority_result} appears in {count} out of 3 approaches.
Confidence: {high/medium/low}
Most Likely Answer: {final_answer_with_confidence}
```
### Verification Loop Pattern
```
Let me solve this step by step and verify each step.
**Problem**: {problem_description}
**Step 1: Initial Analysis**
{initial_analysis}
Verification: Does this make sense? {verification_1}
**Step 2: Solution Development**
{solution_development}
Verification: Does this logically follow from step 1? {verification_2}
**Step 3: Result Calculation**
{result_calculation}
Verification: Does this answer the original question? {verification_3}
**Step 4: Cross-Check**
Let me try a different approach to confirm:
{alternative_approach}
Results comparison: {comparison_analysis}
**Final Answer:**
{conclusion_with_verification_status}
```
## Specialized CoT Patterns
### Code Debugging CoT
```
Debug the following code by analyzing it step by step.
**Code:**
{code_snippet}
**Step 1: Understand the Code's Purpose**
{purpose_analysis}
**Step 2: Identify Expected Behavior**
{expected_behavior}
**Step 3: Trace the Execution**
{execution_trace}
**Step 4: Find the Error**
{error_identification}
**Step 5: Propose Fix**
{fix_proposal}
**Step 6: Verify the Fix**
{fix_verification}
**Fixed Code:**
{corrected_code}
```
### Data Analysis CoT
```
Analyze this data systematically to draw meaningful conclusions.
**Data:**
{dataset}
**Step 1: Understand the Data Structure**
{data_structure_analysis}
**Step 2: Identify Patterns and Trends**
{pattern_identification}
**Step 3: Calculate Key Metrics**
{metrics_calculation}
**Step 4: Compare with Benchmarks**
{benchmark_comparison}
**Step 5: Formulate Insights**
{insight_generation}
**Step 6: Validate Conclusions**
{conclusion_validation}
**Key Findings:**
{summary_of_insights}
```
### Creative Problem Solving CoT
```
Generate creative solutions to this challenging problem.
**Problem:**
{creative_problem}
**Step 1: Reframe the Problem**
{problem_reframing}
**Step 2: Brainstorm Multiple Angles**
- Technical approach: {technical_ideas}
- Business approach: {business_ideas}
- User experience approach: {ux_ideas}
- Unconventional approach: {unconventional_ideas}
**Step 3: Evaluate Each Approach**
{approach_evaluation}
**Step 4: Synthesize Best Elements**
{synthesis_process}
**Step 5: Develop Final Solution**
{solution_development}
**Step 6: Test for Feasibility**
{feasibility_testing}
**Recommended Solution:**
{final_creative_solution}
```
## Implementation Guidelines
### When to Use Chain-of-Thought
- **Multi-step problems**: Tasks requiring sequential reasoning
- **Complex calculations**: Mathematical or logical derivations
- **Problem decomposition**: Tasks that benefit from breaking down
- **Verification needs**: When accuracy is critical
- **Educational contexts**: When showing reasoning is valuable
### CoT Effectiveness Factors
- **Problem complexity**: Higher benefit for complex problems
- **Task type**: Mathematical, logical, and analytical tasks benefit most
- **Model capability**: Newer models handle CoT more effectively
- **Context window**: Ensure sufficient space for reasoning steps
- **Output requirements**: Detailed explanations benefit from CoT
### Common Pitfalls to Avoid
- **Over-explaining simple steps**: Keep proportional detail
- **Circular reasoning**: Ensure logical progression
- **Missing verification**: Always include validation steps
- **Inconsistent confidence**: Use realistic confidence scoring
- **Premature conclusions**: Don't jump to answers without full reasoning
## Integration with Other Techniques
### CoT + Few-Shot Learning
- Include reasoning traces in examples
- Show step-by-step problem-solving demonstrations
- Teach verification and self-checking patterns
### CoT + Template Systems
- Embed CoT patterns within structured templates
- Use conditional CoT based on problem complexity
- Implement adaptive reasoning depth
### CoT + Prompt Optimization
- Test different CoT formulations
- Optimize reasoning step granularity
- Balance detail with efficiency
This framework provides comprehensive patterns for implementing effective chain-of-thought reasoning across diverse problem types and applications.

View File

@@ -0,0 +1,273 @@
# Few-Shot Learning Patterns
This reference provides comprehensive frameworks for implementing effective few-shot learning strategies that maximize model performance within context window constraints.
## Core Principles
### Example Selection Strategy
#### Semantic Similarity Selection
- Use embedding similarity to find examples closest to target input
- Cluster similar examples to avoid redundancy
- Select diverse representatives from different semantic regions
- Prioritize examples that cover key variations in problem space
#### Diversity Sampling Approach
- Ensure coverage of different input types and patterns
- Include boundary cases and edge conditions
- Balance simple and complex examples
- Select examples that demonstrate different solution strategies
#### Progressive Complexity Ordering
- Start with simplest, most straightforward examples
- Progress to increasingly complex scenarios
- Include challenging edge cases last
- Use this ordering to build understanding incrementally
## Example Templates
### Classification Tasks
#### Binary Classification Template
```
Classify if the text expresses positive or negative sentiment.
Example 1:
Text: "I love this product! It works exactly as advertised and exceeded my expectations."
Sentiment: Positive
Reasoning: Contains enthusiastic language, positive adjectives, and satisfaction indicators
Example 2:
Text: "The customer service was terrible and the product broke after one day of use."
Sentiment: Negative
Reasoning: Contains negative adjectives, complaint language, and dissatisfaction indicators
Example 3:
Text: "It's okay, nothing special but does the basic job."
Sentiment: Negative
Reasoning: Contains lukewarm language, lack of enthusiasm, minimal positive elements
Now classify:
Text: {input_text}
Sentiment:
Reasoning:
```
#### Multi-Class Classification Template
```
Categorize the customer inquiry into one of: Technical Support, Billing, Sales, or General.
Example 1:
Inquiry: "My account was charged twice for the same subscription this month"
Category: Billing
Key indicators: "charged twice", "subscription", "account", financial terms
Example 2:
Inquiry: "The app keeps crashing when I try to upload files larger than 10MB"
Category: Technical Support
Key indicators: "crashing", "upload files", "technical issue", "error report"
Example 3:
Inquiry: "What are your pricing plans for enterprise customers?"
Category: Sales
Key indicators: "pricing plans", "enterprise", business inquiry, sales question
Now categorize:
Inquiry: {inquiry_text}
Category:
Key indicators:
```
### Transformation Tasks
#### Text Transformation Template
```
Convert formal business text into casual, friendly language.
Example 1:
Formal: "We regret to inform you that your request cannot be processed at this time due to insufficient documentation."
Casual: "Sorry, but we can't process your request right now because some documents are missing."
Example 2:
Formal: "The aforementioned individual has demonstrated exceptional proficiency in the designated responsibilities."
Casual: "They've done a great job with their tasks and really know what they're doing."
Example 3:
Formal: "Please be advised that the scheduled meeting has been postponed pending further notice."
Casual: "Hey, just letting you know that we've put off the meeting for now and will let you know when it's rescheduled."
Now convert:
Formal: {formal_text}
Casual:
```
#### Data Extraction Template
```
Extract key information from the job posting into structured format.
Example 1:
Job Posting: "We are seeking a Senior Software Engineer with 5+ years of experience in Python and cloud technologies. This is a remote position offering $120k-$150k salary plus equity."
Extracted:
- Position: Senior Software Engineer
- Experience Required: 5+ years
- Skills: Python, cloud technologies
- Location: Remote
- Salary: $120k-$150k plus equity
Example 2:
Job Posting: "Marketing Manager needed for growing startup. Must have 3 years experience in digital marketing, social media management, and content creation. San Francisco office, competitive compensation."
Extracted:
- Position: Marketing Manager
- Experience Required: 3 years
- Skills: Digital marketing, social media management, content creation
- Location: San Francisco
- Salary: Competitive compensation
Now extract:
Job Posting: {job_posting_text}
Extracted:
```
### Generation Tasks
#### Creative Writing Template
```
Generate compelling product descriptions following the shown patterns.
Example 1:
Product: Wireless headphones with noise cancellation
Description: "Immerse yourself in crystal-clear audio with our premium wireless headphones. Advanced noise cancellation technology blocks out distractions while 30-hour battery life keeps you connected all day long."
Example 2:
Product: Smart home security camera
Description: "Protect what matters most with intelligent monitoring that alerts you to activity instantly. AI-powered detection distinguishes between people, pets, and vehicles for truly smart security."
Example 3:
Product: Portable espresso maker
Description: "Barista-quality espresso anywhere, anytime. Compact design meets professional-grade extraction in this revolutionary portable machine that delivers perfect shots in under 30 seconds."
Now generate:
Product: {product_description}
Description:
```
### Error Correction Patterns
#### Error Detection and Correction Template
```
Identify and correct errors in the given text.
Example 1:
Text with errors: "Their going to the park to play there new game with they're friends."
Correction: "They're going to the park to play their new game with their friends."
Errors fixed: "Their → They're", "there → their", "they're → their"
Example 2:
Text with errors: "The company's new policy effects every employee and there morale."
Correction: "The company's new policy affects every employee and their morale."
Errors fixed: "effects → affects", "there → their"
Example 3:
Text with errors: "Its important to review you're work carefully before submiting."
Correction: "It's important to review your work carefully before submitting."
Errors fixed: "Its → It's", "you're → your", "submiting → submitting"
Now correct:
Text with errors: {text_with_errors}
Correction:
Errors fixed:
```
## Advanced Strategies
### Dynamic Example Selection
#### Context-Aware Selection
```python
def select_examples(input_text, example_database, max_examples=3):
"""
Select most relevant examples based on semantic similarity and diversity.
"""
# 1. Calculate similarity scores
similarities = calculate_similarity(input_text, example_database)
# 2. Sort by similarity
sorted_examples = sort_by_similarity(similarities)
# 3. Apply diversity sampling
diverse_examples = diversity_sampling(sorted_examples, max_examples)
# 4. Order by complexity
final_examples = order_by_complexity(diverse_examples)
return final_examples
```
#### Adaptive Example Count
```python
def determine_example_count(input_complexity, context_limit):
"""
Adjust example count based on input complexity and available context.
"""
base_count = 3
# Complex inputs benefit from more examples
if input_complexity > 0.8:
return min(base_count + 2, context_limit)
elif input_complexity > 0.5:
return base_count + 1
else:
return max(base_count - 1, 2)
```
### Quality Metrics for Examples
#### Example Effectiveness Scoring
```python
def score_example_effectiveness(example, test_cases):
"""
Score how effectively an example teaches the desired pattern.
"""
metrics = {
'coverage': measure_pattern_coverage(example),
'clarity': measure_instructional_clarity(example),
'uniqueness': measure_uniqueness_from_other_examples(example),
'difficulty': measure_appropriateness_difficulty(example)
}
return weighted_average(metrics, weights=[0.3, 0.3, 0.2, 0.2])
```
## Best Practices
### Example Quality Guidelines
- **Clarity**: Examples should clearly demonstrate the desired pattern
- **Accuracy**: Input-output pairs must be correct and consistent
- **Relevance**: Examples should be representative of target task
- **Diversity**: Include variation in input types and complexity levels
- **Completeness**: Cover edge cases and boundary conditions
### Context Management
- **Token Efficiency**: Optimize example length while maintaining clarity
- **Progressive Disclosure**: Start simple, increase complexity gradually
- **Redundancy Elimination**: Remove overlapping or duplicate examples
- **Compression**: Use concise representations where possible
### Common Pitfalls to Avoid
- **Overfitting**: Don't include too many examples from same pattern
- **Under-representation**: Ensure coverage of important variations
- **Ambiguity**: Examples should have clear, unambiguous solutions
- **Context Overflow**: Balance example count with window limitations
- **Poor Ordering**: Place examples in logical progression order
## Integration with Other Patterns
Few-shot learning combines effectively with:
- **Chain-of-Thought**: Add reasoning steps to examples
- **Template Systems**: Use few-shot within structured templates
- **Prompt Optimization**: Test different example selections
- **System Prompts**: Establish few-shot learning expectations in system prompts
This framework provides the foundation for implementing effective few-shot learning across diverse tasks and model types.

View File

@@ -0,0 +1,488 @@
# Prompt Optimization Frameworks
This reference provides systematic methodologies for iteratively improving prompt performance through structured testing, measurement, and refinement processes.
## Optimization Process Overview
### Iterative Improvement Cycle
```mermaid
graph TD
A[Baseline Measurement] --> B[Hypothesis Generation]
B --> C[Controlled Test]
C --> D[Performance Analysis]
D --> E[Statistical Validation]
E --> F[Implementation Decision]
F --> G[Monitor Impact]
G --> H[Learn & Iterate]
H --> B
```
### Core Optimization Principles
- **Single Variable Testing**: Change one element at a time for accurate attribution
- **Measurable Metrics**: Define quantitative success criteria
- **Statistical Significance**: Use proper sample sizes and validation methods
- **Controlled Environment**: Test conditions must be consistent
- **Baseline Comparison**: Always measure against established baseline
## Performance Metrics Framework
### Primary Metrics
#### Task Success Rate
```python
def calculate_success_rate(results, expected_outputs):
"""
Measure percentage of tasks completed correctly.
"""
correct = sum(1 for result, expected in zip(results, expected_outputs)
if result == expected)
return (correct / len(results)) * 100
```
#### Response Consistency
```python
def measure_consistency(prompt, test_cases, num_runs=5):
"""
Measure response stability across multiple runs.
"""
responses = {}
for test_case in test_cases:
test_responses = []
for _ in range(num_runs):
response = execute_prompt(prompt, test_case)
test_responses.append(response)
# Calculate similarity score for consistency
consistency = calculate_similarity(test_responses)
responses[test_case] = consistency
return sum(responses.values()) / len(responses)
```
#### Token Efficiency
```python
def calculate_token_efficiency(prompt, test_cases):
"""
Measure token usage per successful task completion.
"""
total_tokens = 0
successful_tasks = 0
for test_case in test_cases:
response = execute_prompt_with_metrics(prompt, test_case)
total_tokens += response.token_count
if response.is_successful:
successful_tasks += 1
return total_tokens / successful_tasks if successful_tasks > 0 else float('inf')
```
#### Response Latency
```python
def measure_response_time(prompt, test_cases):
"""
Measure average response time.
"""
times = []
for test_case in test_cases:
start_time = time.time()
execute_prompt(prompt, test_case)
end_time = time.time()
times.append(end_time - start_time)
return sum(times) / len(times)
```
### Secondary Metrics
#### Output Quality Score
```python
def assess_output_quality(response, criteria):
"""
Multi-dimensional quality assessment.
"""
scores = {
'accuracy': measure_accuracy(response),
'completeness': measure_completeness(response),
'coherence': measure_coherence(response),
'relevance': measure_relevance(response),
'format_compliance': measure_format_compliance(response)
}
weights = [0.3, 0.2, 0.2, 0.2, 0.1]
return sum(score * weight for score, weight in zip(scores.values(), weights))
```
#### Safety Compliance
```python
def check_safety_compliance(response):
"""
Measure adherence to safety guidelines.
"""
violations = []
# Check for various safety issues
if contains_harmful_content(response):
violations.append('harmful_content')
if has_bias(response):
violations.append('bias')
if violates_privacy(response):
violations.append('privacy_violation')
safety_score = max(0, 100 - len(violations) * 25)
return safety_score, violations
```
## A/B Testing Methodology
### Controlled Test Design
```python
def design_ab_test(baseline_prompt, variant_prompt, test_cases):
"""
Design controlled A/B test with proper statistical power.
"""
# Calculate required sample size
effect_size = estimate_effect_size(baseline_prompt, variant_prompt)
sample_size = calculate_sample_size(effect_size, power=0.8, alpha=0.05)
# Random assignment
randomized_cases = random.sample(test_cases, sample_size)
split_point = len(randomized_cases) // 2
group_a = randomized_cases[:split_point]
group_b = randomized_cases[split_point:]
return {
'baseline_group': group_a,
'variant_group': group_b,
'sample_size': sample_size,
'statistical_power': 0.8,
'significance_level': 0.05
}
```
### Statistical Analysis
```python
def analyze_ab_results(baseline_results, variant_results):
"""
Perform statistical analysis of A/B test results.
"""
# Calculate means and standard deviations
baseline_mean = np.mean(baseline_results)
variant_mean = np.mean(variant_results)
baseline_std = np.std(baseline_results)
variant_std = np.std(variant_results)
# Perform t-test
t_statistic, p_value = stats.ttest_ind(baseline_results, variant_results)
# Calculate effect size (Cohen's d)
pooled_std = np.sqrt(((len(baseline_results) - 1) * baseline_std**2 +
(len(variant_results) - 1) * variant_std**2) /
(len(baseline_results) + len(variant_results) - 2))
cohens_d = (variant_mean - baseline_mean) / pooled_std
return {
'baseline_mean': baseline_mean,
'variant_mean': variant_mean,
'improvement': ((variant_mean - baseline_mean) / baseline_mean) * 100,
'p_value': p_value,
'statistical_significance': p_value < 0.05,
'effect_size': cohens_d,
'recommendation': 'implement_variant' if p_value < 0.05 and cohens_d > 0.2 else 'keep_baseline'
}
```
## Optimization Strategies
### Strategy 1: Progressive Enhancement
#### Stepwise Improvement Process
```python
def progressive_optimization(base_prompt, test_cases, max_iterations=10):
"""
Incrementally improve prompt through systematic testing.
"""
current_prompt = base_prompt
current_performance = evaluate_prompt(current_prompt, test_cases)
optimization_history = []
for iteration in range(max_iterations):
# Generate improvement hypotheses
hypotheses = generate_improvement_hypotheses(current_prompt, current_performance)
best_improvement = None
best_performance = current_performance
for hypothesis in hypotheses:
# Test hypothesis
test_prompt = apply_hypothesis(current_prompt, hypothesis)
test_performance = evaluate_prompt(test_prompt, test_cases)
# Validate improvement
if is_statistically_significant(current_performance, test_performance):
if test_performance.overall_score > best_performance.overall_score:
best_improvement = hypothesis
best_performance = test_performance
# Apply best improvement if found
if best_improvement:
current_prompt = apply_hypothesis(current_prompt, best_improvement)
optimization_history.append({
'iteration': iteration,
'hypothesis': best_improvement,
'performance_before': current_performance,
'performance_after': best_performance,
'improvement': best_performance.overall_score - current_performance.overall_score
})
current_performance = best_performance
else:
break # No further improvements found
return current_prompt, optimization_history
```
### Strategy 2: Multi-Objective Optimization
#### Pareto Optimization Framework
```python
def multi_objective_optimization(prompt_variants, objectives):
"""
Optimize for multiple competing objectives using Pareto efficiency.
"""
results = []
for variant in prompt_variants:
scores = {}
for objective in objectives:
scores[objective] = evaluate_objective(variant, objective)
results.append({
'prompt': variant,
'scores': scores,
'dominates': []
})
# Find Pareto optimal solutions
pareto_optimal = []
for i, result_i in enumerate(results):
is_dominated = False
for j, result_j in enumerate(results):
if i != j and dominates(result_j, result_i):
is_dominated = True
break
if not is_dominated:
pareto_optimal.append(result_i)
return pareto_optimal
def dominates(result_a, result_b):
"""
Check if result_a dominates result_b in all objectives.
"""
return all(result_a['scores'][obj] >= result_b['scores'][obj]
for obj in result_a['scores'])
```
### Strategy 3: Adaptive Testing
#### Dynamic Test Allocation
```python
def adaptive_testing(prompt_variants, initial_budget):
"""
Dynamically allocate testing budget to promising variants.
"""
# Initial exploration phase
exploration_results = {}
budget分配 = initial_budget // len(prompt_variants)
for variant in prompt_variants:
exploration_results[variant] = test_prompt(variant, budget分配)
# Exploitation phase - allocate more budget to promising variants
total_budget_spent = len(prompt_variants) * budget分配
remaining_budget = initial_budget - total_budget_spent
# Sort by performance
sorted_variants = sorted(exploration_results.items(),
key=lambda x: x[1].overall_score, reverse=True)
# Allocate remaining budget proportionally to performance
final_results = {}
for i, (variant, initial_result) in enumerate(sorted_variants):
if remaining_budget > 0:
additional_budget = max(1, remaining_budget // (len(sorted_variants) - i))
final_results[variant] = test_prompt(variant, additional_budget)
remaining_budget -= additional_budget
else:
final_results[variant] = initial_result
return final_results
```
## Optimization Hypotheses
### Common Optimization Areas
#### Instruction Clarity
```python
instruction_clarity_hypotheses = [
"Add numbered steps to instructions",
"Include specific output format examples",
"Clarify role and expertise level",
"Add context and background information",
"Specify constraints and boundaries",
"Include success criteria and evaluation standards"
]
```
#### Example Quality
```python
example_optimization_hypotheses = [
"Increase number of examples from 3 to 5",
"Add edge case examples",
"Reorder examples by complexity",
"Include negative examples",
"Add reasoning traces to examples",
"Improve example diversity and coverage"
]
```
#### Structure Optimization
```python
structure_hypotheses = [
"Add clear section headings",
"Reorganize content flow",
"Include summary at the beginning",
"Add checklist for verification",
"Separate instructions from examples",
"Add troubleshooting section"
]
```
#### Model-Specific Optimization
```python
model_specific_hypotheses = {
'claude': [
"Use XML tags for structure",
"Add <thinking> sections for reasoning",
"Include constitutional AI principles",
"Use system message format",
"Add safety guidelines and constraints"
],
'gpt-4': [
"Use numbered sections with ### headers",
"Include JSON format specifications",
"Add function calling patterns",
"Use bullet points for clarity",
"Include error handling instructions"
],
'gemini': [
"Use bold headers with ** formatting",
"Include step-by-step process descriptions",
"Add validation checkpoints",
"Use conversational tone",
"Include confidence scoring"
]
}
```
## Continuous Monitoring
### Production Performance Tracking
```python
def setup_monitoring(prompt, alert_thresholds):
"""
Set up continuous monitoring for deployed prompts.
"""
monitors = {
'success_rate': MetricMonitor('success_rate', alert_thresholds['success_rate']),
'response_time': MetricMonitor('response_time', alert_thresholds['response_time']),
'token_cost': MetricMonitor('token_cost', alert_thresholds['token_cost']),
'safety_score': MetricMonitor('safety_score', alert_thresholds['safety_score'])
}
def monitor_performance():
recent_data = collect_recent_performance(prompt)
alerts = []
for metric_name, monitor in monitors.items():
if metric_name in recent_data:
alert = monitor.check(recent_data[metric_name])
if alert:
alerts.append(alert)
return alerts
return monitor_performance
```
### Automated Rollback System
```python
def automated_rollback_system(prompts, monitoring_data):
"""
Automatically rollback to previous version if performance degrades.
"""
def check_and_rollback(current_prompt, baseline_prompt):
current_metrics = monitoring_data.get_metrics(current_prompt)
baseline_metrics = monitoring_data.get_metrics(baseline_prompt)
# Check if performance degradation exceeds threshold
degradation_threshold = 0.1 # 10% degradation
for metric in current_metrics:
if current_metrics[metric] < baseline_metrics[metric] * (1 - degradation_threshold):
return True, f"Performance degradation in {metric}"
return False, "Performance acceptable"
return check_and_rollback
```
## Optimization Tools and Utilities
### Prompt Variation Generator
```python
def generate_prompt_variations(base_prompt):
"""
Generate systematic variations for testing.
"""
variations = {}
# Instruction variations
variations['more_detailed'] = add_detailed_instructions(base_prompt)
variations['simplified'] = simplify_instructions(base_prompt)
variations['structured'] = add_structured_format(base_prompt)
# Example variations
variations['more_examples'] = add_examples(base_prompt)
variations['better_examples'] = improve_example_quality(base_prompt)
variations['diverse_examples'] = add_example_diversity(base_prompt)
# Format variations
variations['numbered_steps'] = add_numbered_steps(base_prompt)
variations['bullet_points'] = use_bullet_points(base_prompt)
variations['sections'] = add_section_headers(base_prompt)
return variations
```
### Performance Dashboard
```python
def create_performance_dashboard(optimization_history):
"""
Create visualization of optimization progress.
"""
# Generate performance metrics over time
metrics_over_time = {
'iterations': [h['iteration'] for h in optimization_history],
'success_rates': [h['performance_after'].success_rate for h in optimization_history],
'token_efficiency': [h['performance_after'].token_efficiency for h in optimization_history],
'response_times': [h['performance_after'].response_time for h in optimization_history]
}
return PerformanceDashboard(metrics_over_time)
```
This comprehensive framework provides systematic methodologies for continuous prompt improvement through data-driven optimization and rigorous testing processes.

View File

@@ -0,0 +1,494 @@
# System Prompt Design
This reference provides comprehensive frameworks for designing effective system prompts that establish consistent model behavior, define clear boundaries, and ensure reliable performance across diverse applications.
## System Prompt Architecture
### Core Components Structure
```
1. Role Definition & Expertise
2. Behavioral Guidelines & Constraints
3. Interaction Protocols
4. Output Format Specifications
5. Safety & Ethical Guidelines
6. Context & Background Information
7. Quality Standards & Verification
8. Error Handling & Uncertainty Protocols
```
## Component Design Patterns
### 1. Role Definition Framework
#### Comprehensive Role Specification
```markdown
## Role Definition
You are an expert {role} with {experience_level} of specialized experience in {domain}. Your expertise includes:
### Core Competencies
- {competency_1}
- {competency_2}
- {competency_3}
- {competency_4}
### Knowledge Boundaries
- You have deep knowledge of {strength_area_1} and {strength_area_2}
- Your knowledge is current as of {knowledge_cutoff_date}
- You should acknowledge limitations in {limitation_area}
- When uncertain about recent developments, state this explicitly
### Professional Standards
- Adhere to {industry_standard_1} guidelines
- Follow {industry_standard_2} best practices
- Maintain {professional_attribute} in all interactions
- Ensure compliance with {regulatory_framework}
```
#### Specialized Role Templates
##### Technical Expert Role
```markdown
## Technical Expert Role
You are a Senior {domain} Engineer with {years} years of experience in {specialization}. Your expertise encompasses:
### Technical Proficiency
- Deep understanding of {technology_stack}
- Experience with {specific_frameworks} and {tools}
- Knowledge of {design_patterns} and {architectures}
- Proficiency in {programming_languages} and {development_methodologies}
### Problem-Solving Approach
- Analyze problems systematically using {methodology}
- Consider multiple solution approaches before recommending
- Evaluate trade-offs between {criteria_1}, {criteria_2}, and {criteria_3}
- Provide scalable and maintainable solutions
### Communication Style
- Explain technical concepts clearly to both technical and non-technical audiences
- Use precise terminology when appropriate
- Provide concrete examples and code snippets when helpful
- Structure responses with clear sections and logical flow
```
##### Analyst Role
```markdown
## Analyst Role
You are a professional {analysis_type} Analyst with expertise in {data_domain} and {methodology}. Your analytical approach includes:
### Analytical Framework
- Apply {analytical_methodology} for systematic analysis
- Use {statistical_techniques} for data interpretation
- Consider {contextual_factors} in your analysis
- Validate findings through {verification_methods}
### Critical Thinking Process
- Question assumptions and identify potential biases
- Evaluate evidence quality and source reliability
- Consider alternative explanations and perspectives
- Synthesize information from multiple sources
### Reporting Standards
- Present findings with appropriate confidence levels
- Distinguish between facts, interpretations, and recommendations
- Provide evidence-based conclusions
- Acknowledge limitations and uncertainties
```
### 2. Behavioral Guidelines Design
#### Comprehensive Behavior Framework
```markdown
## Behavioral Guidelines
### Interaction Style
- Maintain {tone} tone throughout all interactions
- Use {communication_approach} when explaining complex concepts
- Be {responsiveness_level} in addressing user questions
- Demonstrate {empathy_level} when dealing with user challenges
### Response Standards
- Provide responses that are {length_preference} and {detail_preference}
- Structure information using {organization_pattern}
- Include {frequency} examples and illustrations
- Use {format_preference} formatting for clarity
### Quality Expectations
- Ensure all information is {accuracy_standard}
- Provide citations for {information_type} when available
- Cross-verify information using {verification_method}
- Update knowledge based on {update_criteria}
```
#### Model-Specific Behavior Patterns
##### Claude 3.5/4 Specific Guidelines
```markdown
## Claude-Specific Behavioral Guidelines
### Constitutional Alignment
- Follow constitutional AI principles in all responses
- Prioritize helpfulness while maintaining safety
- Consider multiple perspectives before concluding
- Avoid harmful content while remaining useful
### Output Formatting
- Use XML tags for structured information: <tag>content</tag>
- Include thinking blocks for complex reasoning: <thinking>...</thinking>
- Provide clear section headers with proper hierarchy
- Use markdown formatting for improved readability
### Safety Protocols
- Apply content policies consistently
- Identify and flag potentially harmful requests
- Provide safe alternatives when appropriate
- Maintain transparency about limitations
```
##### GPT-4 Specific Guidelines
```markdown
## GPT-4 Specific Behavioral Guidelines
### Structured Response Patterns
- Use numbered lists for step-by-step processes
- Implement clear section boundaries with ### headers
- Provide JSON formatted outputs when specified
- Use consistent indentation and formatting
### Function Calling Integration
- Recognize when function calling would be appropriate
- Structure responses to facilitate tool usage
- Provide clear parameter specifications
- Handle function results systematically
### Optimization Behaviors
- Balance conciseness with comprehensiveness
- Prioritize information relevance and importance
- Use efficient language patterns
- Minimize redundancy while maintaining clarity
```
### 3. Output Format Specifications
#### Comprehensive Format Framework
```markdown
## Output Format Requirements
### Structure Standards
- Begin responses with {opening_pattern}
- Use {section_pattern} for major sections
- Implement {hierarchy_pattern} for information organization
- Include {closing_pattern} for response completion
### Content Organization
- Present information in {presentation_order}
- Group related information using {grouping_method}
- Use {transition_pattern} between sections
- Include {summary_element} for complex responses
### Format Specifications
{if json_format_required}
- Provide responses in valid JSON format
- Use consistent key naming conventions
- Include all required fields
- Validate JSON syntax before output
{endif}
{if markdown_format_required}
- Use markdown for formatting and emphasis
- Include appropriate heading levels
- Use code blocks for technical content
- Implement tables for structured data
{endif}
```
### 4. Safety and Ethical Guidelines
#### Comprehensive Safety Framework
```markdown
## Safety and Ethical Guidelines
### Content Policies
- Avoid generating {prohibited_content_1}
- Do not provide {prohibited_content_2}
- Flag {sensitive_topics} for human review
- Provide {safe_alternatives} when appropriate
### Ethical Considerations
- Consider {ethical_principle_1} in all responses
- Evaluate potential {ethical_impact} of provided information
- Balance helpfulness with {safety_consideration}
- Maintain {transparency_standard} about limitations
### Bias Mitigation
- Actively identify and mitigate {bias_type_1}
- Present information {neutrality_standard}
- Include {diverse_perspectives} when appropriate
- Avoid {stereotype_patterns}
### Harm Prevention
- Identify potential {harm_type_1} in responses
- Implement {prevention_mechanism} for harmful content
- Provide {warning_system} for sensitive topics
- Include {escalation_protocol} for concerning requests
```
### 5. Error Handling and Uncertainty
#### Comprehensive Error Management
```markdown
## Error Handling and Uncertainty Protocols
### Uncertainty Management
- Explicitly state confidence levels for uncertain information
- Use phrases like "I believe," "It appears that," "Based on available information"
- Acknowledge when information may be {uncertainty_type}
- Provide {verification_method} for uncertain claims
### Error Recognition
- Identify when {error_pattern} might have occurred
- Implement {self_checking_mechanism} for accuracy
- Use {validation_process} for important information
- Provide {correction_protocol} when errors are identified
### Limitation Acknowledgment
- Clearly state {knowledge_limitation} when relevant
- Explain {limitation_reason} when unable to provide complete information
- Suggest {alternative_approach} when direct assistance isn't possible
- Provide {escalation_option} for complex scenarios
### Correction Procedures
- Implement {correction_workflow} for identified errors
- Provide {explanation_format} for corrections
- Use {acknowledgment_pattern} for mistakes
- Include {improvement_commitment} for future accuracy
```
## Specialized System Prompt Templates
### 1. Educational Assistant System Prompt
```markdown
# Educational Assistant System Prompt
## Role Definition
You are an expert educational assistant specializing in {subject_area} with {experience_level} of teaching experience. Your pedagogical approach emphasizes {teaching_philosophy} and adapts to different learning styles.
## Educational Philosophy
- Create inclusive and supportive learning environments
- Adapt explanations to match learner's comprehension level
- Use scaffolding techniques to build understanding progressively
- Encourage critical thinking and independent learning
## Teaching Standards
- Provide accurate, up-to-date information verified through {verification_sources}
- Use clear, accessible language appropriate for the target audience
- Include relevant examples and analogies to enhance understanding
- Structure learning objectives with clear progression
## Interaction Protocols
- Assess learner's current understanding before providing explanations
- Ask clarifying questions to tailor responses appropriately
- Provide opportunities for learner questions and feedback
- Offer additional resources for extended learning
## Output Format
- Begin with brief assessment of learner's needs
- Use clear headings and organized structure
- Include summary points for key takeaways
- Provide practice exercises when appropriate
- End with suggestions for further learning
## Safety Guidelines
- Create psychologically safe learning environments
- Avoid language that might discourage or intimidate learners
- Be patient and supportive when learners struggle with concepts
- Respect diverse backgrounds and learning abilities
## Uncertainty Handling
- Acknowledge when topics are beyond current expertise
- Suggest reliable resources for additional information
- Be transparent about the limits of available knowledge
- Encourage critical thinking and independent verification
```
### 2. Technical Documentation Generator System Prompt
```markdown
# Technical Documentation System Prompt
## Role Definition
You are a Senior Technical Writer with {years} of experience creating documentation for {technology_domain}. Your expertise encompasses {documentation_types} and you follow {industry_standards} for technical communication.
## Documentation Standards
- Follow {style_guide} for consistent formatting and terminology
- Ensure clarity and accuracy in all technical explanations
- Include practical examples and code snippets when helpful
- Structure content with clear hierarchy and logical flow
## Quality Requirements
- Maintain technical accuracy verified through {review_process}
- Use consistent terminology throughout documentation
- Provide comprehensive coverage of topics without overwhelming detail
- Include troubleshooting information for common issues
## Audience Considerations
- Target documentation at {audience_level} technical proficiency
- Define technical terms and concepts appropriately
- Provide progressive disclosure of complex information
- Include context and motivation for technical decisions
## Format Specifications
- Use markdown formatting for clear structure and readability
- Include code blocks with syntax highlighting
- Implement consistent section headings and numbering
- Provide navigation aids and cross-references
## Review Process
- Verify technical accuracy through {verification_method}
- Test all code examples and procedures
- Ensure completeness of coverage for documented features
- Validate clarity and comprehensibility with target audience
## Safety and Compliance
- Include security considerations where relevant
- Document potential risks and mitigation strategies
- Follow industry compliance requirements
- Maintain confidentiality for sensitive information
```
### 3. Data Analysis System Prompt
```markdown
# Data Analysis System Prompt
## Role Definition
You are an expert Data Analyst specializing in {data_domain} with {years} of experience in {analysis_methodologies}. Your analytical approach combines {technical_skills} with {business_acumen} to deliver actionable insights.
## Analytical Framework
- Apply {statistical_methods} for rigorous data analysis
- Use {visualization_techniques} for effective data communication
- Implement {quality_assurance} processes for data validation
- Follow {ethical_guidelines} for responsible data handling
## Analysis Standards
- Ensure methodological soundness in all analyses
- Provide clear documentation of analytical processes
- Include appropriate statistical measures and confidence intervals
- Validate findings through {validation_methods}
## Communication Requirements
- Present findings with appropriate technical depth for the audience
- Use clear visualizations and narrative explanations
- Highlight actionable insights and recommendations
- Acknowledge limitations and uncertainties in analyses
## Output Structure
```json
{
"executive_summary": "High-level overview of key findings",
"methodology": "Description of analytical approach and methods used",
"data_overview": "Summary of data sources, quality, and limitations",
"key_findings": [
{
"finding": "Specific discovery or insight",
"evidence": "Supporting data and statistical measures",
"confidence": "Confidence level in the finding",
"implications": "Business or operational implications"
}
],
"recommendations": [
{
"action": "Recommended action",
"priority": "High/Medium/Low",
"expected_impact": "Anticipated outcome",
"implementation_considerations": "Factors to consider"
}
],
"limitations": "Constraints and limitations of the analysis",
"next_steps": "Suggested follow-up analyses or actions"
}
```
## Ethical Considerations
- Protect privacy and confidentiality of data subjects
- Ensure unbiased analysis and interpretation
- Consider potential impact of findings on stakeholders
- Maintain transparency about analytical limitations
```
## System Prompt Testing and Validation
### Validation Framework
```python
class SystemPromptValidator:
def __init__(self):
self.validation_criteria = {
'role_clarity': 0.2,
'instruction_specificity': 0.2,
'safety_completeness': 0.15,
'output_format_clarity': 0.15,
'error_handling_coverage': 0.1,
'behavioral_consistency': 0.1,
'ethical_considerations': 0.1
}
def validate_prompt(self, system_prompt):
"""Validate system prompt against quality criteria."""
scores = {}
scores['role_clarity'] = self.assess_role_clarity(system_prompt)
scores['instruction_specificity'] = self.assess_instruction_specificity(system_prompt)
scores['safety_completeness'] = self.assess_safety_completeness(system_prompt)
scores['output_format_clarity'] = self.assess_output_format_clarity(system_prompt)
scores['error_handling_coverage'] = self.assess_error_handling(system_prompt)
scores['behavioral_consistency'] = self.assess_behavioral_consistency(system_prompt)
scores['ethical_considerations'] = self.assess_ethical_considerations(system_prompt)
# Calculate overall score
overall_score = sum(score * weight for score, weight in
zip(scores.values(), self.validation_criteria.values()))
return {
'overall_score': overall_score,
'individual_scores': scores,
'recommendations': self.generate_recommendations(scores)
}
def test_prompt_consistency(self, system_prompt, test_scenarios):
"""Test prompt behavior consistency across different scenarios."""
results = []
for scenario in test_scenarios:
response = execute_with_system_prompt(system_prompt, scenario)
# Analyze response consistency
consistency_score = self.analyze_response_consistency(response, system_prompt)
results.append({
'scenario': scenario,
'response': response,
'consistency_score': consistency_score
})
average_consistency = sum(r['consistency_score'] for r in results) / len(results)
return {
'average_consistency': average_consistency,
'scenario_results': results,
'recommendations': self.generate_consistency_recommendations(results)
}
```
## Best Practices Summary
### Design Principles
- **Clarity First**: Ensure role and instructions are unambiguous
- **Comprehensive Coverage**: Address all aspects of model behavior
- **Consistency Focus**: Maintain consistent behavior across scenarios
- **Safety Priority**: Include robust safety guidelines and constraints
- **Flexibility Built-in**: Allow for adaptation to different contexts
### Common Pitfalls to Avoid
- **Vague Instructions**: Be specific about expected behaviors
- **Over-constraining**: Allow room for intelligent adaptation
- **Missing Safety Guidelines**: Always include comprehensive safety measures
- **Inconsistent Formatting**: Use consistent structure throughout
- **Ignoring Model Capabilities**: Design prompts that leverage model strengths
This comprehensive system prompt design framework provides the foundation for creating effective, reliable, and safe AI system behaviors across diverse applications and use cases.

View File

@@ -0,0 +1,599 @@
# Template Systems Architecture
This reference provides comprehensive frameworks for building modular, reusable prompt templates with variable interpolation, conditional sections, and hierarchical composition.
## Template Design Principles
### Modularity and Reusability
- **Single Responsibility**: Each template handles one specific type of task
- **Composability**: Templates can be combined to create complex prompts
- **Parameterization**: Variables allow customization without core changes
- **Inheritance**: Base templates can be extended for specific use cases
### Clear Variable Naming Conventions
```
{user_input} - Direct input from user
{context} - Background information
{examples} - Few-shot learning examples
{constraints} - Task limitations and requirements
{output_format} - Desired output structure
{role} - AI role or persona
{expertise_level} - Level of expertise for the role
{domain} - Specific domain or field
{difficulty} - Task complexity level
{language} - Output language specification
```
## Core Template Components
### 1. Base Template Structure
```
# Template: Universal Task Framework
# Purpose: Base template for most task types
# Variables: {role}, {task_description}, {context}, {examples}, {output_format}
## System Instructions
You are a {role} with {expertise_level} expertise in {domain}.
## Context Information
{if context}
Background and relevant context:
{context}
{endif}
## Task Description
{task_description}
## Examples
{if examples}
Here are some examples to guide your response:
{examples}
{endif}
## Output Requirements
{output_format}
## Constraints and Guidelines
{constraints}
## User Input
{user_input}
```
### 2. Conditional Sections Framework
```python
def process_conditional_template(template, variables):
"""
Process template with conditional sections.
"""
# Process if/endif blocks
while '{if ' in template:
start = template.find('{if ')
end_condition = template.find('}', start)
condition = template[start+4:end_condition].strip()
start_endif = template.find('{endif}', end_condition)
if_content = template[end_condition+1:start_endif].strip()
# Evaluate condition
if evaluate_condition(condition, variables):
template = template[:start] + if_content + template[start_endif+6:]
else:
template = template[:start] + template[start_endif+6:]
# Replace variables
for key, value in variables.items():
template = template.replace(f'{{{key}}}', str(value))
return template
```
### 3. Variable Interpolation System
```python
class TemplateEngine:
def __init__(self):
self.variables = {}
self.functions = {
'upper': str.upper,
'lower': str.lower,
'capitalize': str.capitalize,
'pluralize': self.pluralize,
'format_date': self.format_date,
'truncate': self.truncate
}
def set_variable(self, name, value):
"""Set a template variable."""
self.variables[name] = value
def render(self, template):
"""Render template with variable substitution."""
# Process function calls {variable|function}
template = self.process_functions(template)
# Replace variables
for key, value in self.variables.items():
template = template.replace(f'{{{key}}}', str(value))
return template
def process_functions(self, template):
"""Process template functions."""
import re
pattern = r'\{(\w+)\|(\w+)\}'
def replace_function(match):
var_name, func_name = match.groups()
value = self.variables.get(var_name, '')
if func_name in self.functions:
return self.functions[func_name](str(value))
return value
return re.sub(pattern, replace_function, template)
```
## Specialized Template Types
### 1. Classification Template
```
# Template: Multi-Class Classification
# Purpose: Classify inputs into predefined categories
# Required Variables: {input_text}, {categories}, {role}
## Classification Framework
You are a {role} specializing in accurate text classification.
## Classification Categories
{categories}
## Classification Process
1. Analyze the input text carefully
2. Identify key indicators and features
3. Match against category definitions
4. Select the most appropriate category
5. Provide confidence score
## Input to Classify
{input_text}
## Output Format
```json
{{
"category": "selected_category",
"confidence": 0.95,
"reasoning": "Brief explanation of classification logic",
"key_indicators": ["indicator1", "indicator2"]
}}
```
```
### 2. Transformation Template
```
# Template: Text Transformation
# Purpose: Transform text from one format/style to another
# Required Variables: {source_text}, {target_format}, {transformation_rules}
## Transformation Task
Transform the given {source_format} text into {target_format} following these rules:
{transformation_rules}
## Source Text
{source_text}
## Transformation Process
1. Analyze the structure and content of the source text
2. Apply the specified transformation rules
3. Maintain the core meaning and intent
4. Ensure proper {target_format} formatting
5. Verify completeness and accuracy
## Transformed Output
```
### 3. Generation Template
```
# Template: Creative Generation
# Purpose: Generate creative content based on specifications
# Required Variables: {content_type}, {specifications}, {style_guidelines}
## Creative Generation Task
Generate {content_type} that meets the following specifications:
## Content Specifications
{specifications}
## Style Guidelines
{style_guidelines}
## Quality Requirements
- Originality and creativity
- Adherence to specifications
- Appropriate tone and style
- Clear structure and coherence
- Audience-appropriate language
## Generated Content
```
### 4. Analysis Template
```
# Template: Comprehensive Analysis
# Purpose: Perform detailed analysis of given input
# Required Variables: {input_data}, {analysis_framework}, {focus_areas}
## Analysis Framework
You are an expert analyst with deep expertise in {domain}.
## Analysis Scope
Focus on these key areas:
{focus_areas}
## Analysis Methodology
{analysis_framework}
## Input Data for Analysis
{input_data}
## Analysis Process
1. Initial assessment and context understanding
2. Detailed examination of each focus area
3. Pattern and trend identification
4. Comparative analysis with benchmarks
5. Insight generation and recommendation formulation
## Analysis Output Structure
```yaml
executive_summary:
key_findings: []
overall_assessment: ""
detailed_analysis:
{focus_area_1}:
observations: []
patterns: []
insights: []
{focus_area_2}:
observations: []
patterns: []
insights: []
recommendations:
immediate: []
short_term: []
long_term: []
```
## Advanced Template Patterns
### 1. Hierarchical Template Composition
```python
class HierarchicalTemplate:
def __init__(self, name, content, parent=None):
self.name = name
self.content = content
self.parent = parent
self.children = []
self.variables = {}
def add_child(self, child_template):
"""Add a child template."""
child_template.parent = self
self.children.append(child_template)
def render(self, variables=None):
"""Render template with inherited variables."""
# Combine variables from parent hierarchy
combined_vars = {}
# Collect variables from parents
current = self.parent
while current:
combined_vars.update(current.variables)
current = current.parent
# Add current variables
combined_vars.update(self.variables)
# Override with provided variables
if variables:
combined_vars.update(variables)
# Render content
rendered_content = self.render_content(self.content, combined_vars)
# Render children
for child in self.children:
child_rendered = child.render(combined_vars)
rendered_content = rendered_content.replace(
f'{{child:{child.name}}}', child_rendered
)
return rendered_content
```
### 2. Role-Based Template System
```python
class RoleBasedTemplate:
def __init__(self):
self.roles = {
'analyst': {
'persona': 'You are a professional analyst with expertise in data interpretation and pattern recognition.',
'approach': 'systematic',
'output_style': 'detailed and evidence-based',
'verification': 'Always cross-check findings and cite sources'
},
'creative_writer': {
'persona': 'You are a creative writer with a talent for engaging storytelling and vivid descriptions.',
'approach': 'imaginative',
'output_style': 'descriptive and engaging',
'verification': 'Ensure narrative consistency and flow'
},
'technical_expert': {
'persona': 'You are a technical expert with deep knowledge of {domain} and practical implementation experience.',
'approach': 'methodical',
'output_style': 'precise and technical',
'verification': 'Include technical accuracy and best practices'
}
}
def create_prompt(self, role, task, domain=None):
"""Create role-specific prompt template."""
role_config = self.roles.get(role, self.roles['analyst'])
template = f"""
## Role Definition
{role_config['persona']}
## Approach
Use a {role_config['approach']} approach to this task.
## Task
{task}
## Output Style
{role_config['output_style']}
## Verification
{role_config['verification']}
"""
if domain and '{domain}' in role_config['persona']:
template = template.replace('{domain}', domain)
return template
```
### 3. Dynamic Template Selection
```python
class DynamicTemplateSelector:
def __init__(self):
self.templates = {}
self.selection_rules = {}
def register_template(self, name, template, selection_criteria):
"""Register a template with selection criteria."""
self.templates[name] = template
self.selection_rules[name] = selection_criteria
def select_template(self, task_characteristics):
"""Select the most appropriate template based on task characteristics."""
best_template = None
best_score = 0
for name, criteria in self.selection_rules.items():
score = self.calculate_match_score(task_characteristics, criteria)
if score > best_score:
best_score = score
best_template = name
return self.templates[best_template] if best_template else None
def calculate_match_score(self, task_characteristics, criteria):
"""Calculate how well task matches template criteria."""
score = 0
total_weight = 0
for characteristic, weight in criteria.items():
if characteristic in task_characteristics:
if task_characteristics[characteristic] == weight['value']:
score += weight['weight']
total_weight += weight['weight']
return score / total_weight if total_weight > 0 else 0
```
## Template Implementation Examples
### Example 1: Customer Service Template
```python
customer_service_template = """
# Customer Service Response Template
## Role Definition
You are a {customer_service_role} with {experience_level} of customer service experience in {industry}.
## Context
{if customer_history}
Customer History:
{customer_history}
{endif}
{if issue_context}
Issue Context:
{issue_context}
{endif}
## Response Guidelines
- Maintain {tone} tone throughout
- Address all aspects of the customer's inquiry
- Provide {level_of_detail} explanation
- Include {additional_elements}
- Follow company {communication_style} style
## Customer Inquiry
{customer_inquiry}
## Response Structure
1. Greeting and acknowledgment
2. Understanding and empathy
3. Solution or explanation
4. Additional assistance offered
5. Professional closing
## Response
"""
```
### Example 2: Technical Documentation Template
```python
documentation_template = """
# Technical Documentation Generator
## Role Definition
You are a {technical_writer_role} specializing in {technology} documentation with {experience_level} of experience.
## Documentation Standards
- Target audience: {audience_level}
- Technical depth: {technical_depth}
- Include examples: {include_examples}
- Add troubleshooting: {add_troubleshooting}
- Version: {version}
## Content to Document
{content_to_document}
## Documentation Structure
```markdown
# {title}
## Overview
{overview}
## Prerequisites
{prerequisites}
## {main_sections}
## Examples
{if include_examples}
{examples}
{endif}
## Troubleshooting
{if add_troubleshooting}
{troubleshooting}
{endif}
## Additional Resources
{additional_resources}
```
## Generated Documentation
"""
```
## Template Management System
### Version Control Integration
```python
class TemplateVersionManager:
def __init__(self):
self.versions = {}
self.current_versions = {}
def create_version(self, template_name, template_content, author, description):
"""Create a new version of a template."""
import datetime
import hashlib
version_id = hashlib.md5(template_content.encode()).hexdigest()[:8]
timestamp = datetime.datetime.now().isoformat()
version_info = {
'version_id': version_id,
'content': template_content,
'author': author,
'description': description,
'timestamp': timestamp,
'parent_version': self.current_versions.get(template_name)
}
if template_name not in self.versions:
self.versions[template_name] = []
self.versions[template_name].append(version_info)
self.current_versions[template_name] = version_id
return version_id
def rollback(self, template_name, version_id):
"""Rollback to a specific version."""
if template_name in self.versions:
for version in self.versions[template_name]:
if version['version_id'] == version_id:
self.current_versions[template_name] = version_id
return version['content']
return None
```
### Performance Monitoring
```python
class TemplatePerformanceMonitor:
def __init__(self):
self.usage_stats = {}
self.performance_metrics = {}
def track_usage(self, template_name, execution_time, success):
"""Track template usage and performance."""
if template_name not in self.usage_stats:
self.usage_stats[template_name] = {
'usage_count': 0,
'total_time': 0,
'success_count': 0,
'failure_count': 0
}
stats = self.usage_stats[template_name]
stats['usage_count'] += 1
stats['total_time'] += execution_time
if success:
stats['success_count'] += 1
else:
stats['failure_count'] += 1
def get_performance_report(self, template_name):
"""Generate performance report for a template."""
if template_name not in self.usage_stats:
return None
stats = self.usage_stats[template_name]
avg_time = stats['total_time'] / stats['usage_count']
success_rate = stats['success_count'] / stats['usage_count']
return {
'template_name': template_name,
'total_usage': stats['usage_count'],
'average_execution_time': avg_time,
'success_rate': success_rate,
'failure_rate': 1 - success_rate
}
```
## Best Practices
### Template Quality Guidelines
- **Clear Documentation**: Include purpose, variables, and usage examples
- **Consistent Naming**: Use standardized variable naming conventions
- **Error Handling**: Include fallback mechanisms for missing variables
- **Performance Optimization**: Minimize template complexity and rendering time
- **Testing**: Implement comprehensive template testing frameworks
### Security Considerations
- **Input Validation**: Sanitize all template variables
- **Injection Prevention**: Prevent code injection in template rendering
- **Access Control**: Implement proper authorization for template modifications
- **Audit Trail**: Track template changes and usage
This comprehensive template system architecture provides the foundation for building scalable, maintainable prompt templates that can be efficiently managed and optimized across diverse use cases.

286
skills/ai/rag/SKILL.md Normal file
View File

@@ -0,0 +1,286 @@
---
name: rag-implementation
description: Build Retrieval-Augmented Generation (RAG) systems for AI applications with vector databases and semantic search. Use when implementing knowledge-grounded AI, building document Q&A systems, or integrating LLMs with external knowledge bases.
allowed-tools: Read, Write, Bash
category: ai-engineering
tags: [rag, vector-databases, embeddings, retrieval, semantic-search]
version: 1.0.0
---
# RAG Implementation
Build Retrieval-Augmented Generation systems that extend AI capabilities with external knowledge sources.
## Overview
RAG (Retrieval-Augmented Generation) enhances AI applications by retrieving relevant information from knowledge bases and incorporating it into AI responses, reducing hallucinations and providing accurate, grounded answers.
## When to Use
Use this skill when:
- Building Q&A systems over proprietary documents
- Creating chatbots with current, factual information
- Implementing semantic search with natural language queries
- Reducing hallucinations with grounded responses
- Enabling AI systems to access domain-specific knowledge
- Building documentation assistants
- Creating research tools with source citation
- Developing knowledge management systems
## Core Components
### Vector Databases
Store and efficiently retrieve document embeddings for semantic search.
**Key Options:**
- **Pinecone**: Managed, scalable, production-ready
- **Weaviate**: Open-source, hybrid search capabilities
- **Milvus**: High performance, on-premise deployment
- **Chroma**: Lightweight, easy local development
- **Qdrant**: Fast, advanced filtering
- **FAISS**: Meta's library, full control
### Embedding Models
Convert text to numerical vectors for similarity search.
**Popular Models:**
- **text-embedding-ada-002** (OpenAI): General purpose, 1536 dimensions
- **all-MiniLM-L6-v2**: Fast, lightweight, 384 dimensions
- **e5-large-v2**: High quality, multilingual
- **bge-large-en-v1.5**: State-of-the-art performance
### Retrieval Strategies
Find relevant content based on user queries.
**Approaches:**
- **Dense Retrieval**: Semantic similarity via embeddings
- **Sparse Retrieval**: Keyword matching (BM25, TF-IDF)
- **Hybrid Search**: Combine dense + sparse for best results
- **Multi-Query**: Generate multiple query variations
- **Contextual Compression**: Extract only relevant parts
## Quick Implementation
### Basic RAG Setup
```java
// Load documents from file system
List<Document> documents = FileSystemDocumentLoader.loadDocuments("/path/to/docs");
// Create embedding store
InMemoryEmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();
// Ingest documents into the store
EmbeddingStoreIngestor.ingest(documents, embeddingStore);
// Create AI service with RAG capability
Assistant assistant = AiServices.builder(Assistant.class)
.chatModel(chatModel)
.chatMemory(MessageWindowChatMemory.withMaxMessages(10))
.contentRetriever(EmbeddingStoreContentRetriever.from(embeddingStore))
.build();
```
### Document Processing Pipeline
```java
// Split documents into chunks
DocumentSplitter splitter = new RecursiveCharacterTextSplitter(
500, // chunk size
100 // overlap
);
// Create embedding model
EmbeddingModel embeddingModel = OpenAiEmbeddingModel.builder()
.apiKey("your-api-key")
.build();
// Create embedding store
EmbeddingStore<TextSegment> embeddingStore = PgVectorEmbeddingStore.builder()
.host("localhost")
.database("postgres")
.user("postgres")
.password("password")
.table("embeddings")
.dimension(1536)
.build();
// Process and store documents
for (Document document : documents) {
List<TextSegment> segments = splitter.split(document);
for (TextSegment segment : segments) {
Embedding embedding = embeddingModel.embed(segment).content();
embeddingStore.add(embedding, segment);
}
}
```
## Implementation Patterns
### Pattern 1: Simple Document Q&A
Create a basic Q&A system over your documents.
```java
public interface DocumentAssistant {
String answer(String question);
}
DocumentAssistant assistant = AiServices.builder(DocumentAssistant.class)
.chatModel(chatModel)
.contentRetriever(retriever)
.build();
```
### Pattern 2: Metadata-Filtered Retrieval
Filter results based on document metadata.
```java
// Add metadata during document loading
Document document = Document.builder()
.text("Content here")
.metadata("source", "technical-manual.pdf")
.metadata("category", "technical")
.metadata("date", "2024-01-15")
.build();
// Filter during retrieval
EmbeddingStoreContentRetriever retriever = EmbeddingStoreContentRetriever.builder()
.embeddingStore(embeddingStore)
.embeddingModel(embeddingModel)
.maxResults(5)
.minScore(0.7)
.filter(metadataKey("category").isEqualTo("technical"))
.build();
```
### Pattern 3: Multi-Source Retrieval
Combine results from multiple knowledge sources.
```java
ContentRetriever webRetriever = EmbeddingStoreContentRetriever.from(webStore);
ContentRetriever documentRetriever = EmbeddingStoreContentRetriever.from(documentStore);
ContentRetriever databaseRetriever = EmbeddingStoreContentRetriever.from(databaseStore);
// Combine results
List<Content> allResults = new ArrayList<>();
allResults.addAll(webRetriever.retrieve(query));
allResults.addAll(documentRetriever.retrieve(query));
allResults.addAll(databaseRetriever.retrieve(query));
// Rerank combined results
List<Content> rerankedResults = reranker.reorder(query, allResults);
```
## Best Practices
### Document Preparation
- Clean and preprocess documents before ingestion
- Remove irrelevant content and formatting artifacts
- Standardize document structure for consistent processing
- Add relevant metadata for filtering and context
### Chunking Strategy
- Use 500-1000 tokens per chunk for optimal balance
- Include 10-20% overlap to preserve context at boundaries
- Consider document structure when determining chunk boundaries
- Test different chunk sizes for your specific use case
### Retrieval Optimization
- Start with high k values (10-20) then filter/rerank
- Use metadata filtering to improve relevance
- Combine multiple retrieval strategies for better coverage
- Monitor retrieval quality and user feedback
### Performance Considerations
- Cache embeddings for frequently accessed content
- Use batch processing for document ingestion
- Optimize vector store configuration for your scale
- Monitor query performance and system resources
## Common Issues and Solutions
### Poor Retrieval Quality
**Problem**: Retrieved documents don't match user queries
**Solutions**:
- Improve document preprocessing and cleaning
- Adjust chunk size and overlap parameters
- Try different embedding models
- Use hybrid search combining semantic and keyword matching
### Irrelevant Results
**Problem**: Retrieved documents contain relevant information but are not specific enough
**Solutions**:
- Add metadata filtering for domain-specific constraints
- Implement reranking with cross-encoder models
- Use contextual compression to extract relevant parts
- Fine-tune retrieval parameters (k values, similarity thresholds)
### Performance Issues
**Problem**: Slow response times during retrieval
**Solutions**:
- Optimize vector store configuration and indexing
- Implement caching for frequently retrieved content
- Use smaller embedding models for faster inference
- Consider approximate nearest neighbor algorithms
### Hallucination Prevention
**Problem**: AI generates information not present in retrieved documents
**Solutions**:
- Improve prompt engineering to emphasize grounding
- Add verification steps to check answer alignment
- Include confidence scoring for responses
- Implement fact-checking mechanisms
## Evaluation Framework
### Retrieval Metrics
- **Precision@k**: Percentage of relevant documents in top-k results
- **Recall@k**: Percentage of all relevant documents found in top-k results
- **Mean Reciprocal Rank (MRR)**: Average rank of first relevant result
- **Normalized Discounted Cumulative Gain (nDCG)**: Ranking quality metric
### Answer Quality Metrics
- **Faithfulness**: Degree to which answers are grounded in retrieved documents
- **Answer Relevance**: How well answers address user questions
- **Context Recall**: Percentage of relevant context used in answers
- **Context Precision**: Percentage of retrieved context that is relevant
### User Experience Metrics
- **Response Time**: Time from query to answer
- **User Satisfaction**: Feedback ratings on answer quality
- **Task Completion**: Rate of successful task completion
- **Engagement**: User interaction patterns with the system
## Resources
### Reference Documentation
- [Vector Database Comparison](references/vector-databases.md) - Detailed comparison of vector database options
- [Embedding Models Guide](references/embedding-models.md) - Model selection and optimization
- [Retrieval Strategies](references/retrieval-strategies.md) - Advanced retrieval techniques
- [Document Chunking](references/document-chunking.md) - Chunking strategies and best practices
- [LangChain4j RAG Guide](references/langchain4j-rag-guide.md) - Official implementation patterns
### Assets
- `assets/vector-store-config.yaml` - Configuration templates for different vector stores
- `assets/retriever-pipeline.java` - Complete RAG pipeline implementation
- `assets/evaluation-metrics.java` - Evaluation framework code
## Constraints and Limitations
1. **Token Limits**: Respect model context window limitations
2. **API Rate Limits**: Manage external API rate limits and costs
3. **Data Privacy**: Ensure compliance with data protection regulations
4. **Resource Requirements**: Consider memory and computational requirements
5. **Maintenance**: Plan for regular updates and system monitoring
## Security Considerations
- Secure access to vector databases and embedding services
- Implement proper authentication and authorization
- Validate and sanitize user inputs
- Monitor for abuse and unusual usage patterns
- Regular security audits and penetration testing

View File

@@ -0,0 +1,307 @@
package com.example.rag;
import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.DocumentSplitter;
import dev.langchain4j.data.document.parser.TextDocumentParser;
import dev.langchain4j.data.document.splitter.RecursiveCharacterTextSplitter;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.model.openai.OpenAiEmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingStore;
import dev.langchain4j.store.embedding.EmbeddingStoreIngestor;
import dev.langchain4j.store.embedding.inmemory.InMemoryEmbeddingStore;
import dev.langchain4j.store.embedding.pinecone.PineconeEmbeddingStore;
import dev.langchain4j.store.embedding.chroma.ChromaEmbeddingStore;
import dev.langchain4j.store.embedding.qdrant.QdrantEmbeddingStore;
import dev.langchain4j.data.document.loader.FileSystemDocumentLoader;
import dev.langchain4j.store.embedding.filter.Filter;
import dev.langchain4j.store.embedding.filter.MetadataFilterBuilder;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.List;
import java.util.Map;
import java.util.HashMap;
/**
* Complete RAG Pipeline Implementation
*
* This class provides a comprehensive implementation of a RAG (Retrieval-Augmented Generation)
* system with support for multiple vector stores and advanced retrieval strategies.
*/
public class RAGPipeline {
private final EmbeddingModel embeddingModel;
private final EmbeddingStore<TextSegment> embeddingStore;
private final DocumentSplitter documentSplitter;
private final RAGConfig config;
/**
* Configuration class for RAG pipeline
*/
public static class RAGConfig {
private String vectorStoreType = "chroma";
private String openAiApiKey;
private String pineconeApiKey;
private String pineconeEnvironment;
private String pineconeIndex = "rag-documents";
private String chromaCollection = "rag-documents";
private String chromaPersistPath = "./chroma_db";
private String qdrantHost = "localhost";
private int qdrantPort = 6333;
private String qdrantCollection = "rag-documents";
private int chunkSize = 1000;
private int chunkOverlap = 200;
private int embeddingDimension = 1536;
// Getters and setters
public String getVectorStoreType() { return vectorStoreType; }
public void setVectorStoreType(String vectorStoreType) { this.vectorStoreType = vectorStoreType; }
public String getOpenAiApiKey() { return openAiApiKey; }
public void setOpenAiApiKey(String openAiApiKey) { this.openAiApiKey = openAiApiKey; }
public String getPineconeApiKey() { return pineconeApiKey; }
public void setPineconeApiKey(String pineconeApiKey) { this.pineconeApiKey = pineconeApiKey; }
public String getPineconeEnvironment() { return pineconeEnvironment; }
public void setPineconeEnvironment(String pineconeEnvironment) { this.pineconeEnvironment = pineconeEnvironment; }
public String getPineconeIndex() { return pineconeIndex; }
public void setPineconeIndex(String pineconeIndex) { this.pineconeIndex = pineconeIndex; }
public String getChromaCollection() { return chromaCollection; }
public void setChromaCollection(String chromaCollection) { this.chromaCollection = chromaCollection; }
public String getChromaPersistPath() { return chromaPersistPath; }
public void setChromaPersistPath(String chromaPersistPath) { this.chromaPersistPath = chromaPersistPath; }
public String getQdrantHost() { return qdrantHost; }
public void setQdrantHost(String qdrantHost) { this.qdrantHost = qdrantHost; }
public int getQdrantPort() { return qdrantPort; }
public void setQdrantPort(int qdrantPort) { this.qdrantPort = qdrantPort; }
public String getQdrantCollection() { return qdrantCollection; }
public void setQdrantCollection(String qdrantCollection) { this.qdrantCollection = qdrantCollection; }
public int getChunkSize() { return chunkSize; }
public void setChunkSize(int chunkSize) { this.chunkSize = chunkSize; }
public int getChunkOverlap() { return chunkOverlap; }
public void setChunkOverlap(int chunkOverlap) { this.chunkOverlap = chunkOverlap; }
public int getEmbeddingDimension() { return embeddingDimension; }
public void setEmbeddingDimension(int embeddingDimension) { this.embeddingDimension = embeddingDimension; }
}
/**
* Constructor
*/
public RAGPipeline(RAGConfig config) {
this.config = config;
this.embeddingModel = createEmbeddingModel();
this.embeddingStore = createEmbeddingStore();
this.documentSplitter = createDocumentSplitter();
}
/**
* Create embedding model based on configuration
*/
private EmbeddingModel createEmbeddingModel() {
return OpenAiEmbeddingModel.builder()
.apiKey(config.getOpenAiApiKey())
.modelName("text-embedding-ada-002")
.build();
}
/**
* Create embedding store based on configuration
*/
private EmbeddingStore<TextSegment> createEmbeddingStore() {
switch (config.getVectorStoreType().toLowerCase()) {
case "pinecone":
return PineconeEmbeddingStore.builder()
.apiKey(config.getPineconeApiKey())
.environment(config.getPineconeEnvironment())
.index(config.getPineconeIndex())
.dimension(config.getEmbeddingDimension())
.build();
case "chroma":
return ChromaEmbeddingStore.builder()
.collectionName(config.getChromaCollection())
.persistDirectory(config.getChromaPersistPath())
.build();
case "qdrant":
return QdrantEmbeddingStore.builder()
.host(config.getQdrantHost())
.port(config.getQdrantPort())
.collectionName(config.getQdrantCollection())
.dimension(config.getEmbeddingDimension())
.build();
case "memory":
default:
return new InMemoryEmbeddingStore<>();
}
}
/**
* Create document splitter
*/
private DocumentSplitter createDocumentSplitter() {
return new RecursiveCharacterTextSplitter(
config.getChunkSize(),
config.getChunkOverlap()
);
}
/**
* Load documents from directory
*/
public List<Document> loadDocuments(String directoryPath) {
try {
Path directory = Paths.get(directoryPath);
List<Document> documents = FileSystemDocumentLoader.loadDocuments(directory);
// Add metadata to documents
for (Document document : documents) {
Map<String, Object> metadata = new HashMap<>(document.metadata().toMap());
metadata.put("loaded_at", System.currentTimeMillis());
metadata.put("source_directory", directoryPath);
// Update document metadata
document = Document.from(document.text(), metadata);
}
return documents;
} catch (Exception e) {
throw new RuntimeException("Failed to load documents from " + directoryPath, e);
}
}
/**
* Process and ingest documents
*/
public void ingestDocuments(List<Document> documents) {
// Split documents into segments
List<TextSegment> segments = documentSplitter.split(documents);
// Add additional metadata to segments
for (int i = 0; i < segments.size(); i++) {
TextSegment segment = segments.get(i);
Map<String, Object> metadata = new HashMap<>(segment.metadata().toMap());
metadata.put("segment_index", i);
metadata.put("total_segments", segments.size());
metadata.put("processed_at", System.currentTimeMillis());
segments.set(i, TextSegment.from(segment.text(), metadata));
}
// Ingest into embedding store
EmbeddingStoreIngestor.ingest(segments, embeddingStore);
System.out.println("Ingested " + documents.size() + " documents into " +
segments.size() + " segments");
}
/**
* Search documents with optional filtering
*/
public List<TextSegment> search(String query, int maxResults, Filter filter) {
Embedding queryEmbedding = embeddingModel.embed(query).content();
return embeddingStore.findRelevant(queryEmbedding, maxResults, filter);
}
/**
* Search documents with metadata filtering
*/
public List<TextSegment> searchWithMetadataFilter(String query, int maxResults,
Map<String, Object> metadataFilters) {
Filter filter = null;
if (metadataFilters != null && !metadataFilters.isEmpty()) {
MetadataFilterBuilder filterBuilder = new MetadataFilterBuilder();
for (Map.Entry<String, Object> entry : metadataFilters.entrySet()) {
String key = entry.getKey();
Object value = entry.getValue();
if (value instanceof String) {
filterBuilder = filterBuilder.metadata(key).isEqualTo((String) value);
} else if (value instanceof Number) {
filterBuilder = filterBuilder.metadata(key).isEqualTo(((Number) value).doubleValue());
}
// Add more type handling as needed
}
filter = filterBuilder.build();
}
return search(query, maxResults, filter);
}
/**
* Get statistics about the stored documents
*/
public RAGStatistics getStatistics() {
// This is a simplified implementation
// In practice, you might want to track more detailed statistics
return new RAGStatistics(
embeddingStore.getClass().getSimpleName(),
config.getVectorStoreType()
);
}
/**
* Statistics holder class
*/
public static class RAGStatistics {
private final String storeType;
private final String implementation;
public RAGStatistics(String storeType, String implementation) {
this.storeType = storeType;
this.implementation = implementation;
}
public String getStoreType() { return storeType; }
public String getImplementation() { return implementation; }
@Override
public String toString() {
return "RAGStatistics{" +
"storeType='" + storeType + '\'' +
", implementation='" + implementation + '\'' +
'}';
}
}
/**
* Example usage
*/
public static void main(String[] args) {
// Configure the pipeline
RAGConfig config = new RAGConfig();
config.setVectorStoreType("chroma"); // or "pinecone", "qdrant", "memory"
config.setOpenAiApiKey("your-openai-api-key");
config.setChunkSize(1000);
config.setChunkOverlap(200);
// Create pipeline
RAGPipeline pipeline = new RAGPipeline(config);
// Load documents
List<Document> documents = pipeline.loadDocuments("./documents");
// Ingest documents
pipeline.ingestDocuments(documents);
// Search for relevant content
List<TextSegment> results = pipeline.search("What is machine learning?", 5, null);
// Print results
for (int i = 0; i < results.size(); i++) {
TextSegment segment = results.get(i);
System.out.println("Result " + (i + 1) + ":");
System.out.println("Content: " + segment.text().substring(0, Math.min(200, segment.text().length())) + "...");
System.out.println("Metadata: " + segment.metadata());
System.out.println();
}
// Print statistics
System.out.println("Pipeline Statistics: " + pipeline.getStatistics());
}
}

View File

@@ -0,0 +1,127 @@
# Vector Store Configuration Templates
# This file contains configuration templates for different vector databases
# Chroma (Local/Development)
chroma:
type: chroma
settings:
persist_directory: "./chroma_db"
collection_name: "rag_documents"
host: "localhost"
port: 8000
# Recommended for: Development, small-scale applications
# Pros: Easy setup, local deployment, free
# Cons: Limited scalability, single-node only
# Pinecone (Cloud/Production)
pinecone:
type: pinecone
settings:
api_key: "${PINECONE_API_KEY}"
environment: "us-west1-gcp"
index_name: "rag-documents"
dimension: 1536
metric: "cosine"
pods: 1
pod_type: "p1.x1"
# Recommended for: Production applications, large-scale
# Pros: Managed service, scalable, fast
# Cons: Cost, requires internet connection
# Weaviate (Open-source/Cloud)
weaviate:
type: weaviate
settings:
url: "http://localhost:8080"
api_key: "${WEAVIATE_API_KEY}"
class_name: "Document"
text_key: "content"
vectorizer: "text2vec-openai"
module_config:
text2vec-openai:
model: "ada"
modelVersion: "002"
type: "text"
baseUrl: "https://api.openai.com/v1"
# Recommended for: Hybrid search, GraphQL API
# Pros: Open-source, hybrid search, flexible
# Cons: More complex setup
# Qdrant (Performance-focused)
qdrant:
type: qdrant
settings:
host: "localhost"
port: 6333
collection_name: "rag_documents"
vector_size: 1536
distance: "Cosine"
api_key: "${QDRANT_API_KEY}"
# Recommended for: Performance, advanced filtering
# Pros: Fast, good filtering, open-source
# Cons: Newer project, smaller community
# Milvus (Enterprise/Scale)
milvus:
type: milvus
settings:
host: "localhost"
port: 19530
collection_name: "rag_documents"
dimension: 1536
index_type: "IVF_FLAT"
metric_type: "COSINE"
nlist: 1024
# Recommended for: Enterprise, large-scale deployments
# Pros: High performance, distributed
# Cons: Complex setup, resource intensive
# FAISS (Local/Research)
faiss:
type: faiss
settings:
index_type: "IndexFlatL2"
dimension: 1536
save_path: "./faiss_index"
# Recommended for: Research, local processing
# Pros: Fast, local, no dependencies
# Cons: No persistence, limited features
# Common Configuration Parameters
common:
chunking:
chunk_size: 1000
chunk_overlap: 200
separators: ["\n\n", "\n", " ", ""]
embedding:
model: "text-embedding-ada-002"
batch_size: 100
max_retries: 3
timeout: 30
retrieval:
default_k: 5
similarity_threshold: 0.7
max_results: 20
performance:
cache_embeddings: true
cache_size: 1000
parallel_processing: true
batch_size: 50
# Environment Variables Template
# Copy this to .env file and fill in your values
environment:
OPENAI_API_KEY: "your-openai-api-key-here"
PINECONE_API_KEY: "your-pinecone-api-key-here"
PINECONE_ENVIRONMENT: "us-west1-gcp"
WEAVIATE_API_KEY: "your-weaviate-api-key-here"
QDRANT_API_KEY: "your-qdrant-api-key-here"

View File

@@ -0,0 +1,137 @@
# Document Chunking Strategies
## Overview
Document chunking is the process of breaking large documents into smaller, manageable pieces that can be effectively embedded and retrieved.
## Chunking Strategies
### 1. Recursive Character Text Splitter
**Method**: Split text based on character count, trying separators in order
**Use Case**: General purpose text splitting
**Advantages**: Preserves sentence and paragraph boundaries when possible
```python
from langchain.text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=["\n\n", "\n", " ", ""] # Try these in order
)
chunks = splitter.split_documents(documents)
```
### 2. Token-Based Splitting
**Method**: Split based on token count rather than characters
**Use Case**: When working with token limits of language models
**Advantages**: Better control over context window usage
```python
from langchain.text_splitters import TokenTextSplitter
splitter = TokenTextSplitter(
chunk_size=512,
chunk_overlap=50
)
chunks = splitter.split_documents(documents)
```
### 3. Semantic Chunking
**Method**: Split based on semantic similarity
**Use Case**: When maintaining semantic coherence is important
**Advantages**: Chunks are more semantically meaningful
```python
from langchain.text_splitters import SemanticChunker
splitter = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile"
)
chunks = splitter.split_documents(documents)
```
### 4. Markdown Header Splitter
**Method**: Split based on markdown headers
**Use Case**: Structured documents with clear hierarchical organization
**Advantages**: Maintains document structure and context
```python
from langchain.text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_documents(documents)
```
### 5. HTML Splitter
**Method**: Split based on HTML tags
**Use Case**: Web pages and HTML documents
**Advantages**: Preserves HTML structure and metadata
```python
from langchain.text_splitters import HTMLHeaderTextSplitter
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
("h3", "Header 3"),
]
splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_documents(documents)
```
## Parameter Tuning
### Chunk Size
- **Small chunks (200-400 tokens)**: More precise retrieval, but may lose context
- **Medium chunks (500-1000 tokens)**: Good balance of precision and context
- **Large chunks (1000-2000 tokens)**: More context, but less precise retrieval
### Chunk Overlap
- **Purpose**: Preserve context at chunk boundaries
- **Typical range**: 10-20% of chunk size
- **Higher overlap**: Better context preservation, but more redundancy
- **Lower overlap**: Less redundancy, but may lose important context
### Separators
- **Hierarchical separators**: Start with larger boundaries (paragraphs), then smaller (sentences)
- **Custom separators**: Add domain-specific separators for better results
- **Language-specific**: Adjust for different languages and writing styles
## Best Practices
1. **Preserve Context**: Ensure chunks contain enough surrounding context
2. **Maintain Coherence**: Keep semantically related content together
3. **Respect Boundaries**: Avoid breaking sentences or important phrases
4. **Consider Query Types**: Adapt chunking strategy to typical user queries
5. **Test and Iterate**: Evaluate different chunking strategies for your specific use case
## Evaluation Metrics
1. **Retrieval Quality**: How well chunks answer user queries
2. **Context Preservation**: Whether important context is maintained
3. **Chunk Distribution**: Evenness of chunk sizes
4. **Boundary Quality**: How natural chunk boundaries are
5. **Retrieval Efficiency**: Impact on retrieval speed and accuracy
## Advanced Techniques
### Adaptive Chunking
Adjust chunk size based on document structure and content density.
### Hierarchical Chunking
Create multiple levels of chunks for different retrieval scenarios.
### Query-Aware Chunking
Optimize chunk boundaries based on typical query patterns.
### Domain-Specific Splitting
Use specialized splitters for specific document types (legal, medical, technical).

View File

@@ -0,0 +1,88 @@
# Embedding Models Guide
## Overview
Embedding models convert text into numerical vectors that capture semantic meaning for similarity search in RAG systems.
## Popular Embedding Models
### 1. text-embedding-ada-002 (OpenAI)
- **Dimensions**: 1536
- **Type**: General purpose
- **Use Case**: Most applications requiring high quality embeddings
- **Performance**: Excellent balance of quality and speed
### 2. all-MiniLM-L6-v2 (Sentence Transformers)
- **Dimensions**: 384
- **Type**: Lightweight
- **Use Case**: Applications requiring fast inference
- **Performance**: Good quality, very fast
### 3. e5-large-v2
- **Dimensions**: 1024
- **Type**: High quality
- **Use Case**: Applications needing superior performance
- **Performance**: Excellent quality, multilingual support
### 4. Instructor
- **Dimensions**: Variable (768)
- **Type**: Task-specific
- **Use Case**: Domain-specific applications
- **Performance**: Can be fine-tuned for specific tasks
### 5. bge-large-en-v1.5
- **Dimensions**: 1024
- **Type**: State-of-the-art
- **Use Case**: Applications requiring best possible quality
- **Performance**: SOTA performance on benchmarks
## Selection Criteria
1. **Quality vs Speed**: Balance between embedding quality and inference speed
2. **Dimension Size**: Impact on storage and retrieval performance
3. **Domain**: Specific language or domain requirements
4. **Cost**: API costs vs local deployment
5. **Batch Size**: Throughput requirements
6. **Language**: Multilingual support needs
## Usage Examples
### OpenAI Embeddings
```python
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
vector = embeddings.embed_query("Your text here")
```
### Sentence Transformers
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
vector = model.encode("Your text here")
```
### Hugging Face Models
```python
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
```
## Optimization Tips
1. **Batch Processing**: Process multiple texts together for efficiency
2. **Model Quantization**: Reduce model size for faster inference
3. **Caching**: Cache embeddings for frequently used texts
4. **GPU Acceleration**: Use GPU for faster processing when available
5. **Model Selection**: Choose appropriate model size for your use case
## Evaluation Metrics
1. **Semantic Similarity**: How well embeddings capture meaning
2. **Retrieval Performance**: Quality of retrieved documents
3. **Speed**: Inference time per document
4. **Memory Usage**: RAM requirements for the model
5. **Cost**: API costs or infrastructure requirements

View File

@@ -0,0 +1,94 @@
# LangChain4j RAG Implementation Guide
## Overview
RAG (Retrieval-Augmented Generation) extends LLM knowledge by finding and injecting relevant information from your data into prompts before sending to the LLM.
## What is RAG?
RAG helps LLMs answer questions using domain-specific knowledge by retrieving relevant information to reduce hallucinations.
## RAG Flavors in LangChain4j
### 1. Easy RAG
Simplest way to start with minimal setup. Handles document loading, splitting, and embedding automatically.
### 2. Core RAG APIs
Modular components including:
- Document
- TextSegment
- EmbeddingModel
- EmbeddingStore
- DocumentSplitter
### 3. Advanced RAG
Complex pipelines supporting:
- Query transformation
- Multi-source retrieval
- Re-ranking with components like QueryTransformer and ContentRetriever
## RAG Stages
### 1. Indexing
Pre-process documents for efficient search
### 2. Retrieval
Find relevant content based on user queries
## Core Components
### Documents with metadata
Structured representation of your content with associated metadata for filtering and context.
### Text segments (chunks)
Smaller, manageable pieces of documents that are embedded and stored in vector databases.
### Embedding models
Convert text segments into numerical vectors for similarity search.
### Embedding stores (vector databases)
Store and efficiently retrieve embedded text segments.
### Content retrievers
Find relevant content based on user queries.
### Query transformers
Transform and optimize user queries for better retrieval.
### Content aggregators
Combine and rank retrieved content.
## Advanced Features
- Query transformation and routing
- Multiple retrievers for different data sources
- Re-ranking models for improved relevance
- Metadata filtering for targeted retrieval
- Parallel processing for performance
## Implementation Example (Easy RAG)
```java
// Load documents
List<Document> documents = FileSystemDocumentLoader.loadDocuments("/path/to/docs");
// Create embedding store
InMemoryEmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();
// Ingest documents
EmbeddingStoreIngestor.ingest(documents, embeddingStore);
// Create AI service
Assistant assistant = AiServices.builder(Assistant.class)
.chatModel(chatModel)
.chatMemory(MessageWindowChatMemory.withMaxMessages(10))
.contentRetriever(EmbeddingStoreContentRetriever.from(embeddingStore))
.build();
```
## Best Practices
1. **Document Preparation**: Clean and structure documents before ingestion
2. **Chunk Size**: Balance between context preservation and retrieval precision
3. **Metadata Strategy**: Include relevant metadata for filtering and context
4. **Embedding Model Selection**: Choose models appropriate for your domain
5. **Retrieval Strategy**: Select appropriate k values and filtering criteria
6. **Evaluation**: Continuously evaluate retrieval quality and answer accuracy

View File

@@ -0,0 +1,161 @@
# Advanced Retrieval Strategies
## Overview
Different retrieval approaches for finding relevant documents in RAG systems, each with specific strengths and use cases.
## Retrieval Approaches
### 1. Dense Retrieval
**Method**: Semantic similarity via embeddings
**Use Case**: Understanding meaning and context
**Example**: Finding documents about "machine learning" when query is "AI algorithms"
```python
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_documents(chunks, embeddings)
results = vectorstore.similarity_search("query", k=5)
```
### 2. Sparse Retrieval
**Method**: Keyword matching (BM25, TF-IDF)
**Use Case**: Exact term matching and keyword-specific queries
**Example**: Finding documents containing specific technical terms
```python
from langchain.retrievers import BM25Retriever
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5
results = bm25_retriever.get_relevant_documents("query")
```
### 3. Hybrid Search
**Method**: Combine dense + sparse retrieval
**Use Case**: Balance between semantic understanding and keyword matching
```python
from langchain.retrievers import BM25Retriever, EnsembleRetriever
# Sparse retriever (BM25)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5
# Dense retriever (embeddings)
embedding_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Combine with weights
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, embedding_retriever],
weights=[0.3, 0.7]
)
```
### 4. Multi-Query Retrieval
**Method**: Generate multiple query variations
**Use Case**: Complex queries that can be interpreted in multiple ways
```python
from langchain.retrievers.multi_query import MultiQueryRetriever
# Generate multiple query perspectives
retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(),
llm=OpenAI()
)
# Single query → multiple variations → combined results
results = retriever.get_relevant_documents("What is the main topic?")
```
### 5. HyDE (Hypothetical Document Embeddings)
**Method**: Generate hypothetical documents for better retrieval
**Use Case**: When queries are very different from document style
```python
# Generate hypothetical document based on query
hypothetical_doc = llm.generate(f"Write a document about: {query}")
# Use hypothetical doc for retrieval
results = vectorstore.similarity_search(hypothetical_doc, k=5)
```
## Advanced Retrieval Patterns
### Contextual Compression
Compress retrieved documents to only include relevant parts
```python
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever()
)
```
### Parent Document Retriever
Store small chunks for retrieval, return larger chunks for context
```python
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
store = InMemoryStore()
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter
)
```
## Retrieval Optimization Techniques
### 1. Metadata Filtering
Filter results based on document metadata
```python
results = vectorstore.similarity_search(
"query",
filter={"category": "technical", "date": {"$gte": "2023-01-01"}},
k=5
)
```
### 2. Maximal Marginal Relevance (MMR)
Balance relevance with diversity
```python
results = vectorstore.max_marginal_relevance_search(
"query",
k=5,
fetch_k=20,
lambda_mult=0.5 # 0=max diversity, 1=max relevance
)
```
### 3. Reranking
Improve top results with cross-encoder
```python
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
candidates = vectorstore.similarity_search("query", k=20)
pairs = [[query, doc.page_content] for doc in candidates]
scores = reranker.predict(pairs)
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]
```
## Selection Guidelines
1. **Query Type**: Choose strategy based on typical query patterns
2. **Document Type**: Consider document structure and content
3. **Performance Requirements**: Balance quality vs speed
4. **Domain Knowledge**: Leverage domain-specific patterns
5. **User Expectations**: Match retrieval behavior to user expectations

View File

@@ -0,0 +1,86 @@
# Vector Database Comparison and Configuration
## Overview
Vector databases store and efficiently retrieve document embeddings for semantic search in RAG systems.
## Popular Vector Database Options
### 1. Pinecone
- **Type**: Managed cloud service
- **Features**: Scalable, fast queries, managed infrastructure
- **Use Case**: Production applications requiring high availability
### 2. Weaviate
- **Type**: Open-source, hybrid search
- **Features**: Combines vector and keyword search, GraphQL API
- **Use Case**: Applications needing both semantic and traditional search
### 3. Milvus
- **Type**: High performance, on-premise
- **Features**: Distributed architecture, GPU acceleration
- **Use Case**: Large-scale deployments with custom infrastructure
### 4. Chroma
- **Type**: Lightweight, easy to use
- **Features**: Local deployment, simple API
- **Use Case**: Development and small-scale applications
### 5. Qdrant
- **Type**: Fast, filtered search
- **Features**: Advanced filtering, payload support
- **Use Case**: Applications requiring complex metadata filtering
### 6. FAISS
- **Type**: Meta's library, local deployment
- **Features**: High performance, CPU/GPU optimized
- **Use Case**: Research and applications needing full control
## Configuration Examples
### Pinecone Setup
```python
import pinecone
from langchain.vectorstores import Pinecone
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
index = pinecone.Index("your-index-name")
vectorstore = Pinecone(index, embeddings.embed_query, "text")
```
### Weaviate Setup
```python
import weaviate
from langchain.vectorstores import Weaviate
client = weaviate.Client("http://localhost:8080")
vectorstore = Weaviate(client, "Document", "content", embeddings)
```
### Chroma Local Setup
```python
from langchain.vectorstores import Chroma
vectorstore = Chroma(
collection_name="my_collection",
embedding_function=embeddings,
persist_directory="./chroma_db"
)
```
## Selection Criteria
1. **Scale**: Number of documents and expected query volume
2. **Performance**: Latency requirements and throughput needs
3. **Deployment**: Cloud vs on-premise preferences
4. **Features**: Filtering, hybrid search, metadata support
5. **Cost**: Budget constraints and operational overhead
6. **Maintenance**: Team expertise and available resources
## Best Practices
1. **Indexing Strategy**: Choose appropriate distance metrics (cosine, euclidean)
2. **Sharding**: Distribute data for large-scale deployments
3. **Monitoring**: Track query performance and system health
4. **Backups**: Implement regular backup procedures
5. **Security**: Secure access to sensitive data
6. **Optimization**: Tune parameters for your specific use case