Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:28:34 +08:00
commit 390afca02b
220 changed files with 86013 additions and 0 deletions

View File

@@ -0,0 +1,194 @@
---
name: chunking-strategy
description: Implement optimal chunking strategies in RAG systems and document processing pipelines. Use when building retrieval-augmented generation systems, vector databases, or processing large documents that require breaking into semantically meaningful segments for embeddings and search.
allowed-tools: Read, Write, Bash
category: artificial-intelligence
tags: [rag, chunking, vector-search, embeddings, document-processing]
version: 1.0.0
---
# Chunking Strategy for RAG Systems
## Overview
Implement optimal chunking strategies for Retrieval-Augmented Generation (RAG) systems and document processing pipelines. This skill provides a comprehensive framework for breaking large documents into smaller, semantically meaningful segments that preserve context while enabling efficient retrieval and search.
## When to Use
Use this skill when building RAG systems, optimizing vector search performance, implementing document processing pipelines, handling multi-modal content, or performance-tuning existing RAG systems with poor retrieval quality.
## Instructions
### Choose Chunking Strategy
Select appropriate chunking strategy based on document type and use case:
1. **Fixed-Size Chunking** (Level 1)
- Use for simple documents without clear structure
- Start with 512 tokens and 10-20% overlap
- Adjust size based on query type: 256 for factoid, 1024 for analytical
2. **Recursive Character Chunking** (Level 2)
- Use for documents with clear structural boundaries
- Implement hierarchical separators: paragraphs → sentences → words
- Customize separators for document types (HTML, Markdown)
3. **Structure-Aware Chunking** (Level 3)
- Use for structured documents (Markdown, code, tables, PDFs)
- Preserve semantic units: functions, sections, table blocks
- Validate structure preservation post-splitting
4. **Semantic Chunking** (Level 4)
- Use for complex documents with thematic shifts
- Implement embedding-based boundary detection
- Configure similarity threshold (0.8) and buffer size (3-5 sentences)
5. **Advanced Methods** (Level 5)
- Use Late Chunking for long-context embedding models
- Apply Contextual Retrieval for high-precision requirements
- Monitor computational costs vs. retrieval improvements
Reference detailed strategy implementations in [references/strategies.md](references/strategies.md).
### Implement Chunking Pipeline
Follow these steps to implement effective chunking:
1. **Pre-process documents**
- Analyze document structure and content types
- Identify multi-modal content (tables, images, code)
- Assess information density and complexity
2. **Select strategy parameters**
- Choose chunk size based on embedding model context window
- Set overlap percentage (10-20% for most cases)
- Configure strategy-specific parameters
3. **Process and validate**
- Apply chosen chunking strategy
- Validate semantic coherence of chunks
- Test with representative documents
4. **Evaluate and iterate**
- Measure retrieval precision and recall
- Monitor processing latency and resource usage
- Optimize based on specific use case requirements
Reference detailed implementation guidelines in [references/implementation.md](references/implementation.md).
### Evaluate Performance
Use these metrics to evaluate chunking effectiveness:
- **Retrieval Precision**: Fraction of retrieved chunks that are relevant
- **Retrieval Recall**: Fraction of relevant chunks that are retrieved
- **End-to-End Accuracy**: Quality of final RAG responses
- **Processing Time**: Latency impact on overall system
- **Resource Usage**: Memory and computational costs
Reference detailed evaluation framework in [references/evaluation.md](references/evaluation.md).
## Examples
### Basic Fixed-Size Chunking
```python
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Configure for factoid queries
splitter = RecursiveCharacterTextSplitter(
chunk_size=256,
chunk_overlap=25,
length_function=len
)
chunks = splitter.split_documents(documents)
```
### Structure-Aware Code Chunking
```python
def chunk_python_code(code):
"""Split Python code into semantic chunks"""
import ast
tree = ast.parse(code)
chunks = []
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
chunks.append(ast.get_source_segment(code, node))
return chunks
```
### Semantic Chunking with Embeddings
```python
def semantic_chunk(text, similarity_threshold=0.8):
"""Chunk text based on semantic boundaries"""
sentences = split_into_sentences(text)
embeddings = generate_embeddings(sentences)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
similarity = cosine_similarity(embeddings[i-1], embeddings[i])
if similarity < similarity_threshold:
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
chunks.append(" ".join(current_chunk))
return chunks
```
## Best Practices
### Core Principles
- Balance context preservation with retrieval precision
- Maintain semantic coherence within chunks
- Optimize for embedding model constraints
- Preserve document structure when beneficial
### Implementation Guidelines
- Start simple with fixed-size chunking (512 tokens, 10-20% overlap)
- Test thoroughly with representative documents
- Monitor both accuracy metrics and computational costs
- Iterate based on specific document characteristics
### Common Pitfalls to Avoid
- Over-chunking: Creating too many small, context-poor chunks
- Under-chunking: Missing relevant information due to oversized chunks
- Ignoring document structure and semantic boundaries
- Using one-size-fits-all approach for diverse content types
- Neglecting overlap for boundary-crossing information
## Constraints
### Resource Considerations
- Semantic and contextual methods require significant computational resources
- Late chunking needs long-context embedding models
- Complex strategies increase processing latency
- Monitor memory usage for large document processing
### Quality Requirements
- Validate chunk semantic coherence post-processing
- Test with domain-specific documents before deployment
- Ensure chunks maintain standalone meaning where possible
- Implement proper error handling for edge cases
## References
Reference detailed documentation in the [references/](references/) folder:
- [strategies.md](references/strategies.md) - Detailed strategy implementations
- [implementation.md](references/implementation.md) - Complete implementation guidelines
- [evaluation.md](references/evaluation.md) - Performance evaluation framework
- [tools.md](references/tools.md) - Recommended libraries and frameworks
- [research.md](references/research.md) - Key research papers and findings
- [advanced-strategies.md](references/advanced-strategies.md) - 11 comprehensive chunking methods
- [semantic-methods.md](references/semantic-methods.md) - Semantic and contextual approaches
- [visualization-tools.md](references/visualization-tools.md) - Evaluation and visualization tools

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,904 @@
# Performance Evaluation Framework
This document provides comprehensive methodologies for evaluating chunking strategy performance and effectiveness.
## Evaluation Metrics
### Core Retrieval Metrics
#### Retrieval Precision
Measures the fraction of retrieved chunks that are relevant to the query.
```python
def calculate_precision(retrieved_chunks: List[Dict], relevant_chunks: List[Dict]) -> float:
"""
Calculate retrieval precision
Precision = |Relevant ∩ Retrieved| / |Retrieved|
"""
retrieved_ids = {chunk.get('id') for chunk in retrieved_chunks}
relevant_ids = {chunk.get('id') for chunk in relevant_chunks}
intersection = retrieved_ids & relevant_ids
if not retrieved_ids:
return 0.0
return len(intersection) / len(retrieved_ids)
```
#### Retrieval Recall
Measures the fraction of relevant chunks that are successfully retrieved.
```python
def calculate_recall(retrieved_chunks: List[Dict], relevant_chunks: List[Dict]) -> float:
"""
Calculate retrieval recall
Recall = |Relevant ∩ Retrieved| / |Relevant|
"""
retrieved_ids = {chunk.get('id') for chunk in retrieved_chunks}
relevant_ids = {chunk.get('id') for chunk in relevant_chunks}
intersection = retrieved_ids & relevant_ids
if not relevant_ids:
return 0.0
return len(intersection) / len(relevant_ids)
```
#### F1-Score
Harmonic mean of precision and recall.
```python
def calculate_f1_score(precision: float, recall: float) -> float:
"""
Calculate F1-score
F1 = 2 * (Precision * Recall) / (Precision + Recall)
"""
if precision + recall == 0:
return 0.0
return 2 * (precision * recall) / (precision + recall)
```
### Mean Reciprocal Rank (MRR)
Measures the rank of the first relevant result.
```python
def calculate_mrr(queries: List[Dict], results: List[List[Dict]]) -> float:
"""
Calculate Mean Reciprocal Rank
"""
reciprocal_ranks = []
for query, query_results in zip(queries, results):
relevant_found = False
for rank, result in enumerate(query_results, 1):
if result.get('is_relevant', False):
reciprocal_ranks.append(1.0 / rank)
relevant_found = True
break
if not relevant_found:
reciprocal_ranks.append(0.0)
return sum(reciprocal_ranks) / len(reciprocal_ranks)
```
### Mean Average Precision (MAP)
Considers both precision and the ranking of relevant documents.
```python
def calculate_average_precision(retrieved_chunks: List[Dict], relevant_chunks: List[Dict]) -> float:
"""
Calculate Average Precision for a single query
"""
retrieved_ids = {chunk.get('id') for chunk in retrieved_chunks}
relevant_ids = {chunk.get('id') for chunk in relevant_chunks}
if not relevant_ids:
return 0.0
precisions = []
relevant_count = 0
for rank, chunk in enumerate(retrieved_chunks, 1):
if chunk.get('id') in relevant_ids:
relevant_count += 1
precision_at_rank = relevant_count / rank
precisions.append(precision_at_rank)
return sum(precisions) / len(relevant_ids) if relevant_ids else 0.0
def calculate_map(queries: List[Dict], results: List[List[Dict]]) -> float:
"""
Calculate Mean Average Precision across multiple queries
"""
average_precisions = []
for query, query_results in zip(queries, results):
ap = calculate_average_precision(query_results, query.get('relevant_chunks', []))
average_precisions.append(ap)
return sum(average_precisions) / len(average_precisions) if average_precisions else 0.0
```
### Normalized Discounted Cumulative Gain (NDCG)
Measures ranking quality with emphasis on highly relevant results.
```python
def calculate_dcg(retrieved_chunks: List[Dict]) -> float:
"""
Calculate Discounted Cumulative Gain
"""
dcg = 0.0
for rank, chunk in enumerate(retrieved_chunks, 1):
relevance = chunk.get('relevance_score', 0)
dcg += relevance / np.log2(rank + 1)
return dcg
def calculate_ndcg(retrieved_chunks: List[Dict], ideal_chunks: List[Dict]) -> float:
"""
Calculate Normalized Discounted Cumulative Gain
"""
dcg = calculate_dcg(retrieved_chunks)
idcg = calculate_dcg(ideal_chunks)
if idcg == 0:
return 0.0
return dcg / idcg
```
## End-to-End RAG Evaluation
### Answer Quality Metrics
#### Factual Consistency
Measures how well the generated answer aligns with retrieved chunks.
```python
import spacy
from transformers import pipeline
class FactualConsistencyEvaluator:
def __init__(self):
self.nlp = spacy.load("en_core_web_sm")
self.nli_pipeline = pipeline("text-classification",
model="roberta-large-mnli")
def evaluate_consistency(self, answer: str, retrieved_chunks: List[str]) -> float:
"""
Evaluate factual consistency between answer and retrieved context
"""
if not retrieved_chunks:
return 0.0
# Combine retrieved chunks as context
context = " ".join(retrieved_chunks[:3]) # Use top 3 chunks
# Use Natural Language Inference to check consistency
result = self.nli_pipeline(f"premise: {context} hypothesis: {answer}")
# Extract consistency score (entailment probability)
for item in result:
if item['label'] == 'ENTAILMENT':
return item['score']
elif item['label'] == 'CONTRADICTION':
return 1.0 - item['score']
return 0.5 # Neutral if NLI is inconclusive
```
#### Answer Completeness
Measures how completely the answer addresses the user's query.
```python
def evaluate_completeness(answer: str, query: str, reference_answer: str = None) -> float:
"""
Evaluate answer completeness
"""
# Extract key entities from query
query_entities = extract_entities(query)
answer_entities = extract_entities(answer)
# Calculate entity coverage
if not query_entities:
return 0.5 # Neutral if no entities in query
covered_entities = query_entities & answer_entities
entity_coverage = len(covered_entities) / len(query_entities)
# If reference answer is available, compare against it
if reference_answer:
reference_entities = extract_entities(reference_answer)
answer_reference_overlap = len(answer_entities & reference_entities) / max(len(reference_entities), 1)
return (entity_coverage + answer_reference_overlap) / 2
return entity_coverage
def extract_entities(text: str) -> set:
"""
Extract named entities from text (simplified)
"""
# This would use a proper NER model in practice
import re
# Simple noun phrase extraction as placeholder
noun_phrases = re.findall(r'\b[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\b', text)
return set(noun_phrases)
```
#### Response Relevance
Measures how relevant the answer is to the original query.
```python
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
class RelevanceEvaluator:
def __init__(self, model_name="all-MiniLM-L6-v2"):
self.model = SentenceTransformer(model_name)
def evaluate_relevance(self, query: str, answer: str) -> float:
"""
Evaluate semantic relevance between query and answer
"""
# Generate embeddings
query_embedding = self.model.encode([query])
answer_embedding = self.model.encode([answer])
# Calculate cosine similarity
similarity = cosine_similarity(query_embedding, answer_embedding)[0][0]
return float(similarity)
```
## Performance Metrics
### Processing Time
```python
import time
from dataclasses import dataclass
from typing import List, Dict
@dataclass
class PerformanceMetrics:
total_time: float
chunking_time: float
embedding_time: float
search_time: float
generation_time: float
throughput: float # documents per second
class PerformanceProfiler:
def __init__(self):
self.timings = {}
self.start_times = {}
def start_timer(self, operation: str):
self.start_times[operation] = time.time()
def end_timer(self, operation: str):
if operation in self.start_times:
duration = time.time() - self.start_times[operation]
if operation not in self.timings:
self.timings[operation] = []
self.timings[operation].append(duration)
return duration
return 0.0
def get_performance_metrics(self, document_count: int) -> PerformanceMetrics:
total_time = sum(sum(times) for times in self.timings.values())
return PerformanceMetrics(
total_time=total_time,
chunking_time=sum(self.timings.get('chunking', [0])),
embedding_time=sum(self.timings.get('embedding', [0])),
search_time=sum(self.timings.get('search', [0])),
generation_time=sum(self.timings.get('generation', [0])),
throughput=document_count / total_time if total_time > 0 else 0
)
```
### Memory Usage
```python
import psutil
import os
from typing import Dict, List
class MemoryProfiler:
def __init__(self):
self.process = psutil.Process(os.getpid())
self.memory_snapshots = []
def take_memory_snapshot(self, label: str):
"""Take a snapshot of current memory usage"""
memory_info = self.process.memory_info()
memory_mb = memory_info.rss / 1024 / 1024 # Convert to MB
self.memory_snapshots.append({
'label': label,
'memory_mb': memory_mb,
'timestamp': time.time()
})
def get_peak_memory_usage(self) -> float:
"""Get peak memory usage in MB"""
if not self.memory_snapshots:
return 0.0
return max(snapshot['memory_mb'] for snapshot in self.memory_snapshots)
def get_memory_usage_by_operation(self) -> Dict[str, float]:
"""Get memory usage breakdown by operation"""
if not self.memory_snapshots:
return {}
memory_by_op = {}
for i in range(1, len(self.memory_snapshots)):
prev_snapshot = self.memory_snapshots[i-1]
curr_snapshot = self.memory_snapshots[i]
operation = curr_snapshot['label']
memory_delta = curr_snapshot['memory_mb'] - prev_snapshot['memory_mb']
if operation not in memory_by_op:
memory_by_op[operation] = []
memory_by_op[operation].append(memory_delta)
return {op: sum(deltas) for op, deltas in memory_by_op.items()}
```
## Evaluation Datasets
### Standardized Test Sets
#### Question-Answer Pairs
```python
from dataclasses import dataclass
from typing import List, Optional
import json
@dataclass
class EvaluationQuery:
id: str
question: str
reference_answer: Optional[str]
relevant_chunk_ids: List[str]
query_type: str # factoid, analytical, comparative
difficulty: str # easy, medium, hard
domain: str # finance, medical, legal, technical
class EvaluationDataset:
def __init__(self, name: str):
self.name = name
self.queries: List[EvaluationQuery] = []
self.documents: Dict[str, str] = {}
self.chunks: Dict[str, Dict] = {}
def add_query(self, query: EvaluationQuery):
self.queries.append(query)
def add_document(self, doc_id: str, content: str):
self.documents[doc_id] = content
def add_chunk(self, chunk_id: str, content: str, doc_id: str, metadata: Dict):
self.chunks[chunk_id] = {
'id': chunk_id,
'content': content,
'doc_id': doc_id,
'metadata': metadata
}
def save_to_file(self, filepath: str):
data = {
'name': self.name,
'queries': [
{
'id': q.id,
'question': q.question,
'reference_answer': q.reference_answer,
'relevant_chunk_ids': q.relevant_chunk_ids,
'query_type': q.query_type,
'difficulty': q.difficulty,
'domain': q.domain
}
for q in self.queries
],
'documents': self.documents,
'chunks': self.chunks
}
with open(filepath, 'w') as f:
json.dump(data, f, indent=2)
@classmethod
def load_from_file(cls, filepath: str):
with open(filepath, 'r') as f:
data = json.load(f)
dataset = cls(data['name'])
dataset.documents = data['documents']
dataset.chunks = data['chunks']
for q_data in data['queries']:
query = EvaluationQuery(
id=q_data['id'],
question=q_data['question'],
reference_answer=q_data.get('reference_answer'),
relevant_chunk_ids=q_data['relevant_chunk_ids'],
query_type=q_data['query_type'],
difficulty=q_data['difficulty'],
domain=q_data['domain']
)
dataset.add_query(query)
return dataset
```
### Dataset Generation
#### Synthetic Query Generation
```python
import random
from typing import List, Dict
class SyntheticQueryGenerator:
def __init__(self):
self.query_templates = {
'factoid': [
"What is {concept}?",
"When did {event} occur?",
"Who developed {technology}?",
"How many {items} are mentioned?",
"What is the value of {metric}?"
],
'analytical': [
"Compare and contrast {concept1} and {concept2}.",
"Analyze the impact of {concept} on {domain}.",
"What are the advantages and disadvantages of {technology}?",
"Explain the relationship between {concept1} and {concept2}.",
"Evaluate the effectiveness of {approach} for {problem}."
],
'comparative': [
"Which is better: {option1} or {option2}?",
"How does {method1} differ from {method2}?",
"Compare the performance of {system1} and {system2}.",
"What are the key differences between {approach1} and {approach2}?"
]
}
def generate_queries_from_chunks(self, chunks: List[Dict], num_queries: int = 100) -> List[EvaluationQuery]:
"""Generate synthetic queries from document chunks"""
queries = []
# Extract entities and concepts from chunks
entities = self._extract_entities_from_chunks(chunks)
for i in range(num_queries):
query_type = random.choice(['factoid', 'analytical', 'comparative'])
template = random.choice(self.query_templates[query_type])
# Fill template with extracted entities
query_text = self._fill_template(template, entities)
# Find relevant chunks for this query
relevant_chunks = self._find_relevant_chunks(query_text, chunks)
query = EvaluationQuery(
id=f"synthetic_{i}",
question=query_text,
reference_answer=None, # Would need generation model
relevant_chunk_ids=[chunk['id'] for chunk in relevant_chunks],
query_type=query_type,
difficulty=random.choice(['easy', 'medium', 'hard']),
domain='synthetic'
)
queries.append(query)
return queries
def _extract_entities_from_chunks(self, chunks: List[Dict]) -> Dict[str, List[str]]:
"""Extract entities, concepts, and relationships from chunks"""
# This would use proper NER in practice
entities = {
'concepts': [],
'technologies': [],
'methods': [],
'metrics': [],
'events': []
}
for chunk in chunks:
content = chunk['content']
# Simplified entity extraction
words = content.split()
entities['concepts'].extend([word for word in words if len(word) > 6])
entities['technologies'].extend([word for word in words if 'technology' in word.lower()])
entities['methods'].extend([word for word in words if 'method' in word.lower()])
entities['metrics'].extend([word for word in words if '%' in word or '$' in word])
# Remove duplicates and limit
for key in entities:
entities[key] = list(set(entities[key]))[:50]
return entities
def _fill_template(self, template: str, entities: Dict[str, List[str]]) -> str:
"""Fill query template with random entities"""
import re
def replace_placeholder(match):
placeholder = match.group(1)
# Map placeholders to entity types
entity_mapping = {
'concept': 'concepts',
'concept1': 'concepts',
'concept2': 'concepts',
'technology': 'technologies',
'method': 'methods',
'method1': 'methods',
'method2': 'methods',
'metric': 'metrics',
'event': 'events',
'items': 'concepts',
'option1': 'concepts',
'option2': 'concepts',
'approach': 'methods',
'problem': 'concepts',
'domain': 'concepts',
'system1': 'concepts',
'system2': 'concepts'
}
entity_type = entity_mapping.get(placeholder, 'concepts')
available_entities = entities.get(entity_type, ['something'])
if available_entities:
return random.choice(available_entities)
else:
return 'something'
return re.sub(r'\{(\w+)\}', replace_placeholder, template)
def _find_relevant_chunks(self, query: str, chunks: List[Dict], k: int = 3) -> List[Dict]:
"""Find chunks most relevant to the query"""
# Simple keyword matching for synthetic generation
query_words = set(query.lower().split())
chunk_scores = []
for chunk in chunks:
chunk_words = set(chunk['content'].lower().split())
overlap = len(query_words & chunk_words)
chunk_scores.append((overlap, chunk))
# Sort by overlap and return top k
chunk_scores.sort(key=lambda x: x[0], reverse=True)
return [chunk for _, chunk in chunk_scores[:k]]
```
## A/B Testing Framework
### Statistical Significance Testing
```python
import numpy as np
from scipy import stats
from typing import List, Dict, Tuple
class ABTestAnalyzer:
def __init__(self):
self.significance_level = 0.05
def compare_metrics(self, control_metrics: List[float],
treatment_metrics: List[float],
metric_name: str) -> Dict:
"""
Compare metrics between control and treatment groups
"""
control_mean = np.mean(control_metrics)
treatment_mean = np.mean(treatment_metrics)
control_std = np.std(control_metrics)
treatment_std = np.std(treatment_metrics)
# Perform t-test
t_statistic, p_value = stats.ttest_ind(control_metrics, treatment_metrics)
# Calculate effect size (Cohen's d)
pooled_std = np.sqrt(((len(control_metrics) - 1) * control_std**2 +
(len(treatment_metrics) - 1) * treatment_std**2) /
(len(control_metrics) + len(treatment_metrics) - 2))
cohens_d = (treatment_mean - control_mean) / pooled_std if pooled_std > 0 else 0
# Determine significance
is_significant = p_value < self.significance_level
return {
'metric_name': metric_name,
'control_mean': control_mean,
'treatment_mean': treatment_mean,
'absolute_difference': treatment_mean - control_mean,
'relative_difference': ((treatment_mean - control_mean) / control_mean * 100) if control_mean != 0 else 0,
'control_std': control_std,
'treatment_std': treatment_std,
't_statistic': t_statistic,
'p_value': p_value,
'is_significant': is_significant,
'effect_size': cohens_d,
'significance_level': self.significance_level
}
def analyze_ab_test_results(self,
control_results: Dict[str, List[float]],
treatment_results: Dict[str, List[float]]) -> Dict:
"""
Analyze A/B test results across multiple metrics
"""
analysis_results = {}
# Ensure both dictionaries have the same keys
all_metrics = set(control_results.keys()) & set(treatment_results.keys())
for metric in all_metrics:
if metric in control_results and metric in treatment_results:
analysis_results[metric] = self.compare_metrics(
control_results[metric],
treatment_results[metric],
metric
)
# Calculate overall summary
significant_improvements = sum(1 for result in analysis_results.values()
if result['is_significant'] and result['relative_difference'] > 0)
significant_degradations = sum(1 for result in analysis_results.values()
if result['is_significant'] and result['relative_difference'] < 0)
analysis_results['summary'] = {
'total_metrics_compared': len(analysis_results),
'significant_improvements': significant_improvements,
'significant_degradations': significant_degradations,
'no_significant_change': len(analysis_results) - significant_improvements - significant_degradations
}
return analysis_results
```
## Automated Evaluation Pipeline
### End-to-End Evaluation
```python
class ChunkingEvaluationPipeline:
def __init__(self, strategies: Dict[str, Any], dataset: EvaluationDataset):
self.strategies = strategies
self.dataset = dataset
self.results = {}
self.profiler = PerformanceProfiler()
self.memory_profiler = MemoryProfiler()
def run_evaluation(self) -> Dict:
"""Run comprehensive evaluation of all strategies"""
evaluation_results = {}
for strategy_name, strategy in self.strategies.items():
print(f"Evaluating strategy: {strategy_name}")
# Reset profilers for each strategy
self.profiler = PerformanceProfiler()
self.memory_profiler = MemoryProfiler()
# Evaluate strategy
strategy_results = self._evaluate_strategy(strategy, strategy_name)
evaluation_results[strategy_name] = strategy_results
# Compare strategies
comparison_results = self._compare_strategies(evaluation_results)
return {
'individual_results': evaluation_results,
'comparison': comparison_results,
'recommendations': self._generate_recommendations(comparison_results)
}
def _evaluate_strategy(self, strategy: Any, strategy_name: str) -> Dict:
"""Evaluate a single chunking strategy"""
results = {
'strategy_name': strategy_name,
'retrieval_metrics': {},
'quality_metrics': {},
'performance_metrics': {}
}
# Track memory usage
self.memory_profiler.take_memory_snapshot(f"{strategy_name}_start")
# Process all documents
self.profiler.start_timer('total_processing')
all_chunks = {}
for doc_id, content in self.dataset.documents.items():
self.profiler.start_timer('chunking')
chunks = strategy.chunk(content)
self.profiler.end_timer('chunking')
all_chunks[doc_id] = chunks
self.memory_profiler.take_memory_snapshot(f"{strategy_name}_after_chunking")
# Generate embeddings for chunks
self.profiler.start_timer('embedding')
chunk_embeddings = self._generate_embeddings(all_chunks)
self.profiler.end_timer('embedding')
self.memory_profiler.take_memory_snapshot(f"{strategy_name}_after_embedding")
# Evaluate retrieval performance
retrieval_results = self._evaluate_retrieval(all_chunks, chunk_embeddings)
results['retrieval_metrics'] = retrieval_results
# Evaluate chunk quality
quality_results = self._evaluate_chunk_quality(all_chunks)
results['quality_metrics'] = quality_results
# Get performance metrics
self.profiler.end_timer('total_processing')
performance_metrics = self.profiler.get_performance_metrics(len(self.dataset.documents))
results['performance_metrics'] = performance_metrics.__dict__
# Get memory metrics
self.memory_profiler.take_memory_snapshot(f"{strategy_name}_end")
results['memory_metrics'] = {
'peak_memory_mb': self.memory_profiler.get_peak_memory_usage(),
'memory_by_operation': self.memory_profiler.get_memory_usage_by_operation()
}
return results
def _evaluate_retrieval(self, all_chunks: Dict, chunk_embeddings: Dict) -> Dict:
"""Evaluate retrieval performance"""
retrieval_metrics = {
'precision': [],
'recall': [],
'f1_score': [],
'mrr': [],
'map': []
}
for query in self.dataset.queries:
# Perform retrieval
self.profiler.start_timer('search')
retrieved_chunks = self._retrieve_chunks(query.question, chunk_embeddings, k=10)
self.profiler.end_timer('search')
# Get relevant chunks for this query
relevant_chunk_ids = set(query.relevant_chunk_ids)
relevant_chunks = [chunk for chunk in retrieved_chunks
if chunk.get('id') in relevant_chunk_ids]
# Calculate metrics
precision = calculate_precision(retrieved_chunks, relevant_chunks)
recall = calculate_recall(retrieved_chunks, relevant_chunks)
f1 = calculate_f1_score(precision, recall)
retrieval_metrics['precision'].append(precision)
retrieval_metrics['recall'].append(recall)
retrieval_metrics['f1_score'].append(f1)
# Calculate averages
return {metric: np.mean(values) for metric, values in retrieval_metrics.items()}
def _evaluate_chunk_quality(self, all_chunks: Dict) -> Dict:
"""Evaluate quality of generated chunks"""
quality_assessor = ChunkQualityAssessor()
quality_scores = []
for doc_id, chunks in all_chunks.items():
# Analyze document
content = self.dataset.documents[doc_id]
analyzer = DocumentAnalyzer()
analysis = analyzer.analyze(content)
# Assess chunk quality
scores = quality_assessor.assess_chunks(chunks, analysis)
quality_scores.append(scores)
# Aggregate quality scores
if quality_scores:
avg_scores = {}
for metric in quality_scores[0].keys():
avg_scores[metric] = np.mean([scores[metric] for scores in quality_scores])
return avg_scores
return {}
def _compare_strategies(self, evaluation_results: Dict) -> Dict:
"""Compare performance across strategies"""
ab_analyzer = ABTestAnalyzer()
comparison = {}
# Compare each metric across strategies
strategy_names = list(evaluation_results.keys())
for i in range(len(strategy_names)):
for j in range(i + 1, len(strategy_names)):
strategy1 = strategy_names[i]
strategy2 = strategy_names[j]
comparison_key = f"{strategy1}_vs_{strategy2}"
comparison[comparison_key] = {}
# Compare retrieval metrics
for metric in ['precision', 'recall', 'f1_score']:
if (metric in evaluation_results[strategy1]['retrieval_metrics'] and
metric in evaluation_results[strategy2]['retrieval_metrics']):
comparison[comparison_key][f"retrieval_{metric}"] = ab_analyzer.compare_metrics(
[evaluation_results[strategy1]['retrieval_metrics'][metric]],
[evaluation_results[strategy2]['retrieval_metrics'][metric]],
f"retrieval_{metric}"
)
return comparison
def _generate_recommendations(self, comparison_results: Dict) -> Dict:
"""Generate recommendations based on evaluation results"""
recommendations = {
'best_overall': None,
'best_for_precision': None,
'best_for_recall': None,
'best_for_performance': None,
'trade_offs': []
}
# This would analyze the comparison results and generate specific recommendations
# Implementation depends on specific use case requirements
return recommendations
def _generate_embeddings(self, all_chunks: Dict) -> Dict:
"""Generate embeddings for all chunks"""
# This would use the actual embedding model
# Placeholder implementation
embeddings = {}
for doc_id, chunks in all_chunks.items():
embeddings[doc_id] = []
for chunk in chunks:
# Generate embedding for chunk content
embedding = np.random.rand(384) # Placeholder
embeddings[doc_id].append({
'chunk': chunk,
'embedding': embedding
})
return embeddings
def _retrieve_chunks(self, query: str, chunk_embeddings: Dict, k: int = 10) -> List[Dict]:
"""Retrieve most relevant chunks for a query"""
# This would use actual similarity search
# Placeholder implementation
all_chunks = []
for doc_embeddings in chunk_embeddings.values():
for chunk_data in doc_embeddings:
all_chunks.append(chunk_data['chunk'])
# Simple random selection as placeholder
selected = random.sample(all_chunks, min(k, len(all_chunks)))
return selected
```
This comprehensive evaluation framework provides the tools needed to thoroughly assess chunking strategies across multiple dimensions: retrieval effectiveness, answer quality, system performance, and statistical significance. The modular design allows for easy extension and customization based on specific requirements and use cases.

View File

@@ -0,0 +1,709 @@
# Complete Implementation Guidelines
This document provides comprehensive implementation guidance for building effective chunking systems.
## System Architecture
### Core Components
```
Document Processor
├── Ingestion Layer
│ ├── Document Type Detection
│ ├── Format Parsing (PDF, HTML, Markdown, etc.)
│ └── Content Extraction
├── Analysis Layer
│ ├── Structure Analysis
│ ├── Content Type Identification
│ └── Complexity Assessment
├── Strategy Selection Layer
│ ├── Rule-based Selection
│ ├── ML-based Prediction
│ └── Adaptive Configuration
├── Chunking Layer
│ ├── Strategy Implementation
│ ├── Parameter Optimization
│ └── Quality Validation
└── Output Layer
├── Chunk Metadata Generation
├── Embedding Integration
└── Storage Preparation
```
## Pre-processing Pipeline
### Document Analysis Framework
```python
from dataclasses import dataclass
from typing import List, Dict, Any
import re
@dataclass
class DocumentAnalysis:
doc_type: str
structure_score: float # 0-1, higher means more structured
complexity_score: float # 0-1, higher means more complex
content_types: List[str]
language: str
estimated_tokens: int
has_multimodal: bool
class DocumentAnalyzer:
def __init__(self):
self.structure_patterns = {
'markdown': [r'^#+\s', r'^\*\*.*\*\*$', r'^\* ', r'^\d+\. '],
'html': [r'<h[1-6]>', r'<p>', r'<div>', r'<table>'],
'latex': [r'\\section', r'\\subsection', r'\\begin\{', r'\\end\{'],
'academic': [r'^\d+\.', r'^\d+\.\d+', r'^[A-Z]\.', r'^Figure \d+']
}
def analyze(self, content: str) -> DocumentAnalysis:
doc_type = self.detect_document_type(content)
structure_score = self.calculate_structure_score(content, doc_type)
complexity_score = self.calculate_complexity_score(content)
content_types = self.identify_content_types(content)
language = self.detect_language(content)
estimated_tokens = self.estimate_tokens(content)
has_multimodal = self.detect_multimodal_content(content)
return DocumentAnalysis(
doc_type=doc_type,
structure_score=structure_score,
complexity_score=complexity_score,
content_types=content_types,
language=language,
estimated_tokens=estimated_tokens,
has_multimodal=has_multimodal
)
def detect_document_type(self, content: str) -> str:
content_lower = content.lower()
if '<html' in content_lower or '<body' in content_lower:
return 'html'
elif '#' in content and '##' in content:
return 'markdown'
elif '\\documentclass' in content_lower or '\\begin{' in content_lower:
return 'latex'
elif any(keyword in content_lower for keyword in ['abstract', 'introduction', 'conclusion', 'references']):
return 'academic'
elif 'def ' in content or 'class ' in content or 'function ' in content_lower:
return 'code'
else:
return 'plain'
def calculate_structure_score(self, content: str, doc_type: str) -> float:
patterns = self.structure_patterns.get(doc_type, [])
if not patterns:
return 0.5 # Default for unstructured content
line_count = len(content.split('\n'))
structured_lines = 0
for line in content.split('\n'):
for pattern in patterns:
if re.search(pattern, line.strip()):
structured_lines += 1
break
return min(structured_lines / max(line_count, 1), 1.0)
def calculate_complexity_score(self, content: str) -> float:
# Factors that increase complexity
avg_sentence_length = self.calculate_avg_sentence_length(content)
vocabulary_richness = self.calculate_vocabulary_richness(content)
nested_structure = self.detect_nested_structure(content)
# Normalize and combine
complexity = (
min(avg_sentence_length / 30, 1.0) * 0.3 +
vocabulary_richness * 0.4 +
nested_structure * 0.3
)
return min(complexity, 1.0)
def identify_content_types(self, content: str) -> List[str]:
types = []
if '```' in content or 'def ' in content or 'function ' in content.lower():
types.append('code')
if '|' in content and '\n' in content:
types.append('tables')
if re.search(r'\!\[.*\]\(.*\)', content):
types.append('images')
if re.search(r'http[s]?://', content):
types.append('links')
if re.search(r'\d+\.\d+', content) or re.search(r'\$\d', content):
types.append('numbers')
return types if types else ['text']
def detect_language(self, content: str) -> str:
# Simple language detection - can be enhanced with proper language detection libraries
if re.search(r'[\u4e00-\u9fff]', content):
return 'chinese'
elif re.search(r'[u0600-\u06ff]', content):
return 'arabic'
elif re.search(r'[u0400-\u04ff]', content):
return 'russian'
else:
return 'english' # Default assumption
def estimate_tokens(self, content: str) -> int:
# Rough estimation - actual tokenization varies by model
word_count = len(content.split())
return int(word_count * 1.3) # Average tokens per word
def detect_multimodal_content(self, content: str) -> bool:
multimodal_indicators = [
r'\!\[.*\]\(.*\)', # Images
r'<iframe', # Embedded content
r'<object', # Embedded objects
r'<embed', # Embedded media
]
return any(re.search(pattern, content) for pattern in multimodal_indicators)
def calculate_avg_sentence_length(self, content: str) -> float:
sentences = re.split(r'[.!?]+', content)
sentences = [s.strip() for s in sentences if s.strip()]
if not sentences:
return 0
return sum(len(s.split()) for s in sentences) / len(sentences)
def calculate_vocabulary_richness(self, content: str) -> float:
words = content.lower().split()
if not words:
return 0
unique_words = set(words)
return len(unique_words) / len(words)
def detect_nested_structure(self, content: str) -> float:
# Detect nested lists, indented content, etc.
lines = content.split('\n')
indented_lines = 0
for line in lines:
if line.strip() and line.startswith(' '):
indented_lines += 1
return indented_lines / max(len(lines), 1)
```
### Strategy Selection Engine
```python
from abc import ABC, abstractmethod
from typing import Dict, Any
class ChunkingStrategy(ABC):
@abstractmethod
def chunk(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
pass
class StrategySelector:
def __init__(self):
self.strategies = {
'fixed_size': FixedSizeStrategy(),
'recursive': RecursiveStrategy(),
'structure_aware': StructureAwareStrategy(),
'semantic': SemanticStrategy(),
'adaptive': AdaptiveStrategy()
}
def select_strategy(self, analysis: DocumentAnalysis) -> str:
# Rule-based selection logic
if analysis.structure_score > 0.8 and analysis.doc_type in ['markdown', 'html', 'latex']:
return 'structure_aware'
elif analysis.complexity_score > 0.7 and analysis.estimated_tokens < 10000:
return 'semantic'
elif analysis.doc_type == 'code':
return 'structure_aware'
elif analysis.structure_score < 0.3:
return 'fixed_size'
elif analysis.complexity_score > 0.5:
return 'recursive'
else:
return 'adaptive'
def get_strategy(self, analysis: DocumentAnalysis) -> ChunkingStrategy:
strategy_name = self.select_strategy(analysis)
return self.strategies[strategy_name]
# Example strategy implementations
class FixedSizeStrategy(ChunkingStrategy):
def __init__(self, default_size=512, default_overlap=50):
self.default_size = default_size
self.default_overlap = default_overlap
def chunk(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
# Adjust parameters based on analysis
if analysis.complexity_score > 0.7:
chunk_size = 1024
elif analysis.complexity_score < 0.3:
chunk_size = 256
else:
chunk_size = self.default_size
overlap = int(chunk_size * 0.1) # 10% overlap
# Implementation here...
return self._fixed_size_chunk(content, chunk_size, overlap)
def _fixed_size_chunk(self, content: str, chunk_size: int, overlap: int) -> List[Dict[str, Any]]:
# Implementation using RecursiveCharacterTextSplitter or custom logic
pass
class AdaptiveStrategy(ChunkingStrategy):
def chunk(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
# Combine multiple strategies based on content characteristics
if analysis.structure_score > 0.6:
# Use structure-aware for structured parts
structured_chunks = self._chunk_structured_parts(content, analysis)
else:
# Use fixed-size for unstructured parts
unstructured_chunks = self._chunk_unstructured_parts(content, analysis)
# Merge and optimize
return self._merge_chunks(structured_chunks + unstructured_chunks)
def _chunk_structured_parts(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
# Implementation for structured content
pass
def _chunk_unstructured_parts(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
# Implementation for unstructured content
pass
def _merge_chunks(self, chunks: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
# Implementation for merging and optimizing chunks
pass
```
## Quality Assurance Framework
### Chunk Quality Metrics
```python
from typing import List, Dict, Any
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
class ChunkQualityAssessor:
def __init__(self):
self.quality_weights = {
'coherence': 0.3,
'completeness': 0.25,
'size_appropriateness': 0.2,
'semantic_similarity': 0.15,
'boundary_quality': 0.1
}
def assess_chunks(self, chunks: List[Dict[str, Any]], analysis: DocumentAnalysis) -> Dict[str, float]:
scores = {}
# Coherence: Do chunks make sense on their own?
scores['coherence'] = self._assess_coherence(chunks)
# Completeness: Do chunks preserve important information?
scores['completeness'] = self._assess_completeness(chunks, analysis)
# Size appropriateness: Are chunks within optimal size range?
scores['size_appropriateness'] = self._assess_size(chunks)
# Semantic similarity: Are chunks thematically consistent?
scores['semantic_similarity'] = self._assess_semantic_consistency(chunks)
# Boundary quality: Are chunk boundaries placed well?
scores['boundary_quality'] = self._assess_boundary_quality(chunks)
# Calculate overall quality score
overall_score = sum(
score * self.quality_weights[metric]
for metric, score in scores.items()
)
scores['overall'] = overall_score
return scores
def _assess_coherence(self, chunks: List[Dict[str, Any]]) -> float:
# Simple heuristic-based coherence assessment
coherence_scores = []
for chunk in chunks:
content = chunk['content']
# Check for complete sentences
sentences = re.split(r'[.!?]+', content)
complete_sentences = sum(1 for s in sentences if s.strip())
coherence = complete_sentences / max(len(sentences), 1)
coherence_scores.append(coherence)
return np.mean(coherence_scores)
def _assess_completeness(self, chunks: List[Dict[str, Any]], analysis: DocumentAnalysis) -> float:
# Check if important structural elements are preserved
if analysis.doc_type in ['markdown', 'html']:
return self._assess_structure_preservation(chunks, analysis)
else:
return self._assess_content_preservation(chunks)
def _assess_structure_preservation(self, chunks: List[Dict[str, Any]], analysis: DocumentAnalysis) -> float:
# Check if headings, lists, and other structural elements are preserved
preserved_elements = 0
total_elements = 0
for chunk in chunks:
content = chunk['content']
# Count preserved structural elements
headings = len(re.findall(r'^#+\s', content, re.MULTILINE))
lists = len(re.findall(r'^\s*[-*+]\s', content, re.MULTILINE))
preserved_elements += headings + lists
total_elements += 1 # Simplified count
return preserved_elements / max(total_elements, 1)
def _assess_content_preservation(self, chunks: List[Dict[str, Any]]) -> float:
# Simple check based on content ratio
total_content = ''.join(chunk['content'] for chunk in chunks)
# This would need comparison with original content
return 0.8 # Placeholder
def _assess_size(self, chunks: List[Dict[str, Any]]) -> float:
optimal_min = 100 # tokens
optimal_max = 1000 # tokens
size_scores = []
for chunk in chunks:
token_count = self._estimate_tokens(chunk['content'])
if optimal_min <= token_count <= optimal_max:
score = 1.0
elif token_count < optimal_min:
score = token_count / optimal_min
else:
score = max(0, 1 - (token_count - optimal_max) / optimal_max)
size_scores.append(score)
return np.mean(size_scores)
def _assess_semantic_consistency(self, chunks: List[Dict[str, Any]]) -> float:
# This would require embedding models for actual implementation
# Placeholder implementation
return 0.7
def _assess_boundary_quality(self, chunks: List[Dict[str, Any]]) -> float:
# Check if boundaries don't split important content
boundary_scores = []
for i, chunk in enumerate(chunks):
content = chunk['content']
# Check for incomplete sentences at boundaries
if not content.strip().endswith(('.', '!', '?', '>', '}')):
boundary_scores.append(0.5)
else:
boundary_scores.append(1.0)
return np.mean(boundary_scores)
def _estimate_tokens(self, content: str) -> int:
# Simple token estimation
return len(content.split()) * 4 // 3 # Rough approximation
```
## Error Handling and Edge Cases
### Robust Error Handling
```python
import logging
from typing import Optional, List
from dataclasses import dataclass
@dataclass
class ChunkingError:
error_type: str
message: str
chunk_index: Optional[int] = None
recovery_action: Optional[str] = None
class ChunkingErrorHandler:
def __init__(self):
self.logger = logging.getLogger(__name__)
self.error_handlers = {
'empty_content': self._handle_empty_content,
'oversized_chunk': self._handle_oversized_chunk,
'encoding_error': self._handle_encoding_error,
'memory_error': self._handle_memory_error,
'structure_parsing_error': self._handle_structure_parsing_error
}
def handle_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
error_type = self._classify_error(error)
handler = self.error_handlers.get(error_type, self._handle_generic_error)
return handler(error, context)
def _classify_error(self, error: Exception) -> str:
if isinstance(error, ValueError) and 'empty' in str(error).lower():
return 'empty_content'
elif isinstance(error, MemoryError):
return 'memory_error'
elif isinstance(error, UnicodeError):
return 'encoding_error'
elif 'too large' in str(error).lower():
return 'oversized_chunk'
elif 'parsing' in str(error).lower():
return 'structure_parsing_error'
else:
return 'generic_error'
def _handle_empty_content(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
self.logger.warning(f"Empty content encountered: {error}")
return ChunkingError(
error_type='empty_content',
message=str(error),
recovery_action='skip_empty_content'
)
def _handle_oversized_chunk(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
self.logger.warning(f"Oversized chunk detected: {error}")
return ChunkingError(
error_type='oversized_chunk',
message=str(error),
chunk_index=context.get('chunk_index'),
recovery_action='reduce_chunk_size'
)
def _handle_encoding_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
self.logger.error(f"Encoding error: {error}")
return ChunkingError(
error_type='encoding_error',
message=str(error),
recovery_action='fallback_encoding'
)
def _handle_memory_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
self.logger.error(f"Memory error during chunking: {error}")
return ChunkingError(
error_type='memory_error',
message=str(error),
recovery_action='process_in_batches'
)
def _handle_structure_parsing_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
self.logger.warning(f"Structure parsing failed: {error}")
return ChunkingError(
error_type='structure_parsing_error',
message=str(error),
recovery_action='fallback_to_fixed_size'
)
def _handle_generic_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
self.logger.error(f"Unexpected error during chunking: {error}")
return ChunkingError(
error_type='generic_error',
message=str(error),
recovery_action='skip_and_continue'
)
```
## Performance Optimization
### Caching and Memoization
```python
import hashlib
import pickle
from functools import lru_cache
from typing import Dict, Any, Optional
import redis
import json
class ChunkingCache:
def __init__(self, redis_url: Optional[str] = None):
if redis_url:
self.redis_client = redis.from_url(redis_url)
else:
self.redis_client = None
self.local_cache = {}
def _generate_cache_key(self, content: str, strategy: str, params: Dict[str, Any]) -> str:
content_hash = hashlib.md5(content.encode()).hexdigest()
params_str = json.dumps(params, sort_keys=True)
params_hash = hashlib.md5(params_str.encode()).hexdigest()
return f"chunking:{strategy}:{content_hash}:{params_hash}"
def get(self, content: str, strategy: str, params: Dict[str, Any]) -> Optional[List[Dict[str, Any]]]:
cache_key = self._generate_cache_key(content, strategy, params)
# Try local cache first
if cache_key in self.local_cache:
return self.local_cache[cache_key]
# Try Redis cache
if self.redis_client:
try:
cached_data = self.redis_client.get(cache_key)
if cached_data:
chunks = pickle.loads(cached_data)
self.local_cache[cache_key] = chunks # Cache locally too
return chunks
except Exception as e:
logging.warning(f"Redis cache error: {e}")
return None
def set(self, content: str, strategy: str, params: Dict[str, Any], chunks: List[Dict[str, Any]]) -> None:
cache_key = self._generate_cache_key(content, strategy, params)
# Store in local cache
self.local_cache[cache_key] = chunks
# Store in Redis cache
if self.redis_client:
try:
cached_data = pickle.dumps(chunks)
self.redis_client.setex(cache_key, 3600, cached_data) # 1 hour TTL
except Exception as e:
logging.warning(f"Redis cache set error: {e}")
def clear_local_cache(self):
self.local_cache.clear()
def clear_redis_cache(self):
if self.redis_client:
pattern = "chunking:*"
keys = self.redis_client.keys(pattern)
if keys:
self.redis_client.delete(*keys)
```
### Batch Processing
```python
import asyncio
import concurrent.futures
from typing import List, Callable, Any
class BatchChunkingProcessor:
def __init__(self, max_workers: int = 4, batch_size: int = 10):
self.max_workers = max_workers
self.batch_size = batch_size
def process_documents_batch(self, documents: List[str],
chunking_function: Callable[[str], List[Dict[str, Any]]]) -> List[List[Dict[str, Any]]]:
"""Process multiple documents in parallel"""
results = []
# Process in batches to avoid memory issues
for i in range(0, len(documents), self.batch_size):
batch = documents[i:i + self.batch_size]
with concurrent.futures.ThreadPoolExecutor(max_workers=self.max_workers) as executor:
future_to_doc = {
executor.submit(chunking_function, doc): doc
for doc in batch
}
batch_results = []
for future in concurrent.futures.as_completed(future_to_doc):
try:
chunks = future.result()
batch_results.append(chunks)
except Exception as e:
logging.error(f"Error processing document: {e}")
batch_results.append([]) # Empty result for failed processing
results.extend(batch_results)
return results
async def process_documents_async(self, documents: List[str],
chunking_function: Callable[[str], List[Dict[str, Any]]]) -> List[List[Dict[str, Any]]]:
"""Process documents asynchronously"""
semaphore = asyncio.Semaphore(self.max_workers)
async def process_single_document(doc: str) -> List[Dict[str, Any]]:
async with semaphore:
# Run the synchronous chunking function in an executor
loop = asyncio.get_event_loop()
return await loop.run_in_executor(None, chunking_function, doc)
tasks = [process_single_document(doc) for doc in documents]
return await asyncio.gather(*tasks, return_exceptions=True)
```
## Monitoring and Observability
### Metrics Collection
```python
import time
from dataclasses import dataclass
from typing import Dict, Any, List
from collections import defaultdict
@dataclass
class ChunkingMetrics:
total_documents: int
total_chunks: int
avg_chunk_size: float
processing_time: float
memory_usage: float
error_count: int
strategy_distribution: Dict[str, int]
class MetricsCollector:
def __init__(self):
self.metrics = defaultdict(list)
self.start_time = None
def start_timing(self):
self.start_time = time.time()
def end_timing(self) -> float:
if self.start_time:
duration = time.time() - self.start_time
self.metrics['processing_time'].append(duration)
self.start_time = None
return duration
return 0.0
def record_chunk_count(self, count: int):
self.metrics['chunk_count'].append(count)
def record_chunk_size(self, size: int):
self.metrics['chunk_size'].append(size)
def record_strategy_usage(self, strategy: str):
self.metrics['strategy'][strategy] = self.metrics['strategy'].get(strategy, 0) + 1
def record_error(self, error_type: str):
self.metrics['errors'].append(error_type)
def record_memory_usage(self, memory_mb: float):
self.metrics['memory_usage'].append(memory_mb)
def get_summary(self) -> ChunkingMetrics:
return ChunkingMetrics(
total_documents=len(self.metrics['processing_time']),
total_chunks=sum(self.metrics['chunk_count']),
avg_chunk_size=sum(self.metrics['chunk_size']) / max(len(self.metrics['chunk_size']), 1),
processing_time=sum(self.metrics['processing_time']),
memory_usage=sum(self.metrics['memory_usage']) / max(len(self.metrics['memory_usage']), 1),
error_count=len(self.metrics['errors']),
strategy_distribution=dict(self.metrics['strategy'])
)
def reset(self):
self.metrics.clear()
self.start_time = None
```
This implementation guide provides a comprehensive foundation for building robust, scalable chunking systems that can handle various document types and use cases while maintaining high quality and performance.

View File

@@ -0,0 +1,366 @@
# Key Research Papers and Findings
This document summarizes important research papers and findings related to chunking strategies for RAG systems.
## Seminal Papers
### "Reconstructing Context: Evaluating Advanced Chunking Strategies for RAG" (arXiv:2504.19754)
**Key Findings**:
- Page-level chunking achieved highest average accuracy (0.648) with lowest variance across different query types
- Optimal chunk size varies significantly by document type and query complexity
- Factoid queries perform better with smaller chunks (256-512 tokens)
- Complex analytical queries benefit from larger chunks (1024+ tokens)
**Methodology**:
- Evaluated 7 different chunking strategies across multiple document types
- Tested with both factoid and analytical queries
- Measured end-to-end RAG performance
**Practical Implications**:
- Start with page-level chunking for general-purpose RAG systems
- Adapt chunk size based on expected query patterns
- Consider hybrid approaches for mixed query types
### "Lost in the Middle: How Language Models Use Long Contexts"
**Key Findings**:
- Language models tend to pay more attention to information at the beginning and end of context
- Information in the middle of long contexts is often ignored
- Performance degradation is most severe for centrally located information
**Practical Implications**:
- Place most important information at chunk boundaries
- Consider chunk overlap to ensure important context appears multiple times
- Use ranking to prioritize relevant chunks for inclusion in context
### "Grounded Language Learning in a Simulated 3D World"
**Related Concepts**:
- Importance of grounding text in visual/contextual information
- Multi-modal learning approaches for better understanding
**Relevance to Chunking**:
- Supports contextual chunking approaches that preserve visual/contextual relationships
- Validates importance of maintaining document structure and relationships
## Industry Research
### NVIDIA Research: "Finding the Best Chunking Strategy for Accurate AI Responses"
**Key Findings**:
- Page-level chunking outperformed sentence and paragraph-level approaches
- Fixed-size chunking showed consistent but suboptimal performance
- Semantic chunking provided improvements for complex documents
**Technical Details**:
- Tested chunk sizes from 128 to 2048 tokens
- Evaluated across financial, technical, and legal documents
- Measured both retrieval accuracy and generation quality
**Recommendations**:
- Use 512-1024 token chunks as starting point
- Implement adaptive chunking based on document complexity
- Consider page boundaries as natural chunk separators
### Cohere Research: "Effective Chunking Strategies for RAG"
**Key Findings**:
- Recursive character splitting provides good balance of performance and simplicity
- Document structure awareness improves retrieval by 15-20%
- Overlap of 10-20% provides optimal context preservation
**Methodology**:
- Compared 12 chunking strategies across 6 document types
- Measured retrieval precision, recall, and F1-score
- Tested with both dense and sparse retrieval
**Best Practices Identified**:
- Start with recursive character splitting with 10-20% overlap
- Preserve document structure (headings, lists, tables)
- Customize chunk size based on embedding model context window
### Anthropic: "Contextual Retrieval"
**Key Innovation**:
- Enhance each chunk with LLM-generated contextual information before embedding
- Improves retrieval precision by 25-30% for complex documents
- Particularly effective for technical and academic content
**Implementation Approach**:
1. Split document using traditional methods
2. For each chunk, generate contextual information using LLM
3. Prepend context to chunk before embedding
4. Use hybrid search (dense + sparse) with weighted ranking
**Trade-offs**:
- Significant computational overhead (2-3x processing time)
- Higher embedding storage requirements
- Improved retrieval precision justifies cost for high-value applications
## Algorithmic Advances
### Semantic Chunking Algorithms
#### "Semantic Segmentation of Text Documents"
**Core Idea**: Use cosine similarity between consecutive sentence embeddings to identify natural boundaries.
**Algorithm**:
1. Split document into sentences
2. Generate embeddings for each sentence
3. Calculate similarity between consecutive sentences
4. Create boundaries where similarity drops below threshold
5. Merge short segments with neighbors
**Performance**: 20-30% improvement in retrieval relevance over fixed-size chunking for technical documents.
#### "Hierarchical Semantic Chunking"
**Core Idea**: Multi-level semantic segmentation for document organization.
**Algorithm**:
1. Document-level semantic analysis
2. Section-level boundary detection
3. Paragraph-level segmentation
4. Sentence-level refinement
**Benefits**: Maintains document hierarchy while adapting to semantic structure.
### Advanced Embedding Techniques
#### "Late Chunking: Contextual Chunk Embeddings"
**Core Innovation**: Generate embeddings for entire document first, then create chunk embeddings from token-level embeddings.
**Advantages**:
- Preserves global document context
- Reduces context fragmentation
- Better for documents with complex inter-relationships
**Requirements**:
- Long-context embedding models (8k+ tokens)
- Significant computational resources
- Specialized implementation
#### "Hierarchical Embedding Retrieval"
**Approach**: Create embeddings at multiple granularities (document, section, paragraph, sentence).
**Implementation**:
1. Generate embeddings at each level
2. Store in hierarchical vector database
3. Query at appropriate granularity based on information needs
**Performance**: 15-25% improvement in precision for complex queries.
## Evaluation Methodologies
### Retrieval-Augmented Generation Assessment Frameworks
#### RAGAS Framework
**Metrics**:
- **Faithfulness**: Consistency between generated answer and retrieved context
- **Answer Relevancy**: Relevance of generated answer to the question
- **Context Relevancy**: Relevance of retrieved context to the question
- **Context Recall**: Coverage of relevant information in retrieved context
**Evaluation Process**:
1. Generate questions from document corpus
2. Retrieve relevant chunks using different strategies
3. Generate answers using retrieved chunks
4. Evaluate using automated metrics and human judgment
#### ARES Framework
**Innovation**: Automated evaluation using synthetic questions and LLM-based assessment.
**Key Features**:
- Generates diverse question types (factoid, analytical, comparative)
- Uses LLMs to evaluate answer quality
- Provides scalable evaluation without human annotation
### Benchmark Datasets
#### Natural Questions (NQ)
**Description**: Real user questions from Google Search with relevant Wikipedia passages.
**Relevance**: Natural language queries with authentic relevance judgments.
#### MS MARCO
**Description**: Large-scale passage ranking dataset with real search queries.
**Relevance**: High-quality relevance judgments for passage retrieval.
#### HotpotQA
**Description**: Multi-hop question answering requiring information from multiple documents.
**Relevance**: Tests ability to retrieve and synthesize information from multiple chunks.
## Domain-Specific Research
### Medical Documents
#### "Optimal Chunking for Medical Question Answering"
**Key Findings**:
- Medical terminology requires specialized handling
- Section-based chunking (History, Diagnosis, Treatment) most effective
- Preserving doctor-patient dialogue context crucial
**Recommendations**:
- Use medical-specific tokenizers
- Preserve section headers and structure
- Maintain temporal relationships in medical histories
### Legal Documents
#### "Chunking Strategies for Legal Document Analysis"
**Key Findings**:
- Legal citations and cross-references require special handling
- Contract clause boundaries serve as natural chunk separators
- Case law benefits from hierarchical chunking
**Best Practices**:
- Preserve legal citation structure
- Use clause and section boundaries
- Maintain context for legal definitions and references
### Financial Documents
#### "SEC Filing Chunking for Financial Analysis"
**Key Findings**:
- Table preservation critical for financial data
- XBRL tagging provides natural segmentation
- Risk factors sections benefit from specialized treatment
**Approach**:
- Preserve complete tables when possible
- Use XBRL tags for structured data
- Create specialized chunks for risk sections
## Emerging Trends
### Multi-Modal Chunking
#### "Integrating Text, Tables, and Images in RAG Systems"
**Innovation**: Unified chunking approach for mixed-modal content.
**Approach**:
- Extract and describe images using vision models
- Preserve table structure and relationships
- Create unified embeddings for mixed content
**Results**: 35% improvement in complex document understanding.
### Adaptive Chunking
#### "Machine Learning-Based Chunk Size Optimization"
**Core Idea**: Use ML models to predict optimal chunking parameters.
**Features**:
- Document length and complexity
- Query type distribution
- Embedding model characteristics
- Performance requirements
**Benefits**: Dynamic optimization based on use case and content.
### Real-time Chunking
#### "Streaming Chunking for Live Document Processing"
**Innovation**: Process documents as they become available.
**Techniques**:
- Incremental boundary detection
- Dynamic chunk size adjustment
- Context preservation across chunks
**Applications**: Live news feeds, social media analysis, meeting transcripts.
## Implementation Challenges
### Computational Efficiency
#### "Scalable Chunking for Large Document Collections"
**Challenges**:
- Processing millions of documents efficiently
- Memory usage optimization
- Distributed processing requirements
**Solutions**:
- Batch processing with parallel execution
- Streaming approaches for large documents
- Distributed chunking with load balancing
### Quality Assurance
#### "Evaluating Chunk Quality at Scale"
**Challenges**:
- Automated quality assessment
- Detecting poor chunk boundaries
- Maintaining consistency across document types
**Approaches**:
- Heuristic-based quality metrics
- LLM-based evaluation
- Human-in-the-loop validation
## Future Research Directions
### Context-Aware Chunking
**Open Questions**:
- How to optimally preserve cross-chunk relationships?
- Can we predict chunk quality without human evaluation?
- What is the optimal balance between size and context?
### Domain Adaptation
**Research Areas**:
- Automatic domain detection and adaptation
- Transfer learning across domains
- Zero-shot chunking for new document types
### Evaluation Standards
**Needs**:
- Standardized evaluation benchmarks
- Cross-paper comparison methodologies
- Real-world performance metrics
## Practical Recommendations Based on Research
### Starting Points
1. **For General RAG Systems**: Page-level or recursive character chunking with 512-1024 tokens and 10-20% overlap
2. **For Technical Documents**: Structure-aware chunking with semantic boundary detection
3. **For High-Value Applications**: Contextual retrieval with LLM-generated context
### Evolution Strategy
1. **Begin**: Simple fixed-size chunking (512 tokens)
2. **Improve**: Add document structure awareness
3. **Optimize**: Implement semantic boundaries
4. **Advanced**: Consider contextual retrieval for critical use cases
### Key Success Factors
1. **Match strategy to document type and query patterns**
2. **Preserve document structure when beneficial**
3. **Use overlap to maintain context across boundaries**
4. **Monitor both accuracy and computational costs**
5. **Iterate based on specific use case requirements**
This research foundation provides evidence-based guidance for implementing effective chunking strategies across various domains and use cases.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,423 @@
# Detailed Chunking Strategies
This document provides comprehensive implementation details for all chunking strategies mentioned in the main skill.
## Level 1: Fixed-Size Chunking
### Implementation
```python
from langchain.text_splitter import RecursiveCharacterTextSplitter
class FixedSizeChunker:
def __init__(self, chunk_size=512, chunk_overlap=50):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
separators=["\n\n", "\n", " ", ""]
)
def chunk(self, documents):
return self.splitter.split_documents(documents)
```
### Parameter Recommendations
| Use Case | Chunk Size | Overlap | Rationale |
|----------|------------|---------|-----------|
| Factoid Queries | 256 | 25 | Small chunks for precise answers |
| General Q&A | 512 | 50 | Balanced approach for most cases |
| Analytical Queries | 1024 | 100 | Larger context for complex analysis |
| Code Documentation | 300 | 30 | Preserve code context while maintaining focus |
### Best Practices
- Start with 512 tokens and 10-20% overlap
- Adjust based on embedding model context window
- Use overlap for queries where context might span boundaries
- Monitor token count vs. character count based on model
## Level 2: Recursive Character Chunking
### Implementation
```python
from langchain.text_splitter import RecursiveCharacterTextSplitter
class RecursiveChunker:
def __init__(self, chunk_size=512, separators=None):
self.chunk_size = chunk_size
self.separators = separators or ["\n\n", "\n", " ", ""]
self.splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=0,
length_function=len,
separators=self.separators
)
def chunk(self, text):
return self.splitter.create_documents([text])
# Document-specific configurations
def get_chunker_for_document_type(doc_type):
configurations = {
"markdown": ["\n## ", "\n### ", "\n\n", "\n", " ", ""],
"html": ["</div>", "</p>", "\n\n", "\n", " ", ""],
"code": ["\n\n", "\n", " ", ""],
"plain": ["\n\n", "\n", " ", ""]
}
return RecursiveChunker(separators=configurations.get(doc_type, ["\n\n", "\n", " ", ""]))
```
### Customization Guidelines
- **Markdown**: Use headings as primary separators
- **HTML**: Use block-level tags as separators
- **Code**: Preserve function and class boundaries
- **Academic papers**: Prioritize paragraph and section breaks
## Level 3: Structure-Aware Chunking
### Markdown Documents
```python
import markdown
from bs4 import BeautifulSoup
class MarkdownChunker:
def __init__(self, max_chunk_size=512):
self.max_chunk_size = max_chunk_size
def chunk(self, markdown_text):
html = markdown.markdown(markdown_text)
soup = BeautifulSoup(html, 'html.parser')
chunks = []
current_chunk = ""
current_heading = "Introduction"
for element in soup.find_all(['h1', 'h2', 'h3', 'p', 'pre', 'table']):
if element.name.startswith('h'):
if current_chunk.strip():
chunks.append({
"content": current_chunk.strip(),
"heading": current_heading
})
current_heading = element.get_text().strip()
current_chunk = f"{element}\n"
elif element.name in ['pre', 'table']:
# Preserve code blocks and tables intact
if len(current_chunk) + len(str(element)) > self.max_chunk_size:
if current_chunk.strip():
chunks.append({
"content": current_chunk.strip(),
"heading": current_heading
})
current_chunk = f"{element}\n"
else:
current_chunk += f"{element}\n"
else:
current_chunk += str(element)
if current_chunk.strip():
chunks.append({
"content": current_chunk.strip(),
"heading": current_heading
})
return chunks
```
### Code Documents
```python
import ast
import re
class CodeChunker:
def __init__(self, language='python'):
self.language = language
def chunk_python(self, code):
tree = ast.parse(code)
chunks = []
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
start_line = node.lineno - 1
end_line = node.end_lineno if hasattr(node, 'end_lineno') else start_line + 10
lines = code.split('\n')
chunk_lines = lines[start_line:end_line]
chunks.append('\n'.join(chunk_lines))
return chunks
def chunk_javascript(self, code):
# Use regex for languages without AST parsers
function_pattern = r'(function\s+\w+\s*\([^)]*\)\s*\{[^}]*\})'
class_pattern = r'(class\s+\w+\s*\{[^}]*\})'
patterns = [function_pattern, class_pattern]
chunks = []
for pattern in patterns:
matches = re.finditer(pattern, code, re.MULTILINE | re.DOTALL)
for match in matches:
chunks.append(match.group(1))
return chunks
def chunk(self, code):
if self.language == 'python':
return self.chunk_python(code)
elif self.language == 'javascript':
return self.chunk_javascript(code)
else:
# Fallback to line-based chunking
return self.chunk_by_lines(code)
def chunk_by_lines(self, code, max_lines=50):
lines = code.split('\n')
chunks = []
for i in range(0, len(lines), max_lines):
chunk = '\n'.join(lines[i:i+max_lines])
chunks.append(chunk)
return chunks
```
### Tabular Data
```python
import pandas as pd
class TableChunker:
def __init__(self, max_rows=100, summary_rows=5):
self.max_rows = max_rows
self.summary_rows = summary_rows
def chunk(self, table_data):
if isinstance(table_data, str):
df = pd.read_csv(StringIO(table_data))
else:
df = table_data
chunks = []
if len(df) <= self.max_rows:
# Small table - keep intact
chunks.append({
"type": "full_table",
"content": df.to_string(),
"metadata": {
"rows": len(df),
"columns": len(df.columns)
}
})
else:
# Large table - create summary + chunks
summary = df.head(self.summary_rows)
chunks.append({
"type": "table_summary",
"content": f"Table Summary ({len(df)} rows, {len(df.columns)} columns):\n{summary.to_string()}",
"metadata": {
"total_rows": len(df),
"summary_rows": self.summary_rows,
"columns": list(df.columns)
}
})
# Chunk the remaining data
for i in range(self.summary_rows, len(df), self.max_rows):
chunk_df = df.iloc[i:i+self.max_rows]
chunks.append({
"type": "table_chunk",
"content": f"Rows {i+1}-{min(i+self.max_rows, len(df))}:\n{chunk_df.to_string()}",
"metadata": {
"start_row": i + 1,
"end_row": min(i + self.max_rows, len(df)),
"columns": list(df.columns)
}
})
return chunks
```
## Level 4: Semantic Chunking
### Implementation
```python
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
class SemanticChunker:
def __init__(self, model_name="all-MiniLM-L6-v2", similarity_threshold=0.8, buffer_size=3):
self.model = SentenceTransformer(model_name)
self.similarity_threshold = similarity_threshold
self.buffer_size = buffer_size
def split_into_sentences(self, text):
# Simple sentence splitting - can be enhanced with nltk/spacy
sentences = re.split(r'[.!?]+', text)
return [s.strip() for s in sentences if s.strip()]
def chunk(self, text):
sentences = self.split_into_sentences(text)
if len(sentences) <= self.buffer_size:
return [text]
# Create embeddings
embeddings = self.model.encode(sentences)
chunks = []
current_chunk_sentences = []
for i in range(len(sentences)):
current_chunk_sentences.append(sentences[i])
# Check if we should create a boundary
if i < len(sentences) - 1:
similarity = cosine_similarity(
[embeddings[i]],
[embeddings[i + 1]]
)[0][0]
if similarity < self.similarity_threshold and len(current_chunk_sentences) >= 2:
chunks.append(' '.join(current_chunk_sentences))
current_chunk_sentences = []
# Add remaining sentences
if current_chunk_sentences:
chunks.append(' '.join(current_chunk_sentences))
return chunks
```
### Parameter Tuning
| Parameter | Range | Effect |
|-----------|-------|--------|
| similarity_threshold | 0.5-0.9 | Higher values create more chunks |
| buffer_size | 1-10 | Larger buffers provide more context |
| model_name | Various | Different models for different domains |
### Optimization Tips
- Use domain-specific models for specialized content
- Adjust threshold based on content complexity
- Cache embeddings for repeated processing
- Consider batch processing for large documents
## Level 5: Advanced Contextual Methods
### Late Chunking
```python
import torch
from transformers import AutoTokenizer, AutoModel
class LateChunker:
def __init__(self, model_name="microsoft/DialoGPT-medium"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModel.from_pretrained(model_name)
def chunk(self, text, chunk_size=512):
# Tokenize entire document
tokens = self.tokenizer(text, return_tensors="pt", truncation=False)
# Get token-level embeddings
with torch.no_grad():
outputs = self.model(**tokens, output_hidden_states=True)
token_embeddings = outputs.last_hidden_state[0]
# Create chunk embeddings from token embeddings
chunks = []
for i in range(0, len(token_embeddings), chunk_size):
chunk_tokens = token_embeddings[i:i+chunk_size]
chunk_embedding = torch.mean(chunk_tokens, dim=0)
chunks.append({
"content": self.tokenizer.decode(tokens["input_ids"][0][i:i+chunk_size]),
"embedding": chunk_embedding.numpy()
})
return chunks
```
### Contextual Retrieval
```python
import openai
class ContextualChunker:
def __init__(self, api_key):
self.client = openai.OpenAI(api_key=api_key)
def generate_context(self, chunk, full_document):
prompt = f"""
Given the following document and a chunk from it, provide a brief context
that helps understand the chunk's meaning within the full document.
Document:
{full_document[:2000]}...
Chunk:
{chunk}
Context (max 50 words):
"""
response = self.client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
max_tokens=100,
temperature=0
)
return response.choices[0].message.content.strip()
def chunk_with_context(self, text, base_chunker):
# First create base chunks
base_chunks = base_chunker.chunk(text)
# Then add context to each chunk
contextualized_chunks = []
for chunk in base_chunks:
context = self.generate_context(chunk.page_content, text)
contextualized_content = f"Context: {context}\n\nContent: {chunk.page_content}"
contextualized_chunks.append({
"content": contextualized_content,
"original_content": chunk.page_content,
"context": context
})
return contextualized_chunks
```
## Performance Considerations
### Computational Cost Analysis
| Strategy | Time Complexity | Space Complexity | Relative Cost |
|----------|-----------------|------------------|---------------|
| Fixed-Size | O(n) | O(n) | Low |
| Recursive | O(n) | O(n) | Low |
| Structure-Aware | O(n log n) | O(n) | Medium |
| Semantic | O(n²) | O(n²) | High |
| Late Chunking | O(n) | O(n) | Very High |
| Contextual | O(n²) | O(n²) | Very High |
### Optimization Strategies
1. **Parallel Processing**: Process chunks concurrently when possible
2. **Caching**: Store embeddings and intermediate results
3. **Batch Operations**: Group similar operations together
4. **Progressive Loading**: Process large documents in streaming fashion
5. **Model Selection**: Choose appropriate models for task complexity

View File

@@ -0,0 +1,867 @@
# Recommended Libraries and Frameworks
This document provides a comprehensive guide to tools, libraries, and frameworks for implementing chunking strategies.
## Core Chunking Libraries
### LangChain
**Overview**: Comprehensive framework for building applications with large language models, includes robust text splitting utilities.
**Installation**:
```bash
pip install langchain langchain-text-splitters
```
**Key Features**:
- Multiple text splitting strategies
- Integration with various document loaders
- Support for different content types (code, markdown, etc.)
- Customizable separators and parameters
**Example Usage**:
```python
from langchain.text_splitter import (
RecursiveCharacterTextSplitter,
CharacterTextSplitter,
TokenTextSplitter,
MarkdownTextSplitter,
PythonCodeTextSplitter
)
# Basic recursive splitting
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_text(large_text)
# Markdown-specific splitting
markdown_splitter = MarkdownTextSplitter(
chunk_size=1000,
chunk_overlap=100
)
# Code-specific splitting
code_splitter = PythonCodeTextSplitter(
chunk_size=1000,
chunk_overlap=100
)
```
**Pros**:
- Well-maintained and actively developed
- Extensive documentation and examples
- Integrates well with other LangChain components
- Supports multiple document types
**Cons**:
- Can be heavy dependency for simple use cases
- Some advanced features require LangChain ecosystem
### LlamaIndex
**Overview**: Data framework for LLM applications with advanced indexing and retrieval capabilities.
**Installation**:
```bash
pip install llama-index
```
**Key Features**:
- Advanced semantic chunking
- Hierarchical indexing
- Context-aware retrieval
- Integration with vector databases
**Example Usage**:
```python
from llama_index.core.node_parser import (
SentenceSplitter,
SemanticSplitterNodeParser
)
from llama_index.core import SimpleDirectoryReader
from llama_index.embeddings.openai import OpenAIEmbedding
# Basic sentence splitting
splitter = SentenceSplitter(
chunk_size=1024,
chunk_overlap=20
)
# Semantic chunking with embeddings
embed_model = OpenAIEmbedding()
semantic_splitter = SemanticSplitterNodeParser(
buffer_size=1,
breakpoint_percentile_threshold=95,
embed_model=embed_model
)
# Load and process documents
documents = SimpleDirectoryReader("./data").load_data()
nodes = semantic_splitter.get_nodes_from_documents(documents)
```
**Pros**:
- Excellent semantic chunking capabilities
- Built for production RAG systems
- Strong vector database integration
- Active community support
**Cons**:
- More complex setup for basic use cases
- Semantic chunking requires embedding model setup
### Unstructured
**Overview**: Open-source library for processing unstructured documents, especially strong with multi-modal content.
**Installation**:
```bash
pip install "unstructured[pdf,png,jpg]"
```
**Key Features**:
- Multi-modal document processing
- Support for PDFs, images, and various formats
- Structure preservation
- Table extraction and processing
**Example Usage**:
```python
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
# Partition document by type
elements = partition(filename="document.pdf")
# Chunk by title/heading structure
chunks = chunk_by_title(
elements,
combine_text_under_n_chars=2000,
max_characters=10000,
new_after_n_chars=1500,
multipage_sections=True
)
# Access chunked content
for chunk in chunks:
print(f"Category: {chunk.category}")
print(f"Content: {chunk.text[:200]}...")
```
**Pros**:
- Excellent for PDF and image processing
- Preserves document structure
- Handles tables and figures well
- Strong multi-modal capabilities
**Cons**:
- Can be slower for large documents
- Requires additional dependencies for some formats
## Text Processing Libraries
### NLTK (Natural Language Toolkit)
**Installation**:
```bash
pip install nltk
```
**Key Features**:
- Sentence tokenization
- Language detection
- Text preprocessing
- Linguistic analysis
**Example Usage**:
```python
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
# Download required data
nltk.download('punkt')
nltk.download('stopwords')
# Sentence and word tokenization
text = "This is a sample sentence. This is another sentence."
sentences = sent_tokenize(text)
words = word_tokenize(text)
# Stop words removal
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]
```
### spaCy
**Installation**:
```bash
pip install spacy
python -m spacy download en_core_web_sm
```
**Key Features**:
- Industrial-strength NLP
- Named entity recognition
- Dependency parsing
- Sentence boundary detection
**Example Usage**:
```python
import spacy
# Load language model
nlp = spacy.load("en_core_web_sm")
# Process text
doc = nlp("This is a sample sentence. This is another sentence.")
# Extract sentences
sentences = [sent.text for sent in doc.sents]
# Named entities
entities = [(ent.text, ent.label_) for ent in doc.ents]
# Dependency parsing for better chunking
for token in doc:
print(f"{token.text}: {token.dep_} (head: {token.head.text})")
```
### Sentence Transformers
**Installation**:
```bash
pip install sentence-transformers
```
**Key Features**:
- Pre-trained sentence embeddings
- Semantic similarity calculation
- Multi-lingual support
- Custom model training
**Example Usage**:
```python
from sentence_transformers import SentenceTransformer, util
import numpy as np
# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Generate embeddings
sentences = ["This is a sentence.", "This is another sentence."]
embeddings = model.encode(sentences)
# Calculate semantic similarity
similarity = util.cos_sim(embeddings[0], embeddings[1])
# Find semantic boundaries for chunking
def find_semantic_boundaries(text, model, threshold=0.8):
sentences = [s.strip() for s in text.split('.') if s.strip()]
embeddings = model.encode(sentences)
boundaries = [0]
for i in range(1, len(sentences)):
similarity = util.cos_sim(embeddings[i-1], embeddings[i])
if similarity < threshold:
boundaries.append(i)
return boundaries
```
## Vector Databases and Search
### ChromaDB
**Installation**:
```bash
pip install chromadb
```
**Key Features**:
- In-memory and persistent storage
- Built-in embedding functions
- Similarity search
- Metadata filtering
**Example Usage**:
```python
import chromadb
from chromadb.utils import embedding_functions
# Initialize client
client = chromadb.Client()
# Create collection
collection = client.create_collection(
name="document_chunks",
embedding_function=embedding_functions.DefaultEmbeddingFunction()
)
# Add chunks
collection.add(
documents=[chunk["content"] for chunk in chunks],
metadatas=[chunk.get("metadata", {}) for chunk in chunks],
ids=[chunk["id"] for chunk in chunks]
)
# Search
results = collection.query(
query_texts=["What is chunking?"],
n_results=5
)
```
### Pinecone
**Installation**:
```bash
pip install pinecone-client
```
**Key Features**:
- Managed vector database service
- High-performance similarity search
- Metadata filtering
- Scalable infrastructure
**Example Usage**:
```python
import pinecone
from sentence_transformers import SentenceTransformer
# Initialize
pinecone.init(api_key="your-api-key", environment="your-environment")
index_name = "document-chunks"
# Create index if it doesn't exist
if index_name not in pinecone.list_indexes():
pinecone.create_index(
name=index_name,
dimension=384, # Match embedding model
metric="cosine"
)
index = pinecone.Index(index_name)
# Generate embeddings and upsert
model = SentenceTransformer('all-MiniLM-L6-v2')
for chunk in chunks:
embedding = model.encode(chunk["content"])
index.upsert(
vectors=[{
"id": chunk["id"],
"values": embedding.tolist(),
"metadata": chunk.get("metadata", {})
}]
)
# Search
query_embedding = model.encode("search query")
results = index.query(
vector=query_embedding.tolist(),
top_k=5,
include_metadata=True
)
```
### Weaviate
**Installation**:
```bash
pip install weaviate-client
```
**Key Features**:
- GraphQL API
- Hybrid search (dense + sparse)
- Real-time updates
- Schema validation
**Example Usage**:
```python
import weaviate
# Connect to Weaviate
client = weaviate.Client("http://localhost:8080")
# Define schema
client.schema.create_class({
"class": "DocumentChunk",
"description": "A chunk of document content",
"properties": [
{
"name": "content",
"dataType": ["text"]
},
{
"name": "source",
"dataType": ["string"]
}
]
})
# Add data
for chunk in chunks:
client.data_object.create(
data_object={
"content": chunk["content"],
"source": chunk.get("source", "unknown")
},
class_name="DocumentChunk"
)
# Search
results = client.query.get(
"DocumentChunk",
["content", "source"]
).with_near_text({
"concepts": ["search query"]
}).with_limit(5).do()
```
## Evaluation and Testing
### RAGAS
**Installation**:
```bash
pip install ragas
```
**Key Features**:
- RAG evaluation metrics
- Answer quality assessment
- Context relevance measurement
- Faithfulness evaluation
**Example Usage**:
```python
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_relevancy,
context_recall
)
from datasets import Dataset
# Prepare evaluation data
dataset = Dataset.from_dict({
"question": ["What is chunking?"],
"answer": ["Chunking is the process of breaking large documents into smaller segments"],
"contexts": [["Chunking involves dividing text into manageable pieces for better processing"]],
"ground_truth": ["Chunking is a document processing technique"]
})
# Evaluate
result = evaluate(
dataset=dataset,
metrics=[
faithfulness,
answer_relevancy,
context_relevancy,
context_recall
]
)
print(result)
```
### TruEra (TruLens)
**Installation**:
```bash
pip install trulens trulens-apps
```
**Key Features**:
- LLM application evaluation
- Feedback functions
- Hallucination detection
- Performance monitoring
**Example Usage**:
```python
from trulens.core import TruSession
from trulens.apps.custom import instrument
from trulens.feedback import GroundTruthAgreement
# Initialize session
session = TruSession()
# Define feedback functions
f_groundedness = GroundTruthAgreement(ground_truth)
# Evaluate chunks
@instrument
def chunk_and_query(text, query):
chunks = chunk_function(text)
relevant_chunks = search_function(chunks, query)
answer = generate_function(relevant_chunks, query)
return answer
# Record evaluation
with session:
chunk_and_query("large document text", "what is the main topic?")
```
## Document Processing
### PyPDF2
**Installation**:
```bash
pip install PyPDF2
```
**Key Features**:
- PDF text extraction
- Page manipulation
- Metadata extraction
- Form field processing
**Example Usage**:
```python
import PyPDF2
def extract_text_from_pdf(pdf_path):
text = ""
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
for page in reader.pages:
text += page.extract_text()
return text
# Extract text by page for better chunking
def extract_pages(pdf_path):
pages = []
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
for i, page in enumerate(reader.pages):
pages.append({
"page_number": i + 1,
"content": page.extract_text()
})
return pages
```
### python-docx
**Installation**:
```bash
pip install python-docx
```
**Key Features**:
- Microsoft Word document processing
- Paragraph and table extraction
- Style preservation
- Metadata access
**Example Usage**:
```python
from docx import Document
def extract_from_docx(docx_path):
doc = Document(docx_path)
content = []
for paragraph in doc.paragraphs:
if paragraph.text.strip():
content.append({
"type": "paragraph",
"text": paragraph.text,
"style": paragraph.style.name
})
for table in doc.tables:
table_text = []
for row in table.rows:
row_text = [cell.text for cell in row.cells]
table_text.append(" | ".join(row_text))
content.append({
"type": "table",
"text": "\n".join(table_text)
})
return content
```
## Specialized Libraries
### tiktoken (OpenAI)
**Installation**:
```bash
pip install tiktoken
```
**Key Features**:
- Accurate token counting for OpenAI models
- Fast encoding/decoding
- Multiple model support
- Language model specific tokenization
**Example Usage**:
```python
import tiktoken
# Get encoding for specific model
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
# Encode text
tokens = encoding.encode("This is a sample text")
print(f"Token count: {len(tokens)}")
# Decode tokens
text = encoding.decode(tokens)
# Count tokens without full encoding
def count_tokens(text, model="gpt-3.5-turbo"):
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
# Use in chunking
def chunk_by_tokens(text, max_tokens=1000):
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
tokens = encoding.encode(text)
chunks = []
for i in range(0, len(tokens), max_tokens):
chunk_tokens = tokens[i:i + max_tokens]
chunk_text = encoding.decode(chunk_tokens)
chunks.append(chunk_text)
return chunks
```
### PDFMiner
**Installation**:
```bash
pip install pdfminer.six
```
**Key Features**:
- Detailed PDF analysis
- Layout preservation
- Font and style information
- High-precision text extraction
**Example Usage**:
```python
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
def extract_structured_text(pdf_path):
structured_content = []
for page_layout in extract_pages(pdf_path):
page_content = []
for element in page_layout:
if isinstance(element, LTTextContainer):
text = element.get_text()
font_info = {
"font_size": element.height,
"is_bold": "Bold" in element.fontname,
"x0": element.x0,
"y0": element.y0
}
page_content.append({
"text": text.strip(),
"font_info": font_info
})
structured_content.append({
"page_number": page_layout.pageid,
"content": page_content
})
return structured_content
```
## Performance and Optimization
### Dask
**Installation**:
```bash
pip install dask[complete]
```
**Key Features**:
- Parallel processing
- Out-of-core computation
- Distributed computing
- Integration with pandas
**Example Usage**:
```python
import dask.bag as db
from dask.distributed import Client
# Setup distributed client
client = Client(n_workers=4)
# Parallel chunking of multiple documents
def chunk_document(document):
# Your chunking logic here
return chunk_function(document)
# Process documents in parallel
documents = ["doc1", "doc2", "doc3", ...] # List of document contents
document_bag = db.from_sequence(documents)
# Apply chunking function in parallel
chunked_documents = document_bag.map(chunk_document)
# Compute results
results = chunked_documents.compute()
```
### Ray
**Installation**:
```bash
pip install ray
```
**Key Features**:
- Distributed computing
- Actor model
- Autoscaling
- ML pipeline integration
**Example Usage**:
```python
import ray
# Initialize Ray
ray.init()
@ray.remote
class ChunkingWorker:
def __init__(self, strategy):
self.strategy = strategy
def chunk_documents(self, documents):
results = []
for doc in documents:
chunks = self.strategy.chunk(doc)
results.append(chunks)
return results
# Create workers
workers = [ChunkingWorker.remote(strategy) for _ in range(4)]
# Distribute work
documents_batch = [documents[i::4] for i in range(4)]
futures = [worker.chunk_documents.remote(batch)
for worker, batch in zip(workers, documents_batch)]
# Get results
results = ray.get(futures)
```
## Development and Testing
### pytest
**Installation**:
```bash
pip install pytest pytest-asyncio
```
**Example Tests**:
```python
import pytest
from your_chunking_module import FixedSizeChunker, SemanticChunker
class TestFixedSizeChunker:
def test_chunk_size_respect(self):
chunker = FixedSizeChunker(chunk_size=100, chunk_overlap=10)
text = "word " * 50 # 50 words
chunks = chunker.chunk(text)
for chunk in chunks:
assert len(chunk.split()) <= 100 # Account for word boundaries
def test_overlap_consistency(self):
chunker = FixedSizeChunker(chunk_size=50, chunk_overlap=10)
text = "word " * 30
chunks = chunker.chunk(text)
# Check overlap between consecutive chunks
for i in range(1, len(chunks)):
chunk1_words = set(chunks[i-1].split()[-10:])
chunk2_words = set(chunks[i].split()[:10])
overlap = len(chunk1_words & chunk2_words)
assert overlap >= 5 # Allow some tolerance
@pytest.mark.asyncio
async def test_semantic_chunker():
chunker = SemanticChunker()
text = "First topic sentence. Another sentence about first topic. " \
"Now switching to second topic. More about second topic."
chunks = await chunker.chunk_async(text)
# Should detect topic change and create boundary
assert len(chunks) >= 2
assert "first topic" in chunks[0].lower()
assert "second topic" in chunks[1].lower()
```
### Memory Profiler
**Installation**:
```bash
pip install memory-profiler
```
**Example Usage**:
```python
from memory_profiler import profile
@profile
def chunk_large_document():
chunker = FixedSizeChunker(chunk_size=1000)
large_text = "word " * 100000 # Large document
chunks = chunker.chunk(large_text)
return chunks
# Run with: python -m memory_profiler your_script.py
```
This comprehensive toolset provides everything needed to implement, test, and optimize chunking strategies for various use cases, from simple text processing to production-grade RAG systems.

File diff suppressed because it is too large Load Diff