Initial commit
This commit is contained in:
194
skills/chunking-strategy/SKILL.md
Normal file
194
skills/chunking-strategy/SKILL.md
Normal file
@@ -0,0 +1,194 @@
|
||||
---
|
||||
name: chunking-strategy
|
||||
description: Implement optimal chunking strategies in RAG systems and document processing pipelines. Use when building retrieval-augmented generation systems, vector databases, or processing large documents that require breaking into semantically meaningful segments for embeddings and search.
|
||||
allowed-tools: Read, Write, Bash
|
||||
category: artificial-intelligence
|
||||
tags: [rag, chunking, vector-search, embeddings, document-processing]
|
||||
version: 1.0.0
|
||||
---
|
||||
|
||||
# Chunking Strategy for RAG Systems
|
||||
|
||||
## Overview
|
||||
|
||||
Implement optimal chunking strategies for Retrieval-Augmented Generation (RAG) systems and document processing pipelines. This skill provides a comprehensive framework for breaking large documents into smaller, semantically meaningful segments that preserve context while enabling efficient retrieval and search.
|
||||
|
||||
## When to Use
|
||||
|
||||
Use this skill when building RAG systems, optimizing vector search performance, implementing document processing pipelines, handling multi-modal content, or performance-tuning existing RAG systems with poor retrieval quality.
|
||||
|
||||
## Instructions
|
||||
|
||||
### Choose Chunking Strategy
|
||||
|
||||
Select appropriate chunking strategy based on document type and use case:
|
||||
|
||||
1. **Fixed-Size Chunking** (Level 1)
|
||||
- Use for simple documents without clear structure
|
||||
- Start with 512 tokens and 10-20% overlap
|
||||
- Adjust size based on query type: 256 for factoid, 1024 for analytical
|
||||
|
||||
2. **Recursive Character Chunking** (Level 2)
|
||||
- Use for documents with clear structural boundaries
|
||||
- Implement hierarchical separators: paragraphs → sentences → words
|
||||
- Customize separators for document types (HTML, Markdown)
|
||||
|
||||
3. **Structure-Aware Chunking** (Level 3)
|
||||
- Use for structured documents (Markdown, code, tables, PDFs)
|
||||
- Preserve semantic units: functions, sections, table blocks
|
||||
- Validate structure preservation post-splitting
|
||||
|
||||
4. **Semantic Chunking** (Level 4)
|
||||
- Use for complex documents with thematic shifts
|
||||
- Implement embedding-based boundary detection
|
||||
- Configure similarity threshold (0.8) and buffer size (3-5 sentences)
|
||||
|
||||
5. **Advanced Methods** (Level 5)
|
||||
- Use Late Chunking for long-context embedding models
|
||||
- Apply Contextual Retrieval for high-precision requirements
|
||||
- Monitor computational costs vs. retrieval improvements
|
||||
|
||||
Reference detailed strategy implementations in [references/strategies.md](references/strategies.md).
|
||||
|
||||
### Implement Chunking Pipeline
|
||||
|
||||
Follow these steps to implement effective chunking:
|
||||
|
||||
1. **Pre-process documents**
|
||||
- Analyze document structure and content types
|
||||
- Identify multi-modal content (tables, images, code)
|
||||
- Assess information density and complexity
|
||||
|
||||
2. **Select strategy parameters**
|
||||
- Choose chunk size based on embedding model context window
|
||||
- Set overlap percentage (10-20% for most cases)
|
||||
- Configure strategy-specific parameters
|
||||
|
||||
3. **Process and validate**
|
||||
- Apply chosen chunking strategy
|
||||
- Validate semantic coherence of chunks
|
||||
- Test with representative documents
|
||||
|
||||
4. **Evaluate and iterate**
|
||||
- Measure retrieval precision and recall
|
||||
- Monitor processing latency and resource usage
|
||||
- Optimize based on specific use case requirements
|
||||
|
||||
Reference detailed implementation guidelines in [references/implementation.md](references/implementation.md).
|
||||
|
||||
### Evaluate Performance
|
||||
|
||||
Use these metrics to evaluate chunking effectiveness:
|
||||
|
||||
- **Retrieval Precision**: Fraction of retrieved chunks that are relevant
|
||||
- **Retrieval Recall**: Fraction of relevant chunks that are retrieved
|
||||
- **End-to-End Accuracy**: Quality of final RAG responses
|
||||
- **Processing Time**: Latency impact on overall system
|
||||
- **Resource Usage**: Memory and computational costs
|
||||
|
||||
Reference detailed evaluation framework in [references/evaluation.md](references/evaluation.md).
|
||||
|
||||
## Examples
|
||||
|
||||
### Basic Fixed-Size Chunking
|
||||
|
||||
```python
|
||||
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
||||
|
||||
# Configure for factoid queries
|
||||
splitter = RecursiveCharacterTextSplitter(
|
||||
chunk_size=256,
|
||||
chunk_overlap=25,
|
||||
length_function=len
|
||||
)
|
||||
|
||||
chunks = splitter.split_documents(documents)
|
||||
```
|
||||
|
||||
### Structure-Aware Code Chunking
|
||||
|
||||
```python
|
||||
def chunk_python_code(code):
|
||||
"""Split Python code into semantic chunks"""
|
||||
import ast
|
||||
|
||||
tree = ast.parse(code)
|
||||
chunks = []
|
||||
|
||||
for node in ast.walk(tree):
|
||||
if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
|
||||
chunks.append(ast.get_source_segment(code, node))
|
||||
|
||||
return chunks
|
||||
```
|
||||
|
||||
### Semantic Chunking with Embeddings
|
||||
|
||||
```python
|
||||
def semantic_chunk(text, similarity_threshold=0.8):
|
||||
"""Chunk text based on semantic boundaries"""
|
||||
sentences = split_into_sentences(text)
|
||||
embeddings = generate_embeddings(sentences)
|
||||
|
||||
chunks = []
|
||||
current_chunk = [sentences[0]]
|
||||
|
||||
for i in range(1, len(sentences)):
|
||||
similarity = cosine_similarity(embeddings[i-1], embeddings[i])
|
||||
|
||||
if similarity < similarity_threshold:
|
||||
chunks.append(" ".join(current_chunk))
|
||||
current_chunk = [sentences[i]]
|
||||
else:
|
||||
current_chunk.append(sentences[i])
|
||||
|
||||
chunks.append(" ".join(current_chunk))
|
||||
return chunks
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Core Principles
|
||||
- Balance context preservation with retrieval precision
|
||||
- Maintain semantic coherence within chunks
|
||||
- Optimize for embedding model constraints
|
||||
- Preserve document structure when beneficial
|
||||
|
||||
### Implementation Guidelines
|
||||
- Start simple with fixed-size chunking (512 tokens, 10-20% overlap)
|
||||
- Test thoroughly with representative documents
|
||||
- Monitor both accuracy metrics and computational costs
|
||||
- Iterate based on specific document characteristics
|
||||
|
||||
### Common Pitfalls to Avoid
|
||||
- Over-chunking: Creating too many small, context-poor chunks
|
||||
- Under-chunking: Missing relevant information due to oversized chunks
|
||||
- Ignoring document structure and semantic boundaries
|
||||
- Using one-size-fits-all approach for diverse content types
|
||||
- Neglecting overlap for boundary-crossing information
|
||||
|
||||
## Constraints
|
||||
|
||||
### Resource Considerations
|
||||
- Semantic and contextual methods require significant computational resources
|
||||
- Late chunking needs long-context embedding models
|
||||
- Complex strategies increase processing latency
|
||||
- Monitor memory usage for large document processing
|
||||
|
||||
### Quality Requirements
|
||||
- Validate chunk semantic coherence post-processing
|
||||
- Test with domain-specific documents before deployment
|
||||
- Ensure chunks maintain standalone meaning where possible
|
||||
- Implement proper error handling for edge cases
|
||||
|
||||
## References
|
||||
|
||||
Reference detailed documentation in the [references/](references/) folder:
|
||||
- [strategies.md](references/strategies.md) - Detailed strategy implementations
|
||||
- [implementation.md](references/implementation.md) - Complete implementation guidelines
|
||||
- [evaluation.md](references/evaluation.md) - Performance evaluation framework
|
||||
- [tools.md](references/tools.md) - Recommended libraries and frameworks
|
||||
- [research.md](references/research.md) - Key research papers and findings
|
||||
- [advanced-strategies.md](references/advanced-strategies.md) - 11 comprehensive chunking methods
|
||||
- [semantic-methods.md](references/semantic-methods.md) - Semantic and contextual approaches
|
||||
- [visualization-tools.md](references/visualization-tools.md) - Evaluation and visualization tools
|
||||
1358
skills/chunking-strategy/references/advanced-strategies.md
Normal file
1358
skills/chunking-strategy/references/advanced-strategies.md
Normal file
File diff suppressed because it is too large
Load Diff
904
skills/chunking-strategy/references/evaluation.md
Normal file
904
skills/chunking-strategy/references/evaluation.md
Normal file
@@ -0,0 +1,904 @@
|
||||
# Performance Evaluation Framework
|
||||
|
||||
This document provides comprehensive methodologies for evaluating chunking strategy performance and effectiveness.
|
||||
|
||||
## Evaluation Metrics
|
||||
|
||||
### Core Retrieval Metrics
|
||||
|
||||
#### Retrieval Precision
|
||||
Measures the fraction of retrieved chunks that are relevant to the query.
|
||||
|
||||
```python
|
||||
def calculate_precision(retrieved_chunks: List[Dict], relevant_chunks: List[Dict]) -> float:
|
||||
"""
|
||||
Calculate retrieval precision
|
||||
Precision = |Relevant ∩ Retrieved| / |Retrieved|
|
||||
"""
|
||||
retrieved_ids = {chunk.get('id') for chunk in retrieved_chunks}
|
||||
relevant_ids = {chunk.get('id') for chunk in relevant_chunks}
|
||||
|
||||
intersection = retrieved_ids & relevant_ids
|
||||
|
||||
if not retrieved_ids:
|
||||
return 0.0
|
||||
|
||||
return len(intersection) / len(retrieved_ids)
|
||||
```
|
||||
|
||||
#### Retrieval Recall
|
||||
Measures the fraction of relevant chunks that are successfully retrieved.
|
||||
|
||||
```python
|
||||
def calculate_recall(retrieved_chunks: List[Dict], relevant_chunks: List[Dict]) -> float:
|
||||
"""
|
||||
Calculate retrieval recall
|
||||
Recall = |Relevant ∩ Retrieved| / |Relevant|
|
||||
"""
|
||||
retrieved_ids = {chunk.get('id') for chunk in retrieved_chunks}
|
||||
relevant_ids = {chunk.get('id') for chunk in relevant_chunks}
|
||||
|
||||
intersection = retrieved_ids & relevant_ids
|
||||
|
||||
if not relevant_ids:
|
||||
return 0.0
|
||||
|
||||
return len(intersection) / len(relevant_ids)
|
||||
```
|
||||
|
||||
#### F1-Score
|
||||
Harmonic mean of precision and recall.
|
||||
|
||||
```python
|
||||
def calculate_f1_score(precision: float, recall: float) -> float:
|
||||
"""
|
||||
Calculate F1-score
|
||||
F1 = 2 * (Precision * Recall) / (Precision + Recall)
|
||||
"""
|
||||
if precision + recall == 0:
|
||||
return 0.0
|
||||
|
||||
return 2 * (precision * recall) / (precision + recall)
|
||||
```
|
||||
|
||||
### Mean Reciprocal Rank (MRR)
|
||||
Measures the rank of the first relevant result.
|
||||
|
||||
```python
|
||||
def calculate_mrr(queries: List[Dict], results: List[List[Dict]]) -> float:
|
||||
"""
|
||||
Calculate Mean Reciprocal Rank
|
||||
"""
|
||||
reciprocal_ranks = []
|
||||
|
||||
for query, query_results in zip(queries, results):
|
||||
relevant_found = False
|
||||
|
||||
for rank, result in enumerate(query_results, 1):
|
||||
if result.get('is_relevant', False):
|
||||
reciprocal_ranks.append(1.0 / rank)
|
||||
relevant_found = True
|
||||
break
|
||||
|
||||
if not relevant_found:
|
||||
reciprocal_ranks.append(0.0)
|
||||
|
||||
return sum(reciprocal_ranks) / len(reciprocal_ranks)
|
||||
```
|
||||
|
||||
### Mean Average Precision (MAP)
|
||||
Considers both precision and the ranking of relevant documents.
|
||||
|
||||
```python
|
||||
def calculate_average_precision(retrieved_chunks: List[Dict], relevant_chunks: List[Dict]) -> float:
|
||||
"""
|
||||
Calculate Average Precision for a single query
|
||||
"""
|
||||
retrieved_ids = {chunk.get('id') for chunk in retrieved_chunks}
|
||||
relevant_ids = {chunk.get('id') for chunk in relevant_chunks}
|
||||
|
||||
if not relevant_ids:
|
||||
return 0.0
|
||||
|
||||
precisions = []
|
||||
relevant_count = 0
|
||||
|
||||
for rank, chunk in enumerate(retrieved_chunks, 1):
|
||||
if chunk.get('id') in relevant_ids:
|
||||
relevant_count += 1
|
||||
precision_at_rank = relevant_count / rank
|
||||
precisions.append(precision_at_rank)
|
||||
|
||||
return sum(precisions) / len(relevant_ids) if relevant_ids else 0.0
|
||||
|
||||
def calculate_map(queries: List[Dict], results: List[List[Dict]]) -> float:
|
||||
"""
|
||||
Calculate Mean Average Precision across multiple queries
|
||||
"""
|
||||
average_precisions = []
|
||||
|
||||
for query, query_results in zip(queries, results):
|
||||
ap = calculate_average_precision(query_results, query.get('relevant_chunks', []))
|
||||
average_precisions.append(ap)
|
||||
|
||||
return sum(average_precisions) / len(average_precisions) if average_precisions else 0.0
|
||||
```
|
||||
|
||||
### Normalized Discounted Cumulative Gain (NDCG)
|
||||
Measures ranking quality with emphasis on highly relevant results.
|
||||
|
||||
```python
|
||||
def calculate_dcg(retrieved_chunks: List[Dict]) -> float:
|
||||
"""
|
||||
Calculate Discounted Cumulative Gain
|
||||
"""
|
||||
dcg = 0.0
|
||||
|
||||
for rank, chunk in enumerate(retrieved_chunks, 1):
|
||||
relevance = chunk.get('relevance_score', 0)
|
||||
dcg += relevance / np.log2(rank + 1)
|
||||
|
||||
return dcg
|
||||
|
||||
def calculate_ndcg(retrieved_chunks: List[Dict], ideal_chunks: List[Dict]) -> float:
|
||||
"""
|
||||
Calculate Normalized Discounted Cumulative Gain
|
||||
"""
|
||||
dcg = calculate_dcg(retrieved_chunks)
|
||||
idcg = calculate_dcg(ideal_chunks)
|
||||
|
||||
if idcg == 0:
|
||||
return 0.0
|
||||
|
||||
return dcg / idcg
|
||||
```
|
||||
|
||||
## End-to-End RAG Evaluation
|
||||
|
||||
### Answer Quality Metrics
|
||||
|
||||
#### Factual Consistency
|
||||
Measures how well the generated answer aligns with retrieved chunks.
|
||||
|
||||
```python
|
||||
import spacy
|
||||
from transformers import pipeline
|
||||
|
||||
class FactualConsistencyEvaluator:
|
||||
def __init__(self):
|
||||
self.nlp = spacy.load("en_core_web_sm")
|
||||
self.nli_pipeline = pipeline("text-classification",
|
||||
model="roberta-large-mnli")
|
||||
|
||||
def evaluate_consistency(self, answer: str, retrieved_chunks: List[str]) -> float:
|
||||
"""
|
||||
Evaluate factual consistency between answer and retrieved context
|
||||
"""
|
||||
if not retrieved_chunks:
|
||||
return 0.0
|
||||
|
||||
# Combine retrieved chunks as context
|
||||
context = " ".join(retrieved_chunks[:3]) # Use top 3 chunks
|
||||
|
||||
# Use Natural Language Inference to check consistency
|
||||
result = self.nli_pipeline(f"premise: {context} hypothesis: {answer}")
|
||||
|
||||
# Extract consistency score (entailment probability)
|
||||
for item in result:
|
||||
if item['label'] == 'ENTAILMENT':
|
||||
return item['score']
|
||||
elif item['label'] == 'CONTRADICTION':
|
||||
return 1.0 - item['score']
|
||||
|
||||
return 0.5 # Neutral if NLI is inconclusive
|
||||
```
|
||||
|
||||
#### Answer Completeness
|
||||
Measures how completely the answer addresses the user's query.
|
||||
|
||||
```python
|
||||
def evaluate_completeness(answer: str, query: str, reference_answer: str = None) -> float:
|
||||
"""
|
||||
Evaluate answer completeness
|
||||
"""
|
||||
# Extract key entities from query
|
||||
query_entities = extract_entities(query)
|
||||
answer_entities = extract_entities(answer)
|
||||
|
||||
# Calculate entity coverage
|
||||
if not query_entities:
|
||||
return 0.5 # Neutral if no entities in query
|
||||
|
||||
covered_entities = query_entities & answer_entities
|
||||
entity_coverage = len(covered_entities) / len(query_entities)
|
||||
|
||||
# If reference answer is available, compare against it
|
||||
if reference_answer:
|
||||
reference_entities = extract_entities(reference_answer)
|
||||
answer_reference_overlap = len(answer_entities & reference_entities) / max(len(reference_entities), 1)
|
||||
return (entity_coverage + answer_reference_overlap) / 2
|
||||
|
||||
return entity_coverage
|
||||
|
||||
def extract_entities(text: str) -> set:
|
||||
"""
|
||||
Extract named entities from text (simplified)
|
||||
"""
|
||||
# This would use a proper NER model in practice
|
||||
import re
|
||||
|
||||
# Simple noun phrase extraction as placeholder
|
||||
noun_phrases = re.findall(r'\b[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\b', text)
|
||||
return set(noun_phrases)
|
||||
```
|
||||
|
||||
#### Response Relevance
|
||||
Measures how relevant the answer is to the original query.
|
||||
|
||||
```python
|
||||
from sentence_transformers import SentenceTransformer
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
|
||||
class RelevanceEvaluator:
|
||||
def __init__(self, model_name="all-MiniLM-L6-v2"):
|
||||
self.model = SentenceTransformer(model_name)
|
||||
|
||||
def evaluate_relevance(self, query: str, answer: str) -> float:
|
||||
"""
|
||||
Evaluate semantic relevance between query and answer
|
||||
"""
|
||||
# Generate embeddings
|
||||
query_embedding = self.model.encode([query])
|
||||
answer_embedding = self.model.encode([answer])
|
||||
|
||||
# Calculate cosine similarity
|
||||
similarity = cosine_similarity(query_embedding, answer_embedding)[0][0]
|
||||
|
||||
return float(similarity)
|
||||
```
|
||||
|
||||
## Performance Metrics
|
||||
|
||||
### Processing Time
|
||||
|
||||
```python
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from typing import List, Dict
|
||||
|
||||
@dataclass
|
||||
class PerformanceMetrics:
|
||||
total_time: float
|
||||
chunking_time: float
|
||||
embedding_time: float
|
||||
search_time: float
|
||||
generation_time: float
|
||||
throughput: float # documents per second
|
||||
|
||||
class PerformanceProfiler:
|
||||
def __init__(self):
|
||||
self.timings = {}
|
||||
self.start_times = {}
|
||||
|
||||
def start_timer(self, operation: str):
|
||||
self.start_times[operation] = time.time()
|
||||
|
||||
def end_timer(self, operation: str):
|
||||
if operation in self.start_times:
|
||||
duration = time.time() - self.start_times[operation]
|
||||
if operation not in self.timings:
|
||||
self.timings[operation] = []
|
||||
self.timings[operation].append(duration)
|
||||
return duration
|
||||
return 0.0
|
||||
|
||||
def get_performance_metrics(self, document_count: int) -> PerformanceMetrics:
|
||||
total_time = sum(sum(times) for times in self.timings.values())
|
||||
|
||||
return PerformanceMetrics(
|
||||
total_time=total_time,
|
||||
chunking_time=sum(self.timings.get('chunking', [0])),
|
||||
embedding_time=sum(self.timings.get('embedding', [0])),
|
||||
search_time=sum(self.timings.get('search', [0])),
|
||||
generation_time=sum(self.timings.get('generation', [0])),
|
||||
throughput=document_count / total_time if total_time > 0 else 0
|
||||
)
|
||||
```
|
||||
|
||||
### Memory Usage
|
||||
|
||||
```python
|
||||
import psutil
|
||||
import os
|
||||
from typing import Dict, List
|
||||
|
||||
class MemoryProfiler:
|
||||
def __init__(self):
|
||||
self.process = psutil.Process(os.getpid())
|
||||
self.memory_snapshots = []
|
||||
|
||||
def take_memory_snapshot(self, label: str):
|
||||
"""Take a snapshot of current memory usage"""
|
||||
memory_info = self.process.memory_info()
|
||||
memory_mb = memory_info.rss / 1024 / 1024 # Convert to MB
|
||||
|
||||
self.memory_snapshots.append({
|
||||
'label': label,
|
||||
'memory_mb': memory_mb,
|
||||
'timestamp': time.time()
|
||||
})
|
||||
|
||||
def get_peak_memory_usage(self) -> float:
|
||||
"""Get peak memory usage in MB"""
|
||||
if not self.memory_snapshots:
|
||||
return 0.0
|
||||
return max(snapshot['memory_mb'] for snapshot in self.memory_snapshots)
|
||||
|
||||
def get_memory_usage_by_operation(self) -> Dict[str, float]:
|
||||
"""Get memory usage breakdown by operation"""
|
||||
if not self.memory_snapshots:
|
||||
return {}
|
||||
|
||||
memory_by_op = {}
|
||||
for i in range(1, len(self.memory_snapshots)):
|
||||
prev_snapshot = self.memory_snapshots[i-1]
|
||||
curr_snapshot = self.memory_snapshots[i]
|
||||
|
||||
operation = curr_snapshot['label']
|
||||
memory_delta = curr_snapshot['memory_mb'] - prev_snapshot['memory_mb']
|
||||
|
||||
if operation not in memory_by_op:
|
||||
memory_by_op[operation] = []
|
||||
memory_by_op[operation].append(memory_delta)
|
||||
|
||||
return {op: sum(deltas) for op, deltas in memory_by_op.items()}
|
||||
```
|
||||
|
||||
## Evaluation Datasets
|
||||
|
||||
### Standardized Test Sets
|
||||
|
||||
#### Question-Answer Pairs
|
||||
|
||||
```python
|
||||
from dataclasses import dataclass
|
||||
from typing import List, Optional
|
||||
import json
|
||||
|
||||
@dataclass
|
||||
class EvaluationQuery:
|
||||
id: str
|
||||
question: str
|
||||
reference_answer: Optional[str]
|
||||
relevant_chunk_ids: List[str]
|
||||
query_type: str # factoid, analytical, comparative
|
||||
difficulty: str # easy, medium, hard
|
||||
domain: str # finance, medical, legal, technical
|
||||
|
||||
class EvaluationDataset:
|
||||
def __init__(self, name: str):
|
||||
self.name = name
|
||||
self.queries: List[EvaluationQuery] = []
|
||||
self.documents: Dict[str, str] = {}
|
||||
self.chunks: Dict[str, Dict] = {}
|
||||
|
||||
def add_query(self, query: EvaluationQuery):
|
||||
self.queries.append(query)
|
||||
|
||||
def add_document(self, doc_id: str, content: str):
|
||||
self.documents[doc_id] = content
|
||||
|
||||
def add_chunk(self, chunk_id: str, content: str, doc_id: str, metadata: Dict):
|
||||
self.chunks[chunk_id] = {
|
||||
'id': chunk_id,
|
||||
'content': content,
|
||||
'doc_id': doc_id,
|
||||
'metadata': metadata
|
||||
}
|
||||
|
||||
def save_to_file(self, filepath: str):
|
||||
data = {
|
||||
'name': self.name,
|
||||
'queries': [
|
||||
{
|
||||
'id': q.id,
|
||||
'question': q.question,
|
||||
'reference_answer': q.reference_answer,
|
||||
'relevant_chunk_ids': q.relevant_chunk_ids,
|
||||
'query_type': q.query_type,
|
||||
'difficulty': q.difficulty,
|
||||
'domain': q.domain
|
||||
}
|
||||
for q in self.queries
|
||||
],
|
||||
'documents': self.documents,
|
||||
'chunks': self.chunks
|
||||
}
|
||||
|
||||
with open(filepath, 'w') as f:
|
||||
json.dump(data, f, indent=2)
|
||||
|
||||
@classmethod
|
||||
def load_from_file(cls, filepath: str):
|
||||
with open(filepath, 'r') as f:
|
||||
data = json.load(f)
|
||||
|
||||
dataset = cls(data['name'])
|
||||
dataset.documents = data['documents']
|
||||
dataset.chunks = data['chunks']
|
||||
|
||||
for q_data in data['queries']:
|
||||
query = EvaluationQuery(
|
||||
id=q_data['id'],
|
||||
question=q_data['question'],
|
||||
reference_answer=q_data.get('reference_answer'),
|
||||
relevant_chunk_ids=q_data['relevant_chunk_ids'],
|
||||
query_type=q_data['query_type'],
|
||||
difficulty=q_data['difficulty'],
|
||||
domain=q_data['domain']
|
||||
)
|
||||
dataset.add_query(query)
|
||||
|
||||
return dataset
|
||||
```
|
||||
|
||||
### Dataset Generation
|
||||
|
||||
#### Synthetic Query Generation
|
||||
|
||||
```python
|
||||
import random
|
||||
from typing import List, Dict
|
||||
|
||||
class SyntheticQueryGenerator:
|
||||
def __init__(self):
|
||||
self.query_templates = {
|
||||
'factoid': [
|
||||
"What is {concept}?",
|
||||
"When did {event} occur?",
|
||||
"Who developed {technology}?",
|
||||
"How many {items} are mentioned?",
|
||||
"What is the value of {metric}?"
|
||||
],
|
||||
'analytical': [
|
||||
"Compare and contrast {concept1} and {concept2}.",
|
||||
"Analyze the impact of {concept} on {domain}.",
|
||||
"What are the advantages and disadvantages of {technology}?",
|
||||
"Explain the relationship between {concept1} and {concept2}.",
|
||||
"Evaluate the effectiveness of {approach} for {problem}."
|
||||
],
|
||||
'comparative': [
|
||||
"Which is better: {option1} or {option2}?",
|
||||
"How does {method1} differ from {method2}?",
|
||||
"Compare the performance of {system1} and {system2}.",
|
||||
"What are the key differences between {approach1} and {approach2}?"
|
||||
]
|
||||
}
|
||||
|
||||
def generate_queries_from_chunks(self, chunks: List[Dict], num_queries: int = 100) -> List[EvaluationQuery]:
|
||||
"""Generate synthetic queries from document chunks"""
|
||||
queries = []
|
||||
|
||||
# Extract entities and concepts from chunks
|
||||
entities = self._extract_entities_from_chunks(chunks)
|
||||
|
||||
for i in range(num_queries):
|
||||
query_type = random.choice(['factoid', 'analytical', 'comparative'])
|
||||
template = random.choice(self.query_templates[query_type])
|
||||
|
||||
# Fill template with extracted entities
|
||||
query_text = self._fill_template(template, entities)
|
||||
|
||||
# Find relevant chunks for this query
|
||||
relevant_chunks = self._find_relevant_chunks(query_text, chunks)
|
||||
|
||||
query = EvaluationQuery(
|
||||
id=f"synthetic_{i}",
|
||||
question=query_text,
|
||||
reference_answer=None, # Would need generation model
|
||||
relevant_chunk_ids=[chunk['id'] for chunk in relevant_chunks],
|
||||
query_type=query_type,
|
||||
difficulty=random.choice(['easy', 'medium', 'hard']),
|
||||
domain='synthetic'
|
||||
)
|
||||
|
||||
queries.append(query)
|
||||
|
||||
return queries
|
||||
|
||||
def _extract_entities_from_chunks(self, chunks: List[Dict]) -> Dict[str, List[str]]:
|
||||
"""Extract entities, concepts, and relationships from chunks"""
|
||||
# This would use proper NER in practice
|
||||
entities = {
|
||||
'concepts': [],
|
||||
'technologies': [],
|
||||
'methods': [],
|
||||
'metrics': [],
|
||||
'events': []
|
||||
}
|
||||
|
||||
for chunk in chunks:
|
||||
content = chunk['content']
|
||||
# Simplified entity extraction
|
||||
words = content.split()
|
||||
entities['concepts'].extend([word for word in words if len(word) > 6])
|
||||
entities['technologies'].extend([word for word in words if 'technology' in word.lower()])
|
||||
entities['methods'].extend([word for word in words if 'method' in word.lower()])
|
||||
entities['metrics'].extend([word for word in words if '%' in word or '$' in word])
|
||||
|
||||
# Remove duplicates and limit
|
||||
for key in entities:
|
||||
entities[key] = list(set(entities[key]))[:50]
|
||||
|
||||
return entities
|
||||
|
||||
def _fill_template(self, template: str, entities: Dict[str, List[str]]) -> str:
|
||||
"""Fill query template with random entities"""
|
||||
import re
|
||||
|
||||
def replace_placeholder(match):
|
||||
placeholder = match.group(1)
|
||||
|
||||
# Map placeholders to entity types
|
||||
entity_mapping = {
|
||||
'concept': 'concepts',
|
||||
'concept1': 'concepts',
|
||||
'concept2': 'concepts',
|
||||
'technology': 'technologies',
|
||||
'method': 'methods',
|
||||
'method1': 'methods',
|
||||
'method2': 'methods',
|
||||
'metric': 'metrics',
|
||||
'event': 'events',
|
||||
'items': 'concepts',
|
||||
'option1': 'concepts',
|
||||
'option2': 'concepts',
|
||||
'approach': 'methods',
|
||||
'problem': 'concepts',
|
||||
'domain': 'concepts',
|
||||
'system1': 'concepts',
|
||||
'system2': 'concepts'
|
||||
}
|
||||
|
||||
entity_type = entity_mapping.get(placeholder, 'concepts')
|
||||
available_entities = entities.get(entity_type, ['something'])
|
||||
|
||||
if available_entities:
|
||||
return random.choice(available_entities)
|
||||
else:
|
||||
return 'something'
|
||||
|
||||
return re.sub(r'\{(\w+)\}', replace_placeholder, template)
|
||||
|
||||
def _find_relevant_chunks(self, query: str, chunks: List[Dict], k: int = 3) -> List[Dict]:
|
||||
"""Find chunks most relevant to the query"""
|
||||
# Simple keyword matching for synthetic generation
|
||||
query_words = set(query.lower().split())
|
||||
|
||||
chunk_scores = []
|
||||
for chunk in chunks:
|
||||
chunk_words = set(chunk['content'].lower().split())
|
||||
overlap = len(query_words & chunk_words)
|
||||
chunk_scores.append((overlap, chunk))
|
||||
|
||||
# Sort by overlap and return top k
|
||||
chunk_scores.sort(key=lambda x: x[0], reverse=True)
|
||||
return [chunk for _, chunk in chunk_scores[:k]]
|
||||
```
|
||||
|
||||
## A/B Testing Framework
|
||||
|
||||
### Statistical Significance Testing
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from scipy import stats
|
||||
from typing import List, Dict, Tuple
|
||||
|
||||
class ABTestAnalyzer:
|
||||
def __init__(self):
|
||||
self.significance_level = 0.05
|
||||
|
||||
def compare_metrics(self, control_metrics: List[float],
|
||||
treatment_metrics: List[float],
|
||||
metric_name: str) -> Dict:
|
||||
"""
|
||||
Compare metrics between control and treatment groups
|
||||
"""
|
||||
control_mean = np.mean(control_metrics)
|
||||
treatment_mean = np.mean(treatment_metrics)
|
||||
|
||||
control_std = np.std(control_metrics)
|
||||
treatment_std = np.std(treatment_metrics)
|
||||
|
||||
# Perform t-test
|
||||
t_statistic, p_value = stats.ttest_ind(control_metrics, treatment_metrics)
|
||||
|
||||
# Calculate effect size (Cohen's d)
|
||||
pooled_std = np.sqrt(((len(control_metrics) - 1) * control_std**2 +
|
||||
(len(treatment_metrics) - 1) * treatment_std**2) /
|
||||
(len(control_metrics) + len(treatment_metrics) - 2))
|
||||
|
||||
cohens_d = (treatment_mean - control_mean) / pooled_std if pooled_std > 0 else 0
|
||||
|
||||
# Determine significance
|
||||
is_significant = p_value < self.significance_level
|
||||
|
||||
return {
|
||||
'metric_name': metric_name,
|
||||
'control_mean': control_mean,
|
||||
'treatment_mean': treatment_mean,
|
||||
'absolute_difference': treatment_mean - control_mean,
|
||||
'relative_difference': ((treatment_mean - control_mean) / control_mean * 100) if control_mean != 0 else 0,
|
||||
'control_std': control_std,
|
||||
'treatment_std': treatment_std,
|
||||
't_statistic': t_statistic,
|
||||
'p_value': p_value,
|
||||
'is_significant': is_significant,
|
||||
'effect_size': cohens_d,
|
||||
'significance_level': self.significance_level
|
||||
}
|
||||
|
||||
def analyze_ab_test_results(self,
|
||||
control_results: Dict[str, List[float]],
|
||||
treatment_results: Dict[str, List[float]]) -> Dict:
|
||||
"""
|
||||
Analyze A/B test results across multiple metrics
|
||||
"""
|
||||
analysis_results = {}
|
||||
|
||||
# Ensure both dictionaries have the same keys
|
||||
all_metrics = set(control_results.keys()) & set(treatment_results.keys())
|
||||
|
||||
for metric in all_metrics:
|
||||
if metric in control_results and metric in treatment_results:
|
||||
analysis_results[metric] = self.compare_metrics(
|
||||
control_results[metric],
|
||||
treatment_results[metric],
|
||||
metric
|
||||
)
|
||||
|
||||
# Calculate overall summary
|
||||
significant_improvements = sum(1 for result in analysis_results.values()
|
||||
if result['is_significant'] and result['relative_difference'] > 0)
|
||||
significant_degradations = sum(1 for result in analysis_results.values()
|
||||
if result['is_significant'] and result['relative_difference'] < 0)
|
||||
|
||||
analysis_results['summary'] = {
|
||||
'total_metrics_compared': len(analysis_results),
|
||||
'significant_improvements': significant_improvements,
|
||||
'significant_degradations': significant_degradations,
|
||||
'no_significant_change': len(analysis_results) - significant_improvements - significant_degradations
|
||||
}
|
||||
|
||||
return analysis_results
|
||||
```
|
||||
|
||||
## Automated Evaluation Pipeline
|
||||
|
||||
### End-to-End Evaluation
|
||||
|
||||
```python
|
||||
class ChunkingEvaluationPipeline:
|
||||
def __init__(self, strategies: Dict[str, Any], dataset: EvaluationDataset):
|
||||
self.strategies = strategies
|
||||
self.dataset = dataset
|
||||
self.results = {}
|
||||
self.profiler = PerformanceProfiler()
|
||||
self.memory_profiler = MemoryProfiler()
|
||||
|
||||
def run_evaluation(self) -> Dict:
|
||||
"""Run comprehensive evaluation of all strategies"""
|
||||
evaluation_results = {}
|
||||
|
||||
for strategy_name, strategy in self.strategies.items():
|
||||
print(f"Evaluating strategy: {strategy_name}")
|
||||
|
||||
# Reset profilers for each strategy
|
||||
self.profiler = PerformanceProfiler()
|
||||
self.memory_profiler = MemoryProfiler()
|
||||
|
||||
# Evaluate strategy
|
||||
strategy_results = self._evaluate_strategy(strategy, strategy_name)
|
||||
evaluation_results[strategy_name] = strategy_results
|
||||
|
||||
# Compare strategies
|
||||
comparison_results = self._compare_strategies(evaluation_results)
|
||||
|
||||
return {
|
||||
'individual_results': evaluation_results,
|
||||
'comparison': comparison_results,
|
||||
'recommendations': self._generate_recommendations(comparison_results)
|
||||
}
|
||||
|
||||
def _evaluate_strategy(self, strategy: Any, strategy_name: str) -> Dict:
|
||||
"""Evaluate a single chunking strategy"""
|
||||
results = {
|
||||
'strategy_name': strategy_name,
|
||||
'retrieval_metrics': {},
|
||||
'quality_metrics': {},
|
||||
'performance_metrics': {}
|
||||
}
|
||||
|
||||
# Track memory usage
|
||||
self.memory_profiler.take_memory_snapshot(f"{strategy_name}_start")
|
||||
|
||||
# Process all documents
|
||||
self.profiler.start_timer('total_processing')
|
||||
|
||||
all_chunks = {}
|
||||
for doc_id, content in self.dataset.documents.items():
|
||||
self.profiler.start_timer('chunking')
|
||||
chunks = strategy.chunk(content)
|
||||
self.profiler.end_timer('chunking')
|
||||
|
||||
all_chunks[doc_id] = chunks
|
||||
|
||||
self.memory_profiler.take_memory_snapshot(f"{strategy_name}_after_chunking")
|
||||
|
||||
# Generate embeddings for chunks
|
||||
self.profiler.start_timer('embedding')
|
||||
chunk_embeddings = self._generate_embeddings(all_chunks)
|
||||
self.profiler.end_timer('embedding')
|
||||
|
||||
self.memory_profiler.take_memory_snapshot(f"{strategy_name}_after_embedding")
|
||||
|
||||
# Evaluate retrieval performance
|
||||
retrieval_results = self._evaluate_retrieval(all_chunks, chunk_embeddings)
|
||||
results['retrieval_metrics'] = retrieval_results
|
||||
|
||||
# Evaluate chunk quality
|
||||
quality_results = self._evaluate_chunk_quality(all_chunks)
|
||||
results['quality_metrics'] = quality_results
|
||||
|
||||
# Get performance metrics
|
||||
self.profiler.end_timer('total_processing')
|
||||
performance_metrics = self.profiler.get_performance_metrics(len(self.dataset.documents))
|
||||
results['performance_metrics'] = performance_metrics.__dict__
|
||||
|
||||
# Get memory metrics
|
||||
self.memory_profiler.take_memory_snapshot(f"{strategy_name}_end")
|
||||
results['memory_metrics'] = {
|
||||
'peak_memory_mb': self.memory_profiler.get_peak_memory_usage(),
|
||||
'memory_by_operation': self.memory_profiler.get_memory_usage_by_operation()
|
||||
}
|
||||
|
||||
return results
|
||||
|
||||
def _evaluate_retrieval(self, all_chunks: Dict, chunk_embeddings: Dict) -> Dict:
|
||||
"""Evaluate retrieval performance"""
|
||||
retrieval_metrics = {
|
||||
'precision': [],
|
||||
'recall': [],
|
||||
'f1_score': [],
|
||||
'mrr': [],
|
||||
'map': []
|
||||
}
|
||||
|
||||
for query in self.dataset.queries:
|
||||
# Perform retrieval
|
||||
self.profiler.start_timer('search')
|
||||
retrieved_chunks = self._retrieve_chunks(query.question, chunk_embeddings, k=10)
|
||||
self.profiler.end_timer('search')
|
||||
|
||||
# Get relevant chunks for this query
|
||||
relevant_chunk_ids = set(query.relevant_chunk_ids)
|
||||
relevant_chunks = [chunk for chunk in retrieved_chunks
|
||||
if chunk.get('id') in relevant_chunk_ids]
|
||||
|
||||
# Calculate metrics
|
||||
precision = calculate_precision(retrieved_chunks, relevant_chunks)
|
||||
recall = calculate_recall(retrieved_chunks, relevant_chunks)
|
||||
f1 = calculate_f1_score(precision, recall)
|
||||
|
||||
retrieval_metrics['precision'].append(precision)
|
||||
retrieval_metrics['recall'].append(recall)
|
||||
retrieval_metrics['f1_score'].append(f1)
|
||||
|
||||
# Calculate averages
|
||||
return {metric: np.mean(values) for metric, values in retrieval_metrics.items()}
|
||||
|
||||
def _evaluate_chunk_quality(self, all_chunks: Dict) -> Dict:
|
||||
"""Evaluate quality of generated chunks"""
|
||||
quality_assessor = ChunkQualityAssessor()
|
||||
quality_scores = []
|
||||
|
||||
for doc_id, chunks in all_chunks.items():
|
||||
# Analyze document
|
||||
content = self.dataset.documents[doc_id]
|
||||
analyzer = DocumentAnalyzer()
|
||||
analysis = analyzer.analyze(content)
|
||||
|
||||
# Assess chunk quality
|
||||
scores = quality_assessor.assess_chunks(chunks, analysis)
|
||||
quality_scores.append(scores)
|
||||
|
||||
# Aggregate quality scores
|
||||
if quality_scores:
|
||||
avg_scores = {}
|
||||
for metric in quality_scores[0].keys():
|
||||
avg_scores[metric] = np.mean([scores[metric] for scores in quality_scores])
|
||||
return avg_scores
|
||||
|
||||
return {}
|
||||
|
||||
def _compare_strategies(self, evaluation_results: Dict) -> Dict:
|
||||
"""Compare performance across strategies"""
|
||||
ab_analyzer = ABTestAnalyzer()
|
||||
|
||||
comparison = {}
|
||||
|
||||
# Compare each metric across strategies
|
||||
strategy_names = list(evaluation_results.keys())
|
||||
|
||||
for i in range(len(strategy_names)):
|
||||
for j in range(i + 1, len(strategy_names)):
|
||||
strategy1 = strategy_names[i]
|
||||
strategy2 = strategy_names[j]
|
||||
|
||||
comparison_key = f"{strategy1}_vs_{strategy2}"
|
||||
comparison[comparison_key] = {}
|
||||
|
||||
# Compare retrieval metrics
|
||||
for metric in ['precision', 'recall', 'f1_score']:
|
||||
if (metric in evaluation_results[strategy1]['retrieval_metrics'] and
|
||||
metric in evaluation_results[strategy2]['retrieval_metrics']):
|
||||
|
||||
comparison[comparison_key][f"retrieval_{metric}"] = ab_analyzer.compare_metrics(
|
||||
[evaluation_results[strategy1]['retrieval_metrics'][metric]],
|
||||
[evaluation_results[strategy2]['retrieval_metrics'][metric]],
|
||||
f"retrieval_{metric}"
|
||||
)
|
||||
|
||||
return comparison
|
||||
|
||||
def _generate_recommendations(self, comparison_results: Dict) -> Dict:
|
||||
"""Generate recommendations based on evaluation results"""
|
||||
recommendations = {
|
||||
'best_overall': None,
|
||||
'best_for_precision': None,
|
||||
'best_for_recall': None,
|
||||
'best_for_performance': None,
|
||||
'trade_offs': []
|
||||
}
|
||||
|
||||
# This would analyze the comparison results and generate specific recommendations
|
||||
# Implementation depends on specific use case requirements
|
||||
|
||||
return recommendations
|
||||
|
||||
def _generate_embeddings(self, all_chunks: Dict) -> Dict:
|
||||
"""Generate embeddings for all chunks"""
|
||||
# This would use the actual embedding model
|
||||
# Placeholder implementation
|
||||
embeddings = {}
|
||||
|
||||
for doc_id, chunks in all_chunks.items():
|
||||
embeddings[doc_id] = []
|
||||
for chunk in chunks:
|
||||
# Generate embedding for chunk content
|
||||
embedding = np.random.rand(384) # Placeholder
|
||||
embeddings[doc_id].append({
|
||||
'chunk': chunk,
|
||||
'embedding': embedding
|
||||
})
|
||||
|
||||
return embeddings
|
||||
|
||||
def _retrieve_chunks(self, query: str, chunk_embeddings: Dict, k: int = 10) -> List[Dict]:
|
||||
"""Retrieve most relevant chunks for a query"""
|
||||
# This would use actual similarity search
|
||||
# Placeholder implementation
|
||||
all_chunks = []
|
||||
|
||||
for doc_embeddings in chunk_embeddings.values():
|
||||
for chunk_data in doc_embeddings:
|
||||
all_chunks.append(chunk_data['chunk'])
|
||||
|
||||
# Simple random selection as placeholder
|
||||
selected = random.sample(all_chunks, min(k, len(all_chunks)))
|
||||
|
||||
return selected
|
||||
```
|
||||
|
||||
This comprehensive evaluation framework provides the tools needed to thoroughly assess chunking strategies across multiple dimensions: retrieval effectiveness, answer quality, system performance, and statistical significance. The modular design allows for easy extension and customization based on specific requirements and use cases.
|
||||
709
skills/chunking-strategy/references/implementation.md
Normal file
709
skills/chunking-strategy/references/implementation.md
Normal file
@@ -0,0 +1,709 @@
|
||||
# Complete Implementation Guidelines
|
||||
|
||||
This document provides comprehensive implementation guidance for building effective chunking systems.
|
||||
|
||||
## System Architecture
|
||||
|
||||
### Core Components
|
||||
|
||||
```
|
||||
Document Processor
|
||||
├── Ingestion Layer
|
||||
│ ├── Document Type Detection
|
||||
│ ├── Format Parsing (PDF, HTML, Markdown, etc.)
|
||||
│ └── Content Extraction
|
||||
├── Analysis Layer
|
||||
│ ├── Structure Analysis
|
||||
│ ├── Content Type Identification
|
||||
│ └── Complexity Assessment
|
||||
├── Strategy Selection Layer
|
||||
│ ├── Rule-based Selection
|
||||
│ ├── ML-based Prediction
|
||||
│ └── Adaptive Configuration
|
||||
├── Chunking Layer
|
||||
│ ├── Strategy Implementation
|
||||
│ ├── Parameter Optimization
|
||||
│ └── Quality Validation
|
||||
└── Output Layer
|
||||
├── Chunk Metadata Generation
|
||||
├── Embedding Integration
|
||||
└── Storage Preparation
|
||||
```
|
||||
|
||||
## Pre-processing Pipeline
|
||||
|
||||
### Document Analysis Framework
|
||||
|
||||
```python
|
||||
from dataclasses import dataclass
|
||||
from typing import List, Dict, Any
|
||||
import re
|
||||
|
||||
@dataclass
|
||||
class DocumentAnalysis:
|
||||
doc_type: str
|
||||
structure_score: float # 0-1, higher means more structured
|
||||
complexity_score: float # 0-1, higher means more complex
|
||||
content_types: List[str]
|
||||
language: str
|
||||
estimated_tokens: int
|
||||
has_multimodal: bool
|
||||
|
||||
class DocumentAnalyzer:
|
||||
def __init__(self):
|
||||
self.structure_patterns = {
|
||||
'markdown': [r'^#+\s', r'^\*\*.*\*\*$', r'^\* ', r'^\d+\. '],
|
||||
'html': [r'<h[1-6]>', r'<p>', r'<div>', r'<table>'],
|
||||
'latex': [r'\\section', r'\\subsection', r'\\begin\{', r'\\end\{'],
|
||||
'academic': [r'^\d+\.', r'^\d+\.\d+', r'^[A-Z]\.', r'^Figure \d+']
|
||||
}
|
||||
|
||||
def analyze(self, content: str) -> DocumentAnalysis:
|
||||
doc_type = self.detect_document_type(content)
|
||||
structure_score = self.calculate_structure_score(content, doc_type)
|
||||
complexity_score = self.calculate_complexity_score(content)
|
||||
content_types = self.identify_content_types(content)
|
||||
language = self.detect_language(content)
|
||||
estimated_tokens = self.estimate_tokens(content)
|
||||
has_multimodal = self.detect_multimodal_content(content)
|
||||
|
||||
return DocumentAnalysis(
|
||||
doc_type=doc_type,
|
||||
structure_score=structure_score,
|
||||
complexity_score=complexity_score,
|
||||
content_types=content_types,
|
||||
language=language,
|
||||
estimated_tokens=estimated_tokens,
|
||||
has_multimodal=has_multimodal
|
||||
)
|
||||
|
||||
def detect_document_type(self, content: str) -> str:
|
||||
content_lower = content.lower()
|
||||
|
||||
if '<html' in content_lower or '<body' in content_lower:
|
||||
return 'html'
|
||||
elif '#' in content and '##' in content:
|
||||
return 'markdown'
|
||||
elif '\\documentclass' in content_lower or '\\begin{' in content_lower:
|
||||
return 'latex'
|
||||
elif any(keyword in content_lower for keyword in ['abstract', 'introduction', 'conclusion', 'references']):
|
||||
return 'academic'
|
||||
elif 'def ' in content or 'class ' in content or 'function ' in content_lower:
|
||||
return 'code'
|
||||
else:
|
||||
return 'plain'
|
||||
|
||||
def calculate_structure_score(self, content: str, doc_type: str) -> float:
|
||||
patterns = self.structure_patterns.get(doc_type, [])
|
||||
if not patterns:
|
||||
return 0.5 # Default for unstructured content
|
||||
|
||||
line_count = len(content.split('\n'))
|
||||
structured_lines = 0
|
||||
|
||||
for line in content.split('\n'):
|
||||
for pattern in patterns:
|
||||
if re.search(pattern, line.strip()):
|
||||
structured_lines += 1
|
||||
break
|
||||
|
||||
return min(structured_lines / max(line_count, 1), 1.0)
|
||||
|
||||
def calculate_complexity_score(self, content: str) -> float:
|
||||
# Factors that increase complexity
|
||||
avg_sentence_length = self.calculate_avg_sentence_length(content)
|
||||
vocabulary_richness = self.calculate_vocabulary_richness(content)
|
||||
nested_structure = self.detect_nested_structure(content)
|
||||
|
||||
# Normalize and combine
|
||||
complexity = (
|
||||
min(avg_sentence_length / 30, 1.0) * 0.3 +
|
||||
vocabulary_richness * 0.4 +
|
||||
nested_structure * 0.3
|
||||
)
|
||||
|
||||
return min(complexity, 1.0)
|
||||
|
||||
def identify_content_types(self, content: str) -> List[str]:
|
||||
types = []
|
||||
|
||||
if '```' in content or 'def ' in content or 'function ' in content.lower():
|
||||
types.append('code')
|
||||
if '|' in content and '\n' in content:
|
||||
types.append('tables')
|
||||
if re.search(r'\!\[.*\]\(.*\)', content):
|
||||
types.append('images')
|
||||
if re.search(r'http[s]?://', content):
|
||||
types.append('links')
|
||||
if re.search(r'\d+\.\d+', content) or re.search(r'\$\d', content):
|
||||
types.append('numbers')
|
||||
|
||||
return types if types else ['text']
|
||||
|
||||
def detect_language(self, content: str) -> str:
|
||||
# Simple language detection - can be enhanced with proper language detection libraries
|
||||
if re.search(r'[\u4e00-\u9fff]', content):
|
||||
return 'chinese'
|
||||
elif re.search(r'[u0600-\u06ff]', content):
|
||||
return 'arabic'
|
||||
elif re.search(r'[u0400-\u04ff]', content):
|
||||
return 'russian'
|
||||
else:
|
||||
return 'english' # Default assumption
|
||||
|
||||
def estimate_tokens(self, content: str) -> int:
|
||||
# Rough estimation - actual tokenization varies by model
|
||||
word_count = len(content.split())
|
||||
return int(word_count * 1.3) # Average tokens per word
|
||||
|
||||
def detect_multimodal_content(self, content: str) -> bool:
|
||||
multimodal_indicators = [
|
||||
r'\!\[.*\]\(.*\)', # Images
|
||||
r'<iframe', # Embedded content
|
||||
r'<object', # Embedded objects
|
||||
r'<embed', # Embedded media
|
||||
]
|
||||
|
||||
return any(re.search(pattern, content) for pattern in multimodal_indicators)
|
||||
|
||||
def calculate_avg_sentence_length(self, content: str) -> float:
|
||||
sentences = re.split(r'[.!?]+', content)
|
||||
sentences = [s.strip() for s in sentences if s.strip()]
|
||||
if not sentences:
|
||||
return 0
|
||||
return sum(len(s.split()) for s in sentences) / len(sentences)
|
||||
|
||||
def calculate_vocabulary_richness(self, content: str) -> float:
|
||||
words = content.lower().split()
|
||||
if not words:
|
||||
return 0
|
||||
unique_words = set(words)
|
||||
return len(unique_words) / len(words)
|
||||
|
||||
def detect_nested_structure(self, content: str) -> float:
|
||||
# Detect nested lists, indented content, etc.
|
||||
lines = content.split('\n')
|
||||
indented_lines = 0
|
||||
|
||||
for line in lines:
|
||||
if line.strip() and line.startswith(' '):
|
||||
indented_lines += 1
|
||||
|
||||
return indented_lines / max(len(lines), 1)
|
||||
```
|
||||
|
||||
### Strategy Selection Engine
|
||||
|
||||
```python
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Dict, Any
|
||||
|
||||
class ChunkingStrategy(ABC):
|
||||
@abstractmethod
|
||||
def chunk(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
|
||||
pass
|
||||
|
||||
class StrategySelector:
|
||||
def __init__(self):
|
||||
self.strategies = {
|
||||
'fixed_size': FixedSizeStrategy(),
|
||||
'recursive': RecursiveStrategy(),
|
||||
'structure_aware': StructureAwareStrategy(),
|
||||
'semantic': SemanticStrategy(),
|
||||
'adaptive': AdaptiveStrategy()
|
||||
}
|
||||
|
||||
def select_strategy(self, analysis: DocumentAnalysis) -> str:
|
||||
# Rule-based selection logic
|
||||
if analysis.structure_score > 0.8 and analysis.doc_type in ['markdown', 'html', 'latex']:
|
||||
return 'structure_aware'
|
||||
elif analysis.complexity_score > 0.7 and analysis.estimated_tokens < 10000:
|
||||
return 'semantic'
|
||||
elif analysis.doc_type == 'code':
|
||||
return 'structure_aware'
|
||||
elif analysis.structure_score < 0.3:
|
||||
return 'fixed_size'
|
||||
elif analysis.complexity_score > 0.5:
|
||||
return 'recursive'
|
||||
else:
|
||||
return 'adaptive'
|
||||
|
||||
def get_strategy(self, analysis: DocumentAnalysis) -> ChunkingStrategy:
|
||||
strategy_name = self.select_strategy(analysis)
|
||||
return self.strategies[strategy_name]
|
||||
|
||||
# Example strategy implementations
|
||||
class FixedSizeStrategy(ChunkingStrategy):
|
||||
def __init__(self, default_size=512, default_overlap=50):
|
||||
self.default_size = default_size
|
||||
self.default_overlap = default_overlap
|
||||
|
||||
def chunk(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
|
||||
# Adjust parameters based on analysis
|
||||
if analysis.complexity_score > 0.7:
|
||||
chunk_size = 1024
|
||||
elif analysis.complexity_score < 0.3:
|
||||
chunk_size = 256
|
||||
else:
|
||||
chunk_size = self.default_size
|
||||
|
||||
overlap = int(chunk_size * 0.1) # 10% overlap
|
||||
|
||||
# Implementation here...
|
||||
return self._fixed_size_chunk(content, chunk_size, overlap)
|
||||
|
||||
def _fixed_size_chunk(self, content: str, chunk_size: int, overlap: int) -> List[Dict[str, Any]]:
|
||||
# Implementation using RecursiveCharacterTextSplitter or custom logic
|
||||
pass
|
||||
|
||||
class AdaptiveStrategy(ChunkingStrategy):
|
||||
def chunk(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
|
||||
# Combine multiple strategies based on content characteristics
|
||||
if analysis.structure_score > 0.6:
|
||||
# Use structure-aware for structured parts
|
||||
structured_chunks = self._chunk_structured_parts(content, analysis)
|
||||
else:
|
||||
# Use fixed-size for unstructured parts
|
||||
unstructured_chunks = self._chunk_unstructured_parts(content, analysis)
|
||||
|
||||
# Merge and optimize
|
||||
return self._merge_chunks(structured_chunks + unstructured_chunks)
|
||||
|
||||
def _chunk_structured_parts(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
|
||||
# Implementation for structured content
|
||||
pass
|
||||
|
||||
def _chunk_unstructured_parts(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
|
||||
# Implementation for unstructured content
|
||||
pass
|
||||
|
||||
def _merge_chunks(self, chunks: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
|
||||
# Implementation for merging and optimizing chunks
|
||||
pass
|
||||
```
|
||||
|
||||
## Quality Assurance Framework
|
||||
|
||||
### Chunk Quality Metrics
|
||||
|
||||
```python
|
||||
from typing import List, Dict, Any
|
||||
import numpy as np
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
|
||||
class ChunkQualityAssessor:
|
||||
def __init__(self):
|
||||
self.quality_weights = {
|
||||
'coherence': 0.3,
|
||||
'completeness': 0.25,
|
||||
'size_appropriateness': 0.2,
|
||||
'semantic_similarity': 0.15,
|
||||
'boundary_quality': 0.1
|
||||
}
|
||||
|
||||
def assess_chunks(self, chunks: List[Dict[str, Any]], analysis: DocumentAnalysis) -> Dict[str, float]:
|
||||
scores = {}
|
||||
|
||||
# Coherence: Do chunks make sense on their own?
|
||||
scores['coherence'] = self._assess_coherence(chunks)
|
||||
|
||||
# Completeness: Do chunks preserve important information?
|
||||
scores['completeness'] = self._assess_completeness(chunks, analysis)
|
||||
|
||||
# Size appropriateness: Are chunks within optimal size range?
|
||||
scores['size_appropriateness'] = self._assess_size(chunks)
|
||||
|
||||
# Semantic similarity: Are chunks thematically consistent?
|
||||
scores['semantic_similarity'] = self._assess_semantic_consistency(chunks)
|
||||
|
||||
# Boundary quality: Are chunk boundaries placed well?
|
||||
scores['boundary_quality'] = self._assess_boundary_quality(chunks)
|
||||
|
||||
# Calculate overall quality score
|
||||
overall_score = sum(
|
||||
score * self.quality_weights[metric]
|
||||
for metric, score in scores.items()
|
||||
)
|
||||
|
||||
scores['overall'] = overall_score
|
||||
return scores
|
||||
|
||||
def _assess_coherence(self, chunks: List[Dict[str, Any]]) -> float:
|
||||
# Simple heuristic-based coherence assessment
|
||||
coherence_scores = []
|
||||
|
||||
for chunk in chunks:
|
||||
content = chunk['content']
|
||||
|
||||
# Check for complete sentences
|
||||
sentences = re.split(r'[.!?]+', content)
|
||||
complete_sentences = sum(1 for s in sentences if s.strip())
|
||||
coherence = complete_sentences / max(len(sentences), 1)
|
||||
|
||||
coherence_scores.append(coherence)
|
||||
|
||||
return np.mean(coherence_scores)
|
||||
|
||||
def _assess_completeness(self, chunks: List[Dict[str, Any]], analysis: DocumentAnalysis) -> float:
|
||||
# Check if important structural elements are preserved
|
||||
if analysis.doc_type in ['markdown', 'html']:
|
||||
return self._assess_structure_preservation(chunks, analysis)
|
||||
else:
|
||||
return self._assess_content_preservation(chunks)
|
||||
|
||||
def _assess_structure_preservation(self, chunks: List[Dict[str, Any]], analysis: DocumentAnalysis) -> float:
|
||||
# Check if headings, lists, and other structural elements are preserved
|
||||
preserved_elements = 0
|
||||
total_elements = 0
|
||||
|
||||
for chunk in chunks:
|
||||
content = chunk['content']
|
||||
|
||||
# Count preserved structural elements
|
||||
headings = len(re.findall(r'^#+\s', content, re.MULTILINE))
|
||||
lists = len(re.findall(r'^\s*[-*+]\s', content, re.MULTILINE))
|
||||
|
||||
preserved_elements += headings + lists
|
||||
total_elements += 1 # Simplified count
|
||||
|
||||
return preserved_elements / max(total_elements, 1)
|
||||
|
||||
def _assess_content_preservation(self, chunks: List[Dict[str, Any]]) -> float:
|
||||
# Simple check based on content ratio
|
||||
total_content = ''.join(chunk['content'] for chunk in chunks)
|
||||
# This would need comparison with original content
|
||||
return 0.8 # Placeholder
|
||||
|
||||
def _assess_size(self, chunks: List[Dict[str, Any]]) -> float:
|
||||
optimal_min = 100 # tokens
|
||||
optimal_max = 1000 # tokens
|
||||
|
||||
size_scores = []
|
||||
for chunk in chunks:
|
||||
token_count = self._estimate_tokens(chunk['content'])
|
||||
if optimal_min <= token_count <= optimal_max:
|
||||
score = 1.0
|
||||
elif token_count < optimal_min:
|
||||
score = token_count / optimal_min
|
||||
else:
|
||||
score = max(0, 1 - (token_count - optimal_max) / optimal_max)
|
||||
|
||||
size_scores.append(score)
|
||||
|
||||
return np.mean(size_scores)
|
||||
|
||||
def _assess_semantic_consistency(self, chunks: List[Dict[str, Any]]) -> float:
|
||||
# This would require embedding models for actual implementation
|
||||
# Placeholder implementation
|
||||
return 0.7
|
||||
|
||||
def _assess_boundary_quality(self, chunks: List[Dict[str, Any]]) -> float:
|
||||
# Check if boundaries don't split important content
|
||||
boundary_scores = []
|
||||
|
||||
for i, chunk in enumerate(chunks):
|
||||
content = chunk['content']
|
||||
|
||||
# Check for incomplete sentences at boundaries
|
||||
if not content.strip().endswith(('.', '!', '?', '>', '}')):
|
||||
boundary_scores.append(0.5)
|
||||
else:
|
||||
boundary_scores.append(1.0)
|
||||
|
||||
return np.mean(boundary_scores)
|
||||
|
||||
def _estimate_tokens(self, content: str) -> int:
|
||||
# Simple token estimation
|
||||
return len(content.split()) * 4 // 3 # Rough approximation
|
||||
```
|
||||
|
||||
## Error Handling and Edge Cases
|
||||
|
||||
### Robust Error Handling
|
||||
|
||||
```python
|
||||
import logging
|
||||
from typing import Optional, List
|
||||
from dataclasses import dataclass
|
||||
|
||||
@dataclass
|
||||
class ChunkingError:
|
||||
error_type: str
|
||||
message: str
|
||||
chunk_index: Optional[int] = None
|
||||
recovery_action: Optional[str] = None
|
||||
|
||||
class ChunkingErrorHandler:
|
||||
def __init__(self):
|
||||
self.logger = logging.getLogger(__name__)
|
||||
self.error_handlers = {
|
||||
'empty_content': self._handle_empty_content,
|
||||
'oversized_chunk': self._handle_oversized_chunk,
|
||||
'encoding_error': self._handle_encoding_error,
|
||||
'memory_error': self._handle_memory_error,
|
||||
'structure_parsing_error': self._handle_structure_parsing_error
|
||||
}
|
||||
|
||||
def handle_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
|
||||
error_type = self._classify_error(error)
|
||||
handler = self.error_handlers.get(error_type, self._handle_generic_error)
|
||||
return handler(error, context)
|
||||
|
||||
def _classify_error(self, error: Exception) -> str:
|
||||
if isinstance(error, ValueError) and 'empty' in str(error).lower():
|
||||
return 'empty_content'
|
||||
elif isinstance(error, MemoryError):
|
||||
return 'memory_error'
|
||||
elif isinstance(error, UnicodeError):
|
||||
return 'encoding_error'
|
||||
elif 'too large' in str(error).lower():
|
||||
return 'oversized_chunk'
|
||||
elif 'parsing' in str(error).lower():
|
||||
return 'structure_parsing_error'
|
||||
else:
|
||||
return 'generic_error'
|
||||
|
||||
def _handle_empty_content(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
|
||||
self.logger.warning(f"Empty content encountered: {error}")
|
||||
return ChunkingError(
|
||||
error_type='empty_content',
|
||||
message=str(error),
|
||||
recovery_action='skip_empty_content'
|
||||
)
|
||||
|
||||
def _handle_oversized_chunk(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
|
||||
self.logger.warning(f"Oversized chunk detected: {error}")
|
||||
return ChunkingError(
|
||||
error_type='oversized_chunk',
|
||||
message=str(error),
|
||||
chunk_index=context.get('chunk_index'),
|
||||
recovery_action='reduce_chunk_size'
|
||||
)
|
||||
|
||||
def _handle_encoding_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
|
||||
self.logger.error(f"Encoding error: {error}")
|
||||
return ChunkingError(
|
||||
error_type='encoding_error',
|
||||
message=str(error),
|
||||
recovery_action='fallback_encoding'
|
||||
)
|
||||
|
||||
def _handle_memory_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
|
||||
self.logger.error(f"Memory error during chunking: {error}")
|
||||
return ChunkingError(
|
||||
error_type='memory_error',
|
||||
message=str(error),
|
||||
recovery_action='process_in_batches'
|
||||
)
|
||||
|
||||
def _handle_structure_parsing_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
|
||||
self.logger.warning(f"Structure parsing failed: {error}")
|
||||
return ChunkingError(
|
||||
error_type='structure_parsing_error',
|
||||
message=str(error),
|
||||
recovery_action='fallback_to_fixed_size'
|
||||
)
|
||||
|
||||
def _handle_generic_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
|
||||
self.logger.error(f"Unexpected error during chunking: {error}")
|
||||
return ChunkingError(
|
||||
error_type='generic_error',
|
||||
message=str(error),
|
||||
recovery_action='skip_and_continue'
|
||||
)
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Caching and Memoization
|
||||
|
||||
```python
|
||||
import hashlib
|
||||
import pickle
|
||||
from functools import lru_cache
|
||||
from typing import Dict, Any, Optional
|
||||
import redis
|
||||
import json
|
||||
|
||||
class ChunkingCache:
|
||||
def __init__(self, redis_url: Optional[str] = None):
|
||||
if redis_url:
|
||||
self.redis_client = redis.from_url(redis_url)
|
||||
else:
|
||||
self.redis_client = None
|
||||
self.local_cache = {}
|
||||
|
||||
def _generate_cache_key(self, content: str, strategy: str, params: Dict[str, Any]) -> str:
|
||||
content_hash = hashlib.md5(content.encode()).hexdigest()
|
||||
params_str = json.dumps(params, sort_keys=True)
|
||||
params_hash = hashlib.md5(params_str.encode()).hexdigest()
|
||||
return f"chunking:{strategy}:{content_hash}:{params_hash}"
|
||||
|
||||
def get(self, content: str, strategy: str, params: Dict[str, Any]) -> Optional[List[Dict[str, Any]]]:
|
||||
cache_key = self._generate_cache_key(content, strategy, params)
|
||||
|
||||
# Try local cache first
|
||||
if cache_key in self.local_cache:
|
||||
return self.local_cache[cache_key]
|
||||
|
||||
# Try Redis cache
|
||||
if self.redis_client:
|
||||
try:
|
||||
cached_data = self.redis_client.get(cache_key)
|
||||
if cached_data:
|
||||
chunks = pickle.loads(cached_data)
|
||||
self.local_cache[cache_key] = chunks # Cache locally too
|
||||
return chunks
|
||||
except Exception as e:
|
||||
logging.warning(f"Redis cache error: {e}")
|
||||
|
||||
return None
|
||||
|
||||
def set(self, content: str, strategy: str, params: Dict[str, Any], chunks: List[Dict[str, Any]]) -> None:
|
||||
cache_key = self._generate_cache_key(content, strategy, params)
|
||||
|
||||
# Store in local cache
|
||||
self.local_cache[cache_key] = chunks
|
||||
|
||||
# Store in Redis cache
|
||||
if self.redis_client:
|
||||
try:
|
||||
cached_data = pickle.dumps(chunks)
|
||||
self.redis_client.setex(cache_key, 3600, cached_data) # 1 hour TTL
|
||||
except Exception as e:
|
||||
logging.warning(f"Redis cache set error: {e}")
|
||||
|
||||
def clear_local_cache(self):
|
||||
self.local_cache.clear()
|
||||
|
||||
def clear_redis_cache(self):
|
||||
if self.redis_client:
|
||||
pattern = "chunking:*"
|
||||
keys = self.redis_client.keys(pattern)
|
||||
if keys:
|
||||
self.redis_client.delete(*keys)
|
||||
```
|
||||
|
||||
### Batch Processing
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import concurrent.futures
|
||||
from typing import List, Callable, Any
|
||||
|
||||
class BatchChunkingProcessor:
|
||||
def __init__(self, max_workers: int = 4, batch_size: int = 10):
|
||||
self.max_workers = max_workers
|
||||
self.batch_size = batch_size
|
||||
|
||||
def process_documents_batch(self, documents: List[str],
|
||||
chunking_function: Callable[[str], List[Dict[str, Any]]]) -> List[List[Dict[str, Any]]]:
|
||||
"""Process multiple documents in parallel"""
|
||||
results = []
|
||||
|
||||
# Process in batches to avoid memory issues
|
||||
for i in range(0, len(documents), self.batch_size):
|
||||
batch = documents[i:i + self.batch_size]
|
||||
|
||||
with concurrent.futures.ThreadPoolExecutor(max_workers=self.max_workers) as executor:
|
||||
future_to_doc = {
|
||||
executor.submit(chunking_function, doc): doc
|
||||
for doc in batch
|
||||
}
|
||||
|
||||
batch_results = []
|
||||
for future in concurrent.futures.as_completed(future_to_doc):
|
||||
try:
|
||||
chunks = future.result()
|
||||
batch_results.append(chunks)
|
||||
except Exception as e:
|
||||
logging.error(f"Error processing document: {e}")
|
||||
batch_results.append([]) # Empty result for failed processing
|
||||
|
||||
results.extend(batch_results)
|
||||
|
||||
return results
|
||||
|
||||
async def process_documents_async(self, documents: List[str],
|
||||
chunking_function: Callable[[str], List[Dict[str, Any]]]) -> List[List[Dict[str, Any]]]:
|
||||
"""Process documents asynchronously"""
|
||||
semaphore = asyncio.Semaphore(self.max_workers)
|
||||
|
||||
async def process_single_document(doc: str) -> List[Dict[str, Any]]:
|
||||
async with semaphore:
|
||||
# Run the synchronous chunking function in an executor
|
||||
loop = asyncio.get_event_loop()
|
||||
return await loop.run_in_executor(None, chunking_function, doc)
|
||||
|
||||
tasks = [process_single_document(doc) for doc in documents]
|
||||
return await asyncio.gather(*tasks, return_exceptions=True)
|
||||
```
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
### Metrics Collection
|
||||
|
||||
```python
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from typing import Dict, Any, List
|
||||
from collections import defaultdict
|
||||
|
||||
@dataclass
|
||||
class ChunkingMetrics:
|
||||
total_documents: int
|
||||
total_chunks: int
|
||||
avg_chunk_size: float
|
||||
processing_time: float
|
||||
memory_usage: float
|
||||
error_count: int
|
||||
strategy_distribution: Dict[str, int]
|
||||
|
||||
class MetricsCollector:
|
||||
def __init__(self):
|
||||
self.metrics = defaultdict(list)
|
||||
self.start_time = None
|
||||
|
||||
def start_timing(self):
|
||||
self.start_time = time.time()
|
||||
|
||||
def end_timing(self) -> float:
|
||||
if self.start_time:
|
||||
duration = time.time() - self.start_time
|
||||
self.metrics['processing_time'].append(duration)
|
||||
self.start_time = None
|
||||
return duration
|
||||
return 0.0
|
||||
|
||||
def record_chunk_count(self, count: int):
|
||||
self.metrics['chunk_count'].append(count)
|
||||
|
||||
def record_chunk_size(self, size: int):
|
||||
self.metrics['chunk_size'].append(size)
|
||||
|
||||
def record_strategy_usage(self, strategy: str):
|
||||
self.metrics['strategy'][strategy] = self.metrics['strategy'].get(strategy, 0) + 1
|
||||
|
||||
def record_error(self, error_type: str):
|
||||
self.metrics['errors'].append(error_type)
|
||||
|
||||
def record_memory_usage(self, memory_mb: float):
|
||||
self.metrics['memory_usage'].append(memory_mb)
|
||||
|
||||
def get_summary(self) -> ChunkingMetrics:
|
||||
return ChunkingMetrics(
|
||||
total_documents=len(self.metrics['processing_time']),
|
||||
total_chunks=sum(self.metrics['chunk_count']),
|
||||
avg_chunk_size=sum(self.metrics['chunk_size']) / max(len(self.metrics['chunk_size']), 1),
|
||||
processing_time=sum(self.metrics['processing_time']),
|
||||
memory_usage=sum(self.metrics['memory_usage']) / max(len(self.metrics['memory_usage']), 1),
|
||||
error_count=len(self.metrics['errors']),
|
||||
strategy_distribution=dict(self.metrics['strategy'])
|
||||
)
|
||||
|
||||
def reset(self):
|
||||
self.metrics.clear()
|
||||
self.start_time = None
|
||||
```
|
||||
|
||||
This implementation guide provides a comprehensive foundation for building robust, scalable chunking systems that can handle various document types and use cases while maintaining high quality and performance.
|
||||
366
skills/chunking-strategy/references/research.md
Normal file
366
skills/chunking-strategy/references/research.md
Normal file
@@ -0,0 +1,366 @@
|
||||
# Key Research Papers and Findings
|
||||
|
||||
This document summarizes important research papers and findings related to chunking strategies for RAG systems.
|
||||
|
||||
## Seminal Papers
|
||||
|
||||
### "Reconstructing Context: Evaluating Advanced Chunking Strategies for RAG" (arXiv:2504.19754)
|
||||
|
||||
**Key Findings**:
|
||||
- Page-level chunking achieved highest average accuracy (0.648) with lowest variance across different query types
|
||||
- Optimal chunk size varies significantly by document type and query complexity
|
||||
- Factoid queries perform better with smaller chunks (256-512 tokens)
|
||||
- Complex analytical queries benefit from larger chunks (1024+ tokens)
|
||||
|
||||
**Methodology**:
|
||||
- Evaluated 7 different chunking strategies across multiple document types
|
||||
- Tested with both factoid and analytical queries
|
||||
- Measured end-to-end RAG performance
|
||||
|
||||
**Practical Implications**:
|
||||
- Start with page-level chunking for general-purpose RAG systems
|
||||
- Adapt chunk size based on expected query patterns
|
||||
- Consider hybrid approaches for mixed query types
|
||||
|
||||
### "Lost in the Middle: How Language Models Use Long Contexts"
|
||||
|
||||
**Key Findings**:
|
||||
- Language models tend to pay more attention to information at the beginning and end of context
|
||||
- Information in the middle of long contexts is often ignored
|
||||
- Performance degradation is most severe for centrally located information
|
||||
|
||||
**Practical Implications**:
|
||||
- Place most important information at chunk boundaries
|
||||
- Consider chunk overlap to ensure important context appears multiple times
|
||||
- Use ranking to prioritize relevant chunks for inclusion in context
|
||||
|
||||
### "Grounded Language Learning in a Simulated 3D World"
|
||||
|
||||
**Related Concepts**:
|
||||
- Importance of grounding text in visual/contextual information
|
||||
- Multi-modal learning approaches for better understanding
|
||||
|
||||
**Relevance to Chunking**:
|
||||
- Supports contextual chunking approaches that preserve visual/contextual relationships
|
||||
- Validates importance of maintaining document structure and relationships
|
||||
|
||||
## Industry Research
|
||||
|
||||
### NVIDIA Research: "Finding the Best Chunking Strategy for Accurate AI Responses"
|
||||
|
||||
**Key Findings**:
|
||||
- Page-level chunking outperformed sentence and paragraph-level approaches
|
||||
- Fixed-size chunking showed consistent but suboptimal performance
|
||||
- Semantic chunking provided improvements for complex documents
|
||||
|
||||
**Technical Details**:
|
||||
- Tested chunk sizes from 128 to 2048 tokens
|
||||
- Evaluated across financial, technical, and legal documents
|
||||
- Measured both retrieval accuracy and generation quality
|
||||
|
||||
**Recommendations**:
|
||||
- Use 512-1024 token chunks as starting point
|
||||
- Implement adaptive chunking based on document complexity
|
||||
- Consider page boundaries as natural chunk separators
|
||||
|
||||
### Cohere Research: "Effective Chunking Strategies for RAG"
|
||||
|
||||
**Key Findings**:
|
||||
- Recursive character splitting provides good balance of performance and simplicity
|
||||
- Document structure awareness improves retrieval by 15-20%
|
||||
- Overlap of 10-20% provides optimal context preservation
|
||||
|
||||
**Methodology**:
|
||||
- Compared 12 chunking strategies across 6 document types
|
||||
- Measured retrieval precision, recall, and F1-score
|
||||
- Tested with both dense and sparse retrieval
|
||||
|
||||
**Best Practices Identified**:
|
||||
- Start with recursive character splitting with 10-20% overlap
|
||||
- Preserve document structure (headings, lists, tables)
|
||||
- Customize chunk size based on embedding model context window
|
||||
|
||||
### Anthropic: "Contextual Retrieval"
|
||||
|
||||
**Key Innovation**:
|
||||
- Enhance each chunk with LLM-generated contextual information before embedding
|
||||
- Improves retrieval precision by 25-30% for complex documents
|
||||
- Particularly effective for technical and academic content
|
||||
|
||||
**Implementation Approach**:
|
||||
1. Split document using traditional methods
|
||||
2. For each chunk, generate contextual information using LLM
|
||||
3. Prepend context to chunk before embedding
|
||||
4. Use hybrid search (dense + sparse) with weighted ranking
|
||||
|
||||
**Trade-offs**:
|
||||
- Significant computational overhead (2-3x processing time)
|
||||
- Higher embedding storage requirements
|
||||
- Improved retrieval precision justifies cost for high-value applications
|
||||
|
||||
## Algorithmic Advances
|
||||
|
||||
### Semantic Chunking Algorithms
|
||||
|
||||
#### "Semantic Segmentation of Text Documents"
|
||||
|
||||
**Core Idea**: Use cosine similarity between consecutive sentence embeddings to identify natural boundaries.
|
||||
|
||||
**Algorithm**:
|
||||
1. Split document into sentences
|
||||
2. Generate embeddings for each sentence
|
||||
3. Calculate similarity between consecutive sentences
|
||||
4. Create boundaries where similarity drops below threshold
|
||||
5. Merge short segments with neighbors
|
||||
|
||||
**Performance**: 20-30% improvement in retrieval relevance over fixed-size chunking for technical documents.
|
||||
|
||||
#### "Hierarchical Semantic Chunking"
|
||||
|
||||
**Core Idea**: Multi-level semantic segmentation for document organization.
|
||||
|
||||
**Algorithm**:
|
||||
1. Document-level semantic analysis
|
||||
2. Section-level boundary detection
|
||||
3. Paragraph-level segmentation
|
||||
4. Sentence-level refinement
|
||||
|
||||
**Benefits**: Maintains document hierarchy while adapting to semantic structure.
|
||||
|
||||
### Advanced Embedding Techniques
|
||||
|
||||
#### "Late Chunking: Contextual Chunk Embeddings"
|
||||
|
||||
**Core Innovation**: Generate embeddings for entire document first, then create chunk embeddings from token-level embeddings.
|
||||
|
||||
**Advantages**:
|
||||
- Preserves global document context
|
||||
- Reduces context fragmentation
|
||||
- Better for documents with complex inter-relationships
|
||||
|
||||
**Requirements**:
|
||||
- Long-context embedding models (8k+ tokens)
|
||||
- Significant computational resources
|
||||
- Specialized implementation
|
||||
|
||||
#### "Hierarchical Embedding Retrieval"
|
||||
|
||||
**Approach**: Create embeddings at multiple granularities (document, section, paragraph, sentence).
|
||||
|
||||
**Implementation**:
|
||||
1. Generate embeddings at each level
|
||||
2. Store in hierarchical vector database
|
||||
3. Query at appropriate granularity based on information needs
|
||||
|
||||
**Performance**: 15-25% improvement in precision for complex queries.
|
||||
|
||||
## Evaluation Methodologies
|
||||
|
||||
### Retrieval-Augmented Generation Assessment Frameworks
|
||||
|
||||
#### RAGAS Framework
|
||||
|
||||
**Metrics**:
|
||||
- **Faithfulness**: Consistency between generated answer and retrieved context
|
||||
- **Answer Relevancy**: Relevance of generated answer to the question
|
||||
- **Context Relevancy**: Relevance of retrieved context to the question
|
||||
- **Context Recall**: Coverage of relevant information in retrieved context
|
||||
|
||||
**Evaluation Process**:
|
||||
1. Generate questions from document corpus
|
||||
2. Retrieve relevant chunks using different strategies
|
||||
3. Generate answers using retrieved chunks
|
||||
4. Evaluate using automated metrics and human judgment
|
||||
|
||||
#### ARES Framework
|
||||
|
||||
**Innovation**: Automated evaluation using synthetic questions and LLM-based assessment.
|
||||
|
||||
**Key Features**:
|
||||
- Generates diverse question types (factoid, analytical, comparative)
|
||||
- Uses LLMs to evaluate answer quality
|
||||
- Provides scalable evaluation without human annotation
|
||||
|
||||
### Benchmark Datasets
|
||||
|
||||
#### Natural Questions (NQ)
|
||||
|
||||
**Description**: Real user questions from Google Search with relevant Wikipedia passages.
|
||||
|
||||
**Relevance**: Natural language queries with authentic relevance judgments.
|
||||
|
||||
#### MS MARCO
|
||||
|
||||
**Description**: Large-scale passage ranking dataset with real search queries.
|
||||
|
||||
**Relevance**: High-quality relevance judgments for passage retrieval.
|
||||
|
||||
#### HotpotQA
|
||||
|
||||
**Description**: Multi-hop question answering requiring information from multiple documents.
|
||||
|
||||
**Relevance**: Tests ability to retrieve and synthesize information from multiple chunks.
|
||||
|
||||
## Domain-Specific Research
|
||||
|
||||
### Medical Documents
|
||||
|
||||
#### "Optimal Chunking for Medical Question Answering"
|
||||
|
||||
**Key Findings**:
|
||||
- Medical terminology requires specialized handling
|
||||
- Section-based chunking (History, Diagnosis, Treatment) most effective
|
||||
- Preserving doctor-patient dialogue context crucial
|
||||
|
||||
**Recommendations**:
|
||||
- Use medical-specific tokenizers
|
||||
- Preserve section headers and structure
|
||||
- Maintain temporal relationships in medical histories
|
||||
|
||||
### Legal Documents
|
||||
|
||||
#### "Chunking Strategies for Legal Document Analysis"
|
||||
|
||||
**Key Findings**:
|
||||
- Legal citations and cross-references require special handling
|
||||
- Contract clause boundaries serve as natural chunk separators
|
||||
- Case law benefits from hierarchical chunking
|
||||
|
||||
**Best Practices**:
|
||||
- Preserve legal citation structure
|
||||
- Use clause and section boundaries
|
||||
- Maintain context for legal definitions and references
|
||||
|
||||
### Financial Documents
|
||||
|
||||
#### "SEC Filing Chunking for Financial Analysis"
|
||||
|
||||
**Key Findings**:
|
||||
- Table preservation critical for financial data
|
||||
- XBRL tagging provides natural segmentation
|
||||
- Risk factors sections benefit from specialized treatment
|
||||
|
||||
**Approach**:
|
||||
- Preserve complete tables when possible
|
||||
- Use XBRL tags for structured data
|
||||
- Create specialized chunks for risk sections
|
||||
|
||||
## Emerging Trends
|
||||
|
||||
### Multi-Modal Chunking
|
||||
|
||||
#### "Integrating Text, Tables, and Images in RAG Systems"
|
||||
|
||||
**Innovation**: Unified chunking approach for mixed-modal content.
|
||||
|
||||
**Approach**:
|
||||
- Extract and describe images using vision models
|
||||
- Preserve table structure and relationships
|
||||
- Create unified embeddings for mixed content
|
||||
|
||||
**Results**: 35% improvement in complex document understanding.
|
||||
|
||||
### Adaptive Chunking
|
||||
|
||||
#### "Machine Learning-Based Chunk Size Optimization"
|
||||
|
||||
**Core Idea**: Use ML models to predict optimal chunking parameters.
|
||||
|
||||
**Features**:
|
||||
- Document length and complexity
|
||||
- Query type distribution
|
||||
- Embedding model characteristics
|
||||
- Performance requirements
|
||||
|
||||
**Benefits**: Dynamic optimization based on use case and content.
|
||||
|
||||
### Real-time Chunking
|
||||
|
||||
#### "Streaming Chunking for Live Document Processing"
|
||||
|
||||
**Innovation**: Process documents as they become available.
|
||||
|
||||
**Techniques**:
|
||||
- Incremental boundary detection
|
||||
- Dynamic chunk size adjustment
|
||||
- Context preservation across chunks
|
||||
|
||||
**Applications**: Live news feeds, social media analysis, meeting transcripts.
|
||||
|
||||
## Implementation Challenges
|
||||
|
||||
### Computational Efficiency
|
||||
|
||||
#### "Scalable Chunking for Large Document Collections"
|
||||
|
||||
**Challenges**:
|
||||
- Processing millions of documents efficiently
|
||||
- Memory usage optimization
|
||||
- Distributed processing requirements
|
||||
|
||||
**Solutions**:
|
||||
- Batch processing with parallel execution
|
||||
- Streaming approaches for large documents
|
||||
- Distributed chunking with load balancing
|
||||
|
||||
### Quality Assurance
|
||||
|
||||
#### "Evaluating Chunk Quality at Scale"
|
||||
|
||||
**Challenges**:
|
||||
- Automated quality assessment
|
||||
- Detecting poor chunk boundaries
|
||||
- Maintaining consistency across document types
|
||||
|
||||
**Approaches**:
|
||||
- Heuristic-based quality metrics
|
||||
- LLM-based evaluation
|
||||
- Human-in-the-loop validation
|
||||
|
||||
## Future Research Directions
|
||||
|
||||
### Context-Aware Chunking
|
||||
|
||||
**Open Questions**:
|
||||
- How to optimally preserve cross-chunk relationships?
|
||||
- Can we predict chunk quality without human evaluation?
|
||||
- What is the optimal balance between size and context?
|
||||
|
||||
### Domain Adaptation
|
||||
|
||||
**Research Areas**:
|
||||
- Automatic domain detection and adaptation
|
||||
- Transfer learning across domains
|
||||
- Zero-shot chunking for new document types
|
||||
|
||||
### Evaluation Standards
|
||||
|
||||
**Needs**:
|
||||
- Standardized evaluation benchmarks
|
||||
- Cross-paper comparison methodologies
|
||||
- Real-world performance metrics
|
||||
|
||||
## Practical Recommendations Based on Research
|
||||
|
||||
### Starting Points
|
||||
|
||||
1. **For General RAG Systems**: Page-level or recursive character chunking with 512-1024 tokens and 10-20% overlap
|
||||
2. **For Technical Documents**: Structure-aware chunking with semantic boundary detection
|
||||
3. **For High-Value Applications**: Contextual retrieval with LLM-generated context
|
||||
|
||||
### Evolution Strategy
|
||||
|
||||
1. **Begin**: Simple fixed-size chunking (512 tokens)
|
||||
2. **Improve**: Add document structure awareness
|
||||
3. **Optimize**: Implement semantic boundaries
|
||||
4. **Advanced**: Consider contextual retrieval for critical use cases
|
||||
|
||||
### Key Success Factors
|
||||
|
||||
1. **Match strategy to document type and query patterns**
|
||||
2. **Preserve document structure when beneficial**
|
||||
3. **Use overlap to maintain context across boundaries**
|
||||
4. **Monitor both accuracy and computational costs**
|
||||
5. **Iterate based on specific use case requirements**
|
||||
|
||||
This research foundation provides evidence-based guidance for implementing effective chunking strategies across various domains and use cases.
|
||||
1315
skills/chunking-strategy/references/semantic-methods.md
Normal file
1315
skills/chunking-strategy/references/semantic-methods.md
Normal file
File diff suppressed because it is too large
Load Diff
423
skills/chunking-strategy/references/strategies.md
Normal file
423
skills/chunking-strategy/references/strategies.md
Normal file
@@ -0,0 +1,423 @@
|
||||
# Detailed Chunking Strategies
|
||||
|
||||
This document provides comprehensive implementation details for all chunking strategies mentioned in the main skill.
|
||||
|
||||
## Level 1: Fixed-Size Chunking
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
||||
|
||||
class FixedSizeChunker:
|
||||
def __init__(self, chunk_size=512, chunk_overlap=50):
|
||||
self.chunk_size = chunk_size
|
||||
self.chunk_overlap = chunk_overlap
|
||||
self.splitter = RecursiveCharacterTextSplitter(
|
||||
chunk_size=chunk_size,
|
||||
chunk_overlap=chunk_overlap,
|
||||
length_function=len,
|
||||
separators=["\n\n", "\n", " ", ""]
|
||||
)
|
||||
|
||||
def chunk(self, documents):
|
||||
return self.splitter.split_documents(documents)
|
||||
```
|
||||
|
||||
### Parameter Recommendations
|
||||
|
||||
| Use Case | Chunk Size | Overlap | Rationale |
|
||||
|----------|------------|---------|-----------|
|
||||
| Factoid Queries | 256 | 25 | Small chunks for precise answers |
|
||||
| General Q&A | 512 | 50 | Balanced approach for most cases |
|
||||
| Analytical Queries | 1024 | 100 | Larger context for complex analysis |
|
||||
| Code Documentation | 300 | 30 | Preserve code context while maintaining focus |
|
||||
|
||||
### Best Practices
|
||||
|
||||
- Start with 512 tokens and 10-20% overlap
|
||||
- Adjust based on embedding model context window
|
||||
- Use overlap for queries where context might span boundaries
|
||||
- Monitor token count vs. character count based on model
|
||||
|
||||
## Level 2: Recursive Character Chunking
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
||||
|
||||
class RecursiveChunker:
|
||||
def __init__(self, chunk_size=512, separators=None):
|
||||
self.chunk_size = chunk_size
|
||||
self.separators = separators or ["\n\n", "\n", " ", ""]
|
||||
self.splitter = RecursiveCharacterTextSplitter(
|
||||
chunk_size=chunk_size,
|
||||
chunk_overlap=0,
|
||||
length_function=len,
|
||||
separators=self.separators
|
||||
)
|
||||
|
||||
def chunk(self, text):
|
||||
return self.splitter.create_documents([text])
|
||||
|
||||
# Document-specific configurations
|
||||
def get_chunker_for_document_type(doc_type):
|
||||
configurations = {
|
||||
"markdown": ["\n## ", "\n### ", "\n\n", "\n", " ", ""],
|
||||
"html": ["</div>", "</p>", "\n\n", "\n", " ", ""],
|
||||
"code": ["\n\n", "\n", " ", ""],
|
||||
"plain": ["\n\n", "\n", " ", ""]
|
||||
}
|
||||
return RecursiveChunker(separators=configurations.get(doc_type, ["\n\n", "\n", " ", ""]))
|
||||
```
|
||||
|
||||
### Customization Guidelines
|
||||
|
||||
- **Markdown**: Use headings as primary separators
|
||||
- **HTML**: Use block-level tags as separators
|
||||
- **Code**: Preserve function and class boundaries
|
||||
- **Academic papers**: Prioritize paragraph and section breaks
|
||||
|
||||
## Level 3: Structure-Aware Chunking
|
||||
|
||||
### Markdown Documents
|
||||
|
||||
```python
|
||||
import markdown
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
class MarkdownChunker:
|
||||
def __init__(self, max_chunk_size=512):
|
||||
self.max_chunk_size = max_chunk_size
|
||||
|
||||
def chunk(self, markdown_text):
|
||||
html = markdown.markdown(markdown_text)
|
||||
soup = BeautifulSoup(html, 'html.parser')
|
||||
|
||||
chunks = []
|
||||
current_chunk = ""
|
||||
current_heading = "Introduction"
|
||||
|
||||
for element in soup.find_all(['h1', 'h2', 'h3', 'p', 'pre', 'table']):
|
||||
if element.name.startswith('h'):
|
||||
if current_chunk.strip():
|
||||
chunks.append({
|
||||
"content": current_chunk.strip(),
|
||||
"heading": current_heading
|
||||
})
|
||||
current_heading = element.get_text().strip()
|
||||
current_chunk = f"{element}\n"
|
||||
elif element.name in ['pre', 'table']:
|
||||
# Preserve code blocks and tables intact
|
||||
if len(current_chunk) + len(str(element)) > self.max_chunk_size:
|
||||
if current_chunk.strip():
|
||||
chunks.append({
|
||||
"content": current_chunk.strip(),
|
||||
"heading": current_heading
|
||||
})
|
||||
current_chunk = f"{element}\n"
|
||||
else:
|
||||
current_chunk += f"{element}\n"
|
||||
else:
|
||||
current_chunk += str(element)
|
||||
|
||||
if current_chunk.strip():
|
||||
chunks.append({
|
||||
"content": current_chunk.strip(),
|
||||
"heading": current_heading
|
||||
})
|
||||
|
||||
return chunks
|
||||
```
|
||||
|
||||
### Code Documents
|
||||
|
||||
```python
|
||||
import ast
|
||||
import re
|
||||
|
||||
class CodeChunker:
|
||||
def __init__(self, language='python'):
|
||||
self.language = language
|
||||
|
||||
def chunk_python(self, code):
|
||||
tree = ast.parse(code)
|
||||
chunks = []
|
||||
|
||||
for node in ast.walk(tree):
|
||||
if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
|
||||
start_line = node.lineno - 1
|
||||
end_line = node.end_lineno if hasattr(node, 'end_lineno') else start_line + 10
|
||||
lines = code.split('\n')
|
||||
chunk_lines = lines[start_line:end_line]
|
||||
chunks.append('\n'.join(chunk_lines))
|
||||
|
||||
return chunks
|
||||
|
||||
def chunk_javascript(self, code):
|
||||
# Use regex for languages without AST parsers
|
||||
function_pattern = r'(function\s+\w+\s*\([^)]*\)\s*\{[^}]*\})'
|
||||
class_pattern = r'(class\s+\w+\s*\{[^}]*\})'
|
||||
|
||||
patterns = [function_pattern, class_pattern]
|
||||
chunks = []
|
||||
|
||||
for pattern in patterns:
|
||||
matches = re.finditer(pattern, code, re.MULTILINE | re.DOTALL)
|
||||
for match in matches:
|
||||
chunks.append(match.group(1))
|
||||
|
||||
return chunks
|
||||
|
||||
def chunk(self, code):
|
||||
if self.language == 'python':
|
||||
return self.chunk_python(code)
|
||||
elif self.language == 'javascript':
|
||||
return self.chunk_javascript(code)
|
||||
else:
|
||||
# Fallback to line-based chunking
|
||||
return self.chunk_by_lines(code)
|
||||
|
||||
def chunk_by_lines(self, code, max_lines=50):
|
||||
lines = code.split('\n')
|
||||
chunks = []
|
||||
|
||||
for i in range(0, len(lines), max_lines):
|
||||
chunk = '\n'.join(lines[i:i+max_lines])
|
||||
chunks.append(chunk)
|
||||
|
||||
return chunks
|
||||
```
|
||||
|
||||
### Tabular Data
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
class TableChunker:
|
||||
def __init__(self, max_rows=100, summary_rows=5):
|
||||
self.max_rows = max_rows
|
||||
self.summary_rows = summary_rows
|
||||
|
||||
def chunk(self, table_data):
|
||||
if isinstance(table_data, str):
|
||||
df = pd.read_csv(StringIO(table_data))
|
||||
else:
|
||||
df = table_data
|
||||
|
||||
chunks = []
|
||||
|
||||
if len(df) <= self.max_rows:
|
||||
# Small table - keep intact
|
||||
chunks.append({
|
||||
"type": "full_table",
|
||||
"content": df.to_string(),
|
||||
"metadata": {
|
||||
"rows": len(df),
|
||||
"columns": len(df.columns)
|
||||
}
|
||||
})
|
||||
else:
|
||||
# Large table - create summary + chunks
|
||||
summary = df.head(self.summary_rows)
|
||||
chunks.append({
|
||||
"type": "table_summary",
|
||||
"content": f"Table Summary ({len(df)} rows, {len(df.columns)} columns):\n{summary.to_string()}",
|
||||
"metadata": {
|
||||
"total_rows": len(df),
|
||||
"summary_rows": self.summary_rows,
|
||||
"columns": list(df.columns)
|
||||
}
|
||||
})
|
||||
|
||||
# Chunk the remaining data
|
||||
for i in range(self.summary_rows, len(df), self.max_rows):
|
||||
chunk_df = df.iloc[i:i+self.max_rows]
|
||||
chunks.append({
|
||||
"type": "table_chunk",
|
||||
"content": f"Rows {i+1}-{min(i+self.max_rows, len(df))}:\n{chunk_df.to_string()}",
|
||||
"metadata": {
|
||||
"start_row": i + 1,
|
||||
"end_row": min(i + self.max_rows, len(df)),
|
||||
"columns": list(df.columns)
|
||||
}
|
||||
})
|
||||
|
||||
return chunks
|
||||
```
|
||||
|
||||
## Level 4: Semantic Chunking
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from sentence_transformers import SentenceTransformer
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
|
||||
class SemanticChunker:
|
||||
def __init__(self, model_name="all-MiniLM-L6-v2", similarity_threshold=0.8, buffer_size=3):
|
||||
self.model = SentenceTransformer(model_name)
|
||||
self.similarity_threshold = similarity_threshold
|
||||
self.buffer_size = buffer_size
|
||||
|
||||
def split_into_sentences(self, text):
|
||||
# Simple sentence splitting - can be enhanced with nltk/spacy
|
||||
sentences = re.split(r'[.!?]+', text)
|
||||
return [s.strip() for s in sentences if s.strip()]
|
||||
|
||||
def chunk(self, text):
|
||||
sentences = self.split_into_sentences(text)
|
||||
|
||||
if len(sentences) <= self.buffer_size:
|
||||
return [text]
|
||||
|
||||
# Create embeddings
|
||||
embeddings = self.model.encode(sentences)
|
||||
|
||||
chunks = []
|
||||
current_chunk_sentences = []
|
||||
|
||||
for i in range(len(sentences)):
|
||||
current_chunk_sentences.append(sentences[i])
|
||||
|
||||
# Check if we should create a boundary
|
||||
if i < len(sentences) - 1:
|
||||
similarity = cosine_similarity(
|
||||
[embeddings[i]],
|
||||
[embeddings[i + 1]]
|
||||
)[0][0]
|
||||
|
||||
if similarity < self.similarity_threshold and len(current_chunk_sentences) >= 2:
|
||||
chunks.append(' '.join(current_chunk_sentences))
|
||||
current_chunk_sentences = []
|
||||
|
||||
# Add remaining sentences
|
||||
if current_chunk_sentences:
|
||||
chunks.append(' '.join(current_chunk_sentences))
|
||||
|
||||
return chunks
|
||||
```
|
||||
|
||||
### Parameter Tuning
|
||||
|
||||
| Parameter | Range | Effect |
|
||||
|-----------|-------|--------|
|
||||
| similarity_threshold | 0.5-0.9 | Higher values create more chunks |
|
||||
| buffer_size | 1-10 | Larger buffers provide more context |
|
||||
| model_name | Various | Different models for different domains |
|
||||
|
||||
### Optimization Tips
|
||||
|
||||
- Use domain-specific models for specialized content
|
||||
- Adjust threshold based on content complexity
|
||||
- Cache embeddings for repeated processing
|
||||
- Consider batch processing for large documents
|
||||
|
||||
## Level 5: Advanced Contextual Methods
|
||||
|
||||
### Late Chunking
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
|
||||
class LateChunker:
|
||||
def __init__(self, model_name="microsoft/DialoGPT-medium"):
|
||||
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
self.model = AutoModel.from_pretrained(model_name)
|
||||
|
||||
def chunk(self, text, chunk_size=512):
|
||||
# Tokenize entire document
|
||||
tokens = self.tokenizer(text, return_tensors="pt", truncation=False)
|
||||
|
||||
# Get token-level embeddings
|
||||
with torch.no_grad():
|
||||
outputs = self.model(**tokens, output_hidden_states=True)
|
||||
token_embeddings = outputs.last_hidden_state[0]
|
||||
|
||||
# Create chunk embeddings from token embeddings
|
||||
chunks = []
|
||||
for i in range(0, len(token_embeddings), chunk_size):
|
||||
chunk_tokens = token_embeddings[i:i+chunk_size]
|
||||
chunk_embedding = torch.mean(chunk_tokens, dim=0)
|
||||
chunks.append({
|
||||
"content": self.tokenizer.decode(tokens["input_ids"][0][i:i+chunk_size]),
|
||||
"embedding": chunk_embedding.numpy()
|
||||
})
|
||||
|
||||
return chunks
|
||||
```
|
||||
|
||||
### Contextual Retrieval
|
||||
|
||||
```python
|
||||
import openai
|
||||
|
||||
class ContextualChunker:
|
||||
def __init__(self, api_key):
|
||||
self.client = openai.OpenAI(api_key=api_key)
|
||||
|
||||
def generate_context(self, chunk, full_document):
|
||||
prompt = f"""
|
||||
Given the following document and a chunk from it, provide a brief context
|
||||
that helps understand the chunk's meaning within the full document.
|
||||
|
||||
Document:
|
||||
{full_document[:2000]}...
|
||||
|
||||
Chunk:
|
||||
{chunk}
|
||||
|
||||
Context (max 50 words):
|
||||
"""
|
||||
|
||||
response = self.client.chat.completions.create(
|
||||
model="gpt-3.5-turbo",
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
max_tokens=100,
|
||||
temperature=0
|
||||
)
|
||||
|
||||
return response.choices[0].message.content.strip()
|
||||
|
||||
def chunk_with_context(self, text, base_chunker):
|
||||
# First create base chunks
|
||||
base_chunks = base_chunker.chunk(text)
|
||||
|
||||
# Then add context to each chunk
|
||||
contextualized_chunks = []
|
||||
for chunk in base_chunks:
|
||||
context = self.generate_context(chunk.page_content, text)
|
||||
contextualized_content = f"Context: {context}\n\nContent: {chunk.page_content}"
|
||||
|
||||
contextualized_chunks.append({
|
||||
"content": contextualized_content,
|
||||
"original_content": chunk.page_content,
|
||||
"context": context
|
||||
})
|
||||
|
||||
return contextualized_chunks
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Computational Cost Analysis
|
||||
|
||||
| Strategy | Time Complexity | Space Complexity | Relative Cost |
|
||||
|----------|-----------------|------------------|---------------|
|
||||
| Fixed-Size | O(n) | O(n) | Low |
|
||||
| Recursive | O(n) | O(n) | Low |
|
||||
| Structure-Aware | O(n log n) | O(n) | Medium |
|
||||
| Semantic | O(n²) | O(n²) | High |
|
||||
| Late Chunking | O(n) | O(n) | Very High |
|
||||
| Contextual | O(n²) | O(n²) | Very High |
|
||||
|
||||
### Optimization Strategies
|
||||
|
||||
1. **Parallel Processing**: Process chunks concurrently when possible
|
||||
2. **Caching**: Store embeddings and intermediate results
|
||||
3. **Batch Operations**: Group similar operations together
|
||||
4. **Progressive Loading**: Process large documents in streaming fashion
|
||||
5. **Model Selection**: Choose appropriate models for task complexity
|
||||
867
skills/chunking-strategy/references/tools.md
Normal file
867
skills/chunking-strategy/references/tools.md
Normal file
@@ -0,0 +1,867 @@
|
||||
# Recommended Libraries and Frameworks
|
||||
|
||||
This document provides a comprehensive guide to tools, libraries, and frameworks for implementing chunking strategies.
|
||||
|
||||
## Core Chunking Libraries
|
||||
|
||||
### LangChain
|
||||
|
||||
**Overview**: Comprehensive framework for building applications with large language models, includes robust text splitting utilities.
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install langchain langchain-text-splitters
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Multiple text splitting strategies
|
||||
- Integration with various document loaders
|
||||
- Support for different content types (code, markdown, etc.)
|
||||
- Customizable separators and parameters
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
from langchain.text_splitter import (
|
||||
RecursiveCharacterTextSplitter,
|
||||
CharacterTextSplitter,
|
||||
TokenTextSplitter,
|
||||
MarkdownTextSplitter,
|
||||
PythonCodeTextSplitter
|
||||
)
|
||||
|
||||
# Basic recursive splitting
|
||||
splitter = RecursiveCharacterTextSplitter(
|
||||
chunk_size=1000,
|
||||
chunk_overlap=200,
|
||||
length_function=len,
|
||||
separators=["\n\n", "\n", " ", ""]
|
||||
)
|
||||
|
||||
chunks = splitter.split_text(large_text)
|
||||
|
||||
# Markdown-specific splitting
|
||||
markdown_splitter = MarkdownTextSplitter(
|
||||
chunk_size=1000,
|
||||
chunk_overlap=100
|
||||
)
|
||||
|
||||
# Code-specific splitting
|
||||
code_splitter = PythonCodeTextSplitter(
|
||||
chunk_size=1000,
|
||||
chunk_overlap=100
|
||||
)
|
||||
```
|
||||
|
||||
**Pros**:
|
||||
- Well-maintained and actively developed
|
||||
- Extensive documentation and examples
|
||||
- Integrates well with other LangChain components
|
||||
- Supports multiple document types
|
||||
|
||||
**Cons**:
|
||||
- Can be heavy dependency for simple use cases
|
||||
- Some advanced features require LangChain ecosystem
|
||||
|
||||
### LlamaIndex
|
||||
|
||||
**Overview**: Data framework for LLM applications with advanced indexing and retrieval capabilities.
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install llama-index
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Advanced semantic chunking
|
||||
- Hierarchical indexing
|
||||
- Context-aware retrieval
|
||||
- Integration with vector databases
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
from llama_index.core.node_parser import (
|
||||
SentenceSplitter,
|
||||
SemanticSplitterNodeParser
|
||||
)
|
||||
from llama_index.core import SimpleDirectoryReader
|
||||
from llama_index.embeddings.openai import OpenAIEmbedding
|
||||
|
||||
# Basic sentence splitting
|
||||
splitter = SentenceSplitter(
|
||||
chunk_size=1024,
|
||||
chunk_overlap=20
|
||||
)
|
||||
|
||||
# Semantic chunking with embeddings
|
||||
embed_model = OpenAIEmbedding()
|
||||
semantic_splitter = SemanticSplitterNodeParser(
|
||||
buffer_size=1,
|
||||
breakpoint_percentile_threshold=95,
|
||||
embed_model=embed_model
|
||||
)
|
||||
|
||||
# Load and process documents
|
||||
documents = SimpleDirectoryReader("./data").load_data()
|
||||
nodes = semantic_splitter.get_nodes_from_documents(documents)
|
||||
```
|
||||
|
||||
**Pros**:
|
||||
- Excellent semantic chunking capabilities
|
||||
- Built for production RAG systems
|
||||
- Strong vector database integration
|
||||
- Active community support
|
||||
|
||||
**Cons**:
|
||||
- More complex setup for basic use cases
|
||||
- Semantic chunking requires embedding model setup
|
||||
|
||||
### Unstructured
|
||||
|
||||
**Overview**: Open-source library for processing unstructured documents, especially strong with multi-modal content.
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install "unstructured[pdf,png,jpg]"
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Multi-modal document processing
|
||||
- Support for PDFs, images, and various formats
|
||||
- Structure preservation
|
||||
- Table extraction and processing
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
from unstructured.partition.auto import partition
|
||||
from unstructured.chunking.title import chunk_by_title
|
||||
|
||||
# Partition document by type
|
||||
elements = partition(filename="document.pdf")
|
||||
|
||||
# Chunk by title/heading structure
|
||||
chunks = chunk_by_title(
|
||||
elements,
|
||||
combine_text_under_n_chars=2000,
|
||||
max_characters=10000,
|
||||
new_after_n_chars=1500,
|
||||
multipage_sections=True
|
||||
)
|
||||
|
||||
# Access chunked content
|
||||
for chunk in chunks:
|
||||
print(f"Category: {chunk.category}")
|
||||
print(f"Content: {chunk.text[:200]}...")
|
||||
```
|
||||
|
||||
**Pros**:
|
||||
- Excellent for PDF and image processing
|
||||
- Preserves document structure
|
||||
- Handles tables and figures well
|
||||
- Strong multi-modal capabilities
|
||||
|
||||
**Cons**:
|
||||
- Can be slower for large documents
|
||||
- Requires additional dependencies for some formats
|
||||
|
||||
## Text Processing Libraries
|
||||
|
||||
### NLTK (Natural Language Toolkit)
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install nltk
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Sentence tokenization
|
||||
- Language detection
|
||||
- Text preprocessing
|
||||
- Linguistic analysis
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
import nltk
|
||||
from nltk.tokenize import sent_tokenize, word_tokenize
|
||||
from nltk.corpus import stopwords
|
||||
|
||||
# Download required data
|
||||
nltk.download('punkt')
|
||||
nltk.download('stopwords')
|
||||
|
||||
# Sentence and word tokenization
|
||||
text = "This is a sample sentence. This is another sentence."
|
||||
sentences = sent_tokenize(text)
|
||||
words = word_tokenize(text)
|
||||
|
||||
# Stop words removal
|
||||
stop_words = set(stopwords.words('english'))
|
||||
filtered_words = [word for word in words if word.lower() not in stop_words]
|
||||
```
|
||||
|
||||
### spaCy
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install spacy
|
||||
python -m spacy download en_core_web_sm
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Industrial-strength NLP
|
||||
- Named entity recognition
|
||||
- Dependency parsing
|
||||
- Sentence boundary detection
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
import spacy
|
||||
|
||||
# Load language model
|
||||
nlp = spacy.load("en_core_web_sm")
|
||||
|
||||
# Process text
|
||||
doc = nlp("This is a sample sentence. This is another sentence.")
|
||||
|
||||
# Extract sentences
|
||||
sentences = [sent.text for sent in doc.sents]
|
||||
|
||||
# Named entities
|
||||
entities = [(ent.text, ent.label_) for ent in doc.ents]
|
||||
|
||||
# Dependency parsing for better chunking
|
||||
for token in doc:
|
||||
print(f"{token.text}: {token.dep_} (head: {token.head.text})")
|
||||
```
|
||||
|
||||
### Sentence Transformers
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install sentence-transformers
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Pre-trained sentence embeddings
|
||||
- Semantic similarity calculation
|
||||
- Multi-lingual support
|
||||
- Custom model training
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
from sentence_transformers import SentenceTransformer, util
|
||||
import numpy as np
|
||||
|
||||
# Load pre-trained model
|
||||
model = SentenceTransformer('all-MiniLM-L6-v2')
|
||||
|
||||
# Generate embeddings
|
||||
sentences = ["This is a sentence.", "This is another sentence."]
|
||||
embeddings = model.encode(sentences)
|
||||
|
||||
# Calculate semantic similarity
|
||||
similarity = util.cos_sim(embeddings[0], embeddings[1])
|
||||
|
||||
# Find semantic boundaries for chunking
|
||||
def find_semantic_boundaries(text, model, threshold=0.8):
|
||||
sentences = [s.strip() for s in text.split('.') if s.strip()]
|
||||
embeddings = model.encode(sentences)
|
||||
|
||||
boundaries = [0]
|
||||
for i in range(1, len(sentences)):
|
||||
similarity = util.cos_sim(embeddings[i-1], embeddings[i])
|
||||
if similarity < threshold:
|
||||
boundaries.append(i)
|
||||
|
||||
return boundaries
|
||||
```
|
||||
|
||||
## Vector Databases and Search
|
||||
|
||||
### ChromaDB
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install chromadb
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- In-memory and persistent storage
|
||||
- Built-in embedding functions
|
||||
- Similarity search
|
||||
- Metadata filtering
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
import chromadb
|
||||
from chromadb.utils import embedding_functions
|
||||
|
||||
# Initialize client
|
||||
client = chromadb.Client()
|
||||
|
||||
# Create collection
|
||||
collection = client.create_collection(
|
||||
name="document_chunks",
|
||||
embedding_function=embedding_functions.DefaultEmbeddingFunction()
|
||||
)
|
||||
|
||||
# Add chunks
|
||||
collection.add(
|
||||
documents=[chunk["content"] for chunk in chunks],
|
||||
metadatas=[chunk.get("metadata", {}) for chunk in chunks],
|
||||
ids=[chunk["id"] for chunk in chunks]
|
||||
)
|
||||
|
||||
# Search
|
||||
results = collection.query(
|
||||
query_texts=["What is chunking?"],
|
||||
n_results=5
|
||||
)
|
||||
```
|
||||
|
||||
### Pinecone
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install pinecone-client
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Managed vector database service
|
||||
- High-performance similarity search
|
||||
- Metadata filtering
|
||||
- Scalable infrastructure
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
import pinecone
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
# Initialize
|
||||
pinecone.init(api_key="your-api-key", environment="your-environment")
|
||||
index_name = "document-chunks"
|
||||
|
||||
# Create index if it doesn't exist
|
||||
if index_name not in pinecone.list_indexes():
|
||||
pinecone.create_index(
|
||||
name=index_name,
|
||||
dimension=384, # Match embedding model
|
||||
metric="cosine"
|
||||
)
|
||||
|
||||
index = pinecone.Index(index_name)
|
||||
|
||||
# Generate embeddings and upsert
|
||||
model = SentenceTransformer('all-MiniLM-L6-v2')
|
||||
for chunk in chunks:
|
||||
embedding = model.encode(chunk["content"])
|
||||
index.upsert(
|
||||
vectors=[{
|
||||
"id": chunk["id"],
|
||||
"values": embedding.tolist(),
|
||||
"metadata": chunk.get("metadata", {})
|
||||
}]
|
||||
)
|
||||
|
||||
# Search
|
||||
query_embedding = model.encode("search query")
|
||||
results = index.query(
|
||||
vector=query_embedding.tolist(),
|
||||
top_k=5,
|
||||
include_metadata=True
|
||||
)
|
||||
```
|
||||
|
||||
### Weaviate
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install weaviate-client
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- GraphQL API
|
||||
- Hybrid search (dense + sparse)
|
||||
- Real-time updates
|
||||
- Schema validation
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
import weaviate
|
||||
|
||||
# Connect to Weaviate
|
||||
client = weaviate.Client("http://localhost:8080")
|
||||
|
||||
# Define schema
|
||||
client.schema.create_class({
|
||||
"class": "DocumentChunk",
|
||||
"description": "A chunk of document content",
|
||||
"properties": [
|
||||
{
|
||||
"name": "content",
|
||||
"dataType": ["text"]
|
||||
},
|
||||
{
|
||||
"name": "source",
|
||||
"dataType": ["string"]
|
||||
}
|
||||
]
|
||||
})
|
||||
|
||||
# Add data
|
||||
for chunk in chunks:
|
||||
client.data_object.create(
|
||||
data_object={
|
||||
"content": chunk["content"],
|
||||
"source": chunk.get("source", "unknown")
|
||||
},
|
||||
class_name="DocumentChunk"
|
||||
)
|
||||
|
||||
# Search
|
||||
results = client.query.get(
|
||||
"DocumentChunk",
|
||||
["content", "source"]
|
||||
).with_near_text({
|
||||
"concepts": ["search query"]
|
||||
}).with_limit(5).do()
|
||||
```
|
||||
|
||||
## Evaluation and Testing
|
||||
|
||||
### RAGAS
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install ragas
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- RAG evaluation metrics
|
||||
- Answer quality assessment
|
||||
- Context relevance measurement
|
||||
- Faithfulness evaluation
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
from ragas import evaluate
|
||||
from ragas.metrics import (
|
||||
faithfulness,
|
||||
answer_relevancy,
|
||||
context_relevancy,
|
||||
context_recall
|
||||
)
|
||||
from datasets import Dataset
|
||||
|
||||
# Prepare evaluation data
|
||||
dataset = Dataset.from_dict({
|
||||
"question": ["What is chunking?"],
|
||||
"answer": ["Chunking is the process of breaking large documents into smaller segments"],
|
||||
"contexts": [["Chunking involves dividing text into manageable pieces for better processing"]],
|
||||
"ground_truth": ["Chunking is a document processing technique"]
|
||||
})
|
||||
|
||||
# Evaluate
|
||||
result = evaluate(
|
||||
dataset=dataset,
|
||||
metrics=[
|
||||
faithfulness,
|
||||
answer_relevancy,
|
||||
context_relevancy,
|
||||
context_recall
|
||||
]
|
||||
)
|
||||
|
||||
print(result)
|
||||
```
|
||||
|
||||
### TruEra (TruLens)
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install trulens trulens-apps
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- LLM application evaluation
|
||||
- Feedback functions
|
||||
- Hallucination detection
|
||||
- Performance monitoring
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
from trulens.core import TruSession
|
||||
from trulens.apps.custom import instrument
|
||||
from trulens.feedback import GroundTruthAgreement
|
||||
|
||||
# Initialize session
|
||||
session = TruSession()
|
||||
|
||||
# Define feedback functions
|
||||
f_groundedness = GroundTruthAgreement(ground_truth)
|
||||
|
||||
# Evaluate chunks
|
||||
@instrument
|
||||
def chunk_and_query(text, query):
|
||||
chunks = chunk_function(text)
|
||||
relevant_chunks = search_function(chunks, query)
|
||||
answer = generate_function(relevant_chunks, query)
|
||||
return answer
|
||||
|
||||
# Record evaluation
|
||||
with session:
|
||||
chunk_and_query("large document text", "what is the main topic?")
|
||||
```
|
||||
|
||||
## Document Processing
|
||||
|
||||
### PyPDF2
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install PyPDF2
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- PDF text extraction
|
||||
- Page manipulation
|
||||
- Metadata extraction
|
||||
- Form field processing
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
import PyPDF2
|
||||
|
||||
def extract_text_from_pdf(pdf_path):
|
||||
text = ""
|
||||
with open(pdf_path, 'rb') as file:
|
||||
reader = PyPDF2.PdfReader(file)
|
||||
for page in reader.pages:
|
||||
text += page.extract_text()
|
||||
return text
|
||||
|
||||
# Extract text by page for better chunking
|
||||
def extract_pages(pdf_path):
|
||||
pages = []
|
||||
with open(pdf_path, 'rb') as file:
|
||||
reader = PyPDF2.PdfReader(file)
|
||||
for i, page in enumerate(reader.pages):
|
||||
pages.append({
|
||||
"page_number": i + 1,
|
||||
"content": page.extract_text()
|
||||
})
|
||||
return pages
|
||||
```
|
||||
|
||||
### python-docx
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install python-docx
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Microsoft Word document processing
|
||||
- Paragraph and table extraction
|
||||
- Style preservation
|
||||
- Metadata access
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
from docx import Document
|
||||
|
||||
def extract_from_docx(docx_path):
|
||||
doc = Document(docx_path)
|
||||
content = []
|
||||
|
||||
for paragraph in doc.paragraphs:
|
||||
if paragraph.text.strip():
|
||||
content.append({
|
||||
"type": "paragraph",
|
||||
"text": paragraph.text,
|
||||
"style": paragraph.style.name
|
||||
})
|
||||
|
||||
for table in doc.tables:
|
||||
table_text = []
|
||||
for row in table.rows:
|
||||
row_text = [cell.text for cell in row.cells]
|
||||
table_text.append(" | ".join(row_text))
|
||||
|
||||
content.append({
|
||||
"type": "table",
|
||||
"text": "\n".join(table_text)
|
||||
})
|
||||
|
||||
return content
|
||||
```
|
||||
|
||||
## Specialized Libraries
|
||||
|
||||
### tiktoken (OpenAI)
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install tiktoken
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Accurate token counting for OpenAI models
|
||||
- Fast encoding/decoding
|
||||
- Multiple model support
|
||||
- Language model specific tokenization
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
import tiktoken
|
||||
|
||||
# Get encoding for specific model
|
||||
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
|
||||
|
||||
# Encode text
|
||||
tokens = encoding.encode("This is a sample text")
|
||||
print(f"Token count: {len(tokens)}")
|
||||
|
||||
# Decode tokens
|
||||
text = encoding.decode(tokens)
|
||||
|
||||
# Count tokens without full encoding
|
||||
def count_tokens(text, model="gpt-3.5-turbo"):
|
||||
encoding = tiktoken.encoding_for_model(model)
|
||||
return len(encoding.encode(text))
|
||||
|
||||
# Use in chunking
|
||||
def chunk_by_tokens(text, max_tokens=1000):
|
||||
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
|
||||
tokens = encoding.encode(text)
|
||||
|
||||
chunks = []
|
||||
for i in range(0, len(tokens), max_tokens):
|
||||
chunk_tokens = tokens[i:i + max_tokens]
|
||||
chunk_text = encoding.decode(chunk_tokens)
|
||||
chunks.append(chunk_text)
|
||||
|
||||
return chunks
|
||||
```
|
||||
|
||||
### PDFMiner
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install pdfminer.six
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Detailed PDF analysis
|
||||
- Layout preservation
|
||||
- Font and style information
|
||||
- High-precision text extraction
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
from pdfminer.high_level import extract_pages
|
||||
from pdfminer.layout import LTTextContainer
|
||||
|
||||
def extract_structured_text(pdf_path):
|
||||
structured_content = []
|
||||
|
||||
for page_layout in extract_pages(pdf_path):
|
||||
page_content = []
|
||||
|
||||
for element in page_layout:
|
||||
if isinstance(element, LTTextContainer):
|
||||
text = element.get_text()
|
||||
font_info = {
|
||||
"font_size": element.height,
|
||||
"is_bold": "Bold" in element.fontname,
|
||||
"x0": element.x0,
|
||||
"y0": element.y0
|
||||
}
|
||||
page_content.append({
|
||||
"text": text.strip(),
|
||||
"font_info": font_info
|
||||
})
|
||||
|
||||
structured_content.append({
|
||||
"page_number": page_layout.pageid,
|
||||
"content": page_content
|
||||
})
|
||||
|
||||
return structured_content
|
||||
```
|
||||
|
||||
## Performance and Optimization
|
||||
|
||||
### Dask
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install dask[complete]
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Parallel processing
|
||||
- Out-of-core computation
|
||||
- Distributed computing
|
||||
- Integration with pandas
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
import dask.bag as db
|
||||
from dask.distributed import Client
|
||||
|
||||
# Setup distributed client
|
||||
client = Client(n_workers=4)
|
||||
|
||||
# Parallel chunking of multiple documents
|
||||
def chunk_document(document):
|
||||
# Your chunking logic here
|
||||
return chunk_function(document)
|
||||
|
||||
# Process documents in parallel
|
||||
documents = ["doc1", "doc2", "doc3", ...] # List of document contents
|
||||
document_bag = db.from_sequence(documents)
|
||||
|
||||
# Apply chunking function in parallel
|
||||
chunked_documents = document_bag.map(chunk_document)
|
||||
|
||||
# Compute results
|
||||
results = chunked_documents.compute()
|
||||
```
|
||||
|
||||
### Ray
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install ray
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Distributed computing
|
||||
- Actor model
|
||||
- Autoscaling
|
||||
- ML pipeline integration
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
import ray
|
||||
|
||||
# Initialize Ray
|
||||
ray.init()
|
||||
|
||||
@ray.remote
|
||||
class ChunkingWorker:
|
||||
def __init__(self, strategy):
|
||||
self.strategy = strategy
|
||||
|
||||
def chunk_documents(self, documents):
|
||||
results = []
|
||||
for doc in documents:
|
||||
chunks = self.strategy.chunk(doc)
|
||||
results.append(chunks)
|
||||
return results
|
||||
|
||||
# Create workers
|
||||
workers = [ChunkingWorker.remote(strategy) for _ in range(4)]
|
||||
|
||||
# Distribute work
|
||||
documents_batch = [documents[i::4] for i in range(4)]
|
||||
futures = [worker.chunk_documents.remote(batch)
|
||||
for worker, batch in zip(workers, documents_batch)]
|
||||
|
||||
# Get results
|
||||
results = ray.get(futures)
|
||||
```
|
||||
|
||||
## Development and Testing
|
||||
|
||||
### pytest
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install pytest pytest-asyncio
|
||||
```
|
||||
|
||||
**Example Tests**:
|
||||
|
||||
```python
|
||||
import pytest
|
||||
from your_chunking_module import FixedSizeChunker, SemanticChunker
|
||||
|
||||
class TestFixedSizeChunker:
|
||||
def test_chunk_size_respect(self):
|
||||
chunker = FixedSizeChunker(chunk_size=100, chunk_overlap=10)
|
||||
text = "word " * 50 # 50 words
|
||||
|
||||
chunks = chunker.chunk(text)
|
||||
|
||||
for chunk in chunks:
|
||||
assert len(chunk.split()) <= 100 # Account for word boundaries
|
||||
|
||||
def test_overlap_consistency(self):
|
||||
chunker = FixedSizeChunker(chunk_size=50, chunk_overlap=10)
|
||||
text = "word " * 30
|
||||
|
||||
chunks = chunker.chunk(text)
|
||||
|
||||
# Check overlap between consecutive chunks
|
||||
for i in range(1, len(chunks)):
|
||||
chunk1_words = set(chunks[i-1].split()[-10:])
|
||||
chunk2_words = set(chunks[i].split()[:10])
|
||||
overlap = len(chunk1_words & chunk2_words)
|
||||
assert overlap >= 5 # Allow some tolerance
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_semantic_chunker():
|
||||
chunker = SemanticChunker()
|
||||
text = "First topic sentence. Another sentence about first topic. " \
|
||||
"Now switching to second topic. More about second topic."
|
||||
|
||||
chunks = await chunker.chunk_async(text)
|
||||
|
||||
# Should detect topic change and create boundary
|
||||
assert len(chunks) >= 2
|
||||
assert "first topic" in chunks[0].lower()
|
||||
assert "second topic" in chunks[1].lower()
|
||||
```
|
||||
|
||||
### Memory Profiler
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install memory-profiler
|
||||
```
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
from memory_profiler import profile
|
||||
|
||||
@profile
|
||||
def chunk_large_document():
|
||||
chunker = FixedSizeChunker(chunk_size=1000)
|
||||
large_text = "word " * 100000 # Large document
|
||||
|
||||
chunks = chunker.chunk(large_text)
|
||||
return chunks
|
||||
|
||||
# Run with: python -m memory_profiler your_script.py
|
||||
```
|
||||
|
||||
This comprehensive toolset provides everything needed to implement, test, and optimize chunking strategies for various use cases, from simple text processing to production-grade RAG systems.
|
||||
1403
skills/chunking-strategy/references/visualization-tools.md
Normal file
1403
skills/chunking-strategy/references/visualization-tools.md
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user