Initial commit

2025-11-29 18:28:34 +08:00
commit 390afca02b
220 changed files with 86013 additions and 0 deletions
--- a/skills/chunking-strategy/SKILL.md
+++ b/skills/chunking-strategy/SKILL.md
@@ -0,0 +1,194 @@
+---
+name: chunking-strategy
+description: Implement optimal chunking strategies in RAG systems and document processing pipelines. Use when building retrieval-augmented generation systems, vector databases, or processing large documents that require breaking into semantically meaningful segments for embeddings and search.
+allowed-tools: Read, Write, Bash
+category: artificial-intelligence
+tags: [rag, chunking, vector-search, embeddings, document-processing]
+version: 1.0.0
+---
+
+# Chunking Strategy for RAG Systems
+
+## Overview
+
+Implement optimal chunking strategies for Retrieval-Augmented Generation (RAG) systems and document processing pipelines. This skill provides a comprehensive framework for breaking large documents into smaller, semantically meaningful segments that preserve context while enabling efficient retrieval and search.
+
+## When to Use
+
+Use this skill when building RAG systems, optimizing vector search performance, implementing document processing pipelines, handling multi-modal content, or performance-tuning existing RAG systems with poor retrieval quality.
+
+## Instructions
+
+### Choose Chunking Strategy
+
+Select appropriate chunking strategy based on document type and use case:
+
+1. **Fixed-Size Chunking** (Level 1)
+   - Use for simple documents without clear structure
+   - Start with 512 tokens and 10-20% overlap
+   - Adjust size based on query type: 256 for factoid, 1024 for analytical
+
+2. **Recursive Character Chunking** (Level 2)
+   - Use for documents with clear structural boundaries
+   - Implement hierarchical separators: paragraphs → sentences → words
+   - Customize separators for document types (HTML, Markdown)
+
+3. **Structure-Aware Chunking** (Level 3)
+   - Use for structured documents (Markdown, code, tables, PDFs)
+   - Preserve semantic units: functions, sections, table blocks
+   - Validate structure preservation post-splitting
+
+4. **Semantic Chunking** (Level 4)
+   - Use for complex documents with thematic shifts
+   - Implement embedding-based boundary detection
+   - Configure similarity threshold (0.8) and buffer size (3-5 sentences)
+
+5. **Advanced Methods** (Level 5)
+   - Use Late Chunking for long-context embedding models
+   - Apply Contextual Retrieval for high-precision requirements
+   - Monitor computational costs vs. retrieval improvements
+
+Reference detailed strategy implementations in [references/strategies.md](references/strategies.md).
+
+### Implement Chunking Pipeline
+
+Follow these steps to implement effective chunking:
+
+1. **Pre-process documents**
+   - Analyze document structure and content types
+   - Identify multi-modal content (tables, images, code)
+   - Assess information density and complexity
+
+2. **Select strategy parameters**
+   - Choose chunk size based on embedding model context window
+   - Set overlap percentage (10-20% for most cases)
+   - Configure strategy-specific parameters
+
+3. **Process and validate**
+   - Apply chosen chunking strategy
+   - Validate semantic coherence of chunks
+   - Test with representative documents
+
+4. **Evaluate and iterate**
+   - Measure retrieval precision and recall
+   - Monitor processing latency and resource usage
+   - Optimize based on specific use case requirements
+
+Reference detailed implementation guidelines in [references/implementation.md](references/implementation.md).
+
+### Evaluate Performance
+
+Use these metrics to evaluate chunking effectiveness:
+
+- **Retrieval Precision**: Fraction of retrieved chunks that are relevant
+- **Retrieval Recall**: Fraction of relevant chunks that are retrieved
+- **End-to-End Accuracy**: Quality of final RAG responses
+- **Processing Time**: Latency impact on overall system
+- **Resource Usage**: Memory and computational costs
+
+Reference detailed evaluation framework in [references/evaluation.md](references/evaluation.md).
+
+## Examples
+
+### Basic Fixed-Size Chunking
+
+```python
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+
+# Configure for factoid queries
+splitter = RecursiveCharacterTextSplitter(
+    chunk_size=256,
+    chunk_overlap=25,
+    length_function=len
+)
+
+chunks = splitter.split_documents(documents)
+```
+
+### Structure-Aware Code Chunking
+
+```python
+def chunk_python_code(code):
+    """Split Python code into semantic chunks"""
+    import ast
+
+    tree = ast.parse(code)
+    chunks = []
+
+    for node in ast.walk(tree):
+        if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
+            chunks.append(ast.get_source_segment(code, node))
+
+    return chunks
+```
+
+### Semantic Chunking with Embeddings
+
+```python
+def semantic_chunk(text, similarity_threshold=0.8):
+    """Chunk text based on semantic boundaries"""
+    sentences = split_into_sentences(text)
+    embeddings = generate_embeddings(sentences)
+
+    chunks = []
+    current_chunk = [sentences[0]]
+
+    for i in range(1, len(sentences)):
+        similarity = cosine_similarity(embeddings[i-1], embeddings[i])
+
+        if similarity < similarity_threshold:
+            chunks.append(" ".join(current_chunk))
+            current_chunk = [sentences[i]]
+        else:
+            current_chunk.append(sentences[i])
+
+    chunks.append(" ".join(current_chunk))
+    return chunks
+```
+
+## Best Practices
+
+### Core Principles
+- Balance context preservation with retrieval precision
+- Maintain semantic coherence within chunks
+- Optimize for embedding model constraints
+- Preserve document structure when beneficial
+
+### Implementation Guidelines
+- Start simple with fixed-size chunking (512 tokens, 10-20% overlap)
+- Test thoroughly with representative documents
+- Monitor both accuracy metrics and computational costs
+- Iterate based on specific document characteristics
+
+### Common Pitfalls to Avoid
+- Over-chunking: Creating too many small, context-poor chunks
+- Under-chunking: Missing relevant information due to oversized chunks
+- Ignoring document structure and semantic boundaries
+- Using one-size-fits-all approach for diverse content types
+- Neglecting overlap for boundary-crossing information
+
+## Constraints
+
+### Resource Considerations
+- Semantic and contextual methods require significant computational resources
+- Late chunking needs long-context embedding models
+- Complex strategies increase processing latency
+- Monitor memory usage for large document processing
+
+### Quality Requirements
+- Validate chunk semantic coherence post-processing
+- Test with domain-specific documents before deployment
+- Ensure chunks maintain standalone meaning where possible
+- Implement proper error handling for edge cases
+
+## References
+
+Reference detailed documentation in the [references/](references/) folder:
+- [strategies.md](references/strategies.md) - Detailed strategy implementations
+- [implementation.md](references/implementation.md) - Complete implementation guidelines
+- [evaluation.md](references/evaluation.md) - Performance evaluation framework
+- [tools.md](references/tools.md) - Recommended libraries and frameworks
+- [research.md](references/research.md) - Key research papers and findings
+- [advanced-strategies.md](references/advanced-strategies.md) - 11 comprehensive chunking methods
+- [semantic-methods.md](references/semantic-methods.md) - Semantic and contextual approaches
+- [visualization-tools.md](references/visualization-tools.md) - Evaluation and visualization tools
--- a/skills/chunking-strategy/references/advanced-strategies.md
+++ b/skills/chunking-strategy/references/advanced-strategies.md
--- a/skills/chunking-strategy/references/evaluation.md
+++ b/skills/chunking-strategy/references/evaluation.md
@@ -0,0 +1,904 @@
+# Performance Evaluation Framework
+
+This document provides comprehensive methodologies for evaluating chunking strategy performance and effectiveness.
+
+## Evaluation Metrics
+
+### Core Retrieval Metrics
+
+#### Retrieval Precision
+Measures the fraction of retrieved chunks that are relevant to the query.
+
+```python
+def calculate_precision(retrieved_chunks: List[Dict], relevant_chunks: List[Dict]) -> float:
+    """
+    Calculate retrieval precision
+    Precision = |Relevant ∩ Retrieved| / |Retrieved|
+    """
+    retrieved_ids = {chunk.get('id') for chunk in retrieved_chunks}
+    relevant_ids = {chunk.get('id') for chunk in relevant_chunks}
+
+    intersection = retrieved_ids & relevant_ids
+
+    if not retrieved_ids:
+        return 0.0
+
+    return len(intersection) / len(retrieved_ids)
+```
+
+#### Retrieval Recall
+Measures the fraction of relevant chunks that are successfully retrieved.
+
+```python
+def calculate_recall(retrieved_chunks: List[Dict], relevant_chunks: List[Dict]) -> float:
+    """
+    Calculate retrieval recall
+    Recall = |Relevant ∩ Retrieved| / |Relevant|
+    """
+    retrieved_ids = {chunk.get('id') for chunk in retrieved_chunks}
+    relevant_ids = {chunk.get('id') for chunk in relevant_chunks}
+
+    intersection = retrieved_ids & relevant_ids
+
+    if not relevant_ids:
+        return 0.0
+
+    return len(intersection) / len(relevant_ids)
+```
+
+#### F1-Score
+Harmonic mean of precision and recall.
+
+```python
+def calculate_f1_score(precision: float, recall: float) -> float:
+    """
+    Calculate F1-score
+    F1 = 2 * (Precision * Recall) / (Precision + Recall)
+    """
+    if precision + recall == 0:
+        return 0.0
+
+    return 2 * (precision * recall) / (precision + recall)
+```
+
+### Mean Reciprocal Rank (MRR)
+Measures the rank of the first relevant result.
+
+```python
+def calculate_mrr(queries: List[Dict], results: List[List[Dict]]) -> float:
+    """
+    Calculate Mean Reciprocal Rank
+    """
+    reciprocal_ranks = []
+
+    for query, query_results in zip(queries, results):
+        relevant_found = False
+
+        for rank, result in enumerate(query_results, 1):
+            if result.get('is_relevant', False):
+                reciprocal_ranks.append(1.0 / rank)
+                relevant_found = True
+                break
+
+        if not relevant_found:
+            reciprocal_ranks.append(0.0)
+
+    return sum(reciprocal_ranks) / len(reciprocal_ranks)
+```
+
+### Mean Average Precision (MAP)
+Considers both precision and the ranking of relevant documents.
+
+```python
+def calculate_average_precision(retrieved_chunks: List[Dict], relevant_chunks: List[Dict]) -> float:
+    """
+    Calculate Average Precision for a single query
+    """
+    retrieved_ids = {chunk.get('id') for chunk in retrieved_chunks}
+    relevant_ids = {chunk.get('id') for chunk in relevant_chunks}
+
+    if not relevant_ids:
+        return 0.0
+
+    precisions = []
+    relevant_count = 0
+
+    for rank, chunk in enumerate(retrieved_chunks, 1):
+        if chunk.get('id') in relevant_ids:
+            relevant_count += 1
+            precision_at_rank = relevant_count / rank
+            precisions.append(precision_at_rank)
+
+    return sum(precisions) / len(relevant_ids) if relevant_ids else 0.0
+
+def calculate_map(queries: List[Dict], results: List[List[Dict]]) -> float:
+    """
+    Calculate Mean Average Precision across multiple queries
+    """
+    average_precisions = []
+
+    for query, query_results in zip(queries, results):
+        ap = calculate_average_precision(query_results, query.get('relevant_chunks', []))
+        average_precisions.append(ap)
+
+    return sum(average_precisions) / len(average_precisions) if average_precisions else 0.0
+```
+
+### Normalized Discounted Cumulative Gain (NDCG)
+Measures ranking quality with emphasis on highly relevant results.
+
+```python
+def calculate_dcg(retrieved_chunks: List[Dict]) -> float:
+    """
+    Calculate Discounted Cumulative Gain
+    """
+    dcg = 0.0
+
+    for rank, chunk in enumerate(retrieved_chunks, 1):
+        relevance = chunk.get('relevance_score', 0)
+        dcg += relevance / np.log2(rank + 1)
+
+    return dcg
+
+def calculate_ndcg(retrieved_chunks: List[Dict], ideal_chunks: List[Dict]) -> float:
+    """
+    Calculate Normalized Discounted Cumulative Gain
+    """
+    dcg = calculate_dcg(retrieved_chunks)
+    idcg = calculate_dcg(ideal_chunks)
+
+    if idcg == 0:
+        return 0.0
+
+    return dcg / idcg
+```
+
+## End-to-End RAG Evaluation
+
+### Answer Quality Metrics
+
+#### Factual Consistency
+Measures how well the generated answer aligns with retrieved chunks.
+
+```python
+import spacy
+from transformers import pipeline
+
+class FactualConsistencyEvaluator:
+    def __init__(self):
+        self.nlp = spacy.load("en_core_web_sm")
+        self.nli_pipeline = pipeline("text-classification",
+                                   model="roberta-large-mnli")
+
+    def evaluate_consistency(self, answer: str, retrieved_chunks: List[str]) -> float:
+        """
+        Evaluate factual consistency between answer and retrieved context
+        """
+        if not retrieved_chunks:
+            return 0.0
+
+        # Combine retrieved chunks as context
+        context = " ".join(retrieved_chunks[:3])  # Use top 3 chunks
+
+        # Use Natural Language Inference to check consistency
+        result = self.nli_pipeline(f"premise: {context} hypothesis: {answer}")
+
+        # Extract consistency score (entailment probability)
+        for item in result:
+            if item['label'] == 'ENTAILMENT':
+                return item['score']
+            elif item['label'] == 'CONTRADICTION':
+                return 1.0 - item['score']
+
+        return 0.5  # Neutral if NLI is inconclusive
+```
+
+#### Answer Completeness
+Measures how completely the answer addresses the user's query.
+
+```python
+def evaluate_completeness(answer: str, query: str, reference_answer: str = None) -> float:
+    """
+    Evaluate answer completeness
+    """
+    # Extract key entities from query
+    query_entities = extract_entities(query)
+    answer_entities = extract_entities(answer)
+
+    # Calculate entity coverage
+    if not query_entities:
+        return 0.5  # Neutral if no entities in query
+
+    covered_entities = query_entities & answer_entities
+    entity_coverage = len(covered_entities) / len(query_entities)
+
+    # If reference answer is available, compare against it
+    if reference_answer:
+        reference_entities = extract_entities(reference_answer)
+        answer_reference_overlap = len(answer_entities & reference_entities) / max(len(reference_entities), 1)
+        return (entity_coverage + answer_reference_overlap) / 2
+
+    return entity_coverage
+
+def extract_entities(text: str) -> set:
+    """
+    Extract named entities from text (simplified)
+    """
+    # This would use a proper NER model in practice
+    import re
+
+    # Simple noun phrase extraction as placeholder
+    noun_phrases = re.findall(r'\b[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\b', text)
+    return set(noun_phrases)
+```
+
+#### Response Relevance
+Measures how relevant the answer is to the original query.
+
+```python
+from sentence_transformers import SentenceTransformer
+from sklearn.metrics.pairwise import cosine_similarity
+
+class RelevanceEvaluator:
+    def __init__(self, model_name="all-MiniLM-L6-v2"):
+        self.model = SentenceTransformer(model_name)
+
+    def evaluate_relevance(self, query: str, answer: str) -> float:
+        """
+        Evaluate semantic relevance between query and answer
+        """
+        # Generate embeddings
+        query_embedding = self.model.encode([query])
+        answer_embedding = self.model.encode([answer])
+
+        # Calculate cosine similarity
+        similarity = cosine_similarity(query_embedding, answer_embedding)[0][0]
+
+        return float(similarity)
+```
+
+## Performance Metrics
+
+### Processing Time
+
+```python
+import time
+from dataclasses import dataclass
+from typing import List, Dict
+
+@dataclass
+class PerformanceMetrics:
+    total_time: float
+    chunking_time: float
+    embedding_time: float
+    search_time: float
+    generation_time: float
+    throughput: float  # documents per second
+
+class PerformanceProfiler:
+    def __init__(self):
+        self.timings = {}
+        self.start_times = {}
+
+    def start_timer(self, operation: str):
+        self.start_times[operation] = time.time()
+
+    def end_timer(self, operation: str):
+        if operation in self.start_times:
+            duration = time.time() - self.start_times[operation]
+            if operation not in self.timings:
+                self.timings[operation] = []
+            self.timings[operation].append(duration)
+            return duration
+        return 0.0
+
+    def get_performance_metrics(self, document_count: int) -> PerformanceMetrics:
+        total_time = sum(sum(times) for times in self.timings.values())
+
+        return PerformanceMetrics(
+            total_time=total_time,
+            chunking_time=sum(self.timings.get('chunking', [0])),
+            embedding_time=sum(self.timings.get('embedding', [0])),
+            search_time=sum(self.timings.get('search', [0])),
+            generation_time=sum(self.timings.get('generation', [0])),
+            throughput=document_count / total_time if total_time > 0 else 0
+        )
+```
+
+### Memory Usage
+
+```python
+import psutil
+import os
+from typing import Dict, List
+
+class MemoryProfiler:
+    def __init__(self):
+        self.process = psutil.Process(os.getpid())
+        self.memory_snapshots = []
+
+    def take_memory_snapshot(self, label: str):
+        """Take a snapshot of current memory usage"""
+        memory_info = self.process.memory_info()
+        memory_mb = memory_info.rss / 1024 / 1024  # Convert to MB
+
+        self.memory_snapshots.append({
+            'label': label,
+            'memory_mb': memory_mb,
+            'timestamp': time.time()
+        })
+
+    def get_peak_memory_usage(self) -> float:
+        """Get peak memory usage in MB"""
+        if not self.memory_snapshots:
+            return 0.0
+        return max(snapshot['memory_mb'] for snapshot in self.memory_snapshots)
+
+    def get_memory_usage_by_operation(self) -> Dict[str, float]:
+        """Get memory usage breakdown by operation"""
+        if not self.memory_snapshots:
+            return {}
+
+        memory_by_op = {}
+        for i in range(1, len(self.memory_snapshots)):
+            prev_snapshot = self.memory_snapshots[i-1]
+            curr_snapshot = self.memory_snapshots[i]
+
+            operation = curr_snapshot['label']
+            memory_delta = curr_snapshot['memory_mb'] - prev_snapshot['memory_mb']
+
+            if operation not in memory_by_op:
+                memory_by_op[operation] = []
+            memory_by_op[operation].append(memory_delta)
+
+        return {op: sum(deltas) for op, deltas in memory_by_op.items()}
+```
+
+## Evaluation Datasets
+
+### Standardized Test Sets
+
+#### Question-Answer Pairs
+
+```python
+from dataclasses import dataclass
+from typing import List, Optional
+import json
+
+@dataclass
+class EvaluationQuery:
+    id: str
+    question: str
+    reference_answer: Optional[str]
+    relevant_chunk_ids: List[str]
+    query_type: str  # factoid, analytical, comparative
+    difficulty: str  # easy, medium, hard
+    domain: str  # finance, medical, legal, technical
+
+class EvaluationDataset:
+    def __init__(self, name: str):
+        self.name = name
+        self.queries: List[EvaluationQuery] = []
+        self.documents: Dict[str, str] = {}
+        self.chunks: Dict[str, Dict] = {}
+
+    def add_query(self, query: EvaluationQuery):
+        self.queries.append(query)
+
+    def add_document(self, doc_id: str, content: str):
+        self.documents[doc_id] = content
+
+    def add_chunk(self, chunk_id: str, content: str, doc_id: str, metadata: Dict):
+        self.chunks[chunk_id] = {
+            'id': chunk_id,
+            'content': content,
+            'doc_id': doc_id,
+            'metadata': metadata
+        }
+
+    def save_to_file(self, filepath: str):
+        data = {
+            'name': self.name,
+            'queries': [
+                {
+                    'id': q.id,
+                    'question': q.question,
+                    'reference_answer': q.reference_answer,
+                    'relevant_chunk_ids': q.relevant_chunk_ids,
+                    'query_type': q.query_type,
+                    'difficulty': q.difficulty,
+                    'domain': q.domain
+                }
+                for q in self.queries
+            ],
+            'documents': self.documents,
+            'chunks': self.chunks
+        }
+
+        with open(filepath, 'w') as f:
+            json.dump(data, f, indent=2)
+
+    @classmethod
+    def load_from_file(cls, filepath: str):
+        with open(filepath, 'r') as f:
+            data = json.load(f)
+
+        dataset = cls(data['name'])
+        dataset.documents = data['documents']
+        dataset.chunks = data['chunks']
+
+        for q_data in data['queries']:
+            query = EvaluationQuery(
+                id=q_data['id'],
+                question=q_data['question'],
+                reference_answer=q_data.get('reference_answer'),
+                relevant_chunk_ids=q_data['relevant_chunk_ids'],
+                query_type=q_data['query_type'],
+                difficulty=q_data['difficulty'],
+                domain=q_data['domain']
+            )
+            dataset.add_query(query)
+
+        return dataset
+```
+
+### Dataset Generation
+
+#### Synthetic Query Generation
+
+```python
+import random
+from typing import List, Dict
+
+class SyntheticQueryGenerator:
+    def __init__(self):
+        self.query_templates = {
+            'factoid': [
+                "What is {concept}?",
+                "When did {event} occur?",
+                "Who developed {technology}?",
+                "How many {items} are mentioned?",
+                "What is the value of {metric}?"
+            ],
+            'analytical': [
+                "Compare and contrast {concept1} and {concept2}.",
+                "Analyze the impact of {concept} on {domain}.",
+                "What are the advantages and disadvantages of {technology}?",
+                "Explain the relationship between {concept1} and {concept2}.",
+                "Evaluate the effectiveness of {approach} for {problem}."
+            ],
+            'comparative': [
+                "Which is better: {option1} or {option2}?",
+                "How does {method1} differ from {method2}?",
+                "Compare the performance of {system1} and {system2}.",
+                "What are the key differences between {approach1} and {approach2}?"
+            ]
+        }
+
+    def generate_queries_from_chunks(self, chunks: List[Dict], num_queries: int = 100) -> List[EvaluationQuery]:
+        """Generate synthetic queries from document chunks"""
+        queries = []
+
+        # Extract entities and concepts from chunks
+        entities = self._extract_entities_from_chunks(chunks)
+
+        for i in range(num_queries):
+            query_type = random.choice(['factoid', 'analytical', 'comparative'])
+            template = random.choice(self.query_templates[query_type])
+
+            # Fill template with extracted entities
+            query_text = self._fill_template(template, entities)
+
+            # Find relevant chunks for this query
+            relevant_chunks = self._find_relevant_chunks(query_text, chunks)
+
+            query = EvaluationQuery(
+                id=f"synthetic_{i}",
+                question=query_text,
+                reference_answer=None,  # Would need generation model
+                relevant_chunk_ids=[chunk['id'] for chunk in relevant_chunks],
+                query_type=query_type,
+                difficulty=random.choice(['easy', 'medium', 'hard']),
+                domain='synthetic'
+            )
+
+            queries.append(query)
+
+        return queries
+
+    def _extract_entities_from_chunks(self, chunks: List[Dict]) -> Dict[str, List[str]]:
+        """Extract entities, concepts, and relationships from chunks"""
+        # This would use proper NER in practice
+        entities = {
+            'concepts': [],
+            'technologies': [],
+            'methods': [],
+            'metrics': [],
+            'events': []
+        }
+
+        for chunk in chunks:
+            content = chunk['content']
+            # Simplified entity extraction
+            words = content.split()
+            entities['concepts'].extend([word for word in words if len(word) > 6])
+            entities['technologies'].extend([word for word in words if 'technology' in word.lower()])
+            entities['methods'].extend([word for word in words if 'method' in word.lower()])
+            entities['metrics'].extend([word for word in words if '%' in word or '$' in word])
+
+        # Remove duplicates and limit
+        for key in entities:
+            entities[key] = list(set(entities[key]))[:50]
+
+        return entities
+
+    def _fill_template(self, template: str, entities: Dict[str, List[str]]) -> str:
+        """Fill query template with random entities"""
+        import re
+
+        def replace_placeholder(match):
+            placeholder = match.group(1)
+
+            # Map placeholders to entity types
+            entity_mapping = {
+                'concept': 'concepts',
+                'concept1': 'concepts',
+                'concept2': 'concepts',
+                'technology': 'technologies',
+                'method': 'methods',
+                'method1': 'methods',
+                'method2': 'methods',
+                'metric': 'metrics',
+                'event': 'events',
+                'items': 'concepts',
+                'option1': 'concepts',
+                'option2': 'concepts',
+                'approach': 'methods',
+                'problem': 'concepts',
+                'domain': 'concepts',
+                'system1': 'concepts',
+                'system2': 'concepts'
+            }
+
+            entity_type = entity_mapping.get(placeholder, 'concepts')
+            available_entities = entities.get(entity_type, ['something'])
+
+            if available_entities:
+                return random.choice(available_entities)
+            else:
+                return 'something'
+
+        return re.sub(r'\{(\w+)\}', replace_placeholder, template)
+
+    def _find_relevant_chunks(self, query: str, chunks: List[Dict], k: int = 3) -> List[Dict]:
+        """Find chunks most relevant to the query"""
+        # Simple keyword matching for synthetic generation
+        query_words = set(query.lower().split())
+
+        chunk_scores = []
+        for chunk in chunks:
+            chunk_words = set(chunk['content'].lower().split())
+            overlap = len(query_words & chunk_words)
+            chunk_scores.append((overlap, chunk))
+
+        # Sort by overlap and return top k
+        chunk_scores.sort(key=lambda x: x[0], reverse=True)
+        return [chunk for _, chunk in chunk_scores[:k]]
+```
+
+## A/B Testing Framework
+
+### Statistical Significance Testing
+
+```python
+import numpy as np
+from scipy import stats
+from typing import List, Dict, Tuple
+
+class ABTestAnalyzer:
+    def __init__(self):
+        self.significance_level = 0.05
+
+    def compare_metrics(self, control_metrics: List[float],
+                       treatment_metrics: List[float],
+                       metric_name: str) -> Dict:
+        """
+        Compare metrics between control and treatment groups
+        """
+        control_mean = np.mean(control_metrics)
+        treatment_mean = np.mean(treatment_metrics)
+
+        control_std = np.std(control_metrics)
+        treatment_std = np.std(treatment_metrics)
+
+        # Perform t-test
+        t_statistic, p_value = stats.ttest_ind(control_metrics, treatment_metrics)
+
+        # Calculate effect size (Cohen's d)
+        pooled_std = np.sqrt(((len(control_metrics) - 1) * control_std**2 +
+                             (len(treatment_metrics) - 1) * treatment_std**2) /
+                            (len(control_metrics) + len(treatment_metrics) - 2))
+
+        cohens_d = (treatment_mean - control_mean) / pooled_std if pooled_std > 0 else 0
+
+        # Determine significance
+        is_significant = p_value < self.significance_level
+
+        return {
+            'metric_name': metric_name,
+            'control_mean': control_mean,
+            'treatment_mean': treatment_mean,
+            'absolute_difference': treatment_mean - control_mean,
+            'relative_difference': ((treatment_mean - control_mean) / control_mean * 100) if control_mean != 0 else 0,
+            'control_std': control_std,
+            'treatment_std': treatment_std,
+            't_statistic': t_statistic,
+            'p_value': p_value,
+            'is_significant': is_significant,
+            'effect_size': cohens_d,
+            'significance_level': self.significance_level
+        }
+
+    def analyze_ab_test_results(self,
+                               control_results: Dict[str, List[float]],
+                               treatment_results: Dict[str, List[float]]) -> Dict:
+        """
+        Analyze A/B test results across multiple metrics
+        """
+        analysis_results = {}
+
+        # Ensure both dictionaries have the same keys
+        all_metrics = set(control_results.keys()) & set(treatment_results.keys())
+
+        for metric in all_metrics:
+            if metric in control_results and metric in treatment_results:
+                analysis_results[metric] = self.compare_metrics(
+                    control_results[metric],
+                    treatment_results[metric],
+                    metric
+                )
+
+        # Calculate overall summary
+        significant_improvements = sum(1 for result in analysis_results.values()
+                                     if result['is_significant'] and result['relative_difference'] > 0)
+        significant_degradations = sum(1 for result in analysis_results.values()
+                                      if result['is_significant'] and result['relative_difference'] < 0)
+
+        analysis_results['summary'] = {
+            'total_metrics_compared': len(analysis_results),
+            'significant_improvements': significant_improvements,
+            'significant_degradations': significant_degradations,
+            'no_significant_change': len(analysis_results) - significant_improvements - significant_degradations
+        }
+
+        return analysis_results
+```
+
+## Automated Evaluation Pipeline
+
+### End-to-End Evaluation
+
+```python
+class ChunkingEvaluationPipeline:
+    def __init__(self, strategies: Dict[str, Any], dataset: EvaluationDataset):
+        self.strategies = strategies
+        self.dataset = dataset
+        self.results = {}
+        self.profiler = PerformanceProfiler()
+        self.memory_profiler = MemoryProfiler()
+
+    def run_evaluation(self) -> Dict:
+        """Run comprehensive evaluation of all strategies"""
+        evaluation_results = {}
+
+        for strategy_name, strategy in self.strategies.items():
+            print(f"Evaluating strategy: {strategy_name}")
+
+            # Reset profilers for each strategy
+            self.profiler = PerformanceProfiler()
+            self.memory_profiler = MemoryProfiler()
+
+            # Evaluate strategy
+            strategy_results = self._evaluate_strategy(strategy, strategy_name)
+            evaluation_results[strategy_name] = strategy_results
+
+        # Compare strategies
+        comparison_results = self._compare_strategies(evaluation_results)
+
+        return {
+            'individual_results': evaluation_results,
+            'comparison': comparison_results,
+            'recommendations': self._generate_recommendations(comparison_results)
+        }
+
+    def _evaluate_strategy(self, strategy: Any, strategy_name: str) -> Dict:
+        """Evaluate a single chunking strategy"""
+        results = {
+            'strategy_name': strategy_name,
+            'retrieval_metrics': {},
+            'quality_metrics': {},
+            'performance_metrics': {}
+        }
+
+        # Track memory usage
+        self.memory_profiler.take_memory_snapshot(f"{strategy_name}_start")
+
+        # Process all documents
+        self.profiler.start_timer('total_processing')
+
+        all_chunks = {}
+        for doc_id, content in self.dataset.documents.items():
+            self.profiler.start_timer('chunking')
+            chunks = strategy.chunk(content)
+            self.profiler.end_timer('chunking')
+
+            all_chunks[doc_id] = chunks
+
+        self.memory_profiler.take_memory_snapshot(f"{strategy_name}_after_chunking")
+
+        # Generate embeddings for chunks
+        self.profiler.start_timer('embedding')
+        chunk_embeddings = self._generate_embeddings(all_chunks)
+        self.profiler.end_timer('embedding')
+
+        self.memory_profiler.take_memory_snapshot(f"{strategy_name}_after_embedding")
+
+        # Evaluate retrieval performance
+        retrieval_results = self._evaluate_retrieval(all_chunks, chunk_embeddings)
+        results['retrieval_metrics'] = retrieval_results
+
+        # Evaluate chunk quality
+        quality_results = self._evaluate_chunk_quality(all_chunks)
+        results['quality_metrics'] = quality_results
+
+        # Get performance metrics
+        self.profiler.end_timer('total_processing')
+        performance_metrics = self.profiler.get_performance_metrics(len(self.dataset.documents))
+        results['performance_metrics'] = performance_metrics.__dict__
+
+        # Get memory metrics
+        self.memory_profiler.take_memory_snapshot(f"{strategy_name}_end")
+        results['memory_metrics'] = {
+            'peak_memory_mb': self.memory_profiler.get_peak_memory_usage(),
+            'memory_by_operation': self.memory_profiler.get_memory_usage_by_operation()
+        }
+
+        return results
+
+    def _evaluate_retrieval(self, all_chunks: Dict, chunk_embeddings: Dict) -> Dict:
+        """Evaluate retrieval performance"""
+        retrieval_metrics = {
+            'precision': [],
+            'recall': [],
+            'f1_score': [],
+            'mrr': [],
+            'map': []
+        }
+
+        for query in self.dataset.queries:
+            # Perform retrieval
+            self.profiler.start_timer('search')
+            retrieved_chunks = self._retrieve_chunks(query.question, chunk_embeddings, k=10)
+            self.profiler.end_timer('search')
+
+            # Get relevant chunks for this query
+            relevant_chunk_ids = set(query.relevant_chunk_ids)
+            relevant_chunks = [chunk for chunk in retrieved_chunks
+                             if chunk.get('id') in relevant_chunk_ids]
+
+            # Calculate metrics
+            precision = calculate_precision(retrieved_chunks, relevant_chunks)
+            recall = calculate_recall(retrieved_chunks, relevant_chunks)
+            f1 = calculate_f1_score(precision, recall)
+
+            retrieval_metrics['precision'].append(precision)
+            retrieval_metrics['recall'].append(recall)
+            retrieval_metrics['f1_score'].append(f1)
+
+        # Calculate averages
+        return {metric: np.mean(values) for metric, values in retrieval_metrics.items()}
+
+    def _evaluate_chunk_quality(self, all_chunks: Dict) -> Dict:
+        """Evaluate quality of generated chunks"""
+        quality_assessor = ChunkQualityAssessor()
+        quality_scores = []
+
+        for doc_id, chunks in all_chunks.items():
+            # Analyze document
+            content = self.dataset.documents[doc_id]
+            analyzer = DocumentAnalyzer()
+            analysis = analyzer.analyze(content)
+
+            # Assess chunk quality
+            scores = quality_assessor.assess_chunks(chunks, analysis)
+            quality_scores.append(scores)
+
+        # Aggregate quality scores
+        if quality_scores:
+            avg_scores = {}
+            for metric in quality_scores[0].keys():
+                avg_scores[metric] = np.mean([scores[metric] for scores in quality_scores])
+            return avg_scores
+
+        return {}
+
+    def _compare_strategies(self, evaluation_results: Dict) -> Dict:
+        """Compare performance across strategies"""
+        ab_analyzer = ABTestAnalyzer()
+
+        comparison = {}
+
+        # Compare each metric across strategies
+        strategy_names = list(evaluation_results.keys())
+
+        for i in range(len(strategy_names)):
+            for j in range(i + 1, len(strategy_names)):
+                strategy1 = strategy_names[i]
+                strategy2 = strategy_names[j]
+
+                comparison_key = f"{strategy1}_vs_{strategy2}"
+                comparison[comparison_key] = {}
+
+                # Compare retrieval metrics
+                for metric in ['precision', 'recall', 'f1_score']:
+                    if (metric in evaluation_results[strategy1]['retrieval_metrics'] and
+                        metric in evaluation_results[strategy2]['retrieval_metrics']):
+
+                        comparison[comparison_key][f"retrieval_{metric}"] = ab_analyzer.compare_metrics(
+                            [evaluation_results[strategy1]['retrieval_metrics'][metric]],
+                            [evaluation_results[strategy2]['retrieval_metrics'][metric]],
+                            f"retrieval_{metric}"
+                        )
+
+        return comparison
+
+    def _generate_recommendations(self, comparison_results: Dict) -> Dict:
+        """Generate recommendations based on evaluation results"""
+        recommendations = {
+            'best_overall': None,
+            'best_for_precision': None,
+            'best_for_recall': None,
+            'best_for_performance': None,
+            'trade_offs': []
+        }
+
+        # This would analyze the comparison results and generate specific recommendations
+        # Implementation depends on specific use case requirements
+
+        return recommendations
+
+    def _generate_embeddings(self, all_chunks: Dict) -> Dict:
+        """Generate embeddings for all chunks"""
+        # This would use the actual embedding model
+        # Placeholder implementation
+        embeddings = {}
+
+        for doc_id, chunks in all_chunks.items():
+            embeddings[doc_id] = []
+            for chunk in chunks:
+                # Generate embedding for chunk content
+                embedding = np.random.rand(384)  # Placeholder
+                embeddings[doc_id].append({
+                    'chunk': chunk,
+                    'embedding': embedding
+                })
+
+        return embeddings
+
+    def _retrieve_chunks(self, query: str, chunk_embeddings: Dict, k: int = 10) -> List[Dict]:
+        """Retrieve most relevant chunks for a query"""
+        # This would use actual similarity search
+        # Placeholder implementation
+        all_chunks = []
+
+        for doc_embeddings in chunk_embeddings.values():
+            for chunk_data in doc_embeddings:
+                all_chunks.append(chunk_data['chunk'])
+
+        # Simple random selection as placeholder
+        selected = random.sample(all_chunks, min(k, len(all_chunks)))
+
+        return selected
+```
+
+This comprehensive evaluation framework provides the tools needed to thoroughly assess chunking strategies across multiple dimensions: retrieval effectiveness, answer quality, system performance, and statistical significance. The modular design allows for easy extension and customization based on specific requirements and use cases.
--- a/skills/chunking-strategy/references/implementation.md
+++ b/skills/chunking-strategy/references/implementation.md
@@ -0,0 +1,709 @@
+# Complete Implementation Guidelines
+
+This document provides comprehensive implementation guidance for building effective chunking systems.
+
+## System Architecture
+
+### Core Components
+
+```
+Document Processor
+├── Ingestion Layer
+│   ├── Document Type Detection
+│   ├── Format Parsing (PDF, HTML, Markdown, etc.)
+│   └── Content Extraction
+├── Analysis Layer
+│   ├── Structure Analysis
+│   ├── Content Type Identification
+│   └── Complexity Assessment
+├── Strategy Selection Layer
+│   ├── Rule-based Selection
+│   ├── ML-based Prediction
+│   └── Adaptive Configuration
+├── Chunking Layer
+│   ├── Strategy Implementation
+│   ├── Parameter Optimization
+│   └── Quality Validation
+└── Output Layer
+    ├── Chunk Metadata Generation
+    ├── Embedding Integration
+    └── Storage Preparation
+```
+
+## Pre-processing Pipeline
+
+### Document Analysis Framework
+
+```python
+from dataclasses import dataclass
+from typing import List, Dict, Any
+import re
+
+@dataclass
+class DocumentAnalysis:
+    doc_type: str
+    structure_score: float  # 0-1, higher means more structured
+    complexity_score: float  # 0-1, higher means more complex
+    content_types: List[str]
+    language: str
+    estimated_tokens: int
+    has_multimodal: bool
+
+class DocumentAnalyzer:
+    def __init__(self):
+        self.structure_patterns = {
+            'markdown': [r'^#+\s', r'^\*\*.*\*\*$', r'^\* ', r'^\d+\. '],
+            'html': [r'<h[1-6]>', r'<p>', r'<div>', r'<table>'],
+            'latex': [r'\\section', r'\\subsection', r'\\begin\{', r'\\end\{'],
+            'academic': [r'^\d+\.', r'^\d+\.\d+', r'^[A-Z]\.', r'^Figure \d+']
+        }
+
+    def analyze(self, content: str) -> DocumentAnalysis:
+        doc_type = self.detect_document_type(content)
+        structure_score = self.calculate_structure_score(content, doc_type)
+        complexity_score = self.calculate_complexity_score(content)
+        content_types = self.identify_content_types(content)
+        language = self.detect_language(content)
+        estimated_tokens = self.estimate_tokens(content)
+        has_multimodal = self.detect_multimodal_content(content)
+
+        return DocumentAnalysis(
+            doc_type=doc_type,
+            structure_score=structure_score,
+            complexity_score=complexity_score,
+            content_types=content_types,
+            language=language,
+            estimated_tokens=estimated_tokens,
+            has_multimodal=has_multimodal
+        )
+
+    def detect_document_type(self, content: str) -> str:
+        content_lower = content.lower()
+
+        if '<html' in content_lower or '<body' in content_lower:
+            return 'html'
+        elif '#' in content and '##' in content:
+            return 'markdown'
+        elif '\\documentclass' in content_lower or '\\begin{' in content_lower:
+            return 'latex'
+        elif any(keyword in content_lower for keyword in ['abstract', 'introduction', 'conclusion', 'references']):
+            return 'academic'
+        elif 'def ' in content or 'class ' in content or 'function ' in content_lower:
+            return 'code'
+        else:
+            return 'plain'
+
+    def calculate_structure_score(self, content: str, doc_type: str) -> float:
+        patterns = self.structure_patterns.get(doc_type, [])
+        if not patterns:
+            return 0.5  # Default for unstructured content
+
+        line_count = len(content.split('\n'))
+        structured_lines = 0
+
+        for line in content.split('\n'):
+            for pattern in patterns:
+                if re.search(pattern, line.strip()):
+                    structured_lines += 1
+                    break
+
+        return min(structured_lines / max(line_count, 1), 1.0)
+
+    def calculate_complexity_score(self, content: str) -> float:
+        # Factors that increase complexity
+        avg_sentence_length = self.calculate_avg_sentence_length(content)
+        vocabulary_richness = self.calculate_vocabulary_richness(content)
+        nested_structure = self.detect_nested_structure(content)
+
+        # Normalize and combine
+        complexity = (
+            min(avg_sentence_length / 30, 1.0) * 0.3 +
+            vocabulary_richness * 0.4 +
+            nested_structure * 0.3
+        )
+
+        return min(complexity, 1.0)
+
+    def identify_content_types(self, content: str) -> List[str]:
+        types = []
+
+        if '```' in content or 'def ' in content or 'function ' in content.lower():
+            types.append('code')
+        if '|' in content and '\n' in content:
+            types.append('tables')
+        if re.search(r'\!\[.*\]\(.*\)', content):
+            types.append('images')
+        if re.search(r'http[s]?://', content):
+            types.append('links')
+        if re.search(r'\d+\.\d+', content) or re.search(r'\$\d', content):
+            types.append('numbers')
+
+        return types if types else ['text']
+
+    def detect_language(self, content: str) -> str:
+        # Simple language detection - can be enhanced with proper language detection libraries
+        if re.search(r'[\u4e00-\u9fff]', content):
+            return 'chinese'
+        elif re.search(r'[u0600-\u06ff]', content):
+            return 'arabic'
+        elif re.search(r'[u0400-\u04ff]', content):
+            return 'russian'
+        else:
+            return 'english'  # Default assumption
+
+    def estimate_tokens(self, content: str) -> int:
+        # Rough estimation - actual tokenization varies by model
+        word_count = len(content.split())
+        return int(word_count * 1.3)  # Average tokens per word
+
+    def detect_multimodal_content(self, content: str) -> bool:
+        multimodal_indicators = [
+            r'\!\[.*\]\(.*\)',  # Images
+            r'<iframe',        # Embedded content
+            r'<object',        # Embedded objects
+            r'<embed',         # Embedded media
+        ]
+
+        return any(re.search(pattern, content) for pattern in multimodal_indicators)
+
+    def calculate_avg_sentence_length(self, content: str) -> float:
+        sentences = re.split(r'[.!?]+', content)
+        sentences = [s.strip() for s in sentences if s.strip()]
+        if not sentences:
+            return 0
+        return sum(len(s.split()) for s in sentences) / len(sentences)
+
+    def calculate_vocabulary_richness(self, content: str) -> float:
+        words = content.lower().split()
+        if not words:
+            return 0
+        unique_words = set(words)
+        return len(unique_words) / len(words)
+
+    def detect_nested_structure(self, content: str) -> float:
+        # Detect nested lists, indented content, etc.
+        lines = content.split('\n')
+        indented_lines = 0
+
+        for line in lines:
+            if line.strip() and line.startswith(' '):
+                indented_lines += 1
+
+        return indented_lines / max(len(lines), 1)
+```
+
+### Strategy Selection Engine
+
+```python
+from abc import ABC, abstractmethod
+from typing import Dict, Any
+
+class ChunkingStrategy(ABC):
+    @abstractmethod
+    def chunk(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
+        pass
+
+class StrategySelector:
+    def __init__(self):
+        self.strategies = {
+            'fixed_size': FixedSizeStrategy(),
+            'recursive': RecursiveStrategy(),
+            'structure_aware': StructureAwareStrategy(),
+            'semantic': SemanticStrategy(),
+            'adaptive': AdaptiveStrategy()
+        }
+
+    def select_strategy(self, analysis: DocumentAnalysis) -> str:
+        # Rule-based selection logic
+        if analysis.structure_score > 0.8 and analysis.doc_type in ['markdown', 'html', 'latex']:
+            return 'structure_aware'
+        elif analysis.complexity_score > 0.7 and analysis.estimated_tokens < 10000:
+            return 'semantic'
+        elif analysis.doc_type == 'code':
+            return 'structure_aware'
+        elif analysis.structure_score < 0.3:
+            return 'fixed_size'
+        elif analysis.complexity_score > 0.5:
+            return 'recursive'
+        else:
+            return 'adaptive'
+
+    def get_strategy(self, analysis: DocumentAnalysis) -> ChunkingStrategy:
+        strategy_name = self.select_strategy(analysis)
+        return self.strategies[strategy_name]
+
+# Example strategy implementations
+class FixedSizeStrategy(ChunkingStrategy):
+    def __init__(self, default_size=512, default_overlap=50):
+        self.default_size = default_size
+        self.default_overlap = default_overlap
+
+    def chunk(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
+        # Adjust parameters based on analysis
+        if analysis.complexity_score > 0.7:
+            chunk_size = 1024
+        elif analysis.complexity_score < 0.3:
+            chunk_size = 256
+        else:
+            chunk_size = self.default_size
+
+        overlap = int(chunk_size * 0.1)  # 10% overlap
+
+        # Implementation here...
+        return self._fixed_size_chunk(content, chunk_size, overlap)
+
+    def _fixed_size_chunk(self, content: str, chunk_size: int, overlap: int) -> List[Dict[str, Any]]:
+        # Implementation using RecursiveCharacterTextSplitter or custom logic
+        pass
+
+class AdaptiveStrategy(ChunkingStrategy):
+    def chunk(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
+        # Combine multiple strategies based on content characteristics
+        if analysis.structure_score > 0.6:
+            # Use structure-aware for structured parts
+            structured_chunks = self._chunk_structured_parts(content, analysis)
+        else:
+            # Use fixed-size for unstructured parts
+            unstructured_chunks = self._chunk_unstructured_parts(content, analysis)
+
+        # Merge and optimize
+        return self._merge_chunks(structured_chunks + unstructured_chunks)
+
+    def _chunk_structured_parts(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
+        # Implementation for structured content
+        pass
+
+    def _chunk_unstructured_parts(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
+        # Implementation for unstructured content
+        pass
+
+    def _merge_chunks(self, chunks: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+        # Implementation for merging and optimizing chunks
+        pass
+```
+
+## Quality Assurance Framework
+
+### Chunk Quality Metrics
+
+```python
+from typing import List, Dict, Any
+import numpy as np
+from sklearn.metrics.pairwise import cosine_similarity
+
+class ChunkQualityAssessor:
+    def __init__(self):
+        self.quality_weights = {
+            'coherence': 0.3,
+            'completeness': 0.25,
+            'size_appropriateness': 0.2,
+            'semantic_similarity': 0.15,
+            'boundary_quality': 0.1
+        }
+
+    def assess_chunks(self, chunks: List[Dict[str, Any]], analysis: DocumentAnalysis) -> Dict[str, float]:
+        scores = {}
+
+        # Coherence: Do chunks make sense on their own?
+        scores['coherence'] = self._assess_coherence(chunks)
+
+        # Completeness: Do chunks preserve important information?
+        scores['completeness'] = self._assess_completeness(chunks, analysis)
+
+        # Size appropriateness: Are chunks within optimal size range?
+        scores['size_appropriateness'] = self._assess_size(chunks)
+
+        # Semantic similarity: Are chunks thematically consistent?
+        scores['semantic_similarity'] = self._assess_semantic_consistency(chunks)
+
+        # Boundary quality: Are chunk boundaries placed well?
+        scores['boundary_quality'] = self._assess_boundary_quality(chunks)
+
+        # Calculate overall quality score
+        overall_score = sum(
+            score * self.quality_weights[metric]
+            for metric, score in scores.items()
+        )
+
+        scores['overall'] = overall_score
+        return scores
+
+    def _assess_coherence(self, chunks: List[Dict[str, Any]]) -> float:
+        # Simple heuristic-based coherence assessment
+        coherence_scores = []
+
+        for chunk in chunks:
+            content = chunk['content']
+
+            # Check for complete sentences
+            sentences = re.split(r'[.!?]+', content)
+            complete_sentences = sum(1 for s in sentences if s.strip())
+            coherence = complete_sentences / max(len(sentences), 1)
+
+            coherence_scores.append(coherence)
+
+        return np.mean(coherence_scores)
+
+    def _assess_completeness(self, chunks: List[Dict[str, Any]], analysis: DocumentAnalysis) -> float:
+        # Check if important structural elements are preserved
+        if analysis.doc_type in ['markdown', 'html']:
+            return self._assess_structure_preservation(chunks, analysis)
+        else:
+            return self._assess_content_preservation(chunks)
+
+    def _assess_structure_preservation(self, chunks: List[Dict[str, Any]], analysis: DocumentAnalysis) -> float:
+        # Check if headings, lists, and other structural elements are preserved
+        preserved_elements = 0
+        total_elements = 0
+
+        for chunk in chunks:
+            content = chunk['content']
+
+            # Count preserved structural elements
+            headings = len(re.findall(r'^#+\s', content, re.MULTILINE))
+            lists = len(re.findall(r'^\s*[-*+]\s', content, re.MULTILINE))
+
+            preserved_elements += headings + lists
+            total_elements += 1  # Simplified count
+
+        return preserved_elements / max(total_elements, 1)
+
+    def _assess_content_preservation(self, chunks: List[Dict[str, Any]]) -> float:
+        # Simple check based on content ratio
+        total_content = ''.join(chunk['content'] for chunk in chunks)
+        # This would need comparison with original content
+        return 0.8  # Placeholder
+
+    def _assess_size(self, chunks: List[Dict[str, Any]]) -> float:
+        optimal_min = 100  # tokens
+        optimal_max = 1000  # tokens
+
+        size_scores = []
+        for chunk in chunks:
+            token_count = self._estimate_tokens(chunk['content'])
+            if optimal_min <= token_count <= optimal_max:
+                score = 1.0
+            elif token_count < optimal_min:
+                score = token_count / optimal_min
+            else:
+                score = max(0, 1 - (token_count - optimal_max) / optimal_max)
+
+            size_scores.append(score)
+
+        return np.mean(size_scores)
+
+    def _assess_semantic_consistency(self, chunks: List[Dict[str, Any]]) -> float:
+        # This would require embedding models for actual implementation
+        # Placeholder implementation
+        return 0.7
+
+    def _assess_boundary_quality(self, chunks: List[Dict[str, Any]]) -> float:
+        # Check if boundaries don't split important content
+        boundary_scores = []
+
+        for i, chunk in enumerate(chunks):
+            content = chunk['content']
+
+            # Check for incomplete sentences at boundaries
+            if not content.strip().endswith(('.', '!', '?', '>', '}')):
+                boundary_scores.append(0.5)
+            else:
+                boundary_scores.append(1.0)
+
+        return np.mean(boundary_scores)
+
+    def _estimate_tokens(self, content: str) -> int:
+        # Simple token estimation
+        return len(content.split()) * 4 // 3  # Rough approximation
+```
+
+## Error Handling and Edge Cases
+
+### Robust Error Handling
+
+```python
+import logging
+from typing import Optional, List
+from dataclasses import dataclass
+
+@dataclass
+class ChunkingError:
+    error_type: str
+    message: str
+    chunk_index: Optional[int] = None
+    recovery_action: Optional[str] = None
+
+class ChunkingErrorHandler:
+    def __init__(self):
+        self.logger = logging.getLogger(__name__)
+        self.error_handlers = {
+            'empty_content': self._handle_empty_content,
+            'oversized_chunk': self._handle_oversized_chunk,
+            'encoding_error': self._handle_encoding_error,
+            'memory_error': self._handle_memory_error,
+            'structure_parsing_error': self._handle_structure_parsing_error
+        }
+
+    def handle_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
+        error_type = self._classify_error(error)
+        handler = self.error_handlers.get(error_type, self._handle_generic_error)
+        return handler(error, context)
+
+    def _classify_error(self, error: Exception) -> str:
+        if isinstance(error, ValueError) and 'empty' in str(error).lower():
+            return 'empty_content'
+        elif isinstance(error, MemoryError):
+            return 'memory_error'
+        elif isinstance(error, UnicodeError):
+            return 'encoding_error'
+        elif 'too large' in str(error).lower():
+            return 'oversized_chunk'
+        elif 'parsing' in str(error).lower():
+            return 'structure_parsing_error'
+        else:
+            return 'generic_error'
+
+    def _handle_empty_content(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
+        self.logger.warning(f"Empty content encountered: {error}")
+        return ChunkingError(
+            error_type='empty_content',
+            message=str(error),
+            recovery_action='skip_empty_content'
+        )
+
+    def _handle_oversized_chunk(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
+        self.logger.warning(f"Oversized chunk detected: {error}")
+        return ChunkingError(
+            error_type='oversized_chunk',
+            message=str(error),
+            chunk_index=context.get('chunk_index'),
+            recovery_action='reduce_chunk_size'
+        )
+
+    def _handle_encoding_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
+        self.logger.error(f"Encoding error: {error}")
+        return ChunkingError(
+            error_type='encoding_error',
+            message=str(error),
+            recovery_action='fallback_encoding'
+        )
+
+    def _handle_memory_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
+        self.logger.error(f"Memory error during chunking: {error}")
+        return ChunkingError(
+            error_type='memory_error',
+            message=str(error),
+            recovery_action='process_in_batches'
+        )
+
+    def _handle_structure_parsing_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
+        self.logger.warning(f"Structure parsing failed: {error}")
+        return ChunkingError(
+            error_type='structure_parsing_error',
+            message=str(error),
+            recovery_action='fallback_to_fixed_size'
+        )
+
+    def _handle_generic_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
+        self.logger.error(f"Unexpected error during chunking: {error}")
+        return ChunkingError(
+            error_type='generic_error',
+            message=str(error),
+            recovery_action='skip_and_continue'
+        )
+```
+
+## Performance Optimization
+
+### Caching and Memoization
+
+```python
+import hashlib
+import pickle
+from functools import lru_cache
+from typing import Dict, Any, Optional
+import redis
+import json
+
+class ChunkingCache:
+    def __init__(self, redis_url: Optional[str] = None):
+        if redis_url:
+            self.redis_client = redis.from_url(redis_url)
+        else:
+            self.redis_client = None
+        self.local_cache = {}
+
+    def _generate_cache_key(self, content: str, strategy: str, params: Dict[str, Any]) -> str:
+        content_hash = hashlib.md5(content.encode()).hexdigest()
+        params_str = json.dumps(params, sort_keys=True)
+        params_hash = hashlib.md5(params_str.encode()).hexdigest()
+        return f"chunking:{strategy}:{content_hash}:{params_hash}"
+
+    def get(self, content: str, strategy: str, params: Dict[str, Any]) -> Optional[List[Dict[str, Any]]]:
+        cache_key = self._generate_cache_key(content, strategy, params)
+
+        # Try local cache first
+        if cache_key in self.local_cache:
+            return self.local_cache[cache_key]
+
+        # Try Redis cache
+        if self.redis_client:
+            try:
+                cached_data = self.redis_client.get(cache_key)
+                if cached_data:
+                    chunks = pickle.loads(cached_data)
+                    self.local_cache[cache_key] = chunks  # Cache locally too
+                    return chunks
+            except Exception as e:
+                logging.warning(f"Redis cache error: {e}")
+
+        return None
+
+    def set(self, content: str, strategy: str, params: Dict[str, Any], chunks: List[Dict[str, Any]]) -> None:
+        cache_key = self._generate_cache_key(content, strategy, params)
+
+        # Store in local cache
+        self.local_cache[cache_key] = chunks
+
+        # Store in Redis cache
+        if self.redis_client:
+            try:
+                cached_data = pickle.dumps(chunks)
+                self.redis_client.setex(cache_key, 3600, cached_data)  # 1 hour TTL
+            except Exception as e:
+                logging.warning(f"Redis cache set error: {e}")
+
+    def clear_local_cache(self):
+        self.local_cache.clear()
+
+    def clear_redis_cache(self):
+        if self.redis_client:
+            pattern = "chunking:*"
+            keys = self.redis_client.keys(pattern)
+            if keys:
+                self.redis_client.delete(*keys)
+```
+
+### Batch Processing
+
+```python
+import asyncio
+import concurrent.futures
+from typing import List, Callable, Any
+
+class BatchChunkingProcessor:
+    def __init__(self, max_workers: int = 4, batch_size: int = 10):
+        self.max_workers = max_workers
+        self.batch_size = batch_size
+
+    def process_documents_batch(self, documents: List[str],
+                               chunking_function: Callable[[str], List[Dict[str, Any]]]) -> List[List[Dict[str, Any]]]:
+        """Process multiple documents in parallel"""
+        results = []
+
+        # Process in batches to avoid memory issues
+        for i in range(0, len(documents), self.batch_size):
+            batch = documents[i:i + self.batch_size]
+
+            with concurrent.futures.ThreadPoolExecutor(max_workers=self.max_workers) as executor:
+                future_to_doc = {
+                    executor.submit(chunking_function, doc): doc
+                    for doc in batch
+                }
+
+                batch_results = []
+                for future in concurrent.futures.as_completed(future_to_doc):
+                    try:
+                        chunks = future.result()
+                        batch_results.append(chunks)
+                    except Exception as e:
+                        logging.error(f"Error processing document: {e}")
+                        batch_results.append([])  # Empty result for failed processing
+
+                results.extend(batch_results)
+
+        return results
+
+    async def process_documents_async(self, documents: List[str],
+                                     chunking_function: Callable[[str], List[Dict[str, Any]]]) -> List[List[Dict[str, Any]]]:
+        """Process documents asynchronously"""
+        semaphore = asyncio.Semaphore(self.max_workers)
+
+        async def process_single_document(doc: str) -> List[Dict[str, Any]]:
+            async with semaphore:
+                # Run the synchronous chunking function in an executor
+                loop = asyncio.get_event_loop()
+                return await loop.run_in_executor(None, chunking_function, doc)
+
+        tasks = [process_single_document(doc) for doc in documents]
+        return await asyncio.gather(*tasks, return_exceptions=True)
+```
+
+## Monitoring and Observability
+
+### Metrics Collection
+
+```python
+import time
+from dataclasses import dataclass
+from typing import Dict, Any, List
+from collections import defaultdict
+
+@dataclass
+class ChunkingMetrics:
+    total_documents: int
+    total_chunks: int
+    avg_chunk_size: float
+    processing_time: float
+    memory_usage: float
+    error_count: int
+    strategy_distribution: Dict[str, int]
+
+class MetricsCollector:
+    def __init__(self):
+        self.metrics = defaultdict(list)
+        self.start_time = None
+
+    def start_timing(self):
+        self.start_time = time.time()
+
+    def end_timing(self) -> float:
+        if self.start_time:
+            duration = time.time() - self.start_time
+            self.metrics['processing_time'].append(duration)
+            self.start_time = None
+            return duration
+        return 0.0
+
+    def record_chunk_count(self, count: int):
+        self.metrics['chunk_count'].append(count)
+
+    def record_chunk_size(self, size: int):
+        self.metrics['chunk_size'].append(size)
+
+    def record_strategy_usage(self, strategy: str):
+        self.metrics['strategy'][strategy] = self.metrics['strategy'].get(strategy, 0) + 1
+
+    def record_error(self, error_type: str):
+        self.metrics['errors'].append(error_type)
+
+    def record_memory_usage(self, memory_mb: float):
+        self.metrics['memory_usage'].append(memory_mb)
+
+    def get_summary(self) -> ChunkingMetrics:
+        return ChunkingMetrics(
+            total_documents=len(self.metrics['processing_time']),
+            total_chunks=sum(self.metrics['chunk_count']),
+            avg_chunk_size=sum(self.metrics['chunk_size']) / max(len(self.metrics['chunk_size']), 1),
+            processing_time=sum(self.metrics['processing_time']),
+            memory_usage=sum(self.metrics['memory_usage']) / max(len(self.metrics['memory_usage']), 1),
+            error_count=len(self.metrics['errors']),
+            strategy_distribution=dict(self.metrics['strategy'])
+        )
+
+    def reset(self):
+        self.metrics.clear()
+        self.start_time = None
+```
+
+This implementation guide provides a comprehensive foundation for building robust, scalable chunking systems that can handle various document types and use cases while maintaining high quality and performance.
--- a/skills/chunking-strategy/references/research.md
+++ b/skills/chunking-strategy/references/research.md
@@ -0,0 +1,366 @@
+# Key Research Papers and Findings
+
+This document summarizes important research papers and findings related to chunking strategies for RAG systems.
+
+## Seminal Papers
+
+### "Reconstructing Context: Evaluating Advanced Chunking Strategies for RAG" (arXiv:2504.19754)
+
+**Key Findings**:
+- Page-level chunking achieved highest average accuracy (0.648) with lowest variance across different query types
+- Optimal chunk size varies significantly by document type and query complexity
+- Factoid queries perform better with smaller chunks (256-512 tokens)
+- Complex analytical queries benefit from larger chunks (1024+ tokens)
+
+**Methodology**:
+- Evaluated 7 different chunking strategies across multiple document types
+- Tested with both factoid and analytical queries
+- Measured end-to-end RAG performance
+
+**Practical Implications**:
+- Start with page-level chunking for general-purpose RAG systems
+- Adapt chunk size based on expected query patterns
+- Consider hybrid approaches for mixed query types
+
+### "Lost in the Middle: How Language Models Use Long Contexts"
+
+**Key Findings**:
+- Language models tend to pay more attention to information at the beginning and end of context
+- Information in the middle of long contexts is often ignored
+- Performance degradation is most severe for centrally located information
+
+**Practical Implications**:
+- Place most important information at chunk boundaries
+- Consider chunk overlap to ensure important context appears multiple times
+- Use ranking to prioritize relevant chunks for inclusion in context
+
+### "Grounded Language Learning in a Simulated 3D World"
+
+**Related Concepts**:
+- Importance of grounding text in visual/contextual information
+- Multi-modal learning approaches for better understanding
+
+**Relevance to Chunking**:
+- Supports contextual chunking approaches that preserve visual/contextual relationships
+- Validates importance of maintaining document structure and relationships
+
+## Industry Research
+
+### NVIDIA Research: "Finding the Best Chunking Strategy for Accurate AI Responses"
+
+**Key Findings**:
+- Page-level chunking outperformed sentence and paragraph-level approaches
+- Fixed-size chunking showed consistent but suboptimal performance
+- Semantic chunking provided improvements for complex documents
+
+**Technical Details**:
+- Tested chunk sizes from 128 to 2048 tokens
+- Evaluated across financial, technical, and legal documents
+- Measured both retrieval accuracy and generation quality
+
+**Recommendations**:
+- Use 512-1024 token chunks as starting point
+- Implement adaptive chunking based on document complexity
+- Consider page boundaries as natural chunk separators
+
+### Cohere Research: "Effective Chunking Strategies for RAG"
+
+**Key Findings**:
+- Recursive character splitting provides good balance of performance and simplicity
+- Document structure awareness improves retrieval by 15-20%
+- Overlap of 10-20% provides optimal context preservation
+
+**Methodology**:
+- Compared 12 chunking strategies across 6 document types
+- Measured retrieval precision, recall, and F1-score
+- Tested with both dense and sparse retrieval
+
+**Best Practices Identified**:
+- Start with recursive character splitting with 10-20% overlap
+- Preserve document structure (headings, lists, tables)
+- Customize chunk size based on embedding model context window
+
+### Anthropic: "Contextual Retrieval"
+
+**Key Innovation**:
+- Enhance each chunk with LLM-generated contextual information before embedding
+- Improves retrieval precision by 25-30% for complex documents
+- Particularly effective for technical and academic content
+
+**Implementation Approach**:
+1. Split document using traditional methods
+2. For each chunk, generate contextual information using LLM
+3. Prepend context to chunk before embedding
+4. Use hybrid search (dense + sparse) with weighted ranking
+
+**Trade-offs**:
+- Significant computational overhead (2-3x processing time)
+- Higher embedding storage requirements
+- Improved retrieval precision justifies cost for high-value applications
+
+## Algorithmic Advances
+
+### Semantic Chunking Algorithms
+
+#### "Semantic Segmentation of Text Documents"
+
+**Core Idea**: Use cosine similarity between consecutive sentence embeddings to identify natural boundaries.
+
+**Algorithm**:
+1. Split document into sentences
+2. Generate embeddings for each sentence
+3. Calculate similarity between consecutive sentences
+4. Create boundaries where similarity drops below threshold
+5. Merge short segments with neighbors
+
+**Performance**: 20-30% improvement in retrieval relevance over fixed-size chunking for technical documents.
+
+#### "Hierarchical Semantic Chunking"
+
+**Core Idea**: Multi-level semantic segmentation for document organization.
+
+**Algorithm**:
+1. Document-level semantic analysis
+2. Section-level boundary detection
+3. Paragraph-level segmentation
+4. Sentence-level refinement
+
+**Benefits**: Maintains document hierarchy while adapting to semantic structure.
+
+### Advanced Embedding Techniques
+
+#### "Late Chunking: Contextual Chunk Embeddings"
+
+**Core Innovation**: Generate embeddings for entire document first, then create chunk embeddings from token-level embeddings.
+
+**Advantages**:
+- Preserves global document context
+- Reduces context fragmentation
+- Better for documents with complex inter-relationships
+
+**Requirements**:
+- Long-context embedding models (8k+ tokens)
+- Significant computational resources
+- Specialized implementation
+
+#### "Hierarchical Embedding Retrieval"
+
+**Approach**: Create embeddings at multiple granularities (document, section, paragraph, sentence).
+
+**Implementation**:
+1. Generate embeddings at each level
+2. Store in hierarchical vector database
+3. Query at appropriate granularity based on information needs
+
+**Performance**: 15-25% improvement in precision for complex queries.
+
+## Evaluation Methodologies
+
+### Retrieval-Augmented Generation Assessment Frameworks
+
+#### RAGAS Framework
+
+**Metrics**:
+- **Faithfulness**: Consistency between generated answer and retrieved context
+- **Answer Relevancy**: Relevance of generated answer to the question
+- **Context Relevancy**: Relevance of retrieved context to the question
+- **Context Recall**: Coverage of relevant information in retrieved context
+
+**Evaluation Process**:
+1. Generate questions from document corpus
+2. Retrieve relevant chunks using different strategies
+3. Generate answers using retrieved chunks
+4. Evaluate using automated metrics and human judgment
+
+#### ARES Framework
+
+**Innovation**: Automated evaluation using synthetic questions and LLM-based assessment.
+
+**Key Features**:
+- Generates diverse question types (factoid, analytical, comparative)
+- Uses LLMs to evaluate answer quality
+- Provides scalable evaluation without human annotation
+
+### Benchmark Datasets
+
+#### Natural Questions (NQ)
+
+**Description**: Real user questions from Google Search with relevant Wikipedia passages.
+
+**Relevance**: Natural language queries with authentic relevance judgments.
+
+#### MS MARCO
+
+**Description**: Large-scale passage ranking dataset with real search queries.
+
+**Relevance**: High-quality relevance judgments for passage retrieval.
+
+#### HotpotQA
+
+**Description**: Multi-hop question answering requiring information from multiple documents.
+
+**Relevance**: Tests ability to retrieve and synthesize information from multiple chunks.
+
+## Domain-Specific Research
+
+### Medical Documents
+
+#### "Optimal Chunking for Medical Question Answering"
+
+**Key Findings**:
+- Medical terminology requires specialized handling
+- Section-based chunking (History, Diagnosis, Treatment) most effective
+- Preserving doctor-patient dialogue context crucial
+
+**Recommendations**:
+- Use medical-specific tokenizers
+- Preserve section headers and structure
+- Maintain temporal relationships in medical histories
+
+### Legal Documents
+
+#### "Chunking Strategies for Legal Document Analysis"
+
+**Key Findings**:
+- Legal citations and cross-references require special handling
+- Contract clause boundaries serve as natural chunk separators
+- Case law benefits from hierarchical chunking
+
+**Best Practices**:
+- Preserve legal citation structure
+- Use clause and section boundaries
+- Maintain context for legal definitions and references
+
+### Financial Documents
+
+#### "SEC Filing Chunking for Financial Analysis"
+
+**Key Findings**:
+- Table preservation critical for financial data
+- XBRL tagging provides natural segmentation
+- Risk factors sections benefit from specialized treatment
+
+**Approach**:
+- Preserve complete tables when possible
+- Use XBRL tags for structured data
+- Create specialized chunks for risk sections
+
+## Emerging Trends
+
+### Multi-Modal Chunking
+
+#### "Integrating Text, Tables, and Images in RAG Systems"
+
+**Innovation**: Unified chunking approach for mixed-modal content.
+
+**Approach**:
+- Extract and describe images using vision models
+- Preserve table structure and relationships
+- Create unified embeddings for mixed content
+
+**Results**: 35% improvement in complex document understanding.
+
+### Adaptive Chunking
+
+#### "Machine Learning-Based Chunk Size Optimization"
+
+**Core Idea**: Use ML models to predict optimal chunking parameters.
+
+**Features**:
+- Document length and complexity
+- Query type distribution
+- Embedding model characteristics
+- Performance requirements
+
+**Benefits**: Dynamic optimization based on use case and content.
+
+### Real-time Chunking
+
+#### "Streaming Chunking for Live Document Processing"
+
+**Innovation**: Process documents as they become available.
+
+**Techniques**:
+- Incremental boundary detection
+- Dynamic chunk size adjustment
+- Context preservation across chunks
+
+**Applications**: Live news feeds, social media analysis, meeting transcripts.
+
+## Implementation Challenges
+
+### Computational Efficiency
+
+#### "Scalable Chunking for Large Document Collections"
+
+**Challenges**:
+- Processing millions of documents efficiently
+- Memory usage optimization
+- Distributed processing requirements
+
+**Solutions**:
+- Batch processing with parallel execution
+- Streaming approaches for large documents
+- Distributed chunking with load balancing
+
+### Quality Assurance
+
+#### "Evaluating Chunk Quality at Scale"
+
+**Challenges**:
+- Automated quality assessment
+- Detecting poor chunk boundaries
+- Maintaining consistency across document types
+
+**Approaches**:
+- Heuristic-based quality metrics
+- LLM-based evaluation
+- Human-in-the-loop validation
+
+## Future Research Directions
+
+### Context-Aware Chunking
+
+**Open Questions**:
+- How to optimally preserve cross-chunk relationships?
+- Can we predict chunk quality without human evaluation?
+- What is the optimal balance between size and context?
+
+### Domain Adaptation
+
+**Research Areas**:
+- Automatic domain detection and adaptation
+- Transfer learning across domains
+- Zero-shot chunking for new document types
+
+### Evaluation Standards
+
+**Needs**:
+- Standardized evaluation benchmarks
+- Cross-paper comparison methodologies
+- Real-world performance metrics
+
+## Practical Recommendations Based on Research
+
+### Starting Points
+
+1. **For General RAG Systems**: Page-level or recursive character chunking with 512-1024 tokens and 10-20% overlap
+2. **For Technical Documents**: Structure-aware chunking with semantic boundary detection
+3. **For High-Value Applications**: Contextual retrieval with LLM-generated context
+
+### Evolution Strategy
+
+1. **Begin**: Simple fixed-size chunking (512 tokens)
+2. **Improve**: Add document structure awareness
+3. **Optimize**: Implement semantic boundaries
+4. **Advanced**: Consider contextual retrieval for critical use cases
+
+### Key Success Factors
+
+1. **Match strategy to document type and query patterns**
+2. **Preserve document structure when beneficial**
+3. **Use overlap to maintain context across boundaries**
+4. **Monitor both accuracy and computational costs**
+5. **Iterate based on specific use case requirements**
+
+This research foundation provides evidence-based guidance for implementing effective chunking strategies across various domains and use cases.
--- a/skills/chunking-strategy/references/semantic-methods.md
+++ b/skills/chunking-strategy/references/semantic-methods.md
--- a/skills/chunking-strategy/references/strategies.md
+++ b/skills/chunking-strategy/references/strategies.md
@@ -0,0 +1,423 @@
+# Detailed Chunking Strategies
+
+This document provides comprehensive implementation details for all chunking strategies mentioned in the main skill.
+
+## Level 1: Fixed-Size Chunking
+
+### Implementation
+
+```python
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+
+class FixedSizeChunker:
+    def __init__(self, chunk_size=512, chunk_overlap=50):
+        self.chunk_size = chunk_size
+        self.chunk_overlap = chunk_overlap
+        self.splitter = RecursiveCharacterTextSplitter(
+            chunk_size=chunk_size,
+            chunk_overlap=chunk_overlap,
+            length_function=len,
+            separators=["\n\n", "\n", " ", ""]
+        )
+
+    def chunk(self, documents):
+        return self.splitter.split_documents(documents)
+```
+
+### Parameter Recommendations
+
+| Use Case | Chunk Size | Overlap | Rationale |
+|----------|------------|---------|-----------|
+| Factoid Queries | 256 | 25 | Small chunks for precise answers |
+| General Q&A | 512 | 50 | Balanced approach for most cases |
+| Analytical Queries | 1024 | 100 | Larger context for complex analysis |
+| Code Documentation | 300 | 30 | Preserve code context while maintaining focus |
+
+### Best Practices
+
+- Start with 512 tokens and 10-20% overlap
+- Adjust based on embedding model context window
+- Use overlap for queries where context might span boundaries
+- Monitor token count vs. character count based on model
+
+## Level 2: Recursive Character Chunking
+
+### Implementation
+
+```python
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+
+class RecursiveChunker:
+    def __init__(self, chunk_size=512, separators=None):
+        self.chunk_size = chunk_size
+        self.separators = separators or ["\n\n", "\n", " ", ""]
+        self.splitter = RecursiveCharacterTextSplitter(
+            chunk_size=chunk_size,
+            chunk_overlap=0,
+            length_function=len,
+            separators=self.separators
+        )
+
+    def chunk(self, text):
+        return self.splitter.create_documents([text])
+
+# Document-specific configurations
+def get_chunker_for_document_type(doc_type):
+    configurations = {
+        "markdown": ["\n## ", "\n### ", "\n\n", "\n", " ", ""],
+        "html": ["</div>", "</p>", "\n\n", "\n", " ", ""],
+        "code": ["\n\n", "\n", " ", ""],
+        "plain": ["\n\n", "\n", " ", ""]
+    }
+    return RecursiveChunker(separators=configurations.get(doc_type, ["\n\n", "\n", " ", ""]))
+```
+
+### Customization Guidelines
+
+- **Markdown**: Use headings as primary separators
+- **HTML**: Use block-level tags as separators
+- **Code**: Preserve function and class boundaries
+- **Academic papers**: Prioritize paragraph and section breaks
+
+## Level 3: Structure-Aware Chunking
+
+### Markdown Documents
+
+```python
+import markdown
+from bs4 import BeautifulSoup
+
+class MarkdownChunker:
+    def __init__(self, max_chunk_size=512):
+        self.max_chunk_size = max_chunk_size
+
+    def chunk(self, markdown_text):
+        html = markdown.markdown(markdown_text)
+        soup = BeautifulSoup(html, 'html.parser')
+
+        chunks = []
+        current_chunk = ""
+        current_heading = "Introduction"
+
+        for element in soup.find_all(['h1', 'h2', 'h3', 'p', 'pre', 'table']):
+            if element.name.startswith('h'):
+                if current_chunk.strip():
+                    chunks.append({
+                        "content": current_chunk.strip(),
+                        "heading": current_heading
+                    })
+                current_heading = element.get_text().strip()
+                current_chunk = f"{element}\n"
+            elif element.name in ['pre', 'table']:
+                # Preserve code blocks and tables intact
+                if len(current_chunk) + len(str(element)) > self.max_chunk_size:
+                    if current_chunk.strip():
+                        chunks.append({
+                            "content": current_chunk.strip(),
+                            "heading": current_heading
+                        })
+                    current_chunk = f"{element}\n"
+                else:
+                    current_chunk += f"{element}\n"
+            else:
+                current_chunk += str(element)
+
+        if current_chunk.strip():
+            chunks.append({
+                "content": current_chunk.strip(),
+                "heading": current_heading
+            })
+
+        return chunks
+```
+
+### Code Documents
+
+```python
+import ast
+import re
+
+class CodeChunker:
+    def __init__(self, language='python'):
+        self.language = language
+
+    def chunk_python(self, code):
+        tree = ast.parse(code)
+        chunks = []
+
+        for node in ast.walk(tree):
+            if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
+                start_line = node.lineno - 1
+                end_line = node.end_lineno if hasattr(node, 'end_lineno') else start_line + 10
+                lines = code.split('\n')
+                chunk_lines = lines[start_line:end_line]
+                chunks.append('\n'.join(chunk_lines))
+
+        return chunks
+
+    def chunk_javascript(self, code):
+        # Use regex for languages without AST parsers
+        function_pattern = r'(function\s+\w+\s*\([^)]*\)\s*\{[^}]*\})'
+        class_pattern = r'(class\s+\w+\s*\{[^}]*\})'
+
+        patterns = [function_pattern, class_pattern]
+        chunks = []
+
+        for pattern in patterns:
+            matches = re.finditer(pattern, code, re.MULTILINE | re.DOTALL)
+            for match in matches:
+                chunks.append(match.group(1))
+
+        return chunks
+
+    def chunk(self, code):
+        if self.language == 'python':
+            return self.chunk_python(code)
+        elif self.language == 'javascript':
+            return self.chunk_javascript(code)
+        else:
+            # Fallback to line-based chunking
+            return self.chunk_by_lines(code)
+
+    def chunk_by_lines(self, code, max_lines=50):
+        lines = code.split('\n')
+        chunks = []
+
+        for i in range(0, len(lines), max_lines):
+            chunk = '\n'.join(lines[i:i+max_lines])
+            chunks.append(chunk)
+
+        return chunks
+```
+
+### Tabular Data
+
+```python
+import pandas as pd
+
+class TableChunker:
+    def __init__(self, max_rows=100, summary_rows=5):
+        self.max_rows = max_rows
+        self.summary_rows = summary_rows
+
+    def chunk(self, table_data):
+        if isinstance(table_data, str):
+            df = pd.read_csv(StringIO(table_data))
+        else:
+            df = table_data
+
+        chunks = []
+
+        if len(df) <= self.max_rows:
+            # Small table - keep intact
+            chunks.append({
+                "type": "full_table",
+                "content": df.to_string(),
+                "metadata": {
+                    "rows": len(df),
+                    "columns": len(df.columns)
+                }
+            })
+        else:
+            # Large table - create summary + chunks
+            summary = df.head(self.summary_rows)
+            chunks.append({
+                "type": "table_summary",
+                "content": f"Table Summary ({len(df)} rows, {len(df.columns)} columns):\n{summary.to_string()}",
+                "metadata": {
+                    "total_rows": len(df),
+                    "summary_rows": self.summary_rows,
+                    "columns": list(df.columns)
+                }
+            })
+
+            # Chunk the remaining data
+            for i in range(self.summary_rows, len(df), self.max_rows):
+                chunk_df = df.iloc[i:i+self.max_rows]
+                chunks.append({
+                    "type": "table_chunk",
+                    "content": f"Rows {i+1}-{min(i+self.max_rows, len(df))}:\n{chunk_df.to_string()}",
+                    "metadata": {
+                        "start_row": i + 1,
+                        "end_row": min(i + self.max_rows, len(df)),
+                        "columns": list(df.columns)
+                    }
+                })
+
+        return chunks
+```
+
+## Level 4: Semantic Chunking
+
+### Implementation
+
+```python
+import numpy as np
+from sentence_transformers import SentenceTransformer
+from sklearn.metrics.pairwise import cosine_similarity
+
+class SemanticChunker:
+    def __init__(self, model_name="all-MiniLM-L6-v2", similarity_threshold=0.8, buffer_size=3):
+        self.model = SentenceTransformer(model_name)
+        self.similarity_threshold = similarity_threshold
+        self.buffer_size = buffer_size
+
+    def split_into_sentences(self, text):
+        # Simple sentence splitting - can be enhanced with nltk/spacy
+        sentences = re.split(r'[.!?]+', text)
+        return [s.strip() for s in sentences if s.strip()]
+
+    def chunk(self, text):
+        sentences = self.split_into_sentences(text)
+
+        if len(sentences) <= self.buffer_size:
+            return [text]
+
+        # Create embeddings
+        embeddings = self.model.encode(sentences)
+
+        chunks = []
+        current_chunk_sentences = []
+
+        for i in range(len(sentences)):
+            current_chunk_sentences.append(sentences[i])
+
+            # Check if we should create a boundary
+            if i < len(sentences) - 1:
+                similarity = cosine_similarity(
+                    [embeddings[i]],
+                    [embeddings[i + 1]]
+                )[0][0]
+
+                if similarity < self.similarity_threshold and len(current_chunk_sentences) >= 2:
+                    chunks.append(' '.join(current_chunk_sentences))
+                    current_chunk_sentences = []
+
+        # Add remaining sentences
+        if current_chunk_sentences:
+            chunks.append(' '.join(current_chunk_sentences))
+
+        return chunks
+```
+
+### Parameter Tuning
+
+| Parameter | Range | Effect |
+|-----------|-------|--------|
+| similarity_threshold | 0.5-0.9 | Higher values create more chunks |
+| buffer_size | 1-10 | Larger buffers provide more context |
+| model_name | Various | Different models for different domains |
+
+### Optimization Tips
+
+- Use domain-specific models for specialized content
+- Adjust threshold based on content complexity
+- Cache embeddings for repeated processing
+- Consider batch processing for large documents
+
+## Level 5: Advanced Contextual Methods
+
+### Late Chunking
+
+```python
+import torch
+from transformers import AutoTokenizer, AutoModel
+
+class LateChunker:
+    def __init__(self, model_name="microsoft/DialoGPT-medium"):
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+        self.model = AutoModel.from_pretrained(model_name)
+
+    def chunk(self, text, chunk_size=512):
+        # Tokenize entire document
+        tokens = self.tokenizer(text, return_tensors="pt", truncation=False)
+
+        # Get token-level embeddings
+        with torch.no_grad():
+            outputs = self.model(**tokens, output_hidden_states=True)
+            token_embeddings = outputs.last_hidden_state[0]
+
+        # Create chunk embeddings from token embeddings
+        chunks = []
+        for i in range(0, len(token_embeddings), chunk_size):
+            chunk_tokens = token_embeddings[i:i+chunk_size]
+            chunk_embedding = torch.mean(chunk_tokens, dim=0)
+            chunks.append({
+                "content": self.tokenizer.decode(tokens["input_ids"][0][i:i+chunk_size]),
+                "embedding": chunk_embedding.numpy()
+            })
+
+        return chunks
+```
+
+### Contextual Retrieval
+
+```python
+import openai
+
+class ContextualChunker:
+    def __init__(self, api_key):
+        self.client = openai.OpenAI(api_key=api_key)
+
+    def generate_context(self, chunk, full_document):
+        prompt = f"""
+        Given the following document and a chunk from it, provide a brief context
+        that helps understand the chunk's meaning within the full document.
+
+        Document:
+        {full_document[:2000]}...
+
+        Chunk:
+        {chunk}
+
+        Context (max 50 words):
+        """
+
+        response = self.client.chat.completions.create(
+            model="gpt-3.5-turbo",
+            messages=[{"role": "user", "content": prompt}],
+            max_tokens=100,
+            temperature=0
+        )
+
+        return response.choices[0].message.content.strip()
+
+    def chunk_with_context(self, text, base_chunker):
+        # First create base chunks
+        base_chunks = base_chunker.chunk(text)
+
+        # Then add context to each chunk
+        contextualized_chunks = []
+        for chunk in base_chunks:
+            context = self.generate_context(chunk.page_content, text)
+            contextualized_content = f"Context: {context}\n\nContent: {chunk.page_content}"
+
+            contextualized_chunks.append({
+                "content": contextualized_content,
+                "original_content": chunk.page_content,
+                "context": context
+            })
+
+        return contextualized_chunks
+```
+
+## Performance Considerations
+
+### Computational Cost Analysis
+
+| Strategy | Time Complexity | Space Complexity | Relative Cost |
+|----------|-----------------|------------------|---------------|
+| Fixed-Size | O(n) | O(n) | Low |
+| Recursive | O(n) | O(n) | Low |
+| Structure-Aware | O(n log n) | O(n) | Medium |
+| Semantic | O(n²) | O(n²) | High |
+| Late Chunking | O(n) | O(n) | Very High |
+| Contextual | O(n²) | O(n²) | Very High |
+
+### Optimization Strategies
+
+1. **Parallel Processing**: Process chunks concurrently when possible
+2. **Caching**: Store embeddings and intermediate results
+3. **Batch Operations**: Group similar operations together
+4. **Progressive Loading**: Process large documents in streaming fashion
+5. **Model Selection**: Choose appropriate models for task complexity
--- a/skills/chunking-strategy/references/tools.md
+++ b/skills/chunking-strategy/references/tools.md
@@ -0,0 +1,867 @@
+# Recommended Libraries and Frameworks
+
+This document provides a comprehensive guide to tools, libraries, and frameworks for implementing chunking strategies.
+
+## Core Chunking Libraries
+
+### LangChain
+
+**Overview**: Comprehensive framework for building applications with large language models, includes robust text splitting utilities.
+
+**Installation**:
+```bash
+pip install langchain langchain-text-splitters
+```
+
+**Key Features**:
+- Multiple text splitting strategies
+- Integration with various document loaders
+- Support for different content types (code, markdown, etc.)
+- Customizable separators and parameters
+
+**Example Usage**:
+
+```python
+from langchain.text_splitter import (
+    RecursiveCharacterTextSplitter,
+    CharacterTextSplitter,
+    TokenTextSplitter,
+    MarkdownTextSplitter,
+    PythonCodeTextSplitter
+)
+
+# Basic recursive splitting
+splitter = RecursiveCharacterTextSplitter(
+    chunk_size=1000,
+    chunk_overlap=200,
+    length_function=len,
+    separators=["\n\n", "\n", " ", ""]
+)
+
+chunks = splitter.split_text(large_text)
+
+# Markdown-specific splitting
+markdown_splitter = MarkdownTextSplitter(
+    chunk_size=1000,
+    chunk_overlap=100
+)
+
+# Code-specific splitting
+code_splitter = PythonCodeTextSplitter(
+    chunk_size=1000,
+    chunk_overlap=100
+)
+```
+
+**Pros**:
+- Well-maintained and actively developed
+- Extensive documentation and examples
+- Integrates well with other LangChain components
+- Supports multiple document types
+
+**Cons**:
+- Can be heavy dependency for simple use cases
+- Some advanced features require LangChain ecosystem
+
+### LlamaIndex
+
+**Overview**: Data framework for LLM applications with advanced indexing and retrieval capabilities.
+
+**Installation**:
+```bash
+pip install llama-index
+```
+
+**Key Features**:
+- Advanced semantic chunking
+- Hierarchical indexing
+- Context-aware retrieval
+- Integration with vector databases
+
+**Example Usage**:
+
+```python
+from llama_index.core.node_parser import (
+    SentenceSplitter,
+    SemanticSplitterNodeParser
+)
+from llama_index.core import SimpleDirectoryReader
+from llama_index.embeddings.openai import OpenAIEmbedding
+
+# Basic sentence splitting
+splitter = SentenceSplitter(
+    chunk_size=1024,
+    chunk_overlap=20
+)
+
+# Semantic chunking with embeddings
+embed_model = OpenAIEmbedding()
+semantic_splitter = SemanticSplitterNodeParser(
+    buffer_size=1,
+    breakpoint_percentile_threshold=95,
+    embed_model=embed_model
+)
+
+# Load and process documents
+documents = SimpleDirectoryReader("./data").load_data()
+nodes = semantic_splitter.get_nodes_from_documents(documents)
+```
+
+**Pros**:
+- Excellent semantic chunking capabilities
+- Built for production RAG systems
+- Strong vector database integration
+- Active community support
+
+**Cons**:
+- More complex setup for basic use cases
+- Semantic chunking requires embedding model setup
+
+### Unstructured
+
+**Overview**: Open-source library for processing unstructured documents, especially strong with multi-modal content.
+
+**Installation**:
+```bash
+pip install "unstructured[pdf,png,jpg]"
+```
+
+**Key Features**:
+- Multi-modal document processing
+- Support for PDFs, images, and various formats
+- Structure preservation
+- Table extraction and processing
+
+**Example Usage**:
+
+```python
+from unstructured.partition.auto import partition
+from unstructured.chunking.title import chunk_by_title
+
+# Partition document by type
+elements = partition(filename="document.pdf")
+
+# Chunk by title/heading structure
+chunks = chunk_by_title(
+    elements,
+    combine_text_under_n_chars=2000,
+    max_characters=10000,
+    new_after_n_chars=1500,
+    multipage_sections=True
+)
+
+# Access chunked content
+for chunk in chunks:
+    print(f"Category: {chunk.category}")
+    print(f"Content: {chunk.text[:200]}...")
+```
+
+**Pros**:
+- Excellent for PDF and image processing
+- Preserves document structure
+- Handles tables and figures well
+- Strong multi-modal capabilities
+
+**Cons**:
+- Can be slower for large documents
+- Requires additional dependencies for some formats
+
+## Text Processing Libraries
+
+### NLTK (Natural Language Toolkit)
+
+**Installation**:
+```bash
+pip install nltk
+```
+
+**Key Features**:
+- Sentence tokenization
+- Language detection
+- Text preprocessing
+- Linguistic analysis
+
+**Example Usage**:
+
+```python
+import nltk
+from nltk.tokenize import sent_tokenize, word_tokenize
+from nltk.corpus import stopwords
+
+# Download required data
+nltk.download('punkt')
+nltk.download('stopwords')
+
+# Sentence and word tokenization
+text = "This is a sample sentence. This is another sentence."
+sentences = sent_tokenize(text)
+words = word_tokenize(text)
+
+# Stop words removal
+stop_words = set(stopwords.words('english'))
+filtered_words = [word for word in words if word.lower() not in stop_words]
+```
+
+### spaCy
+
+**Installation**:
+```bash
+pip install spacy
+python -m spacy download en_core_web_sm
+```
+
+**Key Features**:
+- Industrial-strength NLP
+- Named entity recognition
+- Dependency parsing
+- Sentence boundary detection
+
+**Example Usage**:
+
+```python
+import spacy
+
+# Load language model
+nlp = spacy.load("en_core_web_sm")
+
+# Process text
+doc = nlp("This is a sample sentence. This is another sentence.")
+
+# Extract sentences
+sentences = [sent.text for sent in doc.sents]
+
+# Named entities
+entities = [(ent.text, ent.label_) for ent in doc.ents]
+
+# Dependency parsing for better chunking
+for token in doc:
+    print(f"{token.text}: {token.dep_} (head: {token.head.text})")
+```
+
+### Sentence Transformers
+
+**Installation**:
+```bash
+pip install sentence-transformers
+```
+
+**Key Features**:
+- Pre-trained sentence embeddings
+- Semantic similarity calculation
+- Multi-lingual support
+- Custom model training
+
+**Example Usage**:
+
+```python
+from sentence_transformers import SentenceTransformer, util
+import numpy as np
+
+# Load pre-trained model
+model = SentenceTransformer('all-MiniLM-L6-v2')
+
+# Generate embeddings
+sentences = ["This is a sentence.", "This is another sentence."]
+embeddings = model.encode(sentences)
+
+# Calculate semantic similarity
+similarity = util.cos_sim(embeddings[0], embeddings[1])
+
+# Find semantic boundaries for chunking
+def find_semantic_boundaries(text, model, threshold=0.8):
+    sentences = [s.strip() for s in text.split('.') if s.strip()]
+    embeddings = model.encode(sentences)
+
+    boundaries = [0]
+    for i in range(1, len(sentences)):
+        similarity = util.cos_sim(embeddings[i-1], embeddings[i])
+        if similarity < threshold:
+            boundaries.append(i)
+
+    return boundaries
+```
+
+## Vector Databases and Search
+
+### ChromaDB
+
+**Installation**:
+```bash
+pip install chromadb
+```
+
+**Key Features**:
+- In-memory and persistent storage
+- Built-in embedding functions
+- Similarity search
+- Metadata filtering
+
+**Example Usage**:
+
+```python
+import chromadb
+from chromadb.utils import embedding_functions
+
+# Initialize client
+client = chromadb.Client()
+
+# Create collection
+collection = client.create_collection(
+    name="document_chunks",
+    embedding_function=embedding_functions.DefaultEmbeddingFunction()
+)
+
+# Add chunks
+collection.add(
+    documents=[chunk["content"] for chunk in chunks],
+    metadatas=[chunk.get("metadata", {}) for chunk in chunks],
+    ids=[chunk["id"] for chunk in chunks]
+)
+
+# Search
+results = collection.query(
+    query_texts=["What is chunking?"],
+    n_results=5
+)
+```
+
+### Pinecone
+
+**Installation**:
+```bash
+pip install pinecone-client
+```
+
+**Key Features**:
+- Managed vector database service
+- High-performance similarity search
+- Metadata filtering
+- Scalable infrastructure
+
+**Example Usage**:
+
+```python
+import pinecone
+from sentence_transformers import SentenceTransformer
+
+# Initialize
+pinecone.init(api_key="your-api-key", environment="your-environment")
+index_name = "document-chunks"
+
+# Create index if it doesn't exist
+if index_name not in pinecone.list_indexes():
+    pinecone.create_index(
+        name=index_name,
+        dimension=384,  # Match embedding model
+        metric="cosine"
+    )
+
+index = pinecone.Index(index_name)
+
+# Generate embeddings and upsert
+model = SentenceTransformer('all-MiniLM-L6-v2')
+for chunk in chunks:
+    embedding = model.encode(chunk["content"])
+    index.upsert(
+        vectors=[{
+            "id": chunk["id"],
+            "values": embedding.tolist(),
+            "metadata": chunk.get("metadata", {})
+        }]
+    )
+
+# Search
+query_embedding = model.encode("search query")
+results = index.query(
+    vector=query_embedding.tolist(),
+    top_k=5,
+    include_metadata=True
+)
+```
+
+### Weaviate
+
+**Installation**:
+```bash
+pip install weaviate-client
+```
+
+**Key Features**:
+- GraphQL API
+- Hybrid search (dense + sparse)
+- Real-time updates
+- Schema validation
+
+**Example Usage**:
+
+```python
+import weaviate
+
+# Connect to Weaviate
+client = weaviate.Client("http://localhost:8080")
+
+# Define schema
+client.schema.create_class({
+    "class": "DocumentChunk",
+    "description": "A chunk of document content",
+    "properties": [
+        {
+            "name": "content",
+            "dataType": ["text"]
+        },
+        {
+            "name": "source",
+            "dataType": ["string"]
+        }
+    ]
+})
+
+# Add data
+for chunk in chunks:
+    client.data_object.create(
+        data_object={
+            "content": chunk["content"],
+            "source": chunk.get("source", "unknown")
+        },
+        class_name="DocumentChunk"
+    )
+
+# Search
+results = client.query.get(
+    "DocumentChunk",
+    ["content", "source"]
+).with_near_text({
+    "concepts": ["search query"]
+}).with_limit(5).do()
+```
+
+## Evaluation and Testing
+
+### RAGAS
+
+**Installation**:
+```bash
+pip install ragas
+```
+
+**Key Features**:
+- RAG evaluation metrics
+- Answer quality assessment
+- Context relevance measurement
+- Faithfulness evaluation
+
+**Example Usage**:
+
+```python
+from ragas import evaluate
+from ragas.metrics import (
+    faithfulness,
+    answer_relevancy,
+    context_relevancy,
+    context_recall
+)
+from datasets import Dataset
+
+# Prepare evaluation data
+dataset = Dataset.from_dict({
+    "question": ["What is chunking?"],
+    "answer": ["Chunking is the process of breaking large documents into smaller segments"],
+    "contexts": [["Chunking involves dividing text into manageable pieces for better processing"]],
+    "ground_truth": ["Chunking is a document processing technique"]
+})
+
+# Evaluate
+result = evaluate(
+    dataset=dataset,
+    metrics=[
+        faithfulness,
+        answer_relevancy,
+        context_relevancy,
+        context_recall
+    ]
+)
+
+print(result)
+```
+
+### TruEra (TruLens)
+
+**Installation**:
+```bash
+pip install trulens trulens-apps
+```
+
+**Key Features**:
+- LLM application evaluation
+- Feedback functions
+- Hallucination detection
+- Performance monitoring
+
+**Example Usage**:
+
+```python
+from trulens.core import TruSession
+from trulens.apps.custom import instrument
+from trulens.feedback import GroundTruthAgreement
+
+# Initialize session
+session = TruSession()
+
+# Define feedback functions
+f_groundedness = GroundTruthAgreement(ground_truth)
+
+# Evaluate chunks
+@instrument
+def chunk_and_query(text, query):
+    chunks = chunk_function(text)
+    relevant_chunks = search_function(chunks, query)
+    answer = generate_function(relevant_chunks, query)
+    return answer
+
+# Record evaluation
+with session:
+    chunk_and_query("large document text", "what is the main topic?")
+```
+
+## Document Processing
+
+### PyPDF2
+
+**Installation**:
+```bash
+pip install PyPDF2
+```
+
+**Key Features**:
+- PDF text extraction
+- Page manipulation
+- Metadata extraction
+- Form field processing
+
+**Example Usage**:
+
+```python
+import PyPDF2
+
+def extract_text_from_pdf(pdf_path):
+    text = ""
+    with open(pdf_path, 'rb') as file:
+        reader = PyPDF2.PdfReader(file)
+        for page in reader.pages:
+            text += page.extract_text()
+    return text
+
+# Extract text by page for better chunking
+def extract_pages(pdf_path):
+    pages = []
+    with open(pdf_path, 'rb') as file:
+        reader = PyPDF2.PdfReader(file)
+        for i, page in enumerate(reader.pages):
+            pages.append({
+                "page_number": i + 1,
+                "content": page.extract_text()
+            })
+    return pages
+```
+
+### python-docx
+
+**Installation**:
+```bash
+pip install python-docx
+```
+
+**Key Features**:
+- Microsoft Word document processing
+- Paragraph and table extraction
+- Style preservation
+- Metadata access
+
+**Example Usage**:
+
+```python
+from docx import Document
+
+def extract_from_docx(docx_path):
+    doc = Document(docx_path)
+    content = []
+
+    for paragraph in doc.paragraphs:
+        if paragraph.text.strip():
+            content.append({
+                "type": "paragraph",
+                "text": paragraph.text,
+                "style": paragraph.style.name
+            })
+
+    for table in doc.tables:
+        table_text = []
+        for row in table.rows:
+            row_text = [cell.text for cell in row.cells]
+            table_text.append(" | ".join(row_text))
+
+        content.append({
+            "type": "table",
+            "text": "\n".join(table_text)
+        })
+
+    return content
+```
+
+## Specialized Libraries
+
+### tiktoken (OpenAI)
+
+**Installation**:
+```bash
+pip install tiktoken
+```
+
+**Key Features**:
+- Accurate token counting for OpenAI models
+- Fast encoding/decoding
+- Multiple model support
+- Language model specific tokenization
+
+**Example Usage**:
+
+```python
+import tiktoken
+
+# Get encoding for specific model
+encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
+
+# Encode text
+tokens = encoding.encode("This is a sample text")
+print(f"Token count: {len(tokens)}")
+
+# Decode tokens
+text = encoding.decode(tokens)
+
+# Count tokens without full encoding
+def count_tokens(text, model="gpt-3.5-turbo"):
+    encoding = tiktoken.encoding_for_model(model)
+    return len(encoding.encode(text))
+
+# Use in chunking
+def chunk_by_tokens(text, max_tokens=1000):
+    encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
+    tokens = encoding.encode(text)
+
+    chunks = []
+    for i in range(0, len(tokens), max_tokens):
+        chunk_tokens = tokens[i:i + max_tokens]
+        chunk_text = encoding.decode(chunk_tokens)
+        chunks.append(chunk_text)
+
+    return chunks
+```
+
+### PDFMiner
+
+**Installation**:
+```bash
+pip install pdfminer.six
+```
+
+**Key Features**:
+- Detailed PDF analysis
+- Layout preservation
+- Font and style information
+- High-precision text extraction
+
+**Example Usage**:
+
+```python
+from pdfminer.high_level import extract_pages
+from pdfminer.layout import LTTextContainer
+
+def extract_structured_text(pdf_path):
+    structured_content = []
+
+    for page_layout in extract_pages(pdf_path):
+        page_content = []
+
+        for element in page_layout:
+            if isinstance(element, LTTextContainer):
+                text = element.get_text()
+                font_info = {
+                    "font_size": element.height,
+                    "is_bold": "Bold" in element.fontname,
+                    "x0": element.x0,
+                    "y0": element.y0
+                }
+                page_content.append({
+                    "text": text.strip(),
+                    "font_info": font_info
+                })
+
+        structured_content.append({
+            "page_number": page_layout.pageid,
+            "content": page_content
+        })
+
+    return structured_content
+```
+
+## Performance and Optimization
+
+### Dask
+
+**Installation**:
+```bash
+pip install dask[complete]
+```
+
+**Key Features**:
+- Parallel processing
+- Out-of-core computation
+- Distributed computing
+- Integration with pandas
+
+**Example Usage**:
+
+```python
+import dask.bag as db
+from dask.distributed import Client
+
+# Setup distributed client
+client = Client(n_workers=4)
+
+# Parallel chunking of multiple documents
+def chunk_document(document):
+    # Your chunking logic here
+    return chunk_function(document)
+
+# Process documents in parallel
+documents = ["doc1", "doc2", "doc3", ...]  # List of document contents
+document_bag = db.from_sequence(documents)
+
+# Apply chunking function in parallel
+chunked_documents = document_bag.map(chunk_document)
+
+# Compute results
+results = chunked_documents.compute()
+```
+
+### Ray
+
+**Installation**:
+```bash
+pip install ray
+```
+
+**Key Features**:
+- Distributed computing
+- Actor model
+- Autoscaling
+- ML pipeline integration
+
+**Example Usage**:
+
+```python
+import ray
+
+# Initialize Ray
+ray.init()
+
+@ray.remote
+class ChunkingWorker:
+    def __init__(self, strategy):
+        self.strategy = strategy
+
+    def chunk_documents(self, documents):
+        results = []
+        for doc in documents:
+            chunks = self.strategy.chunk(doc)
+            results.append(chunks)
+        return results
+
+# Create workers
+workers = [ChunkingWorker.remote(strategy) for _ in range(4)]
+
+# Distribute work
+documents_batch = [documents[i::4] for i in range(4)]
+futures = [worker.chunk_documents.remote(batch)
+           for worker, batch in zip(workers, documents_batch)]
+
+# Get results
+results = ray.get(futures)
+```
+
+## Development and Testing
+
+### pytest
+
+**Installation**:
+```bash
+pip install pytest pytest-asyncio
+```
+
+**Example Tests**:
+
+```python
+import pytest
+from your_chunking_module import FixedSizeChunker, SemanticChunker
+
+class TestFixedSizeChunker:
+    def test_chunk_size_respect(self):
+        chunker = FixedSizeChunker(chunk_size=100, chunk_overlap=10)
+        text = "word " * 50  # 50 words
+
+        chunks = chunker.chunk(text)
+
+        for chunk in chunks:
+            assert len(chunk.split()) <= 100  # Account for word boundaries
+
+    def test_overlap_consistency(self):
+        chunker = FixedSizeChunker(chunk_size=50, chunk_overlap=10)
+        text = "word " * 30
+
+        chunks = chunker.chunk(text)
+
+        # Check overlap between consecutive chunks
+        for i in range(1, len(chunks)):
+            chunk1_words = set(chunks[i-1].split()[-10:])
+            chunk2_words = set(chunks[i].split()[:10])
+            overlap = len(chunk1_words & chunk2_words)
+            assert overlap >= 5  # Allow some tolerance
+
+@pytest.mark.asyncio
+async def test_semantic_chunker():
+    chunker = SemanticChunker()
+    text = "First topic sentence. Another sentence about first topic. " \
+           "Now switching to second topic. More about second topic."
+
+    chunks = await chunker.chunk_async(text)
+
+    # Should detect topic change and create boundary
+    assert len(chunks) >= 2
+    assert "first topic" in chunks[0].lower()
+    assert "second topic" in chunks[1].lower()
+```
+
+### Memory Profiler
+
+**Installation**:
+```bash
+pip install memory-profiler
+```
+
+**Example Usage**:
+
+```python
+from memory_profiler import profile
+
+@profile
+def chunk_large_document():
+    chunker = FixedSizeChunker(chunk_size=1000)
+    large_text = "word " * 100000  # Large document
+
+    chunks = chunker.chunk(large_text)
+    return chunks
+
+# Run with: python -m memory_profiler your_script.py
+```
+
+This comprehensive toolset provides everything needed to implement, test, and optimize chunking strategies for various use cases, from simple text processing to production-grade RAG systems.
--- a/skills/chunking-strategy/references/visualization-tools.md
+++ b/skills/chunking-strategy/references/visualization-tools.md