194 lines
7.1 KiB
Markdown
194 lines
7.1 KiB
Markdown
---
|
|
name: chunking-strategy
|
|
description: Implement optimal chunking strategies in RAG systems and document processing pipelines. Use when building retrieval-augmented generation systems, vector databases, or processing large documents that require breaking into semantically meaningful segments for embeddings and search.
|
|
allowed-tools: Read, Write, Bash
|
|
category: artificial-intelligence
|
|
tags: [rag, chunking, vector-search, embeddings, document-processing]
|
|
version: 1.0.0
|
|
---
|
|
|
|
# Chunking Strategy for RAG Systems
|
|
|
|
## Overview
|
|
|
|
Implement optimal chunking strategies for Retrieval-Augmented Generation (RAG) systems and document processing pipelines. This skill provides a comprehensive framework for breaking large documents into smaller, semantically meaningful segments that preserve context while enabling efficient retrieval and search.
|
|
|
|
## When to Use
|
|
|
|
Use this skill when building RAG systems, optimizing vector search performance, implementing document processing pipelines, handling multi-modal content, or performance-tuning existing RAG systems with poor retrieval quality.
|
|
|
|
## Instructions
|
|
|
|
### Choose Chunking Strategy
|
|
|
|
Select appropriate chunking strategy based on document type and use case:
|
|
|
|
1. **Fixed-Size Chunking** (Level 1)
|
|
- Use for simple documents without clear structure
|
|
- Start with 512 tokens and 10-20% overlap
|
|
- Adjust size based on query type: 256 for factoid, 1024 for analytical
|
|
|
|
2. **Recursive Character Chunking** (Level 2)
|
|
- Use for documents with clear structural boundaries
|
|
- Implement hierarchical separators: paragraphs → sentences → words
|
|
- Customize separators for document types (HTML, Markdown)
|
|
|
|
3. **Structure-Aware Chunking** (Level 3)
|
|
- Use for structured documents (Markdown, code, tables, PDFs)
|
|
- Preserve semantic units: functions, sections, table blocks
|
|
- Validate structure preservation post-splitting
|
|
|
|
4. **Semantic Chunking** (Level 4)
|
|
- Use for complex documents with thematic shifts
|
|
- Implement embedding-based boundary detection
|
|
- Configure similarity threshold (0.8) and buffer size (3-5 sentences)
|
|
|
|
5. **Advanced Methods** (Level 5)
|
|
- Use Late Chunking for long-context embedding models
|
|
- Apply Contextual Retrieval for high-precision requirements
|
|
- Monitor computational costs vs. retrieval improvements
|
|
|
|
Reference detailed strategy implementations in [references/strategies.md](references/strategies.md).
|
|
|
|
### Implement Chunking Pipeline
|
|
|
|
Follow these steps to implement effective chunking:
|
|
|
|
1. **Pre-process documents**
|
|
- Analyze document structure and content types
|
|
- Identify multi-modal content (tables, images, code)
|
|
- Assess information density and complexity
|
|
|
|
2. **Select strategy parameters**
|
|
- Choose chunk size based on embedding model context window
|
|
- Set overlap percentage (10-20% for most cases)
|
|
- Configure strategy-specific parameters
|
|
|
|
3. **Process and validate**
|
|
- Apply chosen chunking strategy
|
|
- Validate semantic coherence of chunks
|
|
- Test with representative documents
|
|
|
|
4. **Evaluate and iterate**
|
|
- Measure retrieval precision and recall
|
|
- Monitor processing latency and resource usage
|
|
- Optimize based on specific use case requirements
|
|
|
|
Reference detailed implementation guidelines in [references/implementation.md](references/implementation.md).
|
|
|
|
### Evaluate Performance
|
|
|
|
Use these metrics to evaluate chunking effectiveness:
|
|
|
|
- **Retrieval Precision**: Fraction of retrieved chunks that are relevant
|
|
- **Retrieval Recall**: Fraction of relevant chunks that are retrieved
|
|
- **End-to-End Accuracy**: Quality of final RAG responses
|
|
- **Processing Time**: Latency impact on overall system
|
|
- **Resource Usage**: Memory and computational costs
|
|
|
|
Reference detailed evaluation framework in [references/evaluation.md](references/evaluation.md).
|
|
|
|
## Examples
|
|
|
|
### Basic Fixed-Size Chunking
|
|
|
|
```python
|
|
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
|
|
|
# Configure for factoid queries
|
|
splitter = RecursiveCharacterTextSplitter(
|
|
chunk_size=256,
|
|
chunk_overlap=25,
|
|
length_function=len
|
|
)
|
|
|
|
chunks = splitter.split_documents(documents)
|
|
```
|
|
|
|
### Structure-Aware Code Chunking
|
|
|
|
```python
|
|
def chunk_python_code(code):
|
|
"""Split Python code into semantic chunks"""
|
|
import ast
|
|
|
|
tree = ast.parse(code)
|
|
chunks = []
|
|
|
|
for node in ast.walk(tree):
|
|
if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
|
|
chunks.append(ast.get_source_segment(code, node))
|
|
|
|
return chunks
|
|
```
|
|
|
|
### Semantic Chunking with Embeddings
|
|
|
|
```python
|
|
def semantic_chunk(text, similarity_threshold=0.8):
|
|
"""Chunk text based on semantic boundaries"""
|
|
sentences = split_into_sentences(text)
|
|
embeddings = generate_embeddings(sentences)
|
|
|
|
chunks = []
|
|
current_chunk = [sentences[0]]
|
|
|
|
for i in range(1, len(sentences)):
|
|
similarity = cosine_similarity(embeddings[i-1], embeddings[i])
|
|
|
|
if similarity < similarity_threshold:
|
|
chunks.append(" ".join(current_chunk))
|
|
current_chunk = [sentences[i]]
|
|
else:
|
|
current_chunk.append(sentences[i])
|
|
|
|
chunks.append(" ".join(current_chunk))
|
|
return chunks
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### Core Principles
|
|
- Balance context preservation with retrieval precision
|
|
- Maintain semantic coherence within chunks
|
|
- Optimize for embedding model constraints
|
|
- Preserve document structure when beneficial
|
|
|
|
### Implementation Guidelines
|
|
- Start simple with fixed-size chunking (512 tokens, 10-20% overlap)
|
|
- Test thoroughly with representative documents
|
|
- Monitor both accuracy metrics and computational costs
|
|
- Iterate based on specific document characteristics
|
|
|
|
### Common Pitfalls to Avoid
|
|
- Over-chunking: Creating too many small, context-poor chunks
|
|
- Under-chunking: Missing relevant information due to oversized chunks
|
|
- Ignoring document structure and semantic boundaries
|
|
- Using one-size-fits-all approach for diverse content types
|
|
- Neglecting overlap for boundary-crossing information
|
|
|
|
## Constraints
|
|
|
|
### Resource Considerations
|
|
- Semantic and contextual methods require significant computational resources
|
|
- Late chunking needs long-context embedding models
|
|
- Complex strategies increase processing latency
|
|
- Monitor memory usage for large document processing
|
|
|
|
### Quality Requirements
|
|
- Validate chunk semantic coherence post-processing
|
|
- Test with domain-specific documents before deployment
|
|
- Ensure chunks maintain standalone meaning where possible
|
|
- Implement proper error handling for edge cases
|
|
|
|
## References
|
|
|
|
Reference detailed documentation in the [references/](references/) folder:
|
|
- [strategies.md](references/strategies.md) - Detailed strategy implementations
|
|
- [implementation.md](references/implementation.md) - Complete implementation guidelines
|
|
- [evaluation.md](references/evaluation.md) - Performance evaluation framework
|
|
- [tools.md](references/tools.md) - Recommended libraries and frameworks
|
|
- [research.md](references/research.md) - Key research papers and findings
|
|
- [advanced-strategies.md](references/advanced-strategies.md) - 11 comprehensive chunking methods
|
|
- [semantic-methods.md](references/semantic-methods.md) - Semantic and contextual approaches
|
|
- [visualization-tools.md](references/visualization-tools.md) - Evaluation and visualization tools |