4.4 KiB
Document Chunking Strategies
Overview
Document chunking is the process of breaking large documents into smaller, manageable pieces that can be effectively embedded and retrieved.
Chunking Strategies
1. Recursive Character Text Splitter
Method: Split text based on character count, trying separators in order Use Case: General purpose text splitting Advantages: Preserves sentence and paragraph boundaries when possible
from langchain.text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=["\n\n", "\n", " ", ""] # Try these in order
)
chunks = splitter.split_documents(documents)
2. Token-Based Splitting
Method: Split based on token count rather than characters Use Case: When working with token limits of language models Advantages: Better control over context window usage
from langchain.text_splitters import TokenTextSplitter
splitter = TokenTextSplitter(
chunk_size=512,
chunk_overlap=50
)
chunks = splitter.split_documents(documents)
3. Semantic Chunking
Method: Split based on semantic similarity Use Case: When maintaining semantic coherence is important Advantages: Chunks are more semantically meaningful
from langchain.text_splitters import SemanticChunker
splitter = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile"
)
chunks = splitter.split_documents(documents)
4. Markdown Header Splitter
Method: Split based on markdown headers Use Case: Structured documents with clear hierarchical organization Advantages: Maintains document structure and context
from langchain.text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_documents(documents)
5. HTML Splitter
Method: Split based on HTML tags Use Case: Web pages and HTML documents Advantages: Preserves HTML structure and metadata
from langchain.text_splitters import HTMLHeaderTextSplitter
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
("h3", "Header 3"),
]
splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_documents(documents)
Parameter Tuning
Chunk Size
- Small chunks (200-400 tokens): More precise retrieval, but may lose context
- Medium chunks (500-1000 tokens): Good balance of precision and context
- Large chunks (1000-2000 tokens): More context, but less precise retrieval
Chunk Overlap
- Purpose: Preserve context at chunk boundaries
- Typical range: 10-20% of chunk size
- Higher overlap: Better context preservation, but more redundancy
- Lower overlap: Less redundancy, but may lose important context
Separators
- Hierarchical separators: Start with larger boundaries (paragraphs), then smaller (sentences)
- Custom separators: Add domain-specific separators for better results
- Language-specific: Adjust for different languages and writing styles
Best Practices
- Preserve Context: Ensure chunks contain enough surrounding context
- Maintain Coherence: Keep semantically related content together
- Respect Boundaries: Avoid breaking sentences or important phrases
- Consider Query Types: Adapt chunking strategy to typical user queries
- Test and Iterate: Evaluate different chunking strategies for your specific use case
Evaluation Metrics
- Retrieval Quality: How well chunks answer user queries
- Context Preservation: Whether important context is maintained
- Chunk Distribution: Evenness of chunk sizes
- Boundary Quality: How natural chunk boundaries are
- Retrieval Efficiency: Impact on retrieval speed and accuracy
Advanced Techniques
Adaptive Chunking
Adjust chunk size based on document structure and content density.
Hierarchical Chunking
Create multiple levels of chunks for different retrieval scenarios.
Query-Aware Chunking
Optimize chunk boundaries based on typical query patterns.
Domain-Specific Splitting
Use specialized splitters for specific document types (legal, medical, technical).