137 lines
4.4 KiB
Markdown
137 lines
4.4 KiB
Markdown
# Document Chunking Strategies
|
|
|
|
## Overview
|
|
Document chunking is the process of breaking large documents into smaller, manageable pieces that can be effectively embedded and retrieved.
|
|
|
|
## Chunking Strategies
|
|
|
|
### 1. Recursive Character Text Splitter
|
|
**Method**: Split text based on character count, trying separators in order
|
|
**Use Case**: General purpose text splitting
|
|
**Advantages**: Preserves sentence and paragraph boundaries when possible
|
|
|
|
```python
|
|
from langchain.text_splitters import RecursiveCharacterTextSplitter
|
|
|
|
splitter = RecursiveCharacterTextSplitter(
|
|
chunk_size=1000,
|
|
chunk_overlap=200,
|
|
length_function=len,
|
|
separators=["\n\n", "\n", " ", ""] # Try these in order
|
|
)
|
|
chunks = splitter.split_documents(documents)
|
|
```
|
|
|
|
### 2. Token-Based Splitting
|
|
**Method**: Split based on token count rather than characters
|
|
**Use Case**: When working with token limits of language models
|
|
**Advantages**: Better control over context window usage
|
|
|
|
```python
|
|
from langchain.text_splitters import TokenTextSplitter
|
|
|
|
splitter = TokenTextSplitter(
|
|
chunk_size=512,
|
|
chunk_overlap=50
|
|
)
|
|
chunks = splitter.split_documents(documents)
|
|
```
|
|
|
|
### 3. Semantic Chunking
|
|
**Method**: Split based on semantic similarity
|
|
**Use Case**: When maintaining semantic coherence is important
|
|
**Advantages**: Chunks are more semantically meaningful
|
|
|
|
```python
|
|
from langchain.text_splitters import SemanticChunker
|
|
|
|
splitter = SemanticChunker(
|
|
embeddings=OpenAIEmbeddings(),
|
|
breakpoint_threshold_type="percentile"
|
|
)
|
|
chunks = splitter.split_documents(documents)
|
|
```
|
|
|
|
### 4. Markdown Header Splitter
|
|
**Method**: Split based on markdown headers
|
|
**Use Case**: Structured documents with clear hierarchical organization
|
|
**Advantages**: Maintains document structure and context
|
|
|
|
```python
|
|
from langchain.text_splitters import MarkdownHeaderTextSplitter
|
|
|
|
headers_to_split_on = [
|
|
("#", "Header 1"),
|
|
("##", "Header 2"),
|
|
("###", "Header 3"),
|
|
]
|
|
|
|
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
|
|
chunks = splitter.split_documents(documents)
|
|
```
|
|
|
|
### 5. HTML Splitter
|
|
**Method**: Split based on HTML tags
|
|
**Use Case**: Web pages and HTML documents
|
|
**Advantages**: Preserves HTML structure and metadata
|
|
|
|
```python
|
|
from langchain.text_splitters import HTMLHeaderTextSplitter
|
|
|
|
headers_to_split_on = [
|
|
("h1", "Header 1"),
|
|
("h2", "Header 2"),
|
|
("h3", "Header 3"),
|
|
]
|
|
|
|
splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
|
|
chunks = splitter.split_documents(documents)
|
|
```
|
|
|
|
## Parameter Tuning
|
|
|
|
### Chunk Size
|
|
- **Small chunks (200-400 tokens)**: More precise retrieval, but may lose context
|
|
- **Medium chunks (500-1000 tokens)**: Good balance of precision and context
|
|
- **Large chunks (1000-2000 tokens)**: More context, but less precise retrieval
|
|
|
|
### Chunk Overlap
|
|
- **Purpose**: Preserve context at chunk boundaries
|
|
- **Typical range**: 10-20% of chunk size
|
|
- **Higher overlap**: Better context preservation, but more redundancy
|
|
- **Lower overlap**: Less redundancy, but may lose important context
|
|
|
|
### Separators
|
|
- **Hierarchical separators**: Start with larger boundaries (paragraphs), then smaller (sentences)
|
|
- **Custom separators**: Add domain-specific separators for better results
|
|
- **Language-specific**: Adjust for different languages and writing styles
|
|
|
|
## Best Practices
|
|
|
|
1. **Preserve Context**: Ensure chunks contain enough surrounding context
|
|
2. **Maintain Coherence**: Keep semantically related content together
|
|
3. **Respect Boundaries**: Avoid breaking sentences or important phrases
|
|
4. **Consider Query Types**: Adapt chunking strategy to typical user queries
|
|
5. **Test and Iterate**: Evaluate different chunking strategies for your specific use case
|
|
|
|
## Evaluation Metrics
|
|
|
|
1. **Retrieval Quality**: How well chunks answer user queries
|
|
2. **Context Preservation**: Whether important context is maintained
|
|
3. **Chunk Distribution**: Evenness of chunk sizes
|
|
4. **Boundary Quality**: How natural chunk boundaries are
|
|
5. **Retrieval Efficiency**: Impact on retrieval speed and accuracy
|
|
|
|
## Advanced Techniques
|
|
|
|
### Adaptive Chunking
|
|
Adjust chunk size based on document structure and content density.
|
|
|
|
### Hierarchical Chunking
|
|
Create multiple levels of chunks for different retrieval scenarios.
|
|
|
|
### Query-Aware Chunking
|
|
Optimize chunk boundaries based on typical query patterns.
|
|
|
|
### Domain-Specific Splitting
|
|
Use specialized splitters for specific document types (legal, medical, technical). |