Files
gh-giuseppe-trisciuoglio-de…/skills/ai/rag/references/document-chunking.md
2025-11-29 18:28:30 +08:00

137 lines
4.4 KiB
Markdown

# Document Chunking Strategies
## Overview
Document chunking is the process of breaking large documents into smaller, manageable pieces that can be effectively embedded and retrieved.
## Chunking Strategies
### 1. Recursive Character Text Splitter
**Method**: Split text based on character count, trying separators in order
**Use Case**: General purpose text splitting
**Advantages**: Preserves sentence and paragraph boundaries when possible
```python
from langchain.text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=["\n\n", "\n", " ", ""] # Try these in order
)
chunks = splitter.split_documents(documents)
```
### 2. Token-Based Splitting
**Method**: Split based on token count rather than characters
**Use Case**: When working with token limits of language models
**Advantages**: Better control over context window usage
```python
from langchain.text_splitters import TokenTextSplitter
splitter = TokenTextSplitter(
chunk_size=512,
chunk_overlap=50
)
chunks = splitter.split_documents(documents)
```
### 3. Semantic Chunking
**Method**: Split based on semantic similarity
**Use Case**: When maintaining semantic coherence is important
**Advantages**: Chunks are more semantically meaningful
```python
from langchain.text_splitters import SemanticChunker
splitter = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile"
)
chunks = splitter.split_documents(documents)
```
### 4. Markdown Header Splitter
**Method**: Split based on markdown headers
**Use Case**: Structured documents with clear hierarchical organization
**Advantages**: Maintains document structure and context
```python
from langchain.text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_documents(documents)
```
### 5. HTML Splitter
**Method**: Split based on HTML tags
**Use Case**: Web pages and HTML documents
**Advantages**: Preserves HTML structure and metadata
```python
from langchain.text_splitters import HTMLHeaderTextSplitter
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
("h3", "Header 3"),
]
splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_documents(documents)
```
## Parameter Tuning
### Chunk Size
- **Small chunks (200-400 tokens)**: More precise retrieval, but may lose context
- **Medium chunks (500-1000 tokens)**: Good balance of precision and context
- **Large chunks (1000-2000 tokens)**: More context, but less precise retrieval
### Chunk Overlap
- **Purpose**: Preserve context at chunk boundaries
- **Typical range**: 10-20% of chunk size
- **Higher overlap**: Better context preservation, but more redundancy
- **Lower overlap**: Less redundancy, but may lose important context
### Separators
- **Hierarchical separators**: Start with larger boundaries (paragraphs), then smaller (sentences)
- **Custom separators**: Add domain-specific separators for better results
- **Language-specific**: Adjust for different languages and writing styles
## Best Practices
1. **Preserve Context**: Ensure chunks contain enough surrounding context
2. **Maintain Coherence**: Keep semantically related content together
3. **Respect Boundaries**: Avoid breaking sentences or important phrases
4. **Consider Query Types**: Adapt chunking strategy to typical user queries
5. **Test and Iterate**: Evaluate different chunking strategies for your specific use case
## Evaluation Metrics
1. **Retrieval Quality**: How well chunks answer user queries
2. **Context Preservation**: Whether important context is maintained
3. **Chunk Distribution**: Evenness of chunk sizes
4. **Boundary Quality**: How natural chunk boundaries are
5. **Retrieval Efficiency**: Impact on retrieval speed and accuracy
## Advanced Techniques
### Adaptive Chunking
Adjust chunk size based on document structure and content density.
### Hierarchical Chunking
Create multiple levels of chunks for different retrieval scenarios.
### Query-Aware Chunking
Optimize chunk boundaries based on typical query patterns.
### Domain-Specific Splitting
Use specialized splitters for specific document types (legal, medical, technical).