Initial commit
This commit is contained in:
137
skills/ai/rag/references/document-chunking.md
Normal file
137
skills/ai/rag/references/document-chunking.md
Normal file
@@ -0,0 +1,137 @@
|
||||
# Document Chunking Strategies
|
||||
|
||||
## Overview
|
||||
Document chunking is the process of breaking large documents into smaller, manageable pieces that can be effectively embedded and retrieved.
|
||||
|
||||
## Chunking Strategies
|
||||
|
||||
### 1. Recursive Character Text Splitter
|
||||
**Method**: Split text based on character count, trying separators in order
|
||||
**Use Case**: General purpose text splitting
|
||||
**Advantages**: Preserves sentence and paragraph boundaries when possible
|
||||
|
||||
```python
|
||||
from langchain.text_splitters import RecursiveCharacterTextSplitter
|
||||
|
||||
splitter = RecursiveCharacterTextSplitter(
|
||||
chunk_size=1000,
|
||||
chunk_overlap=200,
|
||||
length_function=len,
|
||||
separators=["\n\n", "\n", " ", ""] # Try these in order
|
||||
)
|
||||
chunks = splitter.split_documents(documents)
|
||||
```
|
||||
|
||||
### 2. Token-Based Splitting
|
||||
**Method**: Split based on token count rather than characters
|
||||
**Use Case**: When working with token limits of language models
|
||||
**Advantages**: Better control over context window usage
|
||||
|
||||
```python
|
||||
from langchain.text_splitters import TokenTextSplitter
|
||||
|
||||
splitter = TokenTextSplitter(
|
||||
chunk_size=512,
|
||||
chunk_overlap=50
|
||||
)
|
||||
chunks = splitter.split_documents(documents)
|
||||
```
|
||||
|
||||
### 3. Semantic Chunking
|
||||
**Method**: Split based on semantic similarity
|
||||
**Use Case**: When maintaining semantic coherence is important
|
||||
**Advantages**: Chunks are more semantically meaningful
|
||||
|
||||
```python
|
||||
from langchain.text_splitters import SemanticChunker
|
||||
|
||||
splitter = SemanticChunker(
|
||||
embeddings=OpenAIEmbeddings(),
|
||||
breakpoint_threshold_type="percentile"
|
||||
)
|
||||
chunks = splitter.split_documents(documents)
|
||||
```
|
||||
|
||||
### 4. Markdown Header Splitter
|
||||
**Method**: Split based on markdown headers
|
||||
**Use Case**: Structured documents with clear hierarchical organization
|
||||
**Advantages**: Maintains document structure and context
|
||||
|
||||
```python
|
||||
from langchain.text_splitters import MarkdownHeaderTextSplitter
|
||||
|
||||
headers_to_split_on = [
|
||||
("#", "Header 1"),
|
||||
("##", "Header 2"),
|
||||
("###", "Header 3"),
|
||||
]
|
||||
|
||||
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
|
||||
chunks = splitter.split_documents(documents)
|
||||
```
|
||||
|
||||
### 5. HTML Splitter
|
||||
**Method**: Split based on HTML tags
|
||||
**Use Case**: Web pages and HTML documents
|
||||
**Advantages**: Preserves HTML structure and metadata
|
||||
|
||||
```python
|
||||
from langchain.text_splitters import HTMLHeaderTextSplitter
|
||||
|
||||
headers_to_split_on = [
|
||||
("h1", "Header 1"),
|
||||
("h2", "Header 2"),
|
||||
("h3", "Header 3"),
|
||||
]
|
||||
|
||||
splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
|
||||
chunks = splitter.split_documents(documents)
|
||||
```
|
||||
|
||||
## Parameter Tuning
|
||||
|
||||
### Chunk Size
|
||||
- **Small chunks (200-400 tokens)**: More precise retrieval, but may lose context
|
||||
- **Medium chunks (500-1000 tokens)**: Good balance of precision and context
|
||||
- **Large chunks (1000-2000 tokens)**: More context, but less precise retrieval
|
||||
|
||||
### Chunk Overlap
|
||||
- **Purpose**: Preserve context at chunk boundaries
|
||||
- **Typical range**: 10-20% of chunk size
|
||||
- **Higher overlap**: Better context preservation, but more redundancy
|
||||
- **Lower overlap**: Less redundancy, but may lose important context
|
||||
|
||||
### Separators
|
||||
- **Hierarchical separators**: Start with larger boundaries (paragraphs), then smaller (sentences)
|
||||
- **Custom separators**: Add domain-specific separators for better results
|
||||
- **Language-specific**: Adjust for different languages and writing styles
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Preserve Context**: Ensure chunks contain enough surrounding context
|
||||
2. **Maintain Coherence**: Keep semantically related content together
|
||||
3. **Respect Boundaries**: Avoid breaking sentences or important phrases
|
||||
4. **Consider Query Types**: Adapt chunking strategy to typical user queries
|
||||
5. **Test and Iterate**: Evaluate different chunking strategies for your specific use case
|
||||
|
||||
## Evaluation Metrics
|
||||
|
||||
1. **Retrieval Quality**: How well chunks answer user queries
|
||||
2. **Context Preservation**: Whether important context is maintained
|
||||
3. **Chunk Distribution**: Evenness of chunk sizes
|
||||
4. **Boundary Quality**: How natural chunk boundaries are
|
||||
5. **Retrieval Efficiency**: Impact on retrieval speed and accuracy
|
||||
|
||||
## Advanced Techniques
|
||||
|
||||
### Adaptive Chunking
|
||||
Adjust chunk size based on document structure and content density.
|
||||
|
||||
### Hierarchical Chunking
|
||||
Create multiple levels of chunks for different retrieval scenarios.
|
||||
|
||||
### Query-Aware Chunking
|
||||
Optimize chunk boundaries based on typical query patterns.
|
||||
|
||||
### Domain-Specific Splitting
|
||||
Use specialized splitters for specific document types (legal, medical, technical).
|
||||
Reference in New Issue
Block a user