Initial commit

2025-11-29 18:28:30 +08:00
commit 171acedaa4
220 changed files with 85967 additions and 0 deletions
--- a/skills/ai/rag/references/document-chunking.md
+++ b/skills/ai/rag/references/document-chunking.md
@@ -0,0 +1,137 @@
+# Document Chunking Strategies
+
+## Overview
+Document chunking is the process of breaking large documents into smaller, manageable pieces that can be effectively embedded and retrieved.
+
+## Chunking Strategies
+
+### 1. Recursive Character Text Splitter
+**Method**: Split text based on character count, trying separators in order
+**Use Case**: General purpose text splitting
+**Advantages**: Preserves sentence and paragraph boundaries when possible
+
+```python
+from langchain.text_splitters import RecursiveCharacterTextSplitter
+
+splitter = RecursiveCharacterTextSplitter(
+    chunk_size=1000,
+    chunk_overlap=200,
+    length_function=len,
+    separators=["\n\n", "\n", " ", ""]  # Try these in order
+)
+chunks = splitter.split_documents(documents)
+```
+
+### 2. Token-Based Splitting
+**Method**: Split based on token count rather than characters
+**Use Case**: When working with token limits of language models
+**Advantages**: Better control over context window usage
+
+```python
+from langchain.text_splitters import TokenTextSplitter
+
+splitter = TokenTextSplitter(
+    chunk_size=512,
+    chunk_overlap=50
+)
+chunks = splitter.split_documents(documents)
+```
+
+### 3. Semantic Chunking
+**Method**: Split based on semantic similarity
+**Use Case**: When maintaining semantic coherence is important
+**Advantages**: Chunks are more semantically meaningful
+
+```python
+from langchain.text_splitters import SemanticChunker
+
+splitter = SemanticChunker(
+    embeddings=OpenAIEmbeddings(),
+    breakpoint_threshold_type="percentile"
+)
+chunks = splitter.split_documents(documents)
+```
+
+### 4. Markdown Header Splitter
+**Method**: Split based on markdown headers
+**Use Case**: Structured documents with clear hierarchical organization
+**Advantages**: Maintains document structure and context
+
+```python
+from langchain.text_splitters import MarkdownHeaderTextSplitter
+
+headers_to_split_on = [
+    ("#", "Header 1"),
+    ("##", "Header 2"),
+    ("###", "Header 3"),
+]
+
+splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
+chunks = splitter.split_documents(documents)
+```
+
+### 5. HTML Splitter
+**Method**: Split based on HTML tags
+**Use Case**: Web pages and HTML documents
+**Advantages**: Preserves HTML structure and metadata
+
+```python
+from langchain.text_splitters import HTMLHeaderTextSplitter
+
+headers_to_split_on = [
+    ("h1", "Header 1"),
+    ("h2", "Header 2"),
+    ("h3", "Header 3"),
+]
+
+splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
+chunks = splitter.split_documents(documents)
+```
+
+## Parameter Tuning
+
+### Chunk Size
+- **Small chunks (200-400 tokens)**: More precise retrieval, but may lose context
+- **Medium chunks (500-1000 tokens)**: Good balance of precision and context
+- **Large chunks (1000-2000 tokens)**: More context, but less precise retrieval
+
+### Chunk Overlap
+- **Purpose**: Preserve context at chunk boundaries
+- **Typical range**: 10-20% of chunk size
+- **Higher overlap**: Better context preservation, but more redundancy
+- **Lower overlap**: Less redundancy, but may lose important context
+
+### Separators
+- **Hierarchical separators**: Start with larger boundaries (paragraphs), then smaller (sentences)
+- **Custom separators**: Add domain-specific separators for better results
+- **Language-specific**: Adjust for different languages and writing styles
+
+## Best Practices
+
+1. **Preserve Context**: Ensure chunks contain enough surrounding context
+2. **Maintain Coherence**: Keep semantically related content together
+3. **Respect Boundaries**: Avoid breaking sentences or important phrases
+4. **Consider Query Types**: Adapt chunking strategy to typical user queries
+5. **Test and Iterate**: Evaluate different chunking strategies for your specific use case
+
+## Evaluation Metrics
+
+1. **Retrieval Quality**: How well chunks answer user queries
+2. **Context Preservation**: Whether important context is maintained
+3. **Chunk Distribution**: Evenness of chunk sizes
+4. **Boundary Quality**: How natural chunk boundaries are
+5. **Retrieval Efficiency**: Impact on retrieval speed and accuracy
+
+## Advanced Techniques
+
+### Adaptive Chunking
+Adjust chunk size based on document structure and content density.
+
+### Hierarchical Chunking
+Create multiple levels of chunks for different retrieval scenarios.
+
+### Query-Aware Chunking
+Optimize chunk boundaries based on typical query patterns.
+
+### Domain-Specific Splitting
+Use specialized splitters for specific document types (legal, medical, technical).