Files
gh-giuseppe-trisciuoglio-de…/skills/ai/rag/references/document-chunking.md
2025-11-29 18:28:30 +08:00

4.4 KiB

Document Chunking Strategies

Overview

Document chunking is the process of breaking large documents into smaller, manageable pieces that can be effectively embedded and retrieved.

Chunking Strategies

1. Recursive Character Text Splitter

Method: Split text based on character count, trying separators in order Use Case: General purpose text splitting Advantages: Preserves sentence and paragraph boundaries when possible

from langchain.text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]  # Try these in order
)
chunks = splitter.split_documents(documents)

2. Token-Based Splitting

Method: Split based on token count rather than characters Use Case: When working with token limits of language models Advantages: Better control over context window usage

from langchain.text_splitters import TokenTextSplitter

splitter = TokenTextSplitter(
    chunk_size=512,
    chunk_overlap=50
)
chunks = splitter.split_documents(documents)

3. Semantic Chunking

Method: Split based on semantic similarity Use Case: When maintaining semantic coherence is important Advantages: Chunks are more semantically meaningful

from langchain.text_splitters import SemanticChunker

splitter = SemanticChunker(
    embeddings=OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile"
)
chunks = splitter.split_documents(documents)

4. Markdown Header Splitter

Method: Split based on markdown headers Use Case: Structured documents with clear hierarchical organization Advantages: Maintains document structure and context

from langchain.text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_documents(documents)

5. HTML Splitter

Method: Split based on HTML tags Use Case: Web pages and HTML documents Advantages: Preserves HTML structure and metadata

from langchain.text_splitters import HTMLHeaderTextSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_documents(documents)

Parameter Tuning

Chunk Size

  • Small chunks (200-400 tokens): More precise retrieval, but may lose context
  • Medium chunks (500-1000 tokens): Good balance of precision and context
  • Large chunks (1000-2000 tokens): More context, but less precise retrieval

Chunk Overlap

  • Purpose: Preserve context at chunk boundaries
  • Typical range: 10-20% of chunk size
  • Higher overlap: Better context preservation, but more redundancy
  • Lower overlap: Less redundancy, but may lose important context

Separators

  • Hierarchical separators: Start with larger boundaries (paragraphs), then smaller (sentences)
  • Custom separators: Add domain-specific separators for better results
  • Language-specific: Adjust for different languages and writing styles

Best Practices

  1. Preserve Context: Ensure chunks contain enough surrounding context
  2. Maintain Coherence: Keep semantically related content together
  3. Respect Boundaries: Avoid breaking sentences or important phrases
  4. Consider Query Types: Adapt chunking strategy to typical user queries
  5. Test and Iterate: Evaluate different chunking strategies for your specific use case

Evaluation Metrics

  1. Retrieval Quality: How well chunks answer user queries
  2. Context Preservation: Whether important context is maintained
  3. Chunk Distribution: Evenness of chunk sizes
  4. Boundary Quality: How natural chunk boundaries are
  5. Retrieval Efficiency: Impact on retrieval speed and accuracy

Advanced Techniques

Adaptive Chunking

Adjust chunk size based on document structure and content density.

Hierarchical Chunking

Create multiple levels of chunks for different retrieval scenarios.

Query-Aware Chunking

Optimize chunk boundaries based on typical query patterns.

Domain-Specific Splitting

Use specialized splitters for specific document types (legal, medical, technical).