Files
gh-giuseppe-trisciuoglio-de…/skills/ai/chunking-strategy/references/tools.md
2025-11-29 18:28:30 +08:00

18 KiB

Recommended Libraries and Frameworks

This document provides a comprehensive guide to tools, libraries, and frameworks for implementing chunking strategies.

Core Chunking Libraries

LangChain

Overview: Comprehensive framework for building applications with large language models, includes robust text splitting utilities.

Installation:

pip install langchain langchain-text-splitters

Key Features:

  • Multiple text splitting strategies
  • Integration with various document loaders
  • Support for different content types (code, markdown, etc.)
  • Customizable separators and parameters

Example Usage:

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter,
    MarkdownTextSplitter,
    PythonCodeTextSplitter
)

# Basic recursive splitting
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)

chunks = splitter.split_text(large_text)

# Markdown-specific splitting
markdown_splitter = MarkdownTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)

# Code-specific splitting
code_splitter = PythonCodeTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)

Pros:

  • Well-maintained and actively developed
  • Extensive documentation and examples
  • Integrates well with other LangChain components
  • Supports multiple document types

Cons:

  • Can be heavy dependency for simple use cases
  • Some advanced features require LangChain ecosystem

LlamaIndex

Overview: Data framework for LLM applications with advanced indexing and retrieval capabilities.

Installation:

pip install llama-index

Key Features:

  • Advanced semantic chunking
  • Hierarchical indexing
  • Context-aware retrieval
  • Integration with vector databases

Example Usage:

from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser
)
from llama_index.core import SimpleDirectoryReader
from llama_index.embeddings.openai import OpenAIEmbedding

# Basic sentence splitting
splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20
)

# Semantic chunking with embeddings
embed_model = OpenAIEmbedding()
semantic_splitter = SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=embed_model
)

# Load and process documents
documents = SimpleDirectoryReader("./data").load_data()
nodes = semantic_splitter.get_nodes_from_documents(documents)

Pros:

  • Excellent semantic chunking capabilities
  • Built for production RAG systems
  • Strong vector database integration
  • Active community support

Cons:

  • More complex setup for basic use cases
  • Semantic chunking requires embedding model setup

Unstructured

Overview: Open-source library for processing unstructured documents, especially strong with multi-modal content.

Installation:

pip install "unstructured[pdf,png,jpg]"

Key Features:

  • Multi-modal document processing
  • Support for PDFs, images, and various formats
  • Structure preservation
  • Table extraction and processing

Example Usage:

from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title

# Partition document by type
elements = partition(filename="document.pdf")

# Chunk by title/heading structure
chunks = chunk_by_title(
    elements,
    combine_text_under_n_chars=2000,
    max_characters=10000,
    new_after_n_chars=1500,
    multipage_sections=True
)

# Access chunked content
for chunk in chunks:
    print(f"Category: {chunk.category}")
    print(f"Content: {chunk.text[:200]}...")

Pros:

  • Excellent for PDF and image processing
  • Preserves document structure
  • Handles tables and figures well
  • Strong multi-modal capabilities

Cons:

  • Can be slower for large documents
  • Requires additional dependencies for some formats

Text Processing Libraries

NLTK (Natural Language Toolkit)

Installation:

pip install nltk

Key Features:

  • Sentence tokenization
  • Language detection
  • Text preprocessing
  • Linguistic analysis

Example Usage:

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

# Download required data
nltk.download('punkt')
nltk.download('stopwords')

# Sentence and word tokenization
text = "This is a sample sentence. This is another sentence."
sentences = sent_tokenize(text)
words = word_tokenize(text)

# Stop words removal
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]

spaCy

Installation:

pip install spacy
python -m spacy download en_core_web_sm

Key Features:

  • Industrial-strength NLP
  • Named entity recognition
  • Dependency parsing
  • Sentence boundary detection

Example Usage:

import spacy

# Load language model
nlp = spacy.load("en_core_web_sm")

# Process text
doc = nlp("This is a sample sentence. This is another sentence.")

# Extract sentences
sentences = [sent.text for sent in doc.sents]

# Named entities
entities = [(ent.text, ent.label_) for ent in doc.ents]

# Dependency parsing for better chunking
for token in doc:
    print(f"{token.text}: {token.dep_} (head: {token.head.text})")

Sentence Transformers

Installation:

pip install sentence-transformers

Key Features:

  • Pre-trained sentence embeddings
  • Semantic similarity calculation
  • Multi-lingual support
  • Custom model training

Example Usage:

from sentence_transformers import SentenceTransformer, util
import numpy as np

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings
sentences = ["This is a sentence.", "This is another sentence."]
embeddings = model.encode(sentences)

# Calculate semantic similarity
similarity = util.cos_sim(embeddings[0], embeddings[1])

# Find semantic boundaries for chunking
def find_semantic_boundaries(text, model, threshold=0.8):
    sentences = [s.strip() for s in text.split('.') if s.strip()]
    embeddings = model.encode(sentences)

    boundaries = [0]
    for i in range(1, len(sentences)):
        similarity = util.cos_sim(embeddings[i-1], embeddings[i])
        if similarity < threshold:
            boundaries.append(i)

    return boundaries

ChromaDB

Installation:

pip install chromadb

Key Features:

  • In-memory and persistent storage
  • Built-in embedding functions
  • Similarity search
  • Metadata filtering

Example Usage:

import chromadb
from chromadb.utils import embedding_functions

# Initialize client
client = chromadb.Client()

# Create collection
collection = client.create_collection(
    name="document_chunks",
    embedding_function=embedding_functions.DefaultEmbeddingFunction()
)

# Add chunks
collection.add(
    documents=[chunk["content"] for chunk in chunks],
    metadatas=[chunk.get("metadata", {}) for chunk in chunks],
    ids=[chunk["id"] for chunk in chunks]
)

# Search
results = collection.query(
    query_texts=["What is chunking?"],
    n_results=5
)

Pinecone

Installation:

pip install pinecone-client

Key Features:

  • Managed vector database service
  • High-performance similarity search
  • Metadata filtering
  • Scalable infrastructure

Example Usage:

import pinecone
from sentence_transformers import SentenceTransformer

# Initialize
pinecone.init(api_key="your-api-key", environment="your-environment")
index_name = "document-chunks"

# Create index if it doesn't exist
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=384,  # Match embedding model
        metric="cosine"
    )

index = pinecone.Index(index_name)

# Generate embeddings and upsert
model = SentenceTransformer('all-MiniLM-L6-v2')
for chunk in chunks:
    embedding = model.encode(chunk["content"])
    index.upsert(
        vectors=[{
            "id": chunk["id"],
            "values": embedding.tolist(),
            "metadata": chunk.get("metadata", {})
        }]
    )

# Search
query_embedding = model.encode("search query")
results = index.query(
    vector=query_embedding.tolist(),
    top_k=5,
    include_metadata=True
)

Weaviate

Installation:

pip install weaviate-client

Key Features:

  • GraphQL API
  • Hybrid search (dense + sparse)
  • Real-time updates
  • Schema validation

Example Usage:

import weaviate

# Connect to Weaviate
client = weaviate.Client("http://localhost:8080")

# Define schema
client.schema.create_class({
    "class": "DocumentChunk",
    "description": "A chunk of document content",
    "properties": [
        {
            "name": "content",
            "dataType": ["text"]
        },
        {
            "name": "source",
            "dataType": ["string"]
        }
    ]
})

# Add data
for chunk in chunks:
    client.data_object.create(
        data_object={
            "content": chunk["content"],
            "source": chunk.get("source", "unknown")
        },
        class_name="DocumentChunk"
    )

# Search
results = client.query.get(
    "DocumentChunk",
    ["content", "source"]
).with_near_text({
    "concepts": ["search query"]
}).with_limit(5).do()

Evaluation and Testing

RAGAS

Installation:

pip install ragas

Key Features:

  • RAG evaluation metrics
  • Answer quality assessment
  • Context relevance measurement
  • Faithfulness evaluation

Example Usage:

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_recall
)
from datasets import Dataset

# Prepare evaluation data
dataset = Dataset.from_dict({
    "question": ["What is chunking?"],
    "answer": ["Chunking is the process of breaking large documents into smaller segments"],
    "contexts": [["Chunking involves dividing text into manageable pieces for better processing"]],
    "ground_truth": ["Chunking is a document processing technique"]
})

# Evaluate
result = evaluate(
    dataset=dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_relevancy,
        context_recall
    ]
)

print(result)

TruEra (TruLens)

Installation:

pip install trulens trulens-apps

Key Features:

  • LLM application evaluation
  • Feedback functions
  • Hallucination detection
  • Performance monitoring

Example Usage:

from trulens.core import TruSession
from trulens.apps.custom import instrument
from trulens.feedback import GroundTruthAgreement

# Initialize session
session = TruSession()

# Define feedback functions
f_groundedness = GroundTruthAgreement(ground_truth)

# Evaluate chunks
@instrument
def chunk_and_query(text, query):
    chunks = chunk_function(text)
    relevant_chunks = search_function(chunks, query)
    answer = generate_function(relevant_chunks, query)
    return answer

# Record evaluation
with session:
    chunk_and_query("large document text", "what is the main topic?")

Document Processing

PyPDF2

Installation:

pip install PyPDF2

Key Features:

  • PDF text extraction
  • Page manipulation
  • Metadata extraction
  • Form field processing

Example Usage:

import PyPDF2

def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text()
    return text

# Extract text by page for better chunking
def extract_pages(pdf_path):
    pages = []
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        for i, page in enumerate(reader.pages):
            pages.append({
                "page_number": i + 1,
                "content": page.extract_text()
            })
    return pages

python-docx

Installation:

pip install python-docx

Key Features:

  • Microsoft Word document processing
  • Paragraph and table extraction
  • Style preservation
  • Metadata access

Example Usage:

from docx import Document

def extract_from_docx(docx_path):
    doc = Document(docx_path)
    content = []

    for paragraph in doc.paragraphs:
        if paragraph.text.strip():
            content.append({
                "type": "paragraph",
                "text": paragraph.text,
                "style": paragraph.style.name
            })

    for table in doc.tables:
        table_text = []
        for row in table.rows:
            row_text = [cell.text for cell in row.cells]
            table_text.append(" | ".join(row_text))

        content.append({
            "type": "table",
            "text": "\n".join(table_text)
        })

    return content

Specialized Libraries

tiktoken (OpenAI)

Installation:

pip install tiktoken

Key Features:

  • Accurate token counting for OpenAI models
  • Fast encoding/decoding
  • Multiple model support
  • Language model specific tokenization

Example Usage:

import tiktoken

# Get encoding for specific model
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

# Encode text
tokens = encoding.encode("This is a sample text")
print(f"Token count: {len(tokens)}")

# Decode tokens
text = encoding.decode(tokens)

# Count tokens without full encoding
def count_tokens(text, model="gpt-3.5-turbo"):
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

# Use in chunking
def chunk_by_tokens(text, max_tokens=1000):
    encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
    tokens = encoding.encode(text)

    chunks = []
    for i in range(0, len(tokens), max_tokens):
        chunk_tokens = tokens[i:i + max_tokens]
        chunk_text = encoding.decode(chunk_tokens)
        chunks.append(chunk_text)

    return chunks

PDFMiner

Installation:

pip install pdfminer.six

Key Features:

  • Detailed PDF analysis
  • Layout preservation
  • Font and style information
  • High-precision text extraction

Example Usage:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer

def extract_structured_text(pdf_path):
    structured_content = []

    for page_layout in extract_pages(pdf_path):
        page_content = []

        for element in page_layout:
            if isinstance(element, LTTextContainer):
                text = element.get_text()
                font_info = {
                    "font_size": element.height,
                    "is_bold": "Bold" in element.fontname,
                    "x0": element.x0,
                    "y0": element.y0
                }
                page_content.append({
                    "text": text.strip(),
                    "font_info": font_info
                })

        structured_content.append({
            "page_number": page_layout.pageid,
            "content": page_content
        })

    return structured_content

Performance and Optimization

Dask

Installation:

pip install dask[complete]

Key Features:

  • Parallel processing
  • Out-of-core computation
  • Distributed computing
  • Integration with pandas

Example Usage:

import dask.bag as db
from dask.distributed import Client

# Setup distributed client
client = Client(n_workers=4)

# Parallel chunking of multiple documents
def chunk_document(document):
    # Your chunking logic here
    return chunk_function(document)

# Process documents in parallel
documents = ["doc1", "doc2", "doc3", ...]  # List of document contents
document_bag = db.from_sequence(documents)

# Apply chunking function in parallel
chunked_documents = document_bag.map(chunk_document)

# Compute results
results = chunked_documents.compute()

Ray

Installation:

pip install ray

Key Features:

  • Distributed computing
  • Actor model
  • Autoscaling
  • ML pipeline integration

Example Usage:

import ray

# Initialize Ray
ray.init()

@ray.remote
class ChunkingWorker:
    def __init__(self, strategy):
        self.strategy = strategy

    def chunk_documents(self, documents):
        results = []
        for doc in documents:
            chunks = self.strategy.chunk(doc)
            results.append(chunks)
        return results

# Create workers
workers = [ChunkingWorker.remote(strategy) for _ in range(4)]

# Distribute work
documents_batch = [documents[i::4] for i in range(4)]
futures = [worker.chunk_documents.remote(batch)
           for worker, batch in zip(workers, documents_batch)]

# Get results
results = ray.get(futures)

Development and Testing

pytest

Installation:

pip install pytest pytest-asyncio

Example Tests:

import pytest
from your_chunking_module import FixedSizeChunker, SemanticChunker

class TestFixedSizeChunker:
    def test_chunk_size_respect(self):
        chunker = FixedSizeChunker(chunk_size=100, chunk_overlap=10)
        text = "word " * 50  # 50 words

        chunks = chunker.chunk(text)

        for chunk in chunks:
            assert len(chunk.split()) <= 100  # Account for word boundaries

    def test_overlap_consistency(self):
        chunker = FixedSizeChunker(chunk_size=50, chunk_overlap=10)
        text = "word " * 30

        chunks = chunker.chunk(text)

        # Check overlap between consecutive chunks
        for i in range(1, len(chunks)):
            chunk1_words = set(chunks[i-1].split()[-10:])
            chunk2_words = set(chunks[i].split()[:10])
            overlap = len(chunk1_words & chunk2_words)
            assert overlap >= 5  # Allow some tolerance

@pytest.mark.asyncio
async def test_semantic_chunker():
    chunker = SemanticChunker()
    text = "First topic sentence. Another sentence about first topic. " \
           "Now switching to second topic. More about second topic."

    chunks = await chunker.chunk_async(text)

    # Should detect topic change and create boundary
    assert len(chunks) >= 2
    assert "first topic" in chunks[0].lower()
    assert "second topic" in chunks[1].lower()

Memory Profiler

Installation:

pip install memory-profiler

Example Usage:

from memory_profiler import profile

@profile
def chunk_large_document():
    chunker = FixedSizeChunker(chunk_size=1000)
    large_text = "word " * 100000  # Large document

    chunks = chunker.chunk(large_text)
    return chunks

# Run with: python -m memory_profiler your_script.py

This comprehensive toolset provides everything needed to implement, test, and optimize chunking strategies for various use cases, from simple text processing to production-grade RAG systems.