# Recommended Libraries and Frameworks This document provides a comprehensive guide to tools, libraries, and frameworks for implementing chunking strategies. ## Core Chunking Libraries ### LangChain **Overview**: Comprehensive framework for building applications with large language models, includes robust text splitting utilities. **Installation**: ```bash pip install langchain langchain-text-splitters ``` **Key Features**: - Multiple text splitting strategies - Integration with various document loaders - Support for different content types (code, markdown, etc.) - Customizable separators and parameters **Example Usage**: ```python from langchain.text_splitter import ( RecursiveCharacterTextSplitter, CharacterTextSplitter, TokenTextSplitter, MarkdownTextSplitter, PythonCodeTextSplitter ) # Basic recursive splitting splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, length_function=len, separators=["\n\n", "\n", " ", ""] ) chunks = splitter.split_text(large_text) # Markdown-specific splitting markdown_splitter = MarkdownTextSplitter( chunk_size=1000, chunk_overlap=100 ) # Code-specific splitting code_splitter = PythonCodeTextSplitter( chunk_size=1000, chunk_overlap=100 ) ``` **Pros**: - Well-maintained and actively developed - Extensive documentation and examples - Integrates well with other LangChain components - Supports multiple document types **Cons**: - Can be heavy dependency for simple use cases - Some advanced features require LangChain ecosystem ### LlamaIndex **Overview**: Data framework for LLM applications with advanced indexing and retrieval capabilities. **Installation**: ```bash pip install llama-index ``` **Key Features**: - Advanced semantic chunking - Hierarchical indexing - Context-aware retrieval - Integration with vector databases **Example Usage**: ```python from llama_index.core.node_parser import ( SentenceSplitter, SemanticSplitterNodeParser ) from llama_index.core import SimpleDirectoryReader from llama_index.embeddings.openai import OpenAIEmbedding # Basic sentence splitting splitter = SentenceSplitter( chunk_size=1024, chunk_overlap=20 ) # Semantic chunking with embeddings embed_model = OpenAIEmbedding() semantic_splitter = SemanticSplitterNodeParser( buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model ) # Load and process documents documents = SimpleDirectoryReader("./data").load_data() nodes = semantic_splitter.get_nodes_from_documents(documents) ``` **Pros**: - Excellent semantic chunking capabilities - Built for production RAG systems - Strong vector database integration - Active community support **Cons**: - More complex setup for basic use cases - Semantic chunking requires embedding model setup ### Unstructured **Overview**: Open-source library for processing unstructured documents, especially strong with multi-modal content. **Installation**: ```bash pip install "unstructured[pdf,png,jpg]" ``` **Key Features**: - Multi-modal document processing - Support for PDFs, images, and various formats - Structure preservation - Table extraction and processing **Example Usage**: ```python from unstructured.partition.auto import partition from unstructured.chunking.title import chunk_by_title # Partition document by type elements = partition(filename="document.pdf") # Chunk by title/heading structure chunks = chunk_by_title( elements, combine_text_under_n_chars=2000, max_characters=10000, new_after_n_chars=1500, multipage_sections=True ) # Access chunked content for chunk in chunks: print(f"Category: {chunk.category}") print(f"Content: {chunk.text[:200]}...") ``` **Pros**: - Excellent for PDF and image processing - Preserves document structure - Handles tables and figures well - Strong multi-modal capabilities **Cons**: - Can be slower for large documents - Requires additional dependencies for some formats ## Text Processing Libraries ### NLTK (Natural Language Toolkit) **Installation**: ```bash pip install nltk ``` **Key Features**: - Sentence tokenization - Language detection - Text preprocessing - Linguistic analysis **Example Usage**: ```python import nltk from nltk.tokenize import sent_tokenize, word_tokenize from nltk.corpus import stopwords # Download required data nltk.download('punkt') nltk.download('stopwords') # Sentence and word tokenization text = "This is a sample sentence. This is another sentence." sentences = sent_tokenize(text) words = word_tokenize(text) # Stop words removal stop_words = set(stopwords.words('english')) filtered_words = [word for word in words if word.lower() not in stop_words] ``` ### spaCy **Installation**: ```bash pip install spacy python -m spacy download en_core_web_sm ``` **Key Features**: - Industrial-strength NLP - Named entity recognition - Dependency parsing - Sentence boundary detection **Example Usage**: ```python import spacy # Load language model nlp = spacy.load("en_core_web_sm") # Process text doc = nlp("This is a sample sentence. This is another sentence.") # Extract sentences sentences = [sent.text for sent in doc.sents] # Named entities entities = [(ent.text, ent.label_) for ent in doc.ents] # Dependency parsing for better chunking for token in doc: print(f"{token.text}: {token.dep_} (head: {token.head.text})") ``` ### Sentence Transformers **Installation**: ```bash pip install sentence-transformers ``` **Key Features**: - Pre-trained sentence embeddings - Semantic similarity calculation - Multi-lingual support - Custom model training **Example Usage**: ```python from sentence_transformers import SentenceTransformer, util import numpy as np # Load pre-trained model model = SentenceTransformer('all-MiniLM-L6-v2') # Generate embeddings sentences = ["This is a sentence.", "This is another sentence."] embeddings = model.encode(sentences) # Calculate semantic similarity similarity = util.cos_sim(embeddings[0], embeddings[1]) # Find semantic boundaries for chunking def find_semantic_boundaries(text, model, threshold=0.8): sentences = [s.strip() for s in text.split('.') if s.strip()] embeddings = model.encode(sentences) boundaries = [0] for i in range(1, len(sentences)): similarity = util.cos_sim(embeddings[i-1], embeddings[i]) if similarity < threshold: boundaries.append(i) return boundaries ``` ## Vector Databases and Search ### ChromaDB **Installation**: ```bash pip install chromadb ``` **Key Features**: - In-memory and persistent storage - Built-in embedding functions - Similarity search - Metadata filtering **Example Usage**: ```python import chromadb from chromadb.utils import embedding_functions # Initialize client client = chromadb.Client() # Create collection collection = client.create_collection( name="document_chunks", embedding_function=embedding_functions.DefaultEmbeddingFunction() ) # Add chunks collection.add( documents=[chunk["content"] for chunk in chunks], metadatas=[chunk.get("metadata", {}) for chunk in chunks], ids=[chunk["id"] for chunk in chunks] ) # Search results = collection.query( query_texts=["What is chunking?"], n_results=5 ) ``` ### Pinecone **Installation**: ```bash pip install pinecone-client ``` **Key Features**: - Managed vector database service - High-performance similarity search - Metadata filtering - Scalable infrastructure **Example Usage**: ```python import pinecone from sentence_transformers import SentenceTransformer # Initialize pinecone.init(api_key="your-api-key", environment="your-environment") index_name = "document-chunks" # Create index if it doesn't exist if index_name not in pinecone.list_indexes(): pinecone.create_index( name=index_name, dimension=384, # Match embedding model metric="cosine" ) index = pinecone.Index(index_name) # Generate embeddings and upsert model = SentenceTransformer('all-MiniLM-L6-v2') for chunk in chunks: embedding = model.encode(chunk["content"]) index.upsert( vectors=[{ "id": chunk["id"], "values": embedding.tolist(), "metadata": chunk.get("metadata", {}) }] ) # Search query_embedding = model.encode("search query") results = index.query( vector=query_embedding.tolist(), top_k=5, include_metadata=True ) ``` ### Weaviate **Installation**: ```bash pip install weaviate-client ``` **Key Features**: - GraphQL API - Hybrid search (dense + sparse) - Real-time updates - Schema validation **Example Usage**: ```python import weaviate # Connect to Weaviate client = weaviate.Client("http://localhost:8080") # Define schema client.schema.create_class({ "class": "DocumentChunk", "description": "A chunk of document content", "properties": [ { "name": "content", "dataType": ["text"] }, { "name": "source", "dataType": ["string"] } ] }) # Add data for chunk in chunks: client.data_object.create( data_object={ "content": chunk["content"], "source": chunk.get("source", "unknown") }, class_name="DocumentChunk" ) # Search results = client.query.get( "DocumentChunk", ["content", "source"] ).with_near_text({ "concepts": ["search query"] }).with_limit(5).do() ``` ## Evaluation and Testing ### RAGAS **Installation**: ```bash pip install ragas ``` **Key Features**: - RAG evaluation metrics - Answer quality assessment - Context relevance measurement - Faithfulness evaluation **Example Usage**: ```python from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_relevancy, context_recall ) from datasets import Dataset # Prepare evaluation data dataset = Dataset.from_dict({ "question": ["What is chunking?"], "answer": ["Chunking is the process of breaking large documents into smaller segments"], "contexts": [["Chunking involves dividing text into manageable pieces for better processing"]], "ground_truth": ["Chunking is a document processing technique"] }) # Evaluate result = evaluate( dataset=dataset, metrics=[ faithfulness, answer_relevancy, context_relevancy, context_recall ] ) print(result) ``` ### TruEra (TruLens) **Installation**: ```bash pip install trulens trulens-apps ``` **Key Features**: - LLM application evaluation - Feedback functions - Hallucination detection - Performance monitoring **Example Usage**: ```python from trulens.core import TruSession from trulens.apps.custom import instrument from trulens.feedback import GroundTruthAgreement # Initialize session session = TruSession() # Define feedback functions f_groundedness = GroundTruthAgreement(ground_truth) # Evaluate chunks @instrument def chunk_and_query(text, query): chunks = chunk_function(text) relevant_chunks = search_function(chunks, query) answer = generate_function(relevant_chunks, query) return answer # Record evaluation with session: chunk_and_query("large document text", "what is the main topic?") ``` ## Document Processing ### PyPDF2 **Installation**: ```bash pip install PyPDF2 ``` **Key Features**: - PDF text extraction - Page manipulation - Metadata extraction - Form field processing **Example Usage**: ```python import PyPDF2 def extract_text_from_pdf(pdf_path): text = "" with open(pdf_path, 'rb') as file: reader = PyPDF2.PdfReader(file) for page in reader.pages: text += page.extract_text() return text # Extract text by page for better chunking def extract_pages(pdf_path): pages = [] with open(pdf_path, 'rb') as file: reader = PyPDF2.PdfReader(file) for i, page in enumerate(reader.pages): pages.append({ "page_number": i + 1, "content": page.extract_text() }) return pages ``` ### python-docx **Installation**: ```bash pip install python-docx ``` **Key Features**: - Microsoft Word document processing - Paragraph and table extraction - Style preservation - Metadata access **Example Usage**: ```python from docx import Document def extract_from_docx(docx_path): doc = Document(docx_path) content = [] for paragraph in doc.paragraphs: if paragraph.text.strip(): content.append({ "type": "paragraph", "text": paragraph.text, "style": paragraph.style.name }) for table in doc.tables: table_text = [] for row in table.rows: row_text = [cell.text for cell in row.cells] table_text.append(" | ".join(row_text)) content.append({ "type": "table", "text": "\n".join(table_text) }) return content ``` ## Specialized Libraries ### tiktoken (OpenAI) **Installation**: ```bash pip install tiktoken ``` **Key Features**: - Accurate token counting for OpenAI models - Fast encoding/decoding - Multiple model support - Language model specific tokenization **Example Usage**: ```python import tiktoken # Get encoding for specific model encoding = tiktoken.encoding_for_model("gpt-3.5-turbo") # Encode text tokens = encoding.encode("This is a sample text") print(f"Token count: {len(tokens)}") # Decode tokens text = encoding.decode(tokens) # Count tokens without full encoding def count_tokens(text, model="gpt-3.5-turbo"): encoding = tiktoken.encoding_for_model(model) return len(encoding.encode(text)) # Use in chunking def chunk_by_tokens(text, max_tokens=1000): encoding = tiktoken.encoding_for_model("gpt-3.5-turbo") tokens = encoding.encode(text) chunks = [] for i in range(0, len(tokens), max_tokens): chunk_tokens = tokens[i:i + max_tokens] chunk_text = encoding.decode(chunk_tokens) chunks.append(chunk_text) return chunks ``` ### PDFMiner **Installation**: ```bash pip install pdfminer.six ``` **Key Features**: - Detailed PDF analysis - Layout preservation - Font and style information - High-precision text extraction **Example Usage**: ```python from pdfminer.high_level import extract_pages from pdfminer.layout import LTTextContainer def extract_structured_text(pdf_path): structured_content = [] for page_layout in extract_pages(pdf_path): page_content = [] for element in page_layout: if isinstance(element, LTTextContainer): text = element.get_text() font_info = { "font_size": element.height, "is_bold": "Bold" in element.fontname, "x0": element.x0, "y0": element.y0 } page_content.append({ "text": text.strip(), "font_info": font_info }) structured_content.append({ "page_number": page_layout.pageid, "content": page_content }) return structured_content ``` ## Performance and Optimization ### Dask **Installation**: ```bash pip install dask[complete] ``` **Key Features**: - Parallel processing - Out-of-core computation - Distributed computing - Integration with pandas **Example Usage**: ```python import dask.bag as db from dask.distributed import Client # Setup distributed client client = Client(n_workers=4) # Parallel chunking of multiple documents def chunk_document(document): # Your chunking logic here return chunk_function(document) # Process documents in parallel documents = ["doc1", "doc2", "doc3", ...] # List of document contents document_bag = db.from_sequence(documents) # Apply chunking function in parallel chunked_documents = document_bag.map(chunk_document) # Compute results results = chunked_documents.compute() ``` ### Ray **Installation**: ```bash pip install ray ``` **Key Features**: - Distributed computing - Actor model - Autoscaling - ML pipeline integration **Example Usage**: ```python import ray # Initialize Ray ray.init() @ray.remote class ChunkingWorker: def __init__(self, strategy): self.strategy = strategy def chunk_documents(self, documents): results = [] for doc in documents: chunks = self.strategy.chunk(doc) results.append(chunks) return results # Create workers workers = [ChunkingWorker.remote(strategy) for _ in range(4)] # Distribute work documents_batch = [documents[i::4] for i in range(4)] futures = [worker.chunk_documents.remote(batch) for worker, batch in zip(workers, documents_batch)] # Get results results = ray.get(futures) ``` ## Development and Testing ### pytest **Installation**: ```bash pip install pytest pytest-asyncio ``` **Example Tests**: ```python import pytest from your_chunking_module import FixedSizeChunker, SemanticChunker class TestFixedSizeChunker: def test_chunk_size_respect(self): chunker = FixedSizeChunker(chunk_size=100, chunk_overlap=10) text = "word " * 50 # 50 words chunks = chunker.chunk(text) for chunk in chunks: assert len(chunk.split()) <= 100 # Account for word boundaries def test_overlap_consistency(self): chunker = FixedSizeChunker(chunk_size=50, chunk_overlap=10) text = "word " * 30 chunks = chunker.chunk(text) # Check overlap between consecutive chunks for i in range(1, len(chunks)): chunk1_words = set(chunks[i-1].split()[-10:]) chunk2_words = set(chunks[i].split()[:10]) overlap = len(chunk1_words & chunk2_words) assert overlap >= 5 # Allow some tolerance @pytest.mark.asyncio async def test_semantic_chunker(): chunker = SemanticChunker() text = "First topic sentence. Another sentence about first topic. " \ "Now switching to second topic. More about second topic." chunks = await chunker.chunk_async(text) # Should detect topic change and create boundary assert len(chunks) >= 2 assert "first topic" in chunks[0].lower() assert "second topic" in chunks[1].lower() ``` ### Memory Profiler **Installation**: ```bash pip install memory-profiler ``` **Example Usage**: ```python from memory_profiler import profile @profile def chunk_large_document(): chunker = FixedSizeChunker(chunk_size=1000) large_text = "word " * 100000 # Large document chunks = chunker.chunk(large_text) return chunks # Run with: python -m memory_profiler your_script.py ``` This comprehensive toolset provides everything needed to implement, test, and optimize chunking strategies for various use cases, from simple text processing to production-grade RAG systems.