Initial commit

2025-11-29 18:28:30 +08:00
commit 171acedaa4
220 changed files with 85967 additions and 0 deletions
--- a/skills/ai/chunking-strategy/references/tools.md
+++ b/skills/ai/chunking-strategy/references/tools.md
@@ -0,0 +1,867 @@
+# Recommended Libraries and Frameworks
+
+This document provides a comprehensive guide to tools, libraries, and frameworks for implementing chunking strategies.
+
+## Core Chunking Libraries
+
+### LangChain
+
+**Overview**: Comprehensive framework for building applications with large language models, includes robust text splitting utilities.
+
+**Installation**:
+```bash
+pip install langchain langchain-text-splitters
+```
+
+**Key Features**:
+- Multiple text splitting strategies
+- Integration with various document loaders
+- Support for different content types (code, markdown, etc.)
+- Customizable separators and parameters
+
+**Example Usage**:
+
+```python
+from langchain.text_splitter import (
+    RecursiveCharacterTextSplitter,
+    CharacterTextSplitter,
+    TokenTextSplitter,
+    MarkdownTextSplitter,
+    PythonCodeTextSplitter
+)
+
+# Basic recursive splitting
+splitter = RecursiveCharacterTextSplitter(
+    chunk_size=1000,
+    chunk_overlap=200,
+    length_function=len,
+    separators=["\n\n", "\n", " ", ""]
+)
+
+chunks = splitter.split_text(large_text)
+
+# Markdown-specific splitting
+markdown_splitter = MarkdownTextSplitter(
+    chunk_size=1000,
+    chunk_overlap=100
+)
+
+# Code-specific splitting
+code_splitter = PythonCodeTextSplitter(
+    chunk_size=1000,
+    chunk_overlap=100
+)
+```
+
+**Pros**:
+- Well-maintained and actively developed
+- Extensive documentation and examples
+- Integrates well with other LangChain components
+- Supports multiple document types
+
+**Cons**:
+- Can be heavy dependency for simple use cases
+- Some advanced features require LangChain ecosystem
+
+### LlamaIndex
+
+**Overview**: Data framework for LLM applications with advanced indexing and retrieval capabilities.
+
+**Installation**:
+```bash
+pip install llama-index
+```
+
+**Key Features**:
+- Advanced semantic chunking
+- Hierarchical indexing
+- Context-aware retrieval
+- Integration with vector databases
+
+**Example Usage**:
+
+```python
+from llama_index.core.node_parser import (
+    SentenceSplitter,
+    SemanticSplitterNodeParser
+)
+from llama_index.core import SimpleDirectoryReader
+from llama_index.embeddings.openai import OpenAIEmbedding
+
+# Basic sentence splitting
+splitter = SentenceSplitter(
+    chunk_size=1024,
+    chunk_overlap=20
+)
+
+# Semantic chunking with embeddings
+embed_model = OpenAIEmbedding()
+semantic_splitter = SemanticSplitterNodeParser(
+    buffer_size=1,
+    breakpoint_percentile_threshold=95,
+    embed_model=embed_model
+)
+
+# Load and process documents
+documents = SimpleDirectoryReader("./data").load_data()
+nodes = semantic_splitter.get_nodes_from_documents(documents)
+```
+
+**Pros**:
+- Excellent semantic chunking capabilities
+- Built for production RAG systems
+- Strong vector database integration
+- Active community support
+
+**Cons**:
+- More complex setup for basic use cases
+- Semantic chunking requires embedding model setup
+
+### Unstructured
+
+**Overview**: Open-source library for processing unstructured documents, especially strong with multi-modal content.
+
+**Installation**:
+```bash
+pip install "unstructured[pdf,png,jpg]"
+```
+
+**Key Features**:
+- Multi-modal document processing
+- Support for PDFs, images, and various formats
+- Structure preservation
+- Table extraction and processing
+
+**Example Usage**:
+
+```python
+from unstructured.partition.auto import partition
+from unstructured.chunking.title import chunk_by_title
+
+# Partition document by type
+elements = partition(filename="document.pdf")
+
+# Chunk by title/heading structure
+chunks = chunk_by_title(
+    elements,
+    combine_text_under_n_chars=2000,
+    max_characters=10000,
+    new_after_n_chars=1500,
+    multipage_sections=True
+)
+
+# Access chunked content
+for chunk in chunks:
+    print(f"Category: {chunk.category}")
+    print(f"Content: {chunk.text[:200]}...")
+```
+
+**Pros**:
+- Excellent for PDF and image processing
+- Preserves document structure
+- Handles tables and figures well
+- Strong multi-modal capabilities
+
+**Cons**:
+- Can be slower for large documents
+- Requires additional dependencies for some formats
+
+## Text Processing Libraries
+
+### NLTK (Natural Language Toolkit)
+
+**Installation**:
+```bash
+pip install nltk
+```
+
+**Key Features**:
+- Sentence tokenization
+- Language detection
+- Text preprocessing
+- Linguistic analysis
+
+**Example Usage**:
+
+```python
+import nltk
+from nltk.tokenize import sent_tokenize, word_tokenize
+from nltk.corpus import stopwords
+
+# Download required data
+nltk.download('punkt')
+nltk.download('stopwords')
+
+# Sentence and word tokenization
+text = "This is a sample sentence. This is another sentence."
+sentences = sent_tokenize(text)
+words = word_tokenize(text)
+
+# Stop words removal
+stop_words = set(stopwords.words('english'))
+filtered_words = [word for word in words if word.lower() not in stop_words]
+```
+
+### spaCy
+
+**Installation**:
+```bash
+pip install spacy
+python -m spacy download en_core_web_sm
+```
+
+**Key Features**:
+- Industrial-strength NLP
+- Named entity recognition
+- Dependency parsing
+- Sentence boundary detection
+
+**Example Usage**:
+
+```python
+import spacy
+
+# Load language model
+nlp = spacy.load("en_core_web_sm")
+
+# Process text
+doc = nlp("This is a sample sentence. This is another sentence.")
+
+# Extract sentences
+sentences = [sent.text for sent in doc.sents]
+
+# Named entities
+entities = [(ent.text, ent.label_) for ent in doc.ents]
+
+# Dependency parsing for better chunking
+for token in doc:
+    print(f"{token.text}: {token.dep_} (head: {token.head.text})")
+```
+
+### Sentence Transformers
+
+**Installation**:
+```bash
+pip install sentence-transformers
+```
+
+**Key Features**:
+- Pre-trained sentence embeddings
+- Semantic similarity calculation
+- Multi-lingual support
+- Custom model training
+
+**Example Usage**:
+
+```python
+from sentence_transformers import SentenceTransformer, util
+import numpy as np
+
+# Load pre-trained model
+model = SentenceTransformer('all-MiniLM-L6-v2')
+
+# Generate embeddings
+sentences = ["This is a sentence.", "This is another sentence."]
+embeddings = model.encode(sentences)
+
+# Calculate semantic similarity
+similarity = util.cos_sim(embeddings[0], embeddings[1])
+
+# Find semantic boundaries for chunking
+def find_semantic_boundaries(text, model, threshold=0.8):
+    sentences = [s.strip() for s in text.split('.') if s.strip()]
+    embeddings = model.encode(sentences)
+
+    boundaries = [0]
+    for i in range(1, len(sentences)):
+        similarity = util.cos_sim(embeddings[i-1], embeddings[i])
+        if similarity < threshold:
+            boundaries.append(i)
+
+    return boundaries
+```
+
+## Vector Databases and Search
+
+### ChromaDB
+
+**Installation**:
+```bash
+pip install chromadb
+```
+
+**Key Features**:
+- In-memory and persistent storage
+- Built-in embedding functions
+- Similarity search
+- Metadata filtering
+
+**Example Usage**:
+
+```python
+import chromadb
+from chromadb.utils import embedding_functions
+
+# Initialize client
+client = chromadb.Client()
+
+# Create collection
+collection = client.create_collection(
+    name="document_chunks",
+    embedding_function=embedding_functions.DefaultEmbeddingFunction()
+)
+
+# Add chunks
+collection.add(
+    documents=[chunk["content"] for chunk in chunks],
+    metadatas=[chunk.get("metadata", {}) for chunk in chunks],
+    ids=[chunk["id"] for chunk in chunks]
+)
+
+# Search
+results = collection.query(
+    query_texts=["What is chunking?"],
+    n_results=5
+)
+```
+
+### Pinecone
+
+**Installation**:
+```bash
+pip install pinecone-client
+```
+
+**Key Features**:
+- Managed vector database service
+- High-performance similarity search
+- Metadata filtering
+- Scalable infrastructure
+
+**Example Usage**:
+
+```python
+import pinecone
+from sentence_transformers import SentenceTransformer
+
+# Initialize
+pinecone.init(api_key="your-api-key", environment="your-environment")
+index_name = "document-chunks"
+
+# Create index if it doesn't exist
+if index_name not in pinecone.list_indexes():
+    pinecone.create_index(
+        name=index_name,
+        dimension=384,  # Match embedding model
+        metric="cosine"
+    )
+
+index = pinecone.Index(index_name)
+
+# Generate embeddings and upsert
+model = SentenceTransformer('all-MiniLM-L6-v2')
+for chunk in chunks:
+    embedding = model.encode(chunk["content"])
+    index.upsert(
+        vectors=[{
+            "id": chunk["id"],
+            "values": embedding.tolist(),
+            "metadata": chunk.get("metadata", {})
+        }]
+    )
+
+# Search
+query_embedding = model.encode("search query")
+results = index.query(
+    vector=query_embedding.tolist(),
+    top_k=5,
+    include_metadata=True
+)
+```
+
+### Weaviate
+
+**Installation**:
+```bash
+pip install weaviate-client
+```
+
+**Key Features**:
+- GraphQL API
+- Hybrid search (dense + sparse)
+- Real-time updates
+- Schema validation
+
+**Example Usage**:
+
+```python
+import weaviate
+
+# Connect to Weaviate
+client = weaviate.Client("http://localhost:8080")
+
+# Define schema
+client.schema.create_class({
+    "class": "DocumentChunk",
+    "description": "A chunk of document content",
+    "properties": [
+        {
+            "name": "content",
+            "dataType": ["text"]
+        },
+        {
+            "name": "source",
+            "dataType": ["string"]
+        }
+    ]
+})
+
+# Add data
+for chunk in chunks:
+    client.data_object.create(
+        data_object={
+            "content": chunk["content"],
+            "source": chunk.get("source", "unknown")
+        },
+        class_name="DocumentChunk"
+    )
+
+# Search
+results = client.query.get(
+    "DocumentChunk",
+    ["content", "source"]
+).with_near_text({
+    "concepts": ["search query"]
+}).with_limit(5).do()
+```
+
+## Evaluation and Testing
+
+### RAGAS
+
+**Installation**:
+```bash
+pip install ragas
+```
+
+**Key Features**:
+- RAG evaluation metrics
+- Answer quality assessment
+- Context relevance measurement
+- Faithfulness evaluation
+
+**Example Usage**:
+
+```python
+from ragas import evaluate
+from ragas.metrics import (
+    faithfulness,
+    answer_relevancy,
+    context_relevancy,
+    context_recall
+)
+from datasets import Dataset
+
+# Prepare evaluation data
+dataset = Dataset.from_dict({
+    "question": ["What is chunking?"],
+    "answer": ["Chunking is the process of breaking large documents into smaller segments"],
+    "contexts": [["Chunking involves dividing text into manageable pieces for better processing"]],
+    "ground_truth": ["Chunking is a document processing technique"]
+})
+
+# Evaluate
+result = evaluate(
+    dataset=dataset,
+    metrics=[
+        faithfulness,
+        answer_relevancy,
+        context_relevancy,
+        context_recall
+    ]
+)
+
+print(result)
+```
+
+### TruEra (TruLens)
+
+**Installation**:
+```bash
+pip install trulens trulens-apps
+```
+
+**Key Features**:
+- LLM application evaluation
+- Feedback functions
+- Hallucination detection
+- Performance monitoring
+
+**Example Usage**:
+
+```python
+from trulens.core import TruSession
+from trulens.apps.custom import instrument
+from trulens.feedback import GroundTruthAgreement
+
+# Initialize session
+session = TruSession()
+
+# Define feedback functions
+f_groundedness = GroundTruthAgreement(ground_truth)
+
+# Evaluate chunks
+@instrument
+def chunk_and_query(text, query):
+    chunks = chunk_function(text)
+    relevant_chunks = search_function(chunks, query)
+    answer = generate_function(relevant_chunks, query)
+    return answer
+
+# Record evaluation
+with session:
+    chunk_and_query("large document text", "what is the main topic?")
+```
+
+## Document Processing
+
+### PyPDF2
+
+**Installation**:
+```bash
+pip install PyPDF2
+```
+
+**Key Features**:
+- PDF text extraction
+- Page manipulation
+- Metadata extraction
+- Form field processing
+
+**Example Usage**:
+
+```python
+import PyPDF2
+
+def extract_text_from_pdf(pdf_path):
+    text = ""
+    with open(pdf_path, 'rb') as file:
+        reader = PyPDF2.PdfReader(file)
+        for page in reader.pages:
+            text += page.extract_text()
+    return text
+
+# Extract text by page for better chunking
+def extract_pages(pdf_path):
+    pages = []
+    with open(pdf_path, 'rb') as file:
+        reader = PyPDF2.PdfReader(file)
+        for i, page in enumerate(reader.pages):
+            pages.append({
+                "page_number": i + 1,
+                "content": page.extract_text()
+            })
+    return pages
+```
+
+### python-docx
+
+**Installation**:
+```bash
+pip install python-docx
+```
+
+**Key Features**:
+- Microsoft Word document processing
+- Paragraph and table extraction
+- Style preservation
+- Metadata access
+
+**Example Usage**:
+
+```python
+from docx import Document
+
+def extract_from_docx(docx_path):
+    doc = Document(docx_path)
+    content = []
+
+    for paragraph in doc.paragraphs:
+        if paragraph.text.strip():
+            content.append({
+                "type": "paragraph",
+                "text": paragraph.text,
+                "style": paragraph.style.name
+            })
+
+    for table in doc.tables:
+        table_text = []
+        for row in table.rows:
+            row_text = [cell.text for cell in row.cells]
+            table_text.append(" | ".join(row_text))
+
+        content.append({
+            "type": "table",
+            "text": "\n".join(table_text)
+        })
+
+    return content
+```
+
+## Specialized Libraries
+
+### tiktoken (OpenAI)
+
+**Installation**:
+```bash
+pip install tiktoken
+```
+
+**Key Features**:
+- Accurate token counting for OpenAI models
+- Fast encoding/decoding
+- Multiple model support
+- Language model specific tokenization
+
+**Example Usage**:
+
+```python
+import tiktoken
+
+# Get encoding for specific model
+encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
+
+# Encode text
+tokens = encoding.encode("This is a sample text")
+print(f"Token count: {len(tokens)}")
+
+# Decode tokens
+text = encoding.decode(tokens)
+
+# Count tokens without full encoding
+def count_tokens(text, model="gpt-3.5-turbo"):
+    encoding = tiktoken.encoding_for_model(model)
+    return len(encoding.encode(text))
+
+# Use in chunking
+def chunk_by_tokens(text, max_tokens=1000):
+    encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
+    tokens = encoding.encode(text)
+
+    chunks = []
+    for i in range(0, len(tokens), max_tokens):
+        chunk_tokens = tokens[i:i + max_tokens]
+        chunk_text = encoding.decode(chunk_tokens)
+        chunks.append(chunk_text)
+
+    return chunks
+```
+
+### PDFMiner
+
+**Installation**:
+```bash
+pip install pdfminer.six
+```
+
+**Key Features**:
+- Detailed PDF analysis
+- Layout preservation
+- Font and style information
+- High-precision text extraction
+
+**Example Usage**:
+
+```python
+from pdfminer.high_level import extract_pages
+from pdfminer.layout import LTTextContainer
+
+def extract_structured_text(pdf_path):
+    structured_content = []
+
+    for page_layout in extract_pages(pdf_path):
+        page_content = []
+
+        for element in page_layout:
+            if isinstance(element, LTTextContainer):
+                text = element.get_text()
+                font_info = {
+                    "font_size": element.height,
+                    "is_bold": "Bold" in element.fontname,
+                    "x0": element.x0,
+                    "y0": element.y0
+                }
+                page_content.append({
+                    "text": text.strip(),
+                    "font_info": font_info
+                })
+
+        structured_content.append({
+            "page_number": page_layout.pageid,
+            "content": page_content
+        })
+
+    return structured_content
+```
+
+## Performance and Optimization
+
+### Dask
+
+**Installation**:
+```bash
+pip install dask[complete]
+```
+
+**Key Features**:
+- Parallel processing
+- Out-of-core computation
+- Distributed computing
+- Integration with pandas
+
+**Example Usage**:
+
+```python
+import dask.bag as db
+from dask.distributed import Client
+
+# Setup distributed client
+client = Client(n_workers=4)
+
+# Parallel chunking of multiple documents
+def chunk_document(document):
+    # Your chunking logic here
+    return chunk_function(document)
+
+# Process documents in parallel
+documents = ["doc1", "doc2", "doc3", ...]  # List of document contents
+document_bag = db.from_sequence(documents)
+
+# Apply chunking function in parallel
+chunked_documents = document_bag.map(chunk_document)
+
+# Compute results
+results = chunked_documents.compute()
+```
+
+### Ray
+
+**Installation**:
+```bash
+pip install ray
+```
+
+**Key Features**:
+- Distributed computing
+- Actor model
+- Autoscaling
+- ML pipeline integration
+
+**Example Usage**:
+
+```python
+import ray
+
+# Initialize Ray
+ray.init()
+
+@ray.remote
+class ChunkingWorker:
+    def __init__(self, strategy):
+        self.strategy = strategy
+
+    def chunk_documents(self, documents):
+        results = []
+        for doc in documents:
+            chunks = self.strategy.chunk(doc)
+            results.append(chunks)
+        return results
+
+# Create workers
+workers = [ChunkingWorker.remote(strategy) for _ in range(4)]
+
+# Distribute work
+documents_batch = [documents[i::4] for i in range(4)]
+futures = [worker.chunk_documents.remote(batch)
+           for worker, batch in zip(workers, documents_batch)]
+
+# Get results
+results = ray.get(futures)
+```
+
+## Development and Testing
+
+### pytest
+
+**Installation**:
+```bash
+pip install pytest pytest-asyncio
+```
+
+**Example Tests**:
+
+```python
+import pytest
+from your_chunking_module import FixedSizeChunker, SemanticChunker
+
+class TestFixedSizeChunker:
+    def test_chunk_size_respect(self):
+        chunker = FixedSizeChunker(chunk_size=100, chunk_overlap=10)
+        text = "word " * 50  # 50 words
+
+        chunks = chunker.chunk(text)
+
+        for chunk in chunks:
+            assert len(chunk.split()) <= 100  # Account for word boundaries
+
+    def test_overlap_consistency(self):
+        chunker = FixedSizeChunker(chunk_size=50, chunk_overlap=10)
+        text = "word " * 30
+
+        chunks = chunker.chunk(text)
+
+        # Check overlap between consecutive chunks
+        for i in range(1, len(chunks)):
+            chunk1_words = set(chunks[i-1].split()[-10:])
+            chunk2_words = set(chunks[i].split()[:10])
+            overlap = len(chunk1_words & chunk2_words)
+            assert overlap >= 5  # Allow some tolerance
+
+@pytest.mark.asyncio
+async def test_semantic_chunker():
+    chunker = SemanticChunker()
+    text = "First topic sentence. Another sentence about first topic. " \
+           "Now switching to second topic. More about second topic."
+
+    chunks = await chunker.chunk_async(text)
+
+    # Should detect topic change and create boundary
+    assert len(chunks) >= 2
+    assert "first topic" in chunks[0].lower()
+    assert "second topic" in chunks[1].lower()
+```
+
+### Memory Profiler
+
+**Installation**:
+```bash
+pip install memory-profiler
+```
+
+**Example Usage**:
+
+```python
+from memory_profiler import profile
+
+@profile
+def chunk_large_document():
+    chunker = FixedSizeChunker(chunk_size=1000)
+    large_text = "word " * 100000  # Large document
+
+    chunks = chunker.chunk(large_text)
+    return chunks
+
+# Run with: python -m memory_profiler your_script.py
+```
+
+This comprehensive toolset provides everything needed to implement, test, and optimize chunking strategies for various use cases, from simple text processing to production-grade RAG systems.