Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:28:30 +08:00
commit 171acedaa4
220 changed files with 85967 additions and 0 deletions

View File

@@ -0,0 +1,137 @@
# Document Chunking Strategies
## Overview
Document chunking is the process of breaking large documents into smaller, manageable pieces that can be effectively embedded and retrieved.
## Chunking Strategies
### 1. Recursive Character Text Splitter
**Method**: Split text based on character count, trying separators in order
**Use Case**: General purpose text splitting
**Advantages**: Preserves sentence and paragraph boundaries when possible
```python
from langchain.text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=["\n\n", "\n", " ", ""] # Try these in order
)
chunks = splitter.split_documents(documents)
```
### 2. Token-Based Splitting
**Method**: Split based on token count rather than characters
**Use Case**: When working with token limits of language models
**Advantages**: Better control over context window usage
```python
from langchain.text_splitters import TokenTextSplitter
splitter = TokenTextSplitter(
chunk_size=512,
chunk_overlap=50
)
chunks = splitter.split_documents(documents)
```
### 3. Semantic Chunking
**Method**: Split based on semantic similarity
**Use Case**: When maintaining semantic coherence is important
**Advantages**: Chunks are more semantically meaningful
```python
from langchain.text_splitters import SemanticChunker
splitter = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile"
)
chunks = splitter.split_documents(documents)
```
### 4. Markdown Header Splitter
**Method**: Split based on markdown headers
**Use Case**: Structured documents with clear hierarchical organization
**Advantages**: Maintains document structure and context
```python
from langchain.text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_documents(documents)
```
### 5. HTML Splitter
**Method**: Split based on HTML tags
**Use Case**: Web pages and HTML documents
**Advantages**: Preserves HTML structure and metadata
```python
from langchain.text_splitters import HTMLHeaderTextSplitter
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
("h3", "Header 3"),
]
splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_documents(documents)
```
## Parameter Tuning
### Chunk Size
- **Small chunks (200-400 tokens)**: More precise retrieval, but may lose context
- **Medium chunks (500-1000 tokens)**: Good balance of precision and context
- **Large chunks (1000-2000 tokens)**: More context, but less precise retrieval
### Chunk Overlap
- **Purpose**: Preserve context at chunk boundaries
- **Typical range**: 10-20% of chunk size
- **Higher overlap**: Better context preservation, but more redundancy
- **Lower overlap**: Less redundancy, but may lose important context
### Separators
- **Hierarchical separators**: Start with larger boundaries (paragraphs), then smaller (sentences)
- **Custom separators**: Add domain-specific separators for better results
- **Language-specific**: Adjust for different languages and writing styles
## Best Practices
1. **Preserve Context**: Ensure chunks contain enough surrounding context
2. **Maintain Coherence**: Keep semantically related content together
3. **Respect Boundaries**: Avoid breaking sentences or important phrases
4. **Consider Query Types**: Adapt chunking strategy to typical user queries
5. **Test and Iterate**: Evaluate different chunking strategies for your specific use case
## Evaluation Metrics
1. **Retrieval Quality**: How well chunks answer user queries
2. **Context Preservation**: Whether important context is maintained
3. **Chunk Distribution**: Evenness of chunk sizes
4. **Boundary Quality**: How natural chunk boundaries are
5. **Retrieval Efficiency**: Impact on retrieval speed and accuracy
## Advanced Techniques
### Adaptive Chunking
Adjust chunk size based on document structure and content density.
### Hierarchical Chunking
Create multiple levels of chunks for different retrieval scenarios.
### Query-Aware Chunking
Optimize chunk boundaries based on typical query patterns.
### Domain-Specific Splitting
Use specialized splitters for specific document types (legal, medical, technical).

View File

@@ -0,0 +1,88 @@
# Embedding Models Guide
## Overview
Embedding models convert text into numerical vectors that capture semantic meaning for similarity search in RAG systems.
## Popular Embedding Models
### 1. text-embedding-ada-002 (OpenAI)
- **Dimensions**: 1536
- **Type**: General purpose
- **Use Case**: Most applications requiring high quality embeddings
- **Performance**: Excellent balance of quality and speed
### 2. all-MiniLM-L6-v2 (Sentence Transformers)
- **Dimensions**: 384
- **Type**: Lightweight
- **Use Case**: Applications requiring fast inference
- **Performance**: Good quality, very fast
### 3. e5-large-v2
- **Dimensions**: 1024
- **Type**: High quality
- **Use Case**: Applications needing superior performance
- **Performance**: Excellent quality, multilingual support
### 4. Instructor
- **Dimensions**: Variable (768)
- **Type**: Task-specific
- **Use Case**: Domain-specific applications
- **Performance**: Can be fine-tuned for specific tasks
### 5. bge-large-en-v1.5
- **Dimensions**: 1024
- **Type**: State-of-the-art
- **Use Case**: Applications requiring best possible quality
- **Performance**: SOTA performance on benchmarks
## Selection Criteria
1. **Quality vs Speed**: Balance between embedding quality and inference speed
2. **Dimension Size**: Impact on storage and retrieval performance
3. **Domain**: Specific language or domain requirements
4. **Cost**: API costs vs local deployment
5. **Batch Size**: Throughput requirements
6. **Language**: Multilingual support needs
## Usage Examples
### OpenAI Embeddings
```python
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
vector = embeddings.embed_query("Your text here")
```
### Sentence Transformers
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
vector = model.encode("Your text here")
```
### Hugging Face Models
```python
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
```
## Optimization Tips
1. **Batch Processing**: Process multiple texts together for efficiency
2. **Model Quantization**: Reduce model size for faster inference
3. **Caching**: Cache embeddings for frequently used texts
4. **GPU Acceleration**: Use GPU for faster processing when available
5. **Model Selection**: Choose appropriate model size for your use case
## Evaluation Metrics
1. **Semantic Similarity**: How well embeddings capture meaning
2. **Retrieval Performance**: Quality of retrieved documents
3. **Speed**: Inference time per document
4. **Memory Usage**: RAM requirements for the model
5. **Cost**: API costs or infrastructure requirements

View File

@@ -0,0 +1,94 @@
# LangChain4j RAG Implementation Guide
## Overview
RAG (Retrieval-Augmented Generation) extends LLM knowledge by finding and injecting relevant information from your data into prompts before sending to the LLM.
## What is RAG?
RAG helps LLMs answer questions using domain-specific knowledge by retrieving relevant information to reduce hallucinations.
## RAG Flavors in LangChain4j
### 1. Easy RAG
Simplest way to start with minimal setup. Handles document loading, splitting, and embedding automatically.
### 2. Core RAG APIs
Modular components including:
- Document
- TextSegment
- EmbeddingModel
- EmbeddingStore
- DocumentSplitter
### 3. Advanced RAG
Complex pipelines supporting:
- Query transformation
- Multi-source retrieval
- Re-ranking with components like QueryTransformer and ContentRetriever
## RAG Stages
### 1. Indexing
Pre-process documents for efficient search
### 2. Retrieval
Find relevant content based on user queries
## Core Components
### Documents with metadata
Structured representation of your content with associated metadata for filtering and context.
### Text segments (chunks)
Smaller, manageable pieces of documents that are embedded and stored in vector databases.
### Embedding models
Convert text segments into numerical vectors for similarity search.
### Embedding stores (vector databases)
Store and efficiently retrieve embedded text segments.
### Content retrievers
Find relevant content based on user queries.
### Query transformers
Transform and optimize user queries for better retrieval.
### Content aggregators
Combine and rank retrieved content.
## Advanced Features
- Query transformation and routing
- Multiple retrievers for different data sources
- Re-ranking models for improved relevance
- Metadata filtering for targeted retrieval
- Parallel processing for performance
## Implementation Example (Easy RAG)
```java
// Load documents
List<Document> documents = FileSystemDocumentLoader.loadDocuments("/path/to/docs");
// Create embedding store
InMemoryEmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();
// Ingest documents
EmbeddingStoreIngestor.ingest(documents, embeddingStore);
// Create AI service
Assistant assistant = AiServices.builder(Assistant.class)
.chatModel(chatModel)
.chatMemory(MessageWindowChatMemory.withMaxMessages(10))
.contentRetriever(EmbeddingStoreContentRetriever.from(embeddingStore))
.build();
```
## Best Practices
1. **Document Preparation**: Clean and structure documents before ingestion
2. **Chunk Size**: Balance between context preservation and retrieval precision
3. **Metadata Strategy**: Include relevant metadata for filtering and context
4. **Embedding Model Selection**: Choose models appropriate for your domain
5. **Retrieval Strategy**: Select appropriate k values and filtering criteria
6. **Evaluation**: Continuously evaluate retrieval quality and answer accuracy

View File

@@ -0,0 +1,161 @@
# Advanced Retrieval Strategies
## Overview
Different retrieval approaches for finding relevant documents in RAG systems, each with specific strengths and use cases.
## Retrieval Approaches
### 1. Dense Retrieval
**Method**: Semantic similarity via embeddings
**Use Case**: Understanding meaning and context
**Example**: Finding documents about "machine learning" when query is "AI algorithms"
```python
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_documents(chunks, embeddings)
results = vectorstore.similarity_search("query", k=5)
```
### 2. Sparse Retrieval
**Method**: Keyword matching (BM25, TF-IDF)
**Use Case**: Exact term matching and keyword-specific queries
**Example**: Finding documents containing specific technical terms
```python
from langchain.retrievers import BM25Retriever
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5
results = bm25_retriever.get_relevant_documents("query")
```
### 3. Hybrid Search
**Method**: Combine dense + sparse retrieval
**Use Case**: Balance between semantic understanding and keyword matching
```python
from langchain.retrievers import BM25Retriever, EnsembleRetriever
# Sparse retriever (BM25)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5
# Dense retriever (embeddings)
embedding_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Combine with weights
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, embedding_retriever],
weights=[0.3, 0.7]
)
```
### 4. Multi-Query Retrieval
**Method**: Generate multiple query variations
**Use Case**: Complex queries that can be interpreted in multiple ways
```python
from langchain.retrievers.multi_query import MultiQueryRetriever
# Generate multiple query perspectives
retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(),
llm=OpenAI()
)
# Single query → multiple variations → combined results
results = retriever.get_relevant_documents("What is the main topic?")
```
### 5. HyDE (Hypothetical Document Embeddings)
**Method**: Generate hypothetical documents for better retrieval
**Use Case**: When queries are very different from document style
```python
# Generate hypothetical document based on query
hypothetical_doc = llm.generate(f"Write a document about: {query}")
# Use hypothetical doc for retrieval
results = vectorstore.similarity_search(hypothetical_doc, k=5)
```
## Advanced Retrieval Patterns
### Contextual Compression
Compress retrieved documents to only include relevant parts
```python
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever()
)
```
### Parent Document Retriever
Store small chunks for retrieval, return larger chunks for context
```python
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
store = InMemoryStore()
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter
)
```
## Retrieval Optimization Techniques
### 1. Metadata Filtering
Filter results based on document metadata
```python
results = vectorstore.similarity_search(
"query",
filter={"category": "technical", "date": {"$gte": "2023-01-01"}},
k=5
)
```
### 2. Maximal Marginal Relevance (MMR)
Balance relevance with diversity
```python
results = vectorstore.max_marginal_relevance_search(
"query",
k=5,
fetch_k=20,
lambda_mult=0.5 # 0=max diversity, 1=max relevance
)
```
### 3. Reranking
Improve top results with cross-encoder
```python
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
candidates = vectorstore.similarity_search("query", k=20)
pairs = [[query, doc.page_content] for doc in candidates]
scores = reranker.predict(pairs)
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]
```
## Selection Guidelines
1. **Query Type**: Choose strategy based on typical query patterns
2. **Document Type**: Consider document structure and content
3. **Performance Requirements**: Balance quality vs speed
4. **Domain Knowledge**: Leverage domain-specific patterns
5. **User Expectations**: Match retrieval behavior to user expectations

View File

@@ -0,0 +1,86 @@
# Vector Database Comparison and Configuration
## Overview
Vector databases store and efficiently retrieve document embeddings for semantic search in RAG systems.
## Popular Vector Database Options
### 1. Pinecone
- **Type**: Managed cloud service
- **Features**: Scalable, fast queries, managed infrastructure
- **Use Case**: Production applications requiring high availability
### 2. Weaviate
- **Type**: Open-source, hybrid search
- **Features**: Combines vector and keyword search, GraphQL API
- **Use Case**: Applications needing both semantic and traditional search
### 3. Milvus
- **Type**: High performance, on-premise
- **Features**: Distributed architecture, GPU acceleration
- **Use Case**: Large-scale deployments with custom infrastructure
### 4. Chroma
- **Type**: Lightweight, easy to use
- **Features**: Local deployment, simple API
- **Use Case**: Development and small-scale applications
### 5. Qdrant
- **Type**: Fast, filtered search
- **Features**: Advanced filtering, payload support
- **Use Case**: Applications requiring complex metadata filtering
### 6. FAISS
- **Type**: Meta's library, local deployment
- **Features**: High performance, CPU/GPU optimized
- **Use Case**: Research and applications needing full control
## Configuration Examples
### Pinecone Setup
```python
import pinecone
from langchain.vectorstores import Pinecone
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
index = pinecone.Index("your-index-name")
vectorstore = Pinecone(index, embeddings.embed_query, "text")
```
### Weaviate Setup
```python
import weaviate
from langchain.vectorstores import Weaviate
client = weaviate.Client("http://localhost:8080")
vectorstore = Weaviate(client, "Document", "content", embeddings)
```
### Chroma Local Setup
```python
from langchain.vectorstores import Chroma
vectorstore = Chroma(
collection_name="my_collection",
embedding_function=embeddings,
persist_directory="./chroma_db"
)
```
## Selection Criteria
1. **Scale**: Number of documents and expected query volume
2. **Performance**: Latency requirements and throughput needs
3. **Deployment**: Cloud vs on-premise preferences
4. **Features**: Filtering, hybrid search, metadata support
5. **Cost**: Budget constraints and operational overhead
6. **Maintenance**: Team expertise and available resources
## Best Practices
1. **Indexing Strategy**: Choose appropriate distance metrics (cosine, euclidean)
2. **Sharding**: Distribute data for large-scale deployments
3. **Monitoring**: Track query performance and system health
4. **Backups**: Implement regular backup procedures
5. **Security**: Secure access to sensitive data
6. **Optimization**: Tune parameters for your specific use case