Initial commit
This commit is contained in:
194
skills/ai/chunking-strategy/SKILL.md
Normal file
194
skills/ai/chunking-strategy/SKILL.md
Normal file
@@ -0,0 +1,194 @@
|
||||
---
|
||||
name: chunking-strategy
|
||||
description: Implement optimal chunking strategies in RAG systems and document processing pipelines. Use when building retrieval-augmented generation systems, vector databases, or processing large documents that require breaking into semantically meaningful segments for embeddings and search.
|
||||
allowed-tools: Read, Write, Bash
|
||||
category: artificial-intelligence
|
||||
tags: [rag, chunking, vector-search, embeddings, document-processing]
|
||||
version: 1.0.0
|
||||
---
|
||||
|
||||
# Chunking Strategy for RAG Systems
|
||||
|
||||
## Overview
|
||||
|
||||
Implement optimal chunking strategies for Retrieval-Augmented Generation (RAG) systems and document processing pipelines. This skill provides a comprehensive framework for breaking large documents into smaller, semantically meaningful segments that preserve context while enabling efficient retrieval and search.
|
||||
|
||||
## When to Use
|
||||
|
||||
Use this skill when building RAG systems, optimizing vector search performance, implementing document processing pipelines, handling multi-modal content, or performance-tuning existing RAG systems with poor retrieval quality.
|
||||
|
||||
## Instructions
|
||||
|
||||
### Choose Chunking Strategy
|
||||
|
||||
Select appropriate chunking strategy based on document type and use case:
|
||||
|
||||
1. **Fixed-Size Chunking** (Level 1)
|
||||
- Use for simple documents without clear structure
|
||||
- Start with 512 tokens and 10-20% overlap
|
||||
- Adjust size based on query type: 256 for factoid, 1024 for analytical
|
||||
|
||||
2. **Recursive Character Chunking** (Level 2)
|
||||
- Use for documents with clear structural boundaries
|
||||
- Implement hierarchical separators: paragraphs → sentences → words
|
||||
- Customize separators for document types (HTML, Markdown)
|
||||
|
||||
3. **Structure-Aware Chunking** (Level 3)
|
||||
- Use for structured documents (Markdown, code, tables, PDFs)
|
||||
- Preserve semantic units: functions, sections, table blocks
|
||||
- Validate structure preservation post-splitting
|
||||
|
||||
4. **Semantic Chunking** (Level 4)
|
||||
- Use for complex documents with thematic shifts
|
||||
- Implement embedding-based boundary detection
|
||||
- Configure similarity threshold (0.8) and buffer size (3-5 sentences)
|
||||
|
||||
5. **Advanced Methods** (Level 5)
|
||||
- Use Late Chunking for long-context embedding models
|
||||
- Apply Contextual Retrieval for high-precision requirements
|
||||
- Monitor computational costs vs. retrieval improvements
|
||||
|
||||
Reference detailed strategy implementations in [references/strategies.md](references/strategies.md).
|
||||
|
||||
### Implement Chunking Pipeline
|
||||
|
||||
Follow these steps to implement effective chunking:
|
||||
|
||||
1. **Pre-process documents**
|
||||
- Analyze document structure and content types
|
||||
- Identify multi-modal content (tables, images, code)
|
||||
- Assess information density and complexity
|
||||
|
||||
2. **Select strategy parameters**
|
||||
- Choose chunk size based on embedding model context window
|
||||
- Set overlap percentage (10-20% for most cases)
|
||||
- Configure strategy-specific parameters
|
||||
|
||||
3. **Process and validate**
|
||||
- Apply chosen chunking strategy
|
||||
- Validate semantic coherence of chunks
|
||||
- Test with representative documents
|
||||
|
||||
4. **Evaluate and iterate**
|
||||
- Measure retrieval precision and recall
|
||||
- Monitor processing latency and resource usage
|
||||
- Optimize based on specific use case requirements
|
||||
|
||||
Reference detailed implementation guidelines in [references/implementation.md](references/implementation.md).
|
||||
|
||||
### Evaluate Performance
|
||||
|
||||
Use these metrics to evaluate chunking effectiveness:
|
||||
|
||||
- **Retrieval Precision**: Fraction of retrieved chunks that are relevant
|
||||
- **Retrieval Recall**: Fraction of relevant chunks that are retrieved
|
||||
- **End-to-End Accuracy**: Quality of final RAG responses
|
||||
- **Processing Time**: Latency impact on overall system
|
||||
- **Resource Usage**: Memory and computational costs
|
||||
|
||||
Reference detailed evaluation framework in [references/evaluation.md](references/evaluation.md).
|
||||
|
||||
## Examples
|
||||
|
||||
### Basic Fixed-Size Chunking
|
||||
|
||||
```python
|
||||
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
||||
|
||||
# Configure for factoid queries
|
||||
splitter = RecursiveCharacterTextSplitter(
|
||||
chunk_size=256,
|
||||
chunk_overlap=25,
|
||||
length_function=len
|
||||
)
|
||||
|
||||
chunks = splitter.split_documents(documents)
|
||||
```
|
||||
|
||||
### Structure-Aware Code Chunking
|
||||
|
||||
```python
|
||||
def chunk_python_code(code):
|
||||
"""Split Python code into semantic chunks"""
|
||||
import ast
|
||||
|
||||
tree = ast.parse(code)
|
||||
chunks = []
|
||||
|
||||
for node in ast.walk(tree):
|
||||
if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
|
||||
chunks.append(ast.get_source_segment(code, node))
|
||||
|
||||
return chunks
|
||||
```
|
||||
|
||||
### Semantic Chunking with Embeddings
|
||||
|
||||
```python
|
||||
def semantic_chunk(text, similarity_threshold=0.8):
|
||||
"""Chunk text based on semantic boundaries"""
|
||||
sentences = split_into_sentences(text)
|
||||
embeddings = generate_embeddings(sentences)
|
||||
|
||||
chunks = []
|
||||
current_chunk = [sentences[0]]
|
||||
|
||||
for i in range(1, len(sentences)):
|
||||
similarity = cosine_similarity(embeddings[i-1], embeddings[i])
|
||||
|
||||
if similarity < similarity_threshold:
|
||||
chunks.append(" ".join(current_chunk))
|
||||
current_chunk = [sentences[i]]
|
||||
else:
|
||||
current_chunk.append(sentences[i])
|
||||
|
||||
chunks.append(" ".join(current_chunk))
|
||||
return chunks
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Core Principles
|
||||
- Balance context preservation with retrieval precision
|
||||
- Maintain semantic coherence within chunks
|
||||
- Optimize for embedding model constraints
|
||||
- Preserve document structure when beneficial
|
||||
|
||||
### Implementation Guidelines
|
||||
- Start simple with fixed-size chunking (512 tokens, 10-20% overlap)
|
||||
- Test thoroughly with representative documents
|
||||
- Monitor both accuracy metrics and computational costs
|
||||
- Iterate based on specific document characteristics
|
||||
|
||||
### Common Pitfalls to Avoid
|
||||
- Over-chunking: Creating too many small, context-poor chunks
|
||||
- Under-chunking: Missing relevant information due to oversized chunks
|
||||
- Ignoring document structure and semantic boundaries
|
||||
- Using one-size-fits-all approach for diverse content types
|
||||
- Neglecting overlap for boundary-crossing information
|
||||
|
||||
## Constraints
|
||||
|
||||
### Resource Considerations
|
||||
- Semantic and contextual methods require significant computational resources
|
||||
- Late chunking needs long-context embedding models
|
||||
- Complex strategies increase processing latency
|
||||
- Monitor memory usage for large document processing
|
||||
|
||||
### Quality Requirements
|
||||
- Validate chunk semantic coherence post-processing
|
||||
- Test with domain-specific documents before deployment
|
||||
- Ensure chunks maintain standalone meaning where possible
|
||||
- Implement proper error handling for edge cases
|
||||
|
||||
## References
|
||||
|
||||
Reference detailed documentation in the [references/](references/) folder:
|
||||
- [strategies.md](references/strategies.md) - Detailed strategy implementations
|
||||
- [implementation.md](references/implementation.md) - Complete implementation guidelines
|
||||
- [evaluation.md](references/evaluation.md) - Performance evaluation framework
|
||||
- [tools.md](references/tools.md) - Recommended libraries and frameworks
|
||||
- [research.md](references/research.md) - Key research papers and findings
|
||||
- [advanced-strategies.md](references/advanced-strategies.md) - 11 comprehensive chunking methods
|
||||
- [semantic-methods.md](references/semantic-methods.md) - Semantic and contextual approaches
|
||||
- [visualization-tools.md](references/visualization-tools.md) - Evaluation and visualization tools
|
||||
1358
skills/ai/chunking-strategy/references/advanced-strategies.md
Normal file
1358
skills/ai/chunking-strategy/references/advanced-strategies.md
Normal file
File diff suppressed because it is too large
Load Diff
904
skills/ai/chunking-strategy/references/evaluation.md
Normal file
904
skills/ai/chunking-strategy/references/evaluation.md
Normal file
@@ -0,0 +1,904 @@
|
||||
# Performance Evaluation Framework
|
||||
|
||||
This document provides comprehensive methodologies for evaluating chunking strategy performance and effectiveness.
|
||||
|
||||
## Evaluation Metrics
|
||||
|
||||
### Core Retrieval Metrics
|
||||
|
||||
#### Retrieval Precision
|
||||
Measures the fraction of retrieved chunks that are relevant to the query.
|
||||
|
||||
```python
|
||||
def calculate_precision(retrieved_chunks: List[Dict], relevant_chunks: List[Dict]) -> float:
|
||||
"""
|
||||
Calculate retrieval precision
|
||||
Precision = |Relevant ∩ Retrieved| / |Retrieved|
|
||||
"""
|
||||
retrieved_ids = {chunk.get('id') for chunk in retrieved_chunks}
|
||||
relevant_ids = {chunk.get('id') for chunk in relevant_chunks}
|
||||
|
||||
intersection = retrieved_ids & relevant_ids
|
||||
|
||||
if not retrieved_ids:
|
||||
return 0.0
|
||||
|
||||
return len(intersection) / len(retrieved_ids)
|
||||
```
|
||||
|
||||
#### Retrieval Recall
|
||||
Measures the fraction of relevant chunks that are successfully retrieved.
|
||||
|
||||
```python
|
||||
def calculate_recall(retrieved_chunks: List[Dict], relevant_chunks: List[Dict]) -> float:
|
||||
"""
|
||||
Calculate retrieval recall
|
||||
Recall = |Relevant ∩ Retrieved| / |Relevant|
|
||||
"""
|
||||
retrieved_ids = {chunk.get('id') for chunk in retrieved_chunks}
|
||||
relevant_ids = {chunk.get('id') for chunk in relevant_chunks}
|
||||
|
||||
intersection = retrieved_ids & relevant_ids
|
||||
|
||||
if not relevant_ids:
|
||||
return 0.0
|
||||
|
||||
return len(intersection) / len(relevant_ids)
|
||||
```
|
||||
|
||||
#### F1-Score
|
||||
Harmonic mean of precision and recall.
|
||||
|
||||
```python
|
||||
def calculate_f1_score(precision: float, recall: float) -> float:
|
||||
"""
|
||||
Calculate F1-score
|
||||
F1 = 2 * (Precision * Recall) / (Precision + Recall)
|
||||
"""
|
||||
if precision + recall == 0:
|
||||
return 0.0
|
||||
|
||||
return 2 * (precision * recall) / (precision + recall)
|
||||
```
|
||||
|
||||
### Mean Reciprocal Rank (MRR)
|
||||
Measures the rank of the first relevant result.
|
||||
|
||||
```python
|
||||
def calculate_mrr(queries: List[Dict], results: List[List[Dict]]) -> float:
|
||||
"""
|
||||
Calculate Mean Reciprocal Rank
|
||||
"""
|
||||
reciprocal_ranks = []
|
||||
|
||||
for query, query_results in zip(queries, results):
|
||||
relevant_found = False
|
||||
|
||||
for rank, result in enumerate(query_results, 1):
|
||||
if result.get('is_relevant', False):
|
||||
reciprocal_ranks.append(1.0 / rank)
|
||||
relevant_found = True
|
||||
break
|
||||
|
||||
if not relevant_found:
|
||||
reciprocal_ranks.append(0.0)
|
||||
|
||||
return sum(reciprocal_ranks) / len(reciprocal_ranks)
|
||||
```
|
||||
|
||||
### Mean Average Precision (MAP)
|
||||
Considers both precision and the ranking of relevant documents.
|
||||
|
||||
```python
|
||||
def calculate_average_precision(retrieved_chunks: List[Dict], relevant_chunks: List[Dict]) -> float:
|
||||
"""
|
||||
Calculate Average Precision for a single query
|
||||
"""
|
||||
retrieved_ids = {chunk.get('id') for chunk in retrieved_chunks}
|
||||
relevant_ids = {chunk.get('id') for chunk in relevant_chunks}
|
||||
|
||||
if not relevant_ids:
|
||||
return 0.0
|
||||
|
||||
precisions = []
|
||||
relevant_count = 0
|
||||
|
||||
for rank, chunk in enumerate(retrieved_chunks, 1):
|
||||
if chunk.get('id') in relevant_ids:
|
||||
relevant_count += 1
|
||||
precision_at_rank = relevant_count / rank
|
||||
precisions.append(precision_at_rank)
|
||||
|
||||
return sum(precisions) / len(relevant_ids) if relevant_ids else 0.0
|
||||
|
||||
def calculate_map(queries: List[Dict], results: List[List[Dict]]) -> float:
|
||||
"""
|
||||
Calculate Mean Average Precision across multiple queries
|
||||
"""
|
||||
average_precisions = []
|
||||
|
||||
for query, query_results in zip(queries, results):
|
||||
ap = calculate_average_precision(query_results, query.get('relevant_chunks', []))
|
||||
average_precisions.append(ap)
|
||||
|
||||
return sum(average_precisions) / len(average_precisions) if average_precisions else 0.0
|
||||
```
|
||||
|
||||
### Normalized Discounted Cumulative Gain (NDCG)
|
||||
Measures ranking quality with emphasis on highly relevant results.
|
||||
|
||||
```python
|
||||
def calculate_dcg(retrieved_chunks: List[Dict]) -> float:
|
||||
"""
|
||||
Calculate Discounted Cumulative Gain
|
||||
"""
|
||||
dcg = 0.0
|
||||
|
||||
for rank, chunk in enumerate(retrieved_chunks, 1):
|
||||
relevance = chunk.get('relevance_score', 0)
|
||||
dcg += relevance / np.log2(rank + 1)
|
||||
|
||||
return dcg
|
||||
|
||||
def calculate_ndcg(retrieved_chunks: List[Dict], ideal_chunks: List[Dict]) -> float:
|
||||
"""
|
||||
Calculate Normalized Discounted Cumulative Gain
|
||||
"""
|
||||
dcg = calculate_dcg(retrieved_chunks)
|
||||
idcg = calculate_dcg(ideal_chunks)
|
||||
|
||||
if idcg == 0:
|
||||
return 0.0
|
||||
|
||||
return dcg / idcg
|
||||
```
|
||||
|
||||
## End-to-End RAG Evaluation
|
||||
|
||||
### Answer Quality Metrics
|
||||
|
||||
#### Factual Consistency
|
||||
Measures how well the generated answer aligns with retrieved chunks.
|
||||
|
||||
```python
|
||||
import spacy
|
||||
from transformers import pipeline
|
||||
|
||||
class FactualConsistencyEvaluator:
|
||||
def __init__(self):
|
||||
self.nlp = spacy.load("en_core_web_sm")
|
||||
self.nli_pipeline = pipeline("text-classification",
|
||||
model="roberta-large-mnli")
|
||||
|
||||
def evaluate_consistency(self, answer: str, retrieved_chunks: List[str]) -> float:
|
||||
"""
|
||||
Evaluate factual consistency between answer and retrieved context
|
||||
"""
|
||||
if not retrieved_chunks:
|
||||
return 0.0
|
||||
|
||||
# Combine retrieved chunks as context
|
||||
context = " ".join(retrieved_chunks[:3]) # Use top 3 chunks
|
||||
|
||||
# Use Natural Language Inference to check consistency
|
||||
result = self.nli_pipeline(f"premise: {context} hypothesis: {answer}")
|
||||
|
||||
# Extract consistency score (entailment probability)
|
||||
for item in result:
|
||||
if item['label'] == 'ENTAILMENT':
|
||||
return item['score']
|
||||
elif item['label'] == 'CONTRADICTION':
|
||||
return 1.0 - item['score']
|
||||
|
||||
return 0.5 # Neutral if NLI is inconclusive
|
||||
```
|
||||
|
||||
#### Answer Completeness
|
||||
Measures how completely the answer addresses the user's query.
|
||||
|
||||
```python
|
||||
def evaluate_completeness(answer: str, query: str, reference_answer: str = None) -> float:
|
||||
"""
|
||||
Evaluate answer completeness
|
||||
"""
|
||||
# Extract key entities from query
|
||||
query_entities = extract_entities(query)
|
||||
answer_entities = extract_entities(answer)
|
||||
|
||||
# Calculate entity coverage
|
||||
if not query_entities:
|
||||
return 0.5 # Neutral if no entities in query
|
||||
|
||||
covered_entities = query_entities & answer_entities
|
||||
entity_coverage = len(covered_entities) / len(query_entities)
|
||||
|
||||
# If reference answer is available, compare against it
|
||||
if reference_answer:
|
||||
reference_entities = extract_entities(reference_answer)
|
||||
answer_reference_overlap = len(answer_entities & reference_entities) / max(len(reference_entities), 1)
|
||||
return (entity_coverage + answer_reference_overlap) / 2
|
||||
|
||||
return entity_coverage
|
||||
|
||||
def extract_entities(text: str) -> set:
|
||||
"""
|
||||
Extract named entities from text (simplified)
|
||||
"""
|
||||
# This would use a proper NER model in practice
|
||||
import re
|
||||
|
||||
# Simple noun phrase extraction as placeholder
|
||||
noun_phrases = re.findall(r'\b[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\b', text)
|
||||
return set(noun_phrases)
|
||||
```
|
||||
|
||||
#### Response Relevance
|
||||
Measures how relevant the answer is to the original query.
|
||||
|
||||
```python
|
||||
from sentence_transformers import SentenceTransformer
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
|
||||
class RelevanceEvaluator:
|
||||
def __init__(self, model_name="all-MiniLM-L6-v2"):
|
||||
self.model = SentenceTransformer(model_name)
|
||||
|
||||
def evaluate_relevance(self, query: str, answer: str) -> float:
|
||||
"""
|
||||
Evaluate semantic relevance between query and answer
|
||||
"""
|
||||
# Generate embeddings
|
||||
query_embedding = self.model.encode([query])
|
||||
answer_embedding = self.model.encode([answer])
|
||||
|
||||
# Calculate cosine similarity
|
||||
similarity = cosine_similarity(query_embedding, answer_embedding)[0][0]
|
||||
|
||||
return float(similarity)
|
||||
```
|
||||
|
||||
## Performance Metrics
|
||||
|
||||
### Processing Time
|
||||
|
||||
```python
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from typing import List, Dict
|
||||
|
||||
@dataclass
|
||||
class PerformanceMetrics:
|
||||
total_time: float
|
||||
chunking_time: float
|
||||
embedding_time: float
|
||||
search_time: float
|
||||
generation_time: float
|
||||
throughput: float # documents per second
|
||||
|
||||
class PerformanceProfiler:
|
||||
def __init__(self):
|
||||
self.timings = {}
|
||||
self.start_times = {}
|
||||
|
||||
def start_timer(self, operation: str):
|
||||
self.start_times[operation] = time.time()
|
||||
|
||||
def end_timer(self, operation: str):
|
||||
if operation in self.start_times:
|
||||
duration = time.time() - self.start_times[operation]
|
||||
if operation not in self.timings:
|
||||
self.timings[operation] = []
|
||||
self.timings[operation].append(duration)
|
||||
return duration
|
||||
return 0.0
|
||||
|
||||
def get_performance_metrics(self, document_count: int) -> PerformanceMetrics:
|
||||
total_time = sum(sum(times) for times in self.timings.values())
|
||||
|
||||
return PerformanceMetrics(
|
||||
total_time=total_time,
|
||||
chunking_time=sum(self.timings.get('chunking', [0])),
|
||||
embedding_time=sum(self.timings.get('embedding', [0])),
|
||||
search_time=sum(self.timings.get('search', [0])),
|
||||
generation_time=sum(self.timings.get('generation', [0])),
|
||||
throughput=document_count / total_time if total_time > 0 else 0
|
||||
)
|
||||
```
|
||||
|
||||
### Memory Usage
|
||||
|
||||
```python
|
||||
import psutil
|
||||
import os
|
||||
from typing import Dict, List
|
||||
|
||||
class MemoryProfiler:
|
||||
def __init__(self):
|
||||
self.process = psutil.Process(os.getpid())
|
||||
self.memory_snapshots = []
|
||||
|
||||
def take_memory_snapshot(self, label: str):
|
||||
"""Take a snapshot of current memory usage"""
|
||||
memory_info = self.process.memory_info()
|
||||
memory_mb = memory_info.rss / 1024 / 1024 # Convert to MB
|
||||
|
||||
self.memory_snapshots.append({
|
||||
'label': label,
|
||||
'memory_mb': memory_mb,
|
||||
'timestamp': time.time()
|
||||
})
|
||||
|
||||
def get_peak_memory_usage(self) -> float:
|
||||
"""Get peak memory usage in MB"""
|
||||
if not self.memory_snapshots:
|
||||
return 0.0
|
||||
return max(snapshot['memory_mb'] for snapshot in self.memory_snapshots)
|
||||
|
||||
def get_memory_usage_by_operation(self) -> Dict[str, float]:
|
||||
"""Get memory usage breakdown by operation"""
|
||||
if not self.memory_snapshots:
|
||||
return {}
|
||||
|
||||
memory_by_op = {}
|
||||
for i in range(1, len(self.memory_snapshots)):
|
||||
prev_snapshot = self.memory_snapshots[i-1]
|
||||
curr_snapshot = self.memory_snapshots[i]
|
||||
|
||||
operation = curr_snapshot['label']
|
||||
memory_delta = curr_snapshot['memory_mb'] - prev_snapshot['memory_mb']
|
||||
|
||||
if operation not in memory_by_op:
|
||||
memory_by_op[operation] = []
|
||||
memory_by_op[operation].append(memory_delta)
|
||||
|
||||
return {op: sum(deltas) for op, deltas in memory_by_op.items()}
|
||||
```
|
||||
|
||||
## Evaluation Datasets
|
||||
|
||||
### Standardized Test Sets
|
||||
|
||||
#### Question-Answer Pairs
|
||||
|
||||
```python
|
||||
from dataclasses import dataclass
|
||||
from typing import List, Optional
|
||||
import json
|
||||
|
||||
@dataclass
|
||||
class EvaluationQuery:
|
||||
id: str
|
||||
question: str
|
||||
reference_answer: Optional[str]
|
||||
relevant_chunk_ids: List[str]
|
||||
query_type: str # factoid, analytical, comparative
|
||||
difficulty: str # easy, medium, hard
|
||||
domain: str # finance, medical, legal, technical
|
||||
|
||||
class EvaluationDataset:
|
||||
def __init__(self, name: str):
|
||||
self.name = name
|
||||
self.queries: List[EvaluationQuery] = []
|
||||
self.documents: Dict[str, str] = {}
|
||||
self.chunks: Dict[str, Dict] = {}
|
||||
|
||||
def add_query(self, query: EvaluationQuery):
|
||||
self.queries.append(query)
|
||||
|
||||
def add_document(self, doc_id: str, content: str):
|
||||
self.documents[doc_id] = content
|
||||
|
||||
def add_chunk(self, chunk_id: str, content: str, doc_id: str, metadata: Dict):
|
||||
self.chunks[chunk_id] = {
|
||||
'id': chunk_id,
|
||||
'content': content,
|
||||
'doc_id': doc_id,
|
||||
'metadata': metadata
|
||||
}
|
||||
|
||||
def save_to_file(self, filepath: str):
|
||||
data = {
|
||||
'name': self.name,
|
||||
'queries': [
|
||||
{
|
||||
'id': q.id,
|
||||
'question': q.question,
|
||||
'reference_answer': q.reference_answer,
|
||||
'relevant_chunk_ids': q.relevant_chunk_ids,
|
||||
'query_type': q.query_type,
|
||||
'difficulty': q.difficulty,
|
||||
'domain': q.domain
|
||||
}
|
||||
for q in self.queries
|
||||
],
|
||||
'documents': self.documents,
|
||||
'chunks': self.chunks
|
||||
}
|
||||
|
||||
with open(filepath, 'w') as f:
|
||||
json.dump(data, f, indent=2)
|
||||
|
||||
@classmethod
|
||||
def load_from_file(cls, filepath: str):
|
||||
with open(filepath, 'r') as f:
|
||||
data = json.load(f)
|
||||
|
||||
dataset = cls(data['name'])
|
||||
dataset.documents = data['documents']
|
||||
dataset.chunks = data['chunks']
|
||||
|
||||
for q_data in data['queries']:
|
||||
query = EvaluationQuery(
|
||||
id=q_data['id'],
|
||||
question=q_data['question'],
|
||||
reference_answer=q_data.get('reference_answer'),
|
||||
relevant_chunk_ids=q_data['relevant_chunk_ids'],
|
||||
query_type=q_data['query_type'],
|
||||
difficulty=q_data['difficulty'],
|
||||
domain=q_data['domain']
|
||||
)
|
||||
dataset.add_query(query)
|
||||
|
||||
return dataset
|
||||
```
|
||||
|
||||
### Dataset Generation
|
||||
|
||||
#### Synthetic Query Generation
|
||||
|
||||
```python
|
||||
import random
|
||||
from typing import List, Dict
|
||||
|
||||
class SyntheticQueryGenerator:
|
||||
def __init__(self):
|
||||
self.query_templates = {
|
||||
'factoid': [
|
||||
"What is {concept}?",
|
||||
"When did {event} occur?",
|
||||
"Who developed {technology}?",
|
||||
"How many {items} are mentioned?",
|
||||
"What is the value of {metric}?"
|
||||
],
|
||||
'analytical': [
|
||||
"Compare and contrast {concept1} and {concept2}.",
|
||||
"Analyze the impact of {concept} on {domain}.",
|
||||
"What are the advantages and disadvantages of {technology}?",
|
||||
"Explain the relationship between {concept1} and {concept2}.",
|
||||
"Evaluate the effectiveness of {approach} for {problem}."
|
||||
],
|
||||
'comparative': [
|
||||
"Which is better: {option1} or {option2}?",
|
||||
"How does {method1} differ from {method2}?",
|
||||
"Compare the performance of {system1} and {system2}.",
|
||||
"What are the key differences between {approach1} and {approach2}?"
|
||||
]
|
||||
}
|
||||
|
||||
def generate_queries_from_chunks(self, chunks: List[Dict], num_queries: int = 100) -> List[EvaluationQuery]:
|
||||
"""Generate synthetic queries from document chunks"""
|
||||
queries = []
|
||||
|
||||
# Extract entities and concepts from chunks
|
||||
entities = self._extract_entities_from_chunks(chunks)
|
||||
|
||||
for i in range(num_queries):
|
||||
query_type = random.choice(['factoid', 'analytical', 'comparative'])
|
||||
template = random.choice(self.query_templates[query_type])
|
||||
|
||||
# Fill template with extracted entities
|
||||
query_text = self._fill_template(template, entities)
|
||||
|
||||
# Find relevant chunks for this query
|
||||
relevant_chunks = self._find_relevant_chunks(query_text, chunks)
|
||||
|
||||
query = EvaluationQuery(
|
||||
id=f"synthetic_{i}",
|
||||
question=query_text,
|
||||
reference_answer=None, # Would need generation model
|
||||
relevant_chunk_ids=[chunk['id'] for chunk in relevant_chunks],
|
||||
query_type=query_type,
|
||||
difficulty=random.choice(['easy', 'medium', 'hard']),
|
||||
domain='synthetic'
|
||||
)
|
||||
|
||||
queries.append(query)
|
||||
|
||||
return queries
|
||||
|
||||
def _extract_entities_from_chunks(self, chunks: List[Dict]) -> Dict[str, List[str]]:
|
||||
"""Extract entities, concepts, and relationships from chunks"""
|
||||
# This would use proper NER in practice
|
||||
entities = {
|
||||
'concepts': [],
|
||||
'technologies': [],
|
||||
'methods': [],
|
||||
'metrics': [],
|
||||
'events': []
|
||||
}
|
||||
|
||||
for chunk in chunks:
|
||||
content = chunk['content']
|
||||
# Simplified entity extraction
|
||||
words = content.split()
|
||||
entities['concepts'].extend([word for word in words if len(word) > 6])
|
||||
entities['technologies'].extend([word for word in words if 'technology' in word.lower()])
|
||||
entities['methods'].extend([word for word in words if 'method' in word.lower()])
|
||||
entities['metrics'].extend([word for word in words if '%' in word or '$' in word])
|
||||
|
||||
# Remove duplicates and limit
|
||||
for key in entities:
|
||||
entities[key] = list(set(entities[key]))[:50]
|
||||
|
||||
return entities
|
||||
|
||||
def _fill_template(self, template: str, entities: Dict[str, List[str]]) -> str:
|
||||
"""Fill query template with random entities"""
|
||||
import re
|
||||
|
||||
def replace_placeholder(match):
|
||||
placeholder = match.group(1)
|
||||
|
||||
# Map placeholders to entity types
|
||||
entity_mapping = {
|
||||
'concept': 'concepts',
|
||||
'concept1': 'concepts',
|
||||
'concept2': 'concepts',
|
||||
'technology': 'technologies',
|
||||
'method': 'methods',
|
||||
'method1': 'methods',
|
||||
'method2': 'methods',
|
||||
'metric': 'metrics',
|
||||
'event': 'events',
|
||||
'items': 'concepts',
|
||||
'option1': 'concepts',
|
||||
'option2': 'concepts',
|
||||
'approach': 'methods',
|
||||
'problem': 'concepts',
|
||||
'domain': 'concepts',
|
||||
'system1': 'concepts',
|
||||
'system2': 'concepts'
|
||||
}
|
||||
|
||||
entity_type = entity_mapping.get(placeholder, 'concepts')
|
||||
available_entities = entities.get(entity_type, ['something'])
|
||||
|
||||
if available_entities:
|
||||
return random.choice(available_entities)
|
||||
else:
|
||||
return 'something'
|
||||
|
||||
return re.sub(r'\{(\w+)\}', replace_placeholder, template)
|
||||
|
||||
def _find_relevant_chunks(self, query: str, chunks: List[Dict], k: int = 3) -> List[Dict]:
|
||||
"""Find chunks most relevant to the query"""
|
||||
# Simple keyword matching for synthetic generation
|
||||
query_words = set(query.lower().split())
|
||||
|
||||
chunk_scores = []
|
||||
for chunk in chunks:
|
||||
chunk_words = set(chunk['content'].lower().split())
|
||||
overlap = len(query_words & chunk_words)
|
||||
chunk_scores.append((overlap, chunk))
|
||||
|
||||
# Sort by overlap and return top k
|
||||
chunk_scores.sort(key=lambda x: x[0], reverse=True)
|
||||
return [chunk for _, chunk in chunk_scores[:k]]
|
||||
```
|
||||
|
||||
## A/B Testing Framework
|
||||
|
||||
### Statistical Significance Testing
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from scipy import stats
|
||||
from typing import List, Dict, Tuple
|
||||
|
||||
class ABTestAnalyzer:
|
||||
def __init__(self):
|
||||
self.significance_level = 0.05
|
||||
|
||||
def compare_metrics(self, control_metrics: List[float],
|
||||
treatment_metrics: List[float],
|
||||
metric_name: str) -> Dict:
|
||||
"""
|
||||
Compare metrics between control and treatment groups
|
||||
"""
|
||||
control_mean = np.mean(control_metrics)
|
||||
treatment_mean = np.mean(treatment_metrics)
|
||||
|
||||
control_std = np.std(control_metrics)
|
||||
treatment_std = np.std(treatment_metrics)
|
||||
|
||||
# Perform t-test
|
||||
t_statistic, p_value = stats.ttest_ind(control_metrics, treatment_metrics)
|
||||
|
||||
# Calculate effect size (Cohen's d)
|
||||
pooled_std = np.sqrt(((len(control_metrics) - 1) * control_std**2 +
|
||||
(len(treatment_metrics) - 1) * treatment_std**2) /
|
||||
(len(control_metrics) + len(treatment_metrics) - 2))
|
||||
|
||||
cohens_d = (treatment_mean - control_mean) / pooled_std if pooled_std > 0 else 0
|
||||
|
||||
# Determine significance
|
||||
is_significant = p_value < self.significance_level
|
||||
|
||||
return {
|
||||
'metric_name': metric_name,
|
||||
'control_mean': control_mean,
|
||||
'treatment_mean': treatment_mean,
|
||||
'absolute_difference': treatment_mean - control_mean,
|
||||
'relative_difference': ((treatment_mean - control_mean) / control_mean * 100) if control_mean != 0 else 0,
|
||||
'control_std': control_std,
|
||||
'treatment_std': treatment_std,
|
||||
't_statistic': t_statistic,
|
||||
'p_value': p_value,
|
||||
'is_significant': is_significant,
|
||||
'effect_size': cohens_d,
|
||||
'significance_level': self.significance_level
|
||||
}
|
||||
|
||||
def analyze_ab_test_results(self,
|
||||
control_results: Dict[str, List[float]],
|
||||
treatment_results: Dict[str, List[float]]) -> Dict:
|
||||
"""
|
||||
Analyze A/B test results across multiple metrics
|
||||
"""
|
||||
analysis_results = {}
|
||||
|
||||
# Ensure both dictionaries have the same keys
|
||||
all_metrics = set(control_results.keys()) & set(treatment_results.keys())
|
||||
|
||||
for metric in all_metrics:
|
||||
if metric in control_results and metric in treatment_results:
|
||||
analysis_results[metric] = self.compare_metrics(
|
||||
control_results[metric],
|
||||
treatment_results[metric],
|
||||
metric
|
||||
)
|
||||
|
||||
# Calculate overall summary
|
||||
significant_improvements = sum(1 for result in analysis_results.values()
|
||||
if result['is_significant'] and result['relative_difference'] > 0)
|
||||
significant_degradations = sum(1 for result in analysis_results.values()
|
||||
if result['is_significant'] and result['relative_difference'] < 0)
|
||||
|
||||
analysis_results['summary'] = {
|
||||
'total_metrics_compared': len(analysis_results),
|
||||
'significant_improvements': significant_improvements,
|
||||
'significant_degradations': significant_degradations,
|
||||
'no_significant_change': len(analysis_results) - significant_improvements - significant_degradations
|
||||
}
|
||||
|
||||
return analysis_results
|
||||
```
|
||||
|
||||
## Automated Evaluation Pipeline
|
||||
|
||||
### End-to-End Evaluation
|
||||
|
||||
```python
|
||||
class ChunkingEvaluationPipeline:
|
||||
def __init__(self, strategies: Dict[str, Any], dataset: EvaluationDataset):
|
||||
self.strategies = strategies
|
||||
self.dataset = dataset
|
||||
self.results = {}
|
||||
self.profiler = PerformanceProfiler()
|
||||
self.memory_profiler = MemoryProfiler()
|
||||
|
||||
def run_evaluation(self) -> Dict:
|
||||
"""Run comprehensive evaluation of all strategies"""
|
||||
evaluation_results = {}
|
||||
|
||||
for strategy_name, strategy in self.strategies.items():
|
||||
print(f"Evaluating strategy: {strategy_name}")
|
||||
|
||||
# Reset profilers for each strategy
|
||||
self.profiler = PerformanceProfiler()
|
||||
self.memory_profiler = MemoryProfiler()
|
||||
|
||||
# Evaluate strategy
|
||||
strategy_results = self._evaluate_strategy(strategy, strategy_name)
|
||||
evaluation_results[strategy_name] = strategy_results
|
||||
|
||||
# Compare strategies
|
||||
comparison_results = self._compare_strategies(evaluation_results)
|
||||
|
||||
return {
|
||||
'individual_results': evaluation_results,
|
||||
'comparison': comparison_results,
|
||||
'recommendations': self._generate_recommendations(comparison_results)
|
||||
}
|
||||
|
||||
def _evaluate_strategy(self, strategy: Any, strategy_name: str) -> Dict:
|
||||
"""Evaluate a single chunking strategy"""
|
||||
results = {
|
||||
'strategy_name': strategy_name,
|
||||
'retrieval_metrics': {},
|
||||
'quality_metrics': {},
|
||||
'performance_metrics': {}
|
||||
}
|
||||
|
||||
# Track memory usage
|
||||
self.memory_profiler.take_memory_snapshot(f"{strategy_name}_start")
|
||||
|
||||
# Process all documents
|
||||
self.profiler.start_timer('total_processing')
|
||||
|
||||
all_chunks = {}
|
||||
for doc_id, content in self.dataset.documents.items():
|
||||
self.profiler.start_timer('chunking')
|
||||
chunks = strategy.chunk(content)
|
||||
self.profiler.end_timer('chunking')
|
||||
|
||||
all_chunks[doc_id] = chunks
|
||||
|
||||
self.memory_profiler.take_memory_snapshot(f"{strategy_name}_after_chunking")
|
||||
|
||||
# Generate embeddings for chunks
|
||||
self.profiler.start_timer('embedding')
|
||||
chunk_embeddings = self._generate_embeddings(all_chunks)
|
||||
self.profiler.end_timer('embedding')
|
||||
|
||||
self.memory_profiler.take_memory_snapshot(f"{strategy_name}_after_embedding")
|
||||
|
||||
# Evaluate retrieval performance
|
||||
retrieval_results = self._evaluate_retrieval(all_chunks, chunk_embeddings)
|
||||
results['retrieval_metrics'] = retrieval_results
|
||||
|
||||
# Evaluate chunk quality
|
||||
quality_results = self._evaluate_chunk_quality(all_chunks)
|
||||
results['quality_metrics'] = quality_results
|
||||
|
||||
# Get performance metrics
|
||||
self.profiler.end_timer('total_processing')
|
||||
performance_metrics = self.profiler.get_performance_metrics(len(self.dataset.documents))
|
||||
results['performance_metrics'] = performance_metrics.__dict__
|
||||
|
||||
# Get memory metrics
|
||||
self.memory_profiler.take_memory_snapshot(f"{strategy_name}_end")
|
||||
results['memory_metrics'] = {
|
||||
'peak_memory_mb': self.memory_profiler.get_peak_memory_usage(),
|
||||
'memory_by_operation': self.memory_profiler.get_memory_usage_by_operation()
|
||||
}
|
||||
|
||||
return results
|
||||
|
||||
def _evaluate_retrieval(self, all_chunks: Dict, chunk_embeddings: Dict) -> Dict:
|
||||
"""Evaluate retrieval performance"""
|
||||
retrieval_metrics = {
|
||||
'precision': [],
|
||||
'recall': [],
|
||||
'f1_score': [],
|
||||
'mrr': [],
|
||||
'map': []
|
||||
}
|
||||
|
||||
for query in self.dataset.queries:
|
||||
# Perform retrieval
|
||||
self.profiler.start_timer('search')
|
||||
retrieved_chunks = self._retrieve_chunks(query.question, chunk_embeddings, k=10)
|
||||
self.profiler.end_timer('search')
|
||||
|
||||
# Get relevant chunks for this query
|
||||
relevant_chunk_ids = set(query.relevant_chunk_ids)
|
||||
relevant_chunks = [chunk for chunk in retrieved_chunks
|
||||
if chunk.get('id') in relevant_chunk_ids]
|
||||
|
||||
# Calculate metrics
|
||||
precision = calculate_precision(retrieved_chunks, relevant_chunks)
|
||||
recall = calculate_recall(retrieved_chunks, relevant_chunks)
|
||||
f1 = calculate_f1_score(precision, recall)
|
||||
|
||||
retrieval_metrics['precision'].append(precision)
|
||||
retrieval_metrics['recall'].append(recall)
|
||||
retrieval_metrics['f1_score'].append(f1)
|
||||
|
||||
# Calculate averages
|
||||
return {metric: np.mean(values) for metric, values in retrieval_metrics.items()}
|
||||
|
||||
def _evaluate_chunk_quality(self, all_chunks: Dict) -> Dict:
|
||||
"""Evaluate quality of generated chunks"""
|
||||
quality_assessor = ChunkQualityAssessor()
|
||||
quality_scores = []
|
||||
|
||||
for doc_id, chunks in all_chunks.items():
|
||||
# Analyze document
|
||||
content = self.dataset.documents[doc_id]
|
||||
analyzer = DocumentAnalyzer()
|
||||
analysis = analyzer.analyze(content)
|
||||
|
||||
# Assess chunk quality
|
||||
scores = quality_assessor.assess_chunks(chunks, analysis)
|
||||
quality_scores.append(scores)
|
||||
|
||||
# Aggregate quality scores
|
||||
if quality_scores:
|
||||
avg_scores = {}
|
||||
for metric in quality_scores[0].keys():
|
||||
avg_scores[metric] = np.mean([scores[metric] for scores in quality_scores])
|
||||
return avg_scores
|
||||
|
||||
return {}
|
||||
|
||||
def _compare_strategies(self, evaluation_results: Dict) -> Dict:
|
||||
"""Compare performance across strategies"""
|
||||
ab_analyzer = ABTestAnalyzer()
|
||||
|
||||
comparison = {}
|
||||
|
||||
# Compare each metric across strategies
|
||||
strategy_names = list(evaluation_results.keys())
|
||||
|
||||
for i in range(len(strategy_names)):
|
||||
for j in range(i + 1, len(strategy_names)):
|
||||
strategy1 = strategy_names[i]
|
||||
strategy2 = strategy_names[j]
|
||||
|
||||
comparison_key = f"{strategy1}_vs_{strategy2}"
|
||||
comparison[comparison_key] = {}
|
||||
|
||||
# Compare retrieval metrics
|
||||
for metric in ['precision', 'recall', 'f1_score']:
|
||||
if (metric in evaluation_results[strategy1]['retrieval_metrics'] and
|
||||
metric in evaluation_results[strategy2]['retrieval_metrics']):
|
||||
|
||||
comparison[comparison_key][f"retrieval_{metric}"] = ab_analyzer.compare_metrics(
|
||||
[evaluation_results[strategy1]['retrieval_metrics'][metric]],
|
||||
[evaluation_results[strategy2]['retrieval_metrics'][metric]],
|
||||
f"retrieval_{metric}"
|
||||
)
|
||||
|
||||
return comparison
|
||||
|
||||
def _generate_recommendations(self, comparison_results: Dict) -> Dict:
|
||||
"""Generate recommendations based on evaluation results"""
|
||||
recommendations = {
|
||||
'best_overall': None,
|
||||
'best_for_precision': None,
|
||||
'best_for_recall': None,
|
||||
'best_for_performance': None,
|
||||
'trade_offs': []
|
||||
}
|
||||
|
||||
# This would analyze the comparison results and generate specific recommendations
|
||||
# Implementation depends on specific use case requirements
|
||||
|
||||
return recommendations
|
||||
|
||||
def _generate_embeddings(self, all_chunks: Dict) -> Dict:
|
||||
"""Generate embeddings for all chunks"""
|
||||
# This would use the actual embedding model
|
||||
# Placeholder implementation
|
||||
embeddings = {}
|
||||
|
||||
for doc_id, chunks in all_chunks.items():
|
||||
embeddings[doc_id] = []
|
||||
for chunk in chunks:
|
||||
# Generate embedding for chunk content
|
||||
embedding = np.random.rand(384) # Placeholder
|
||||
embeddings[doc_id].append({
|
||||
'chunk': chunk,
|
||||
'embedding': embedding
|
||||
})
|
||||
|
||||
return embeddings
|
||||
|
||||
def _retrieve_chunks(self, query: str, chunk_embeddings: Dict, k: int = 10) -> List[Dict]:
|
||||
"""Retrieve most relevant chunks for a query"""
|
||||
# This would use actual similarity search
|
||||
# Placeholder implementation
|
||||
all_chunks = []
|
||||
|
||||
for doc_embeddings in chunk_embeddings.values():
|
||||
for chunk_data in doc_embeddings:
|
||||
all_chunks.append(chunk_data['chunk'])
|
||||
|
||||
# Simple random selection as placeholder
|
||||
selected = random.sample(all_chunks, min(k, len(all_chunks)))
|
||||
|
||||
return selected
|
||||
```
|
||||
|
||||
This comprehensive evaluation framework provides the tools needed to thoroughly assess chunking strategies across multiple dimensions: retrieval effectiveness, answer quality, system performance, and statistical significance. The modular design allows for easy extension and customization based on specific requirements and use cases.
|
||||
709
skills/ai/chunking-strategy/references/implementation.md
Normal file
709
skills/ai/chunking-strategy/references/implementation.md
Normal file
@@ -0,0 +1,709 @@
|
||||
# Complete Implementation Guidelines
|
||||
|
||||
This document provides comprehensive implementation guidance for building effective chunking systems.
|
||||
|
||||
## System Architecture
|
||||
|
||||
### Core Components
|
||||
|
||||
```
|
||||
Document Processor
|
||||
├── Ingestion Layer
|
||||
│ ├── Document Type Detection
|
||||
│ ├── Format Parsing (PDF, HTML, Markdown, etc.)
|
||||
│ └── Content Extraction
|
||||
├── Analysis Layer
|
||||
│ ├── Structure Analysis
|
||||
│ ├── Content Type Identification
|
||||
│ └── Complexity Assessment
|
||||
├── Strategy Selection Layer
|
||||
│ ├── Rule-based Selection
|
||||
│ ├── ML-based Prediction
|
||||
│ └── Adaptive Configuration
|
||||
├── Chunking Layer
|
||||
│ ├── Strategy Implementation
|
||||
│ ├── Parameter Optimization
|
||||
│ └── Quality Validation
|
||||
└── Output Layer
|
||||
├── Chunk Metadata Generation
|
||||
├── Embedding Integration
|
||||
└── Storage Preparation
|
||||
```
|
||||
|
||||
## Pre-processing Pipeline
|
||||
|
||||
### Document Analysis Framework
|
||||
|
||||
```python
|
||||
from dataclasses import dataclass
|
||||
from typing import List, Dict, Any
|
||||
import re
|
||||
|
||||
@dataclass
|
||||
class DocumentAnalysis:
|
||||
doc_type: str
|
||||
structure_score: float # 0-1, higher means more structured
|
||||
complexity_score: float # 0-1, higher means more complex
|
||||
content_types: List[str]
|
||||
language: str
|
||||
estimated_tokens: int
|
||||
has_multimodal: bool
|
||||
|
||||
class DocumentAnalyzer:
|
||||
def __init__(self):
|
||||
self.structure_patterns = {
|
||||
'markdown': [r'^#+\s', r'^\*\*.*\*\*$', r'^\* ', r'^\d+\. '],
|
||||
'html': [r'<h[1-6]>', r'<p>', r'<div>', r'<table>'],
|
||||
'latex': [r'\\section', r'\\subsection', r'\\begin\{', r'\\end\{'],
|
||||
'academic': [r'^\d+\.', r'^\d+\.\d+', r'^[A-Z]\.', r'^Figure \d+']
|
||||
}
|
||||
|
||||
def analyze(self, content: str) -> DocumentAnalysis:
|
||||
doc_type = self.detect_document_type(content)
|
||||
structure_score = self.calculate_structure_score(content, doc_type)
|
||||
complexity_score = self.calculate_complexity_score(content)
|
||||
content_types = self.identify_content_types(content)
|
||||
language = self.detect_language(content)
|
||||
estimated_tokens = self.estimate_tokens(content)
|
||||
has_multimodal = self.detect_multimodal_content(content)
|
||||
|
||||
return DocumentAnalysis(
|
||||
doc_type=doc_type,
|
||||
structure_score=structure_score,
|
||||
complexity_score=complexity_score,
|
||||
content_types=content_types,
|
||||
language=language,
|
||||
estimated_tokens=estimated_tokens,
|
||||
has_multimodal=has_multimodal
|
||||
)
|
||||
|
||||
def detect_document_type(self, content: str) -> str:
|
||||
content_lower = content.lower()
|
||||
|
||||
if '<html' in content_lower or '<body' in content_lower:
|
||||
return 'html'
|
||||
elif '#' in content and '##' in content:
|
||||
return 'markdown'
|
||||
elif '\\documentclass' in content_lower or '\\begin{' in content_lower:
|
||||
return 'latex'
|
||||
elif any(keyword in content_lower for keyword in ['abstract', 'introduction', 'conclusion', 'references']):
|
||||
return 'academic'
|
||||
elif 'def ' in content or 'class ' in content or 'function ' in content_lower:
|
||||
return 'code'
|
||||
else:
|
||||
return 'plain'
|
||||
|
||||
def calculate_structure_score(self, content: str, doc_type: str) -> float:
|
||||
patterns = self.structure_patterns.get(doc_type, [])
|
||||
if not patterns:
|
||||
return 0.5 # Default for unstructured content
|
||||
|
||||
line_count = len(content.split('\n'))
|
||||
structured_lines = 0
|
||||
|
||||
for line in content.split('\n'):
|
||||
for pattern in patterns:
|
||||
if re.search(pattern, line.strip()):
|
||||
structured_lines += 1
|
||||
break
|
||||
|
||||
return min(structured_lines / max(line_count, 1), 1.0)
|
||||
|
||||
def calculate_complexity_score(self, content: str) -> float:
|
||||
# Factors that increase complexity
|
||||
avg_sentence_length = self.calculate_avg_sentence_length(content)
|
||||
vocabulary_richness = self.calculate_vocabulary_richness(content)
|
||||
nested_structure = self.detect_nested_structure(content)
|
||||
|
||||
# Normalize and combine
|
||||
complexity = (
|
||||
min(avg_sentence_length / 30, 1.0) * 0.3 +
|
||||
vocabulary_richness * 0.4 +
|
||||
nested_structure * 0.3
|
||||
)
|
||||
|
||||
return min(complexity, 1.0)
|
||||
|
||||
def identify_content_types(self, content: str) -> List[str]:
|
||||
types = []
|
||||
|
||||
if '```' in content or 'def ' in content or 'function ' in content.lower():
|
||||
types.append('code')
|
||||
if '|' in content and '\n' in content:
|
||||
types.append('tables')
|
||||
if re.search(r'\!\[.*\]\(.*\)', content):
|
||||
types.append('images')
|
||||
if re.search(r'http[s]?://', content):
|
||||
types.append('links')
|
||||
if re.search(r'\d+\.\d+', content) or re.search(r'\$\d', content):
|
||||
types.append('numbers')
|
||||
|
||||
return types if types else ['text']
|
||||
|
||||
def detect_language(self, content: str) -> str:
|
||||
# Simple language detection - can be enhanced with proper language detection libraries
|
||||
if re.search(r'[\u4e00-\u9fff]', content):
|
||||
return 'chinese'
|
||||
elif re.search(r'[u0600-\u06ff]', content):
|
||||
return 'arabic'
|
||||
elif re.search(r'[u0400-\u04ff]', content):
|
||||
return 'russian'
|
||||
else:
|
||||
return 'english' # Default assumption
|
||||
|
||||
def estimate_tokens(self, content: str) -> int:
|
||||
# Rough estimation - actual tokenization varies by model
|
||||
word_count = len(content.split())
|
||||
return int(word_count * 1.3) # Average tokens per word
|
||||
|
||||
def detect_multimodal_content(self, content: str) -> bool:
|
||||
multimodal_indicators = [
|
||||
r'\!\[.*\]\(.*\)', # Images
|
||||
r'<iframe', # Embedded content
|
||||
r'<object', # Embedded objects
|
||||
r'<embed', # Embedded media
|
||||
]
|
||||
|
||||
return any(re.search(pattern, content) for pattern in multimodal_indicators)
|
||||
|
||||
def calculate_avg_sentence_length(self, content: str) -> float:
|
||||
sentences = re.split(r'[.!?]+', content)
|
||||
sentences = [s.strip() for s in sentences if s.strip()]
|
||||
if not sentences:
|
||||
return 0
|
||||
return sum(len(s.split()) for s in sentences) / len(sentences)
|
||||
|
||||
def calculate_vocabulary_richness(self, content: str) -> float:
|
||||
words = content.lower().split()
|
||||
if not words:
|
||||
return 0
|
||||
unique_words = set(words)
|
||||
return len(unique_words) / len(words)
|
||||
|
||||
def detect_nested_structure(self, content: str) -> float:
|
||||
# Detect nested lists, indented content, etc.
|
||||
lines = content.split('\n')
|
||||
indented_lines = 0
|
||||
|
||||
for line in lines:
|
||||
if line.strip() and line.startswith(' '):
|
||||
indented_lines += 1
|
||||
|
||||
return indented_lines / max(len(lines), 1)
|
||||
```
|
||||
|
||||
### Strategy Selection Engine
|
||||
|
||||
```python
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Dict, Any
|
||||
|
||||
class ChunkingStrategy(ABC):
|
||||
@abstractmethod
|
||||
def chunk(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
|
||||
pass
|
||||
|
||||
class StrategySelector:
|
||||
def __init__(self):
|
||||
self.strategies = {
|
||||
'fixed_size': FixedSizeStrategy(),
|
||||
'recursive': RecursiveStrategy(),
|
||||
'structure_aware': StructureAwareStrategy(),
|
||||
'semantic': SemanticStrategy(),
|
||||
'adaptive': AdaptiveStrategy()
|
||||
}
|
||||
|
||||
def select_strategy(self, analysis: DocumentAnalysis) -> str:
|
||||
# Rule-based selection logic
|
||||
if analysis.structure_score > 0.8 and analysis.doc_type in ['markdown', 'html', 'latex']:
|
||||
return 'structure_aware'
|
||||
elif analysis.complexity_score > 0.7 and analysis.estimated_tokens < 10000:
|
||||
return 'semantic'
|
||||
elif analysis.doc_type == 'code':
|
||||
return 'structure_aware'
|
||||
elif analysis.structure_score < 0.3:
|
||||
return 'fixed_size'
|
||||
elif analysis.complexity_score > 0.5:
|
||||
return 'recursive'
|
||||
else:
|
||||
return 'adaptive'
|
||||
|
||||
def get_strategy(self, analysis: DocumentAnalysis) -> ChunkingStrategy:
|
||||
strategy_name = self.select_strategy(analysis)
|
||||
return self.strategies[strategy_name]
|
||||
|
||||
# Example strategy implementations
|
||||
class FixedSizeStrategy(ChunkingStrategy):
|
||||
def __init__(self, default_size=512, default_overlap=50):
|
||||
self.default_size = default_size
|
||||
self.default_overlap = default_overlap
|
||||
|
||||
def chunk(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
|
||||
# Adjust parameters based on analysis
|
||||
if analysis.complexity_score > 0.7:
|
||||
chunk_size = 1024
|
||||
elif analysis.complexity_score < 0.3:
|
||||
chunk_size = 256
|
||||
else:
|
||||
chunk_size = self.default_size
|
||||
|
||||
overlap = int(chunk_size * 0.1) # 10% overlap
|
||||
|
||||
# Implementation here...
|
||||
return self._fixed_size_chunk(content, chunk_size, overlap)
|
||||
|
||||
def _fixed_size_chunk(self, content: str, chunk_size: int, overlap: int) -> List[Dict[str, Any]]:
|
||||
# Implementation using RecursiveCharacterTextSplitter or custom logic
|
||||
pass
|
||||
|
||||
class AdaptiveStrategy(ChunkingStrategy):
|
||||
def chunk(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
|
||||
# Combine multiple strategies based on content characteristics
|
||||
if analysis.structure_score > 0.6:
|
||||
# Use structure-aware for structured parts
|
||||
structured_chunks = self._chunk_structured_parts(content, analysis)
|
||||
else:
|
||||
# Use fixed-size for unstructured parts
|
||||
unstructured_chunks = self._chunk_unstructured_parts(content, analysis)
|
||||
|
||||
# Merge and optimize
|
||||
return self._merge_chunks(structured_chunks + unstructured_chunks)
|
||||
|
||||
def _chunk_structured_parts(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
|
||||
# Implementation for structured content
|
||||
pass
|
||||
|
||||
def _chunk_unstructured_parts(self, content: str, analysis: DocumentAnalysis) -> List[Dict[str, Any]]:
|
||||
# Implementation for unstructured content
|
||||
pass
|
||||
|
||||
def _merge_chunks(self, chunks: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
|
||||
# Implementation for merging and optimizing chunks
|
||||
pass
|
||||
```
|
||||
|
||||
## Quality Assurance Framework
|
||||
|
||||
### Chunk Quality Metrics
|
||||
|
||||
```python
|
||||
from typing import List, Dict, Any
|
||||
import numpy as np
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
|
||||
class ChunkQualityAssessor:
|
||||
def __init__(self):
|
||||
self.quality_weights = {
|
||||
'coherence': 0.3,
|
||||
'completeness': 0.25,
|
||||
'size_appropriateness': 0.2,
|
||||
'semantic_similarity': 0.15,
|
||||
'boundary_quality': 0.1
|
||||
}
|
||||
|
||||
def assess_chunks(self, chunks: List[Dict[str, Any]], analysis: DocumentAnalysis) -> Dict[str, float]:
|
||||
scores = {}
|
||||
|
||||
# Coherence: Do chunks make sense on their own?
|
||||
scores['coherence'] = self._assess_coherence(chunks)
|
||||
|
||||
# Completeness: Do chunks preserve important information?
|
||||
scores['completeness'] = self._assess_completeness(chunks, analysis)
|
||||
|
||||
# Size appropriateness: Are chunks within optimal size range?
|
||||
scores['size_appropriateness'] = self._assess_size(chunks)
|
||||
|
||||
# Semantic similarity: Are chunks thematically consistent?
|
||||
scores['semantic_similarity'] = self._assess_semantic_consistency(chunks)
|
||||
|
||||
# Boundary quality: Are chunk boundaries placed well?
|
||||
scores['boundary_quality'] = self._assess_boundary_quality(chunks)
|
||||
|
||||
# Calculate overall quality score
|
||||
overall_score = sum(
|
||||
score * self.quality_weights[metric]
|
||||
for metric, score in scores.items()
|
||||
)
|
||||
|
||||
scores['overall'] = overall_score
|
||||
return scores
|
||||
|
||||
def _assess_coherence(self, chunks: List[Dict[str, Any]]) -> float:
|
||||
# Simple heuristic-based coherence assessment
|
||||
coherence_scores = []
|
||||
|
||||
for chunk in chunks:
|
||||
content = chunk['content']
|
||||
|
||||
# Check for complete sentences
|
||||
sentences = re.split(r'[.!?]+', content)
|
||||
complete_sentences = sum(1 for s in sentences if s.strip())
|
||||
coherence = complete_sentences / max(len(sentences), 1)
|
||||
|
||||
coherence_scores.append(coherence)
|
||||
|
||||
return np.mean(coherence_scores)
|
||||
|
||||
def _assess_completeness(self, chunks: List[Dict[str, Any]], analysis: DocumentAnalysis) -> float:
|
||||
# Check if important structural elements are preserved
|
||||
if analysis.doc_type in ['markdown', 'html']:
|
||||
return self._assess_structure_preservation(chunks, analysis)
|
||||
else:
|
||||
return self._assess_content_preservation(chunks)
|
||||
|
||||
def _assess_structure_preservation(self, chunks: List[Dict[str, Any]], analysis: DocumentAnalysis) -> float:
|
||||
# Check if headings, lists, and other structural elements are preserved
|
||||
preserved_elements = 0
|
||||
total_elements = 0
|
||||
|
||||
for chunk in chunks:
|
||||
content = chunk['content']
|
||||
|
||||
# Count preserved structural elements
|
||||
headings = len(re.findall(r'^#+\s', content, re.MULTILINE))
|
||||
lists = len(re.findall(r'^\s*[-*+]\s', content, re.MULTILINE))
|
||||
|
||||
preserved_elements += headings + lists
|
||||
total_elements += 1 # Simplified count
|
||||
|
||||
return preserved_elements / max(total_elements, 1)
|
||||
|
||||
def _assess_content_preservation(self, chunks: List[Dict[str, Any]]) -> float:
|
||||
# Simple check based on content ratio
|
||||
total_content = ''.join(chunk['content'] for chunk in chunks)
|
||||
# This would need comparison with original content
|
||||
return 0.8 # Placeholder
|
||||
|
||||
def _assess_size(self, chunks: List[Dict[str, Any]]) -> float:
|
||||
optimal_min = 100 # tokens
|
||||
optimal_max = 1000 # tokens
|
||||
|
||||
size_scores = []
|
||||
for chunk in chunks:
|
||||
token_count = self._estimate_tokens(chunk['content'])
|
||||
if optimal_min <= token_count <= optimal_max:
|
||||
score = 1.0
|
||||
elif token_count < optimal_min:
|
||||
score = token_count / optimal_min
|
||||
else:
|
||||
score = max(0, 1 - (token_count - optimal_max) / optimal_max)
|
||||
|
||||
size_scores.append(score)
|
||||
|
||||
return np.mean(size_scores)
|
||||
|
||||
def _assess_semantic_consistency(self, chunks: List[Dict[str, Any]]) -> float:
|
||||
# This would require embedding models for actual implementation
|
||||
# Placeholder implementation
|
||||
return 0.7
|
||||
|
||||
def _assess_boundary_quality(self, chunks: List[Dict[str, Any]]) -> float:
|
||||
# Check if boundaries don't split important content
|
||||
boundary_scores = []
|
||||
|
||||
for i, chunk in enumerate(chunks):
|
||||
content = chunk['content']
|
||||
|
||||
# Check for incomplete sentences at boundaries
|
||||
if not content.strip().endswith(('.', '!', '?', '>', '}')):
|
||||
boundary_scores.append(0.5)
|
||||
else:
|
||||
boundary_scores.append(1.0)
|
||||
|
||||
return np.mean(boundary_scores)
|
||||
|
||||
def _estimate_tokens(self, content: str) -> int:
|
||||
# Simple token estimation
|
||||
return len(content.split()) * 4 // 3 # Rough approximation
|
||||
```
|
||||
|
||||
## Error Handling and Edge Cases
|
||||
|
||||
### Robust Error Handling
|
||||
|
||||
```python
|
||||
import logging
|
||||
from typing import Optional, List
|
||||
from dataclasses import dataclass
|
||||
|
||||
@dataclass
|
||||
class ChunkingError:
|
||||
error_type: str
|
||||
message: str
|
||||
chunk_index: Optional[int] = None
|
||||
recovery_action: Optional[str] = None
|
||||
|
||||
class ChunkingErrorHandler:
|
||||
def __init__(self):
|
||||
self.logger = logging.getLogger(__name__)
|
||||
self.error_handlers = {
|
||||
'empty_content': self._handle_empty_content,
|
||||
'oversized_chunk': self._handle_oversized_chunk,
|
||||
'encoding_error': self._handle_encoding_error,
|
||||
'memory_error': self._handle_memory_error,
|
||||
'structure_parsing_error': self._handle_structure_parsing_error
|
||||
}
|
||||
|
||||
def handle_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
|
||||
error_type = self._classify_error(error)
|
||||
handler = self.error_handlers.get(error_type, self._handle_generic_error)
|
||||
return handler(error, context)
|
||||
|
||||
def _classify_error(self, error: Exception) -> str:
|
||||
if isinstance(error, ValueError) and 'empty' in str(error).lower():
|
||||
return 'empty_content'
|
||||
elif isinstance(error, MemoryError):
|
||||
return 'memory_error'
|
||||
elif isinstance(error, UnicodeError):
|
||||
return 'encoding_error'
|
||||
elif 'too large' in str(error).lower():
|
||||
return 'oversized_chunk'
|
||||
elif 'parsing' in str(error).lower():
|
||||
return 'structure_parsing_error'
|
||||
else:
|
||||
return 'generic_error'
|
||||
|
||||
def _handle_empty_content(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
|
||||
self.logger.warning(f"Empty content encountered: {error}")
|
||||
return ChunkingError(
|
||||
error_type='empty_content',
|
||||
message=str(error),
|
||||
recovery_action='skip_empty_content'
|
||||
)
|
||||
|
||||
def _handle_oversized_chunk(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
|
||||
self.logger.warning(f"Oversized chunk detected: {error}")
|
||||
return ChunkingError(
|
||||
error_type='oversized_chunk',
|
||||
message=str(error),
|
||||
chunk_index=context.get('chunk_index'),
|
||||
recovery_action='reduce_chunk_size'
|
||||
)
|
||||
|
||||
def _handle_encoding_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
|
||||
self.logger.error(f"Encoding error: {error}")
|
||||
return ChunkingError(
|
||||
error_type='encoding_error',
|
||||
message=str(error),
|
||||
recovery_action='fallback_encoding'
|
||||
)
|
||||
|
||||
def _handle_memory_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
|
||||
self.logger.error(f"Memory error during chunking: {error}")
|
||||
return ChunkingError(
|
||||
error_type='memory_error',
|
||||
message=str(error),
|
||||
recovery_action='process_in_batches'
|
||||
)
|
||||
|
||||
def _handle_structure_parsing_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
|
||||
self.logger.warning(f"Structure parsing failed: {error}")
|
||||
return ChunkingError(
|
||||
error_type='structure_parsing_error',
|
||||
message=str(error),
|
||||
recovery_action='fallback_to_fixed_size'
|
||||
)
|
||||
|
||||
def _handle_generic_error(self, error: Exception, context: Dict[str, Any]) -> ChunkingError:
|
||||
self.logger.error(f"Unexpected error during chunking: {error}")
|
||||
return ChunkingError(
|
||||
error_type='generic_error',
|
||||
message=str(error),
|
||||
recovery_action='skip_and_continue'
|
||||
)
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Caching and Memoization
|
||||
|
||||
```python
|
||||
import hashlib
|
||||
import pickle
|
||||
from functools import lru_cache
|
||||
from typing import Dict, Any, Optional
|
||||
import redis
|
||||
import json
|
||||
|
||||
class ChunkingCache:
|
||||
def __init__(self, redis_url: Optional[str] = None):
|
||||
if redis_url:
|
||||
self.redis_client = redis.from_url(redis_url)
|
||||
else:
|
||||
self.redis_client = None
|
||||
self.local_cache = {}
|
||||
|
||||
def _generate_cache_key(self, content: str, strategy: str, params: Dict[str, Any]) -> str:
|
||||
content_hash = hashlib.md5(content.encode()).hexdigest()
|
||||
params_str = json.dumps(params, sort_keys=True)
|
||||
params_hash = hashlib.md5(params_str.encode()).hexdigest()
|
||||
return f"chunking:{strategy}:{content_hash}:{params_hash}"
|
||||
|
||||
def get(self, content: str, strategy: str, params: Dict[str, Any]) -> Optional[List[Dict[str, Any]]]:
|
||||
cache_key = self._generate_cache_key(content, strategy, params)
|
||||
|
||||
# Try local cache first
|
||||
if cache_key in self.local_cache:
|
||||
return self.local_cache[cache_key]
|
||||
|
||||
# Try Redis cache
|
||||
if self.redis_client:
|
||||
try:
|
||||
cached_data = self.redis_client.get(cache_key)
|
||||
if cached_data:
|
||||
chunks = pickle.loads(cached_data)
|
||||
self.local_cache[cache_key] = chunks # Cache locally too
|
||||
return chunks
|
||||
except Exception as e:
|
||||
logging.warning(f"Redis cache error: {e}")
|
||||
|
||||
return None
|
||||
|
||||
def set(self, content: str, strategy: str, params: Dict[str, Any], chunks: List[Dict[str, Any]]) -> None:
|
||||
cache_key = self._generate_cache_key(content, strategy, params)
|
||||
|
||||
# Store in local cache
|
||||
self.local_cache[cache_key] = chunks
|
||||
|
||||
# Store in Redis cache
|
||||
if self.redis_client:
|
||||
try:
|
||||
cached_data = pickle.dumps(chunks)
|
||||
self.redis_client.setex(cache_key, 3600, cached_data) # 1 hour TTL
|
||||
except Exception as e:
|
||||
logging.warning(f"Redis cache set error: {e}")
|
||||
|
||||
def clear_local_cache(self):
|
||||
self.local_cache.clear()
|
||||
|
||||
def clear_redis_cache(self):
|
||||
if self.redis_client:
|
||||
pattern = "chunking:*"
|
||||
keys = self.redis_client.keys(pattern)
|
||||
if keys:
|
||||
self.redis_client.delete(*keys)
|
||||
```
|
||||
|
||||
### Batch Processing
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import concurrent.futures
|
||||
from typing import List, Callable, Any
|
||||
|
||||
class BatchChunkingProcessor:
|
||||
def __init__(self, max_workers: int = 4, batch_size: int = 10):
|
||||
self.max_workers = max_workers
|
||||
self.batch_size = batch_size
|
||||
|
||||
def process_documents_batch(self, documents: List[str],
|
||||
chunking_function: Callable[[str], List[Dict[str, Any]]]) -> List[List[Dict[str, Any]]]:
|
||||
"""Process multiple documents in parallel"""
|
||||
results = []
|
||||
|
||||
# Process in batches to avoid memory issues
|
||||
for i in range(0, len(documents), self.batch_size):
|
||||
batch = documents[i:i + self.batch_size]
|
||||
|
||||
with concurrent.futures.ThreadPoolExecutor(max_workers=self.max_workers) as executor:
|
||||
future_to_doc = {
|
||||
executor.submit(chunking_function, doc): doc
|
||||
for doc in batch
|
||||
}
|
||||
|
||||
batch_results = []
|
||||
for future in concurrent.futures.as_completed(future_to_doc):
|
||||
try:
|
||||
chunks = future.result()
|
||||
batch_results.append(chunks)
|
||||
except Exception as e:
|
||||
logging.error(f"Error processing document: {e}")
|
||||
batch_results.append([]) # Empty result for failed processing
|
||||
|
||||
results.extend(batch_results)
|
||||
|
||||
return results
|
||||
|
||||
async def process_documents_async(self, documents: List[str],
|
||||
chunking_function: Callable[[str], List[Dict[str, Any]]]) -> List[List[Dict[str, Any]]]:
|
||||
"""Process documents asynchronously"""
|
||||
semaphore = asyncio.Semaphore(self.max_workers)
|
||||
|
||||
async def process_single_document(doc: str) -> List[Dict[str, Any]]:
|
||||
async with semaphore:
|
||||
# Run the synchronous chunking function in an executor
|
||||
loop = asyncio.get_event_loop()
|
||||
return await loop.run_in_executor(None, chunking_function, doc)
|
||||
|
||||
tasks = [process_single_document(doc) for doc in documents]
|
||||
return await asyncio.gather(*tasks, return_exceptions=True)
|
||||
```
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
### Metrics Collection
|
||||
|
||||
```python
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from typing import Dict, Any, List
|
||||
from collections import defaultdict
|
||||
|
||||
@dataclass
|
||||
class ChunkingMetrics:
|
||||
total_documents: int
|
||||
total_chunks: int
|
||||
avg_chunk_size: float
|
||||
processing_time: float
|
||||
memory_usage: float
|
||||
error_count: int
|
||||
strategy_distribution: Dict[str, int]
|
||||
|
||||
class MetricsCollector:
|
||||
def __init__(self):
|
||||
self.metrics = defaultdict(list)
|
||||
self.start_time = None
|
||||
|
||||
def start_timing(self):
|
||||
self.start_time = time.time()
|
||||
|
||||
def end_timing(self) -> float:
|
||||
if self.start_time:
|
||||
duration = time.time() - self.start_time
|
||||
self.metrics['processing_time'].append(duration)
|
||||
self.start_time = None
|
||||
return duration
|
||||
return 0.0
|
||||
|
||||
def record_chunk_count(self, count: int):
|
||||
self.metrics['chunk_count'].append(count)
|
||||
|
||||
def record_chunk_size(self, size: int):
|
||||
self.metrics['chunk_size'].append(size)
|
||||
|
||||
def record_strategy_usage(self, strategy: str):
|
||||
self.metrics['strategy'][strategy] = self.metrics['strategy'].get(strategy, 0) + 1
|
||||
|
||||
def record_error(self, error_type: str):
|
||||
self.metrics['errors'].append(error_type)
|
||||
|
||||
def record_memory_usage(self, memory_mb: float):
|
||||
self.metrics['memory_usage'].append(memory_mb)
|
||||
|
||||
def get_summary(self) -> ChunkingMetrics:
|
||||
return ChunkingMetrics(
|
||||
total_documents=len(self.metrics['processing_time']),
|
||||
total_chunks=sum(self.metrics['chunk_count']),
|
||||
avg_chunk_size=sum(self.metrics['chunk_size']) / max(len(self.metrics['chunk_size']), 1),
|
||||
processing_time=sum(self.metrics['processing_time']),
|
||||
memory_usage=sum(self.metrics['memory_usage']) / max(len(self.metrics['memory_usage']), 1),
|
||||
error_count=len(self.metrics['errors']),
|
||||
strategy_distribution=dict(self.metrics['strategy'])
|
||||
)
|
||||
|
||||
def reset(self):
|
||||
self.metrics.clear()
|
||||
self.start_time = None
|
||||
```
|
||||
|
||||
This implementation guide provides a comprehensive foundation for building robust, scalable chunking systems that can handle various document types and use cases while maintaining high quality and performance.
|
||||
366
skills/ai/chunking-strategy/references/research.md
Normal file
366
skills/ai/chunking-strategy/references/research.md
Normal file
@@ -0,0 +1,366 @@
|
||||
# Key Research Papers and Findings
|
||||
|
||||
This document summarizes important research papers and findings related to chunking strategies for RAG systems.
|
||||
|
||||
## Seminal Papers
|
||||
|
||||
### "Reconstructing Context: Evaluating Advanced Chunking Strategies for RAG" (arXiv:2504.19754)
|
||||
|
||||
**Key Findings**:
|
||||
- Page-level chunking achieved highest average accuracy (0.648) with lowest variance across different query types
|
||||
- Optimal chunk size varies significantly by document type and query complexity
|
||||
- Factoid queries perform better with smaller chunks (256-512 tokens)
|
||||
- Complex analytical queries benefit from larger chunks (1024+ tokens)
|
||||
|
||||
**Methodology**:
|
||||
- Evaluated 7 different chunking strategies across multiple document types
|
||||
- Tested with both factoid and analytical queries
|
||||
- Measured end-to-end RAG performance
|
||||
|
||||
**Practical Implications**:
|
||||
- Start with page-level chunking for general-purpose RAG systems
|
||||
- Adapt chunk size based on expected query patterns
|
||||
- Consider hybrid approaches for mixed query types
|
||||
|
||||
### "Lost in the Middle: How Language Models Use Long Contexts"
|
||||
|
||||
**Key Findings**:
|
||||
- Language models tend to pay more attention to information at the beginning and end of context
|
||||
- Information in the middle of long contexts is often ignored
|
||||
- Performance degradation is most severe for centrally located information
|
||||
|
||||
**Practical Implications**:
|
||||
- Place most important information at chunk boundaries
|
||||
- Consider chunk overlap to ensure important context appears multiple times
|
||||
- Use ranking to prioritize relevant chunks for inclusion in context
|
||||
|
||||
### "Grounded Language Learning in a Simulated 3D World"
|
||||
|
||||
**Related Concepts**:
|
||||
- Importance of grounding text in visual/contextual information
|
||||
- Multi-modal learning approaches for better understanding
|
||||
|
||||
**Relevance to Chunking**:
|
||||
- Supports contextual chunking approaches that preserve visual/contextual relationships
|
||||
- Validates importance of maintaining document structure and relationships
|
||||
|
||||
## Industry Research
|
||||
|
||||
### NVIDIA Research: "Finding the Best Chunking Strategy for Accurate AI Responses"
|
||||
|
||||
**Key Findings**:
|
||||
- Page-level chunking outperformed sentence and paragraph-level approaches
|
||||
- Fixed-size chunking showed consistent but suboptimal performance
|
||||
- Semantic chunking provided improvements for complex documents
|
||||
|
||||
**Technical Details**:
|
||||
- Tested chunk sizes from 128 to 2048 tokens
|
||||
- Evaluated across financial, technical, and legal documents
|
||||
- Measured both retrieval accuracy and generation quality
|
||||
|
||||
**Recommendations**:
|
||||
- Use 512-1024 token chunks as starting point
|
||||
- Implement adaptive chunking based on document complexity
|
||||
- Consider page boundaries as natural chunk separators
|
||||
|
||||
### Cohere Research: "Effective Chunking Strategies for RAG"
|
||||
|
||||
**Key Findings**:
|
||||
- Recursive character splitting provides good balance of performance and simplicity
|
||||
- Document structure awareness improves retrieval by 15-20%
|
||||
- Overlap of 10-20% provides optimal context preservation
|
||||
|
||||
**Methodology**:
|
||||
- Compared 12 chunking strategies across 6 document types
|
||||
- Measured retrieval precision, recall, and F1-score
|
||||
- Tested with both dense and sparse retrieval
|
||||
|
||||
**Best Practices Identified**:
|
||||
- Start with recursive character splitting with 10-20% overlap
|
||||
- Preserve document structure (headings, lists, tables)
|
||||
- Customize chunk size based on embedding model context window
|
||||
|
||||
### Anthropic: "Contextual Retrieval"
|
||||
|
||||
**Key Innovation**:
|
||||
- Enhance each chunk with LLM-generated contextual information before embedding
|
||||
- Improves retrieval precision by 25-30% for complex documents
|
||||
- Particularly effective for technical and academic content
|
||||
|
||||
**Implementation Approach**:
|
||||
1. Split document using traditional methods
|
||||
2. For each chunk, generate contextual information using LLM
|
||||
3. Prepend context to chunk before embedding
|
||||
4. Use hybrid search (dense + sparse) with weighted ranking
|
||||
|
||||
**Trade-offs**:
|
||||
- Significant computational overhead (2-3x processing time)
|
||||
- Higher embedding storage requirements
|
||||
- Improved retrieval precision justifies cost for high-value applications
|
||||
|
||||
## Algorithmic Advances
|
||||
|
||||
### Semantic Chunking Algorithms
|
||||
|
||||
#### "Semantic Segmentation of Text Documents"
|
||||
|
||||
**Core Idea**: Use cosine similarity between consecutive sentence embeddings to identify natural boundaries.
|
||||
|
||||
**Algorithm**:
|
||||
1. Split document into sentences
|
||||
2. Generate embeddings for each sentence
|
||||
3. Calculate similarity between consecutive sentences
|
||||
4. Create boundaries where similarity drops below threshold
|
||||
5. Merge short segments with neighbors
|
||||
|
||||
**Performance**: 20-30% improvement in retrieval relevance over fixed-size chunking for technical documents.
|
||||
|
||||
#### "Hierarchical Semantic Chunking"
|
||||
|
||||
**Core Idea**: Multi-level semantic segmentation for document organization.
|
||||
|
||||
**Algorithm**:
|
||||
1. Document-level semantic analysis
|
||||
2. Section-level boundary detection
|
||||
3. Paragraph-level segmentation
|
||||
4. Sentence-level refinement
|
||||
|
||||
**Benefits**: Maintains document hierarchy while adapting to semantic structure.
|
||||
|
||||
### Advanced Embedding Techniques
|
||||
|
||||
#### "Late Chunking: Contextual Chunk Embeddings"
|
||||
|
||||
**Core Innovation**: Generate embeddings for entire document first, then create chunk embeddings from token-level embeddings.
|
||||
|
||||
**Advantages**:
|
||||
- Preserves global document context
|
||||
- Reduces context fragmentation
|
||||
- Better for documents with complex inter-relationships
|
||||
|
||||
**Requirements**:
|
||||
- Long-context embedding models (8k+ tokens)
|
||||
- Significant computational resources
|
||||
- Specialized implementation
|
||||
|
||||
#### "Hierarchical Embedding Retrieval"
|
||||
|
||||
**Approach**: Create embeddings at multiple granularities (document, section, paragraph, sentence).
|
||||
|
||||
**Implementation**:
|
||||
1. Generate embeddings at each level
|
||||
2. Store in hierarchical vector database
|
||||
3. Query at appropriate granularity based on information needs
|
||||
|
||||
**Performance**: 15-25% improvement in precision for complex queries.
|
||||
|
||||
## Evaluation Methodologies
|
||||
|
||||
### Retrieval-Augmented Generation Assessment Frameworks
|
||||
|
||||
#### RAGAS Framework
|
||||
|
||||
**Metrics**:
|
||||
- **Faithfulness**: Consistency between generated answer and retrieved context
|
||||
- **Answer Relevancy**: Relevance of generated answer to the question
|
||||
- **Context Relevancy**: Relevance of retrieved context to the question
|
||||
- **Context Recall**: Coverage of relevant information in retrieved context
|
||||
|
||||
**Evaluation Process**:
|
||||
1. Generate questions from document corpus
|
||||
2. Retrieve relevant chunks using different strategies
|
||||
3. Generate answers using retrieved chunks
|
||||
4. Evaluate using automated metrics and human judgment
|
||||
|
||||
#### ARES Framework
|
||||
|
||||
**Innovation**: Automated evaluation using synthetic questions and LLM-based assessment.
|
||||
|
||||
**Key Features**:
|
||||
- Generates diverse question types (factoid, analytical, comparative)
|
||||
- Uses LLMs to evaluate answer quality
|
||||
- Provides scalable evaluation without human annotation
|
||||
|
||||
### Benchmark Datasets
|
||||
|
||||
#### Natural Questions (NQ)
|
||||
|
||||
**Description**: Real user questions from Google Search with relevant Wikipedia passages.
|
||||
|
||||
**Relevance**: Natural language queries with authentic relevance judgments.
|
||||
|
||||
#### MS MARCO
|
||||
|
||||
**Description**: Large-scale passage ranking dataset with real search queries.
|
||||
|
||||
**Relevance**: High-quality relevance judgments for passage retrieval.
|
||||
|
||||
#### HotpotQA
|
||||
|
||||
**Description**: Multi-hop question answering requiring information from multiple documents.
|
||||
|
||||
**Relevance**: Tests ability to retrieve and synthesize information from multiple chunks.
|
||||
|
||||
## Domain-Specific Research
|
||||
|
||||
### Medical Documents
|
||||
|
||||
#### "Optimal Chunking for Medical Question Answering"
|
||||
|
||||
**Key Findings**:
|
||||
- Medical terminology requires specialized handling
|
||||
- Section-based chunking (History, Diagnosis, Treatment) most effective
|
||||
- Preserving doctor-patient dialogue context crucial
|
||||
|
||||
**Recommendations**:
|
||||
- Use medical-specific tokenizers
|
||||
- Preserve section headers and structure
|
||||
- Maintain temporal relationships in medical histories
|
||||
|
||||
### Legal Documents
|
||||
|
||||
#### "Chunking Strategies for Legal Document Analysis"
|
||||
|
||||
**Key Findings**:
|
||||
- Legal citations and cross-references require special handling
|
||||
- Contract clause boundaries serve as natural chunk separators
|
||||
- Case law benefits from hierarchical chunking
|
||||
|
||||
**Best Practices**:
|
||||
- Preserve legal citation structure
|
||||
- Use clause and section boundaries
|
||||
- Maintain context for legal definitions and references
|
||||
|
||||
### Financial Documents
|
||||
|
||||
#### "SEC Filing Chunking for Financial Analysis"
|
||||
|
||||
**Key Findings**:
|
||||
- Table preservation critical for financial data
|
||||
- XBRL tagging provides natural segmentation
|
||||
- Risk factors sections benefit from specialized treatment
|
||||
|
||||
**Approach**:
|
||||
- Preserve complete tables when possible
|
||||
- Use XBRL tags for structured data
|
||||
- Create specialized chunks for risk sections
|
||||
|
||||
## Emerging Trends
|
||||
|
||||
### Multi-Modal Chunking
|
||||
|
||||
#### "Integrating Text, Tables, and Images in RAG Systems"
|
||||
|
||||
**Innovation**: Unified chunking approach for mixed-modal content.
|
||||
|
||||
**Approach**:
|
||||
- Extract and describe images using vision models
|
||||
- Preserve table structure and relationships
|
||||
- Create unified embeddings for mixed content
|
||||
|
||||
**Results**: 35% improvement in complex document understanding.
|
||||
|
||||
### Adaptive Chunking
|
||||
|
||||
#### "Machine Learning-Based Chunk Size Optimization"
|
||||
|
||||
**Core Idea**: Use ML models to predict optimal chunking parameters.
|
||||
|
||||
**Features**:
|
||||
- Document length and complexity
|
||||
- Query type distribution
|
||||
- Embedding model characteristics
|
||||
- Performance requirements
|
||||
|
||||
**Benefits**: Dynamic optimization based on use case and content.
|
||||
|
||||
### Real-time Chunking
|
||||
|
||||
#### "Streaming Chunking for Live Document Processing"
|
||||
|
||||
**Innovation**: Process documents as they become available.
|
||||
|
||||
**Techniques**:
|
||||
- Incremental boundary detection
|
||||
- Dynamic chunk size adjustment
|
||||
- Context preservation across chunks
|
||||
|
||||
**Applications**: Live news feeds, social media analysis, meeting transcripts.
|
||||
|
||||
## Implementation Challenges
|
||||
|
||||
### Computational Efficiency
|
||||
|
||||
#### "Scalable Chunking for Large Document Collections"
|
||||
|
||||
**Challenges**:
|
||||
- Processing millions of documents efficiently
|
||||
- Memory usage optimization
|
||||
- Distributed processing requirements
|
||||
|
||||
**Solutions**:
|
||||
- Batch processing with parallel execution
|
||||
- Streaming approaches for large documents
|
||||
- Distributed chunking with load balancing
|
||||
|
||||
### Quality Assurance
|
||||
|
||||
#### "Evaluating Chunk Quality at Scale"
|
||||
|
||||
**Challenges**:
|
||||
- Automated quality assessment
|
||||
- Detecting poor chunk boundaries
|
||||
- Maintaining consistency across document types
|
||||
|
||||
**Approaches**:
|
||||
- Heuristic-based quality metrics
|
||||
- LLM-based evaluation
|
||||
- Human-in-the-loop validation
|
||||
|
||||
## Future Research Directions
|
||||
|
||||
### Context-Aware Chunking
|
||||
|
||||
**Open Questions**:
|
||||
- How to optimally preserve cross-chunk relationships?
|
||||
- Can we predict chunk quality without human evaluation?
|
||||
- What is the optimal balance between size and context?
|
||||
|
||||
### Domain Adaptation
|
||||
|
||||
**Research Areas**:
|
||||
- Automatic domain detection and adaptation
|
||||
- Transfer learning across domains
|
||||
- Zero-shot chunking for new document types
|
||||
|
||||
### Evaluation Standards
|
||||
|
||||
**Needs**:
|
||||
- Standardized evaluation benchmarks
|
||||
- Cross-paper comparison methodologies
|
||||
- Real-world performance metrics
|
||||
|
||||
## Practical Recommendations Based on Research
|
||||
|
||||
### Starting Points
|
||||
|
||||
1. **For General RAG Systems**: Page-level or recursive character chunking with 512-1024 tokens and 10-20% overlap
|
||||
2. **For Technical Documents**: Structure-aware chunking with semantic boundary detection
|
||||
3. **For High-Value Applications**: Contextual retrieval with LLM-generated context
|
||||
|
||||
### Evolution Strategy
|
||||
|
||||
1. **Begin**: Simple fixed-size chunking (512 tokens)
|
||||
2. **Improve**: Add document structure awareness
|
||||
3. **Optimize**: Implement semantic boundaries
|
||||
4. **Advanced**: Consider contextual retrieval for critical use cases
|
||||
|
||||
### Key Success Factors
|
||||
|
||||
1. **Match strategy to document type and query patterns**
|
||||
2. **Preserve document structure when beneficial**
|
||||
3. **Use overlap to maintain context across boundaries**
|
||||
4. **Monitor both accuracy and computational costs**
|
||||
5. **Iterate based on specific use case requirements**
|
||||
|
||||
This research foundation provides evidence-based guidance for implementing effective chunking strategies across various domains and use cases.
|
||||
1315
skills/ai/chunking-strategy/references/semantic-methods.md
Normal file
1315
skills/ai/chunking-strategy/references/semantic-methods.md
Normal file
File diff suppressed because it is too large
Load Diff
423
skills/ai/chunking-strategy/references/strategies.md
Normal file
423
skills/ai/chunking-strategy/references/strategies.md
Normal file
@@ -0,0 +1,423 @@
|
||||
# Detailed Chunking Strategies
|
||||
|
||||
This document provides comprehensive implementation details for all chunking strategies mentioned in the main skill.
|
||||
|
||||
## Level 1: Fixed-Size Chunking
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
||||
|
||||
class FixedSizeChunker:
|
||||
def __init__(self, chunk_size=512, chunk_overlap=50):
|
||||
self.chunk_size = chunk_size
|
||||
self.chunk_overlap = chunk_overlap
|
||||
self.splitter = RecursiveCharacterTextSplitter(
|
||||
chunk_size=chunk_size,
|
||||
chunk_overlap=chunk_overlap,
|
||||
length_function=len,
|
||||
separators=["\n\n", "\n", " ", ""]
|
||||
)
|
||||
|
||||
def chunk(self, documents):
|
||||
return self.splitter.split_documents(documents)
|
||||
```
|
||||
|
||||
### Parameter Recommendations
|
||||
|
||||
| Use Case | Chunk Size | Overlap | Rationale |
|
||||
|----------|------------|---------|-----------|
|
||||
| Factoid Queries | 256 | 25 | Small chunks for precise answers |
|
||||
| General Q&A | 512 | 50 | Balanced approach for most cases |
|
||||
| Analytical Queries | 1024 | 100 | Larger context for complex analysis |
|
||||
| Code Documentation | 300 | 30 | Preserve code context while maintaining focus |
|
||||
|
||||
### Best Practices
|
||||
|
||||
- Start with 512 tokens and 10-20% overlap
|
||||
- Adjust based on embedding model context window
|
||||
- Use overlap for queries where context might span boundaries
|
||||
- Monitor token count vs. character count based on model
|
||||
|
||||
## Level 2: Recursive Character Chunking
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
||||
|
||||
class RecursiveChunker:
|
||||
def __init__(self, chunk_size=512, separators=None):
|
||||
self.chunk_size = chunk_size
|
||||
self.separators = separators or ["\n\n", "\n", " ", ""]
|
||||
self.splitter = RecursiveCharacterTextSplitter(
|
||||
chunk_size=chunk_size,
|
||||
chunk_overlap=0,
|
||||
length_function=len,
|
||||
separators=self.separators
|
||||
)
|
||||
|
||||
def chunk(self, text):
|
||||
return self.splitter.create_documents([text])
|
||||
|
||||
# Document-specific configurations
|
||||
def get_chunker_for_document_type(doc_type):
|
||||
configurations = {
|
||||
"markdown": ["\n## ", "\n### ", "\n\n", "\n", " ", ""],
|
||||
"html": ["</div>", "</p>", "\n\n", "\n", " ", ""],
|
||||
"code": ["\n\n", "\n", " ", ""],
|
||||
"plain": ["\n\n", "\n", " ", ""]
|
||||
}
|
||||
return RecursiveChunker(separators=configurations.get(doc_type, ["\n\n", "\n", " ", ""]))
|
||||
```
|
||||
|
||||
### Customization Guidelines
|
||||
|
||||
- **Markdown**: Use headings as primary separators
|
||||
- **HTML**: Use block-level tags as separators
|
||||
- **Code**: Preserve function and class boundaries
|
||||
- **Academic papers**: Prioritize paragraph and section breaks
|
||||
|
||||
## Level 3: Structure-Aware Chunking
|
||||
|
||||
### Markdown Documents
|
||||
|
||||
```python
|
||||
import markdown
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
class MarkdownChunker:
|
||||
def __init__(self, max_chunk_size=512):
|
||||
self.max_chunk_size = max_chunk_size
|
||||
|
||||
def chunk(self, markdown_text):
|
||||
html = markdown.markdown(markdown_text)
|
||||
soup = BeautifulSoup(html, 'html.parser')
|
||||
|
||||
chunks = []
|
||||
current_chunk = ""
|
||||
current_heading = "Introduction"
|
||||
|
||||
for element in soup.find_all(['h1', 'h2', 'h3', 'p', 'pre', 'table']):
|
||||
if element.name.startswith('h'):
|
||||
if current_chunk.strip():
|
||||
chunks.append({
|
||||
"content": current_chunk.strip(),
|
||||
"heading": current_heading
|
||||
})
|
||||
current_heading = element.get_text().strip()
|
||||
current_chunk = f"{element}\n"
|
||||
elif element.name in ['pre', 'table']:
|
||||
# Preserve code blocks and tables intact
|
||||
if len(current_chunk) + len(str(element)) > self.max_chunk_size:
|
||||
if current_chunk.strip():
|
||||
chunks.append({
|
||||
"content": current_chunk.strip(),
|
||||
"heading": current_heading
|
||||
})
|
||||
current_chunk = f"{element}\n"
|
||||
else:
|
||||
current_chunk += f"{element}\n"
|
||||
else:
|
||||
current_chunk += str(element)
|
||||
|
||||
if current_chunk.strip():
|
||||
chunks.append({
|
||||
"content": current_chunk.strip(),
|
||||
"heading": current_heading
|
||||
})
|
||||
|
||||
return chunks
|
||||
```
|
||||
|
||||
### Code Documents
|
||||
|
||||
```python
|
||||
import ast
|
||||
import re
|
||||
|
||||
class CodeChunker:
|
||||
def __init__(self, language='python'):
|
||||
self.language = language
|
||||
|
||||
def chunk_python(self, code):
|
||||
tree = ast.parse(code)
|
||||
chunks = []
|
||||
|
||||
for node in ast.walk(tree):
|
||||
if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
|
||||
start_line = node.lineno - 1
|
||||
end_line = node.end_lineno if hasattr(node, 'end_lineno') else start_line + 10
|
||||
lines = code.split('\n')
|
||||
chunk_lines = lines[start_line:end_line]
|
||||
chunks.append('\n'.join(chunk_lines))
|
||||
|
||||
return chunks
|
||||
|
||||
def chunk_javascript(self, code):
|
||||
# Use regex for languages without AST parsers
|
||||
function_pattern = r'(function\s+\w+\s*\([^)]*\)\s*\{[^}]*\})'
|
||||
class_pattern = r'(class\s+\w+\s*\{[^}]*\})'
|
||||
|
||||
patterns = [function_pattern, class_pattern]
|
||||
chunks = []
|
||||
|
||||
for pattern in patterns:
|
||||
matches = re.finditer(pattern, code, re.MULTILINE | re.DOTALL)
|
||||
for match in matches:
|
||||
chunks.append(match.group(1))
|
||||
|
||||
return chunks
|
||||
|
||||
def chunk(self, code):
|
||||
if self.language == 'python':
|
||||
return self.chunk_python(code)
|
||||
elif self.language == 'javascript':
|
||||
return self.chunk_javascript(code)
|
||||
else:
|
||||
# Fallback to line-based chunking
|
||||
return self.chunk_by_lines(code)
|
||||
|
||||
def chunk_by_lines(self, code, max_lines=50):
|
||||
lines = code.split('\n')
|
||||
chunks = []
|
||||
|
||||
for i in range(0, len(lines), max_lines):
|
||||
chunk = '\n'.join(lines[i:i+max_lines])
|
||||
chunks.append(chunk)
|
||||
|
||||
return chunks
|
||||
```
|
||||
|
||||
### Tabular Data
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
class TableChunker:
|
||||
def __init__(self, max_rows=100, summary_rows=5):
|
||||
self.max_rows = max_rows
|
||||
self.summary_rows = summary_rows
|
||||
|
||||
def chunk(self, table_data):
|
||||
if isinstance(table_data, str):
|
||||
df = pd.read_csv(StringIO(table_data))
|
||||
else:
|
||||
df = table_data
|
||||
|
||||
chunks = []
|
||||
|
||||
if len(df) <= self.max_rows:
|
||||
# Small table - keep intact
|
||||
chunks.append({
|
||||
"type": "full_table",
|
||||
"content": df.to_string(),
|
||||
"metadata": {
|
||||
"rows": len(df),
|
||||
"columns": len(df.columns)
|
||||
}
|
||||
})
|
||||
else:
|
||||
# Large table - create summary + chunks
|
||||
summary = df.head(self.summary_rows)
|
||||
chunks.append({
|
||||
"type": "table_summary",
|
||||
"content": f"Table Summary ({len(df)} rows, {len(df.columns)} columns):\n{summary.to_string()}",
|
||||
"metadata": {
|
||||
"total_rows": len(df),
|
||||
"summary_rows": self.summary_rows,
|
||||
"columns": list(df.columns)
|
||||
}
|
||||
})
|
||||
|
||||
# Chunk the remaining data
|
||||
for i in range(self.summary_rows, len(df), self.max_rows):
|
||||
chunk_df = df.iloc[i:i+self.max_rows]
|
||||
chunks.append({
|
||||
"type": "table_chunk",
|
||||
"content": f"Rows {i+1}-{min(i+self.max_rows, len(df))}:\n{chunk_df.to_string()}",
|
||||
"metadata": {
|
||||
"start_row": i + 1,
|
||||
"end_row": min(i + self.max_rows, len(df)),
|
||||
"columns": list(df.columns)
|
||||
}
|
||||
})
|
||||
|
||||
return chunks
|
||||
```
|
||||
|
||||
## Level 4: Semantic Chunking
|
||||
|
||||
### Implementation
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from sentence_transformers import SentenceTransformer
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
|
||||
class SemanticChunker:
|
||||
def __init__(self, model_name="all-MiniLM-L6-v2", similarity_threshold=0.8, buffer_size=3):
|
||||
self.model = SentenceTransformer(model_name)
|
||||
self.similarity_threshold = similarity_threshold
|
||||
self.buffer_size = buffer_size
|
||||
|
||||
def split_into_sentences(self, text):
|
||||
# Simple sentence splitting - can be enhanced with nltk/spacy
|
||||
sentences = re.split(r'[.!?]+', text)
|
||||
return [s.strip() for s in sentences if s.strip()]
|
||||
|
||||
def chunk(self, text):
|
||||
sentences = self.split_into_sentences(text)
|
||||
|
||||
if len(sentences) <= self.buffer_size:
|
||||
return [text]
|
||||
|
||||
# Create embeddings
|
||||
embeddings = self.model.encode(sentences)
|
||||
|
||||
chunks = []
|
||||
current_chunk_sentences = []
|
||||
|
||||
for i in range(len(sentences)):
|
||||
current_chunk_sentences.append(sentences[i])
|
||||
|
||||
# Check if we should create a boundary
|
||||
if i < len(sentences) - 1:
|
||||
similarity = cosine_similarity(
|
||||
[embeddings[i]],
|
||||
[embeddings[i + 1]]
|
||||
)[0][0]
|
||||
|
||||
if similarity < self.similarity_threshold and len(current_chunk_sentences) >= 2:
|
||||
chunks.append(' '.join(current_chunk_sentences))
|
||||
current_chunk_sentences = []
|
||||
|
||||
# Add remaining sentences
|
||||
if current_chunk_sentences:
|
||||
chunks.append(' '.join(current_chunk_sentences))
|
||||
|
||||
return chunks
|
||||
```
|
||||
|
||||
### Parameter Tuning
|
||||
|
||||
| Parameter | Range | Effect |
|
||||
|-----------|-------|--------|
|
||||
| similarity_threshold | 0.5-0.9 | Higher values create more chunks |
|
||||
| buffer_size | 1-10 | Larger buffers provide more context |
|
||||
| model_name | Various | Different models for different domains |
|
||||
|
||||
### Optimization Tips
|
||||
|
||||
- Use domain-specific models for specialized content
|
||||
- Adjust threshold based on content complexity
|
||||
- Cache embeddings for repeated processing
|
||||
- Consider batch processing for large documents
|
||||
|
||||
## Level 5: Advanced Contextual Methods
|
||||
|
||||
### Late Chunking
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
|
||||
class LateChunker:
|
||||
def __init__(self, model_name="microsoft/DialoGPT-medium"):
|
||||
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
self.model = AutoModel.from_pretrained(model_name)
|
||||
|
||||
def chunk(self, text, chunk_size=512):
|
||||
# Tokenize entire document
|
||||
tokens = self.tokenizer(text, return_tensors="pt", truncation=False)
|
||||
|
||||
# Get token-level embeddings
|
||||
with torch.no_grad():
|
||||
outputs = self.model(**tokens, output_hidden_states=True)
|
||||
token_embeddings = outputs.last_hidden_state[0]
|
||||
|
||||
# Create chunk embeddings from token embeddings
|
||||
chunks = []
|
||||
for i in range(0, len(token_embeddings), chunk_size):
|
||||
chunk_tokens = token_embeddings[i:i+chunk_size]
|
||||
chunk_embedding = torch.mean(chunk_tokens, dim=0)
|
||||
chunks.append({
|
||||
"content": self.tokenizer.decode(tokens["input_ids"][0][i:i+chunk_size]),
|
||||
"embedding": chunk_embedding.numpy()
|
||||
})
|
||||
|
||||
return chunks
|
||||
```
|
||||
|
||||
### Contextual Retrieval
|
||||
|
||||
```python
|
||||
import openai
|
||||
|
||||
class ContextualChunker:
|
||||
def __init__(self, api_key):
|
||||
self.client = openai.OpenAI(api_key=api_key)
|
||||
|
||||
def generate_context(self, chunk, full_document):
|
||||
prompt = f"""
|
||||
Given the following document and a chunk from it, provide a brief context
|
||||
that helps understand the chunk's meaning within the full document.
|
||||
|
||||
Document:
|
||||
{full_document[:2000]}...
|
||||
|
||||
Chunk:
|
||||
{chunk}
|
||||
|
||||
Context (max 50 words):
|
||||
"""
|
||||
|
||||
response = self.client.chat.completions.create(
|
||||
model="gpt-3.5-turbo",
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
max_tokens=100,
|
||||
temperature=0
|
||||
)
|
||||
|
||||
return response.choices[0].message.content.strip()
|
||||
|
||||
def chunk_with_context(self, text, base_chunker):
|
||||
# First create base chunks
|
||||
base_chunks = base_chunker.chunk(text)
|
||||
|
||||
# Then add context to each chunk
|
||||
contextualized_chunks = []
|
||||
for chunk in base_chunks:
|
||||
context = self.generate_context(chunk.page_content, text)
|
||||
contextualized_content = f"Context: {context}\n\nContent: {chunk.page_content}"
|
||||
|
||||
contextualized_chunks.append({
|
||||
"content": contextualized_content,
|
||||
"original_content": chunk.page_content,
|
||||
"context": context
|
||||
})
|
||||
|
||||
return contextualized_chunks
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Computational Cost Analysis
|
||||
|
||||
| Strategy | Time Complexity | Space Complexity | Relative Cost |
|
||||
|----------|-----------------|------------------|---------------|
|
||||
| Fixed-Size | O(n) | O(n) | Low |
|
||||
| Recursive | O(n) | O(n) | Low |
|
||||
| Structure-Aware | O(n log n) | O(n) | Medium |
|
||||
| Semantic | O(n²) | O(n²) | High |
|
||||
| Late Chunking | O(n) | O(n) | Very High |
|
||||
| Contextual | O(n²) | O(n²) | Very High |
|
||||
|
||||
### Optimization Strategies
|
||||
|
||||
1. **Parallel Processing**: Process chunks concurrently when possible
|
||||
2. **Caching**: Store embeddings and intermediate results
|
||||
3. **Batch Operations**: Group similar operations together
|
||||
4. **Progressive Loading**: Process large documents in streaming fashion
|
||||
5. **Model Selection**: Choose appropriate models for task complexity
|
||||
867
skills/ai/chunking-strategy/references/tools.md
Normal file
867
skills/ai/chunking-strategy/references/tools.md
Normal file
@@ -0,0 +1,867 @@
|
||||
# Recommended Libraries and Frameworks
|
||||
|
||||
This document provides a comprehensive guide to tools, libraries, and frameworks for implementing chunking strategies.
|
||||
|
||||
## Core Chunking Libraries
|
||||
|
||||
### LangChain
|
||||
|
||||
**Overview**: Comprehensive framework for building applications with large language models, includes robust text splitting utilities.
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install langchain langchain-text-splitters
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Multiple text splitting strategies
|
||||
- Integration with various document loaders
|
||||
- Support for different content types (code, markdown, etc.)
|
||||
- Customizable separators and parameters
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
from langchain.text_splitter import (
|
||||
RecursiveCharacterTextSplitter,
|
||||
CharacterTextSplitter,
|
||||
TokenTextSplitter,
|
||||
MarkdownTextSplitter,
|
||||
PythonCodeTextSplitter
|
||||
)
|
||||
|
||||
# Basic recursive splitting
|
||||
splitter = RecursiveCharacterTextSplitter(
|
||||
chunk_size=1000,
|
||||
chunk_overlap=200,
|
||||
length_function=len,
|
||||
separators=["\n\n", "\n", " ", ""]
|
||||
)
|
||||
|
||||
chunks = splitter.split_text(large_text)
|
||||
|
||||
# Markdown-specific splitting
|
||||
markdown_splitter = MarkdownTextSplitter(
|
||||
chunk_size=1000,
|
||||
chunk_overlap=100
|
||||
)
|
||||
|
||||
# Code-specific splitting
|
||||
code_splitter = PythonCodeTextSplitter(
|
||||
chunk_size=1000,
|
||||
chunk_overlap=100
|
||||
)
|
||||
```
|
||||
|
||||
**Pros**:
|
||||
- Well-maintained and actively developed
|
||||
- Extensive documentation and examples
|
||||
- Integrates well with other LangChain components
|
||||
- Supports multiple document types
|
||||
|
||||
**Cons**:
|
||||
- Can be heavy dependency for simple use cases
|
||||
- Some advanced features require LangChain ecosystem
|
||||
|
||||
### LlamaIndex
|
||||
|
||||
**Overview**: Data framework for LLM applications with advanced indexing and retrieval capabilities.
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install llama-index
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Advanced semantic chunking
|
||||
- Hierarchical indexing
|
||||
- Context-aware retrieval
|
||||
- Integration with vector databases
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
from llama_index.core.node_parser import (
|
||||
SentenceSplitter,
|
||||
SemanticSplitterNodeParser
|
||||
)
|
||||
from llama_index.core import SimpleDirectoryReader
|
||||
from llama_index.embeddings.openai import OpenAIEmbedding
|
||||
|
||||
# Basic sentence splitting
|
||||
splitter = SentenceSplitter(
|
||||
chunk_size=1024,
|
||||
chunk_overlap=20
|
||||
)
|
||||
|
||||
# Semantic chunking with embeddings
|
||||
embed_model = OpenAIEmbedding()
|
||||
semantic_splitter = SemanticSplitterNodeParser(
|
||||
buffer_size=1,
|
||||
breakpoint_percentile_threshold=95,
|
||||
embed_model=embed_model
|
||||
)
|
||||
|
||||
# Load and process documents
|
||||
documents = SimpleDirectoryReader("./data").load_data()
|
||||
nodes = semantic_splitter.get_nodes_from_documents(documents)
|
||||
```
|
||||
|
||||
**Pros**:
|
||||
- Excellent semantic chunking capabilities
|
||||
- Built for production RAG systems
|
||||
- Strong vector database integration
|
||||
- Active community support
|
||||
|
||||
**Cons**:
|
||||
- More complex setup for basic use cases
|
||||
- Semantic chunking requires embedding model setup
|
||||
|
||||
### Unstructured
|
||||
|
||||
**Overview**: Open-source library for processing unstructured documents, especially strong with multi-modal content.
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install "unstructured[pdf,png,jpg]"
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Multi-modal document processing
|
||||
- Support for PDFs, images, and various formats
|
||||
- Structure preservation
|
||||
- Table extraction and processing
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
from unstructured.partition.auto import partition
|
||||
from unstructured.chunking.title import chunk_by_title
|
||||
|
||||
# Partition document by type
|
||||
elements = partition(filename="document.pdf")
|
||||
|
||||
# Chunk by title/heading structure
|
||||
chunks = chunk_by_title(
|
||||
elements,
|
||||
combine_text_under_n_chars=2000,
|
||||
max_characters=10000,
|
||||
new_after_n_chars=1500,
|
||||
multipage_sections=True
|
||||
)
|
||||
|
||||
# Access chunked content
|
||||
for chunk in chunks:
|
||||
print(f"Category: {chunk.category}")
|
||||
print(f"Content: {chunk.text[:200]}...")
|
||||
```
|
||||
|
||||
**Pros**:
|
||||
- Excellent for PDF and image processing
|
||||
- Preserves document structure
|
||||
- Handles tables and figures well
|
||||
- Strong multi-modal capabilities
|
||||
|
||||
**Cons**:
|
||||
- Can be slower for large documents
|
||||
- Requires additional dependencies for some formats
|
||||
|
||||
## Text Processing Libraries
|
||||
|
||||
### NLTK (Natural Language Toolkit)
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install nltk
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Sentence tokenization
|
||||
- Language detection
|
||||
- Text preprocessing
|
||||
- Linguistic analysis
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
import nltk
|
||||
from nltk.tokenize import sent_tokenize, word_tokenize
|
||||
from nltk.corpus import stopwords
|
||||
|
||||
# Download required data
|
||||
nltk.download('punkt')
|
||||
nltk.download('stopwords')
|
||||
|
||||
# Sentence and word tokenization
|
||||
text = "This is a sample sentence. This is another sentence."
|
||||
sentences = sent_tokenize(text)
|
||||
words = word_tokenize(text)
|
||||
|
||||
# Stop words removal
|
||||
stop_words = set(stopwords.words('english'))
|
||||
filtered_words = [word for word in words if word.lower() not in stop_words]
|
||||
```
|
||||
|
||||
### spaCy
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install spacy
|
||||
python -m spacy download en_core_web_sm
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Industrial-strength NLP
|
||||
- Named entity recognition
|
||||
- Dependency parsing
|
||||
- Sentence boundary detection
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
import spacy
|
||||
|
||||
# Load language model
|
||||
nlp = spacy.load("en_core_web_sm")
|
||||
|
||||
# Process text
|
||||
doc = nlp("This is a sample sentence. This is another sentence.")
|
||||
|
||||
# Extract sentences
|
||||
sentences = [sent.text for sent in doc.sents]
|
||||
|
||||
# Named entities
|
||||
entities = [(ent.text, ent.label_) for ent in doc.ents]
|
||||
|
||||
# Dependency parsing for better chunking
|
||||
for token in doc:
|
||||
print(f"{token.text}: {token.dep_} (head: {token.head.text})")
|
||||
```
|
||||
|
||||
### Sentence Transformers
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install sentence-transformers
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Pre-trained sentence embeddings
|
||||
- Semantic similarity calculation
|
||||
- Multi-lingual support
|
||||
- Custom model training
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
from sentence_transformers import SentenceTransformer, util
|
||||
import numpy as np
|
||||
|
||||
# Load pre-trained model
|
||||
model = SentenceTransformer('all-MiniLM-L6-v2')
|
||||
|
||||
# Generate embeddings
|
||||
sentences = ["This is a sentence.", "This is another sentence."]
|
||||
embeddings = model.encode(sentences)
|
||||
|
||||
# Calculate semantic similarity
|
||||
similarity = util.cos_sim(embeddings[0], embeddings[1])
|
||||
|
||||
# Find semantic boundaries for chunking
|
||||
def find_semantic_boundaries(text, model, threshold=0.8):
|
||||
sentences = [s.strip() for s in text.split('.') if s.strip()]
|
||||
embeddings = model.encode(sentences)
|
||||
|
||||
boundaries = [0]
|
||||
for i in range(1, len(sentences)):
|
||||
similarity = util.cos_sim(embeddings[i-1], embeddings[i])
|
||||
if similarity < threshold:
|
||||
boundaries.append(i)
|
||||
|
||||
return boundaries
|
||||
```
|
||||
|
||||
## Vector Databases and Search
|
||||
|
||||
### ChromaDB
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install chromadb
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- In-memory and persistent storage
|
||||
- Built-in embedding functions
|
||||
- Similarity search
|
||||
- Metadata filtering
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
import chromadb
|
||||
from chromadb.utils import embedding_functions
|
||||
|
||||
# Initialize client
|
||||
client = chromadb.Client()
|
||||
|
||||
# Create collection
|
||||
collection = client.create_collection(
|
||||
name="document_chunks",
|
||||
embedding_function=embedding_functions.DefaultEmbeddingFunction()
|
||||
)
|
||||
|
||||
# Add chunks
|
||||
collection.add(
|
||||
documents=[chunk["content"] for chunk in chunks],
|
||||
metadatas=[chunk.get("metadata", {}) for chunk in chunks],
|
||||
ids=[chunk["id"] for chunk in chunks]
|
||||
)
|
||||
|
||||
# Search
|
||||
results = collection.query(
|
||||
query_texts=["What is chunking?"],
|
||||
n_results=5
|
||||
)
|
||||
```
|
||||
|
||||
### Pinecone
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install pinecone-client
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Managed vector database service
|
||||
- High-performance similarity search
|
||||
- Metadata filtering
|
||||
- Scalable infrastructure
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
import pinecone
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
# Initialize
|
||||
pinecone.init(api_key="your-api-key", environment="your-environment")
|
||||
index_name = "document-chunks"
|
||||
|
||||
# Create index if it doesn't exist
|
||||
if index_name not in pinecone.list_indexes():
|
||||
pinecone.create_index(
|
||||
name=index_name,
|
||||
dimension=384, # Match embedding model
|
||||
metric="cosine"
|
||||
)
|
||||
|
||||
index = pinecone.Index(index_name)
|
||||
|
||||
# Generate embeddings and upsert
|
||||
model = SentenceTransformer('all-MiniLM-L6-v2')
|
||||
for chunk in chunks:
|
||||
embedding = model.encode(chunk["content"])
|
||||
index.upsert(
|
||||
vectors=[{
|
||||
"id": chunk["id"],
|
||||
"values": embedding.tolist(),
|
||||
"metadata": chunk.get("metadata", {})
|
||||
}]
|
||||
)
|
||||
|
||||
# Search
|
||||
query_embedding = model.encode("search query")
|
||||
results = index.query(
|
||||
vector=query_embedding.tolist(),
|
||||
top_k=5,
|
||||
include_metadata=True
|
||||
)
|
||||
```
|
||||
|
||||
### Weaviate
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install weaviate-client
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- GraphQL API
|
||||
- Hybrid search (dense + sparse)
|
||||
- Real-time updates
|
||||
- Schema validation
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
import weaviate
|
||||
|
||||
# Connect to Weaviate
|
||||
client = weaviate.Client("http://localhost:8080")
|
||||
|
||||
# Define schema
|
||||
client.schema.create_class({
|
||||
"class": "DocumentChunk",
|
||||
"description": "A chunk of document content",
|
||||
"properties": [
|
||||
{
|
||||
"name": "content",
|
||||
"dataType": ["text"]
|
||||
},
|
||||
{
|
||||
"name": "source",
|
||||
"dataType": ["string"]
|
||||
}
|
||||
]
|
||||
})
|
||||
|
||||
# Add data
|
||||
for chunk in chunks:
|
||||
client.data_object.create(
|
||||
data_object={
|
||||
"content": chunk["content"],
|
||||
"source": chunk.get("source", "unknown")
|
||||
},
|
||||
class_name="DocumentChunk"
|
||||
)
|
||||
|
||||
# Search
|
||||
results = client.query.get(
|
||||
"DocumentChunk",
|
||||
["content", "source"]
|
||||
).with_near_text({
|
||||
"concepts": ["search query"]
|
||||
}).with_limit(5).do()
|
||||
```
|
||||
|
||||
## Evaluation and Testing
|
||||
|
||||
### RAGAS
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install ragas
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- RAG evaluation metrics
|
||||
- Answer quality assessment
|
||||
- Context relevance measurement
|
||||
- Faithfulness evaluation
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
from ragas import evaluate
|
||||
from ragas.metrics import (
|
||||
faithfulness,
|
||||
answer_relevancy,
|
||||
context_relevancy,
|
||||
context_recall
|
||||
)
|
||||
from datasets import Dataset
|
||||
|
||||
# Prepare evaluation data
|
||||
dataset = Dataset.from_dict({
|
||||
"question": ["What is chunking?"],
|
||||
"answer": ["Chunking is the process of breaking large documents into smaller segments"],
|
||||
"contexts": [["Chunking involves dividing text into manageable pieces for better processing"]],
|
||||
"ground_truth": ["Chunking is a document processing technique"]
|
||||
})
|
||||
|
||||
# Evaluate
|
||||
result = evaluate(
|
||||
dataset=dataset,
|
||||
metrics=[
|
||||
faithfulness,
|
||||
answer_relevancy,
|
||||
context_relevancy,
|
||||
context_recall
|
||||
]
|
||||
)
|
||||
|
||||
print(result)
|
||||
```
|
||||
|
||||
### TruEra (TruLens)
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install trulens trulens-apps
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- LLM application evaluation
|
||||
- Feedback functions
|
||||
- Hallucination detection
|
||||
- Performance monitoring
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
from trulens.core import TruSession
|
||||
from trulens.apps.custom import instrument
|
||||
from trulens.feedback import GroundTruthAgreement
|
||||
|
||||
# Initialize session
|
||||
session = TruSession()
|
||||
|
||||
# Define feedback functions
|
||||
f_groundedness = GroundTruthAgreement(ground_truth)
|
||||
|
||||
# Evaluate chunks
|
||||
@instrument
|
||||
def chunk_and_query(text, query):
|
||||
chunks = chunk_function(text)
|
||||
relevant_chunks = search_function(chunks, query)
|
||||
answer = generate_function(relevant_chunks, query)
|
||||
return answer
|
||||
|
||||
# Record evaluation
|
||||
with session:
|
||||
chunk_and_query("large document text", "what is the main topic?")
|
||||
```
|
||||
|
||||
## Document Processing
|
||||
|
||||
### PyPDF2
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install PyPDF2
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- PDF text extraction
|
||||
- Page manipulation
|
||||
- Metadata extraction
|
||||
- Form field processing
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
import PyPDF2
|
||||
|
||||
def extract_text_from_pdf(pdf_path):
|
||||
text = ""
|
||||
with open(pdf_path, 'rb') as file:
|
||||
reader = PyPDF2.PdfReader(file)
|
||||
for page in reader.pages:
|
||||
text += page.extract_text()
|
||||
return text
|
||||
|
||||
# Extract text by page for better chunking
|
||||
def extract_pages(pdf_path):
|
||||
pages = []
|
||||
with open(pdf_path, 'rb') as file:
|
||||
reader = PyPDF2.PdfReader(file)
|
||||
for i, page in enumerate(reader.pages):
|
||||
pages.append({
|
||||
"page_number": i + 1,
|
||||
"content": page.extract_text()
|
||||
})
|
||||
return pages
|
||||
```
|
||||
|
||||
### python-docx
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install python-docx
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Microsoft Word document processing
|
||||
- Paragraph and table extraction
|
||||
- Style preservation
|
||||
- Metadata access
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
from docx import Document
|
||||
|
||||
def extract_from_docx(docx_path):
|
||||
doc = Document(docx_path)
|
||||
content = []
|
||||
|
||||
for paragraph in doc.paragraphs:
|
||||
if paragraph.text.strip():
|
||||
content.append({
|
||||
"type": "paragraph",
|
||||
"text": paragraph.text,
|
||||
"style": paragraph.style.name
|
||||
})
|
||||
|
||||
for table in doc.tables:
|
||||
table_text = []
|
||||
for row in table.rows:
|
||||
row_text = [cell.text for cell in row.cells]
|
||||
table_text.append(" | ".join(row_text))
|
||||
|
||||
content.append({
|
||||
"type": "table",
|
||||
"text": "\n".join(table_text)
|
||||
})
|
||||
|
||||
return content
|
||||
```
|
||||
|
||||
## Specialized Libraries
|
||||
|
||||
### tiktoken (OpenAI)
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install tiktoken
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Accurate token counting for OpenAI models
|
||||
- Fast encoding/decoding
|
||||
- Multiple model support
|
||||
- Language model specific tokenization
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
import tiktoken
|
||||
|
||||
# Get encoding for specific model
|
||||
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
|
||||
|
||||
# Encode text
|
||||
tokens = encoding.encode("This is a sample text")
|
||||
print(f"Token count: {len(tokens)}")
|
||||
|
||||
# Decode tokens
|
||||
text = encoding.decode(tokens)
|
||||
|
||||
# Count tokens without full encoding
|
||||
def count_tokens(text, model="gpt-3.5-turbo"):
|
||||
encoding = tiktoken.encoding_for_model(model)
|
||||
return len(encoding.encode(text))
|
||||
|
||||
# Use in chunking
|
||||
def chunk_by_tokens(text, max_tokens=1000):
|
||||
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
|
||||
tokens = encoding.encode(text)
|
||||
|
||||
chunks = []
|
||||
for i in range(0, len(tokens), max_tokens):
|
||||
chunk_tokens = tokens[i:i + max_tokens]
|
||||
chunk_text = encoding.decode(chunk_tokens)
|
||||
chunks.append(chunk_text)
|
||||
|
||||
return chunks
|
||||
```
|
||||
|
||||
### PDFMiner
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install pdfminer.six
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Detailed PDF analysis
|
||||
- Layout preservation
|
||||
- Font and style information
|
||||
- High-precision text extraction
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
from pdfminer.high_level import extract_pages
|
||||
from pdfminer.layout import LTTextContainer
|
||||
|
||||
def extract_structured_text(pdf_path):
|
||||
structured_content = []
|
||||
|
||||
for page_layout in extract_pages(pdf_path):
|
||||
page_content = []
|
||||
|
||||
for element in page_layout:
|
||||
if isinstance(element, LTTextContainer):
|
||||
text = element.get_text()
|
||||
font_info = {
|
||||
"font_size": element.height,
|
||||
"is_bold": "Bold" in element.fontname,
|
||||
"x0": element.x0,
|
||||
"y0": element.y0
|
||||
}
|
||||
page_content.append({
|
||||
"text": text.strip(),
|
||||
"font_info": font_info
|
||||
})
|
||||
|
||||
structured_content.append({
|
||||
"page_number": page_layout.pageid,
|
||||
"content": page_content
|
||||
})
|
||||
|
||||
return structured_content
|
||||
```
|
||||
|
||||
## Performance and Optimization
|
||||
|
||||
### Dask
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install dask[complete]
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Parallel processing
|
||||
- Out-of-core computation
|
||||
- Distributed computing
|
||||
- Integration with pandas
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
import dask.bag as db
|
||||
from dask.distributed import Client
|
||||
|
||||
# Setup distributed client
|
||||
client = Client(n_workers=4)
|
||||
|
||||
# Parallel chunking of multiple documents
|
||||
def chunk_document(document):
|
||||
# Your chunking logic here
|
||||
return chunk_function(document)
|
||||
|
||||
# Process documents in parallel
|
||||
documents = ["doc1", "doc2", "doc3", ...] # List of document contents
|
||||
document_bag = db.from_sequence(documents)
|
||||
|
||||
# Apply chunking function in parallel
|
||||
chunked_documents = document_bag.map(chunk_document)
|
||||
|
||||
# Compute results
|
||||
results = chunked_documents.compute()
|
||||
```
|
||||
|
||||
### Ray
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install ray
|
||||
```
|
||||
|
||||
**Key Features**:
|
||||
- Distributed computing
|
||||
- Actor model
|
||||
- Autoscaling
|
||||
- ML pipeline integration
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
import ray
|
||||
|
||||
# Initialize Ray
|
||||
ray.init()
|
||||
|
||||
@ray.remote
|
||||
class ChunkingWorker:
|
||||
def __init__(self, strategy):
|
||||
self.strategy = strategy
|
||||
|
||||
def chunk_documents(self, documents):
|
||||
results = []
|
||||
for doc in documents:
|
||||
chunks = self.strategy.chunk(doc)
|
||||
results.append(chunks)
|
||||
return results
|
||||
|
||||
# Create workers
|
||||
workers = [ChunkingWorker.remote(strategy) for _ in range(4)]
|
||||
|
||||
# Distribute work
|
||||
documents_batch = [documents[i::4] for i in range(4)]
|
||||
futures = [worker.chunk_documents.remote(batch)
|
||||
for worker, batch in zip(workers, documents_batch)]
|
||||
|
||||
# Get results
|
||||
results = ray.get(futures)
|
||||
```
|
||||
|
||||
## Development and Testing
|
||||
|
||||
### pytest
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install pytest pytest-asyncio
|
||||
```
|
||||
|
||||
**Example Tests**:
|
||||
|
||||
```python
|
||||
import pytest
|
||||
from your_chunking_module import FixedSizeChunker, SemanticChunker
|
||||
|
||||
class TestFixedSizeChunker:
|
||||
def test_chunk_size_respect(self):
|
||||
chunker = FixedSizeChunker(chunk_size=100, chunk_overlap=10)
|
||||
text = "word " * 50 # 50 words
|
||||
|
||||
chunks = chunker.chunk(text)
|
||||
|
||||
for chunk in chunks:
|
||||
assert len(chunk.split()) <= 100 # Account for word boundaries
|
||||
|
||||
def test_overlap_consistency(self):
|
||||
chunker = FixedSizeChunker(chunk_size=50, chunk_overlap=10)
|
||||
text = "word " * 30
|
||||
|
||||
chunks = chunker.chunk(text)
|
||||
|
||||
# Check overlap between consecutive chunks
|
||||
for i in range(1, len(chunks)):
|
||||
chunk1_words = set(chunks[i-1].split()[-10:])
|
||||
chunk2_words = set(chunks[i].split()[:10])
|
||||
overlap = len(chunk1_words & chunk2_words)
|
||||
assert overlap >= 5 # Allow some tolerance
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_semantic_chunker():
|
||||
chunker = SemanticChunker()
|
||||
text = "First topic sentence. Another sentence about first topic. " \
|
||||
"Now switching to second topic. More about second topic."
|
||||
|
||||
chunks = await chunker.chunk_async(text)
|
||||
|
||||
# Should detect topic change and create boundary
|
||||
assert len(chunks) >= 2
|
||||
assert "first topic" in chunks[0].lower()
|
||||
assert "second topic" in chunks[1].lower()
|
||||
```
|
||||
|
||||
### Memory Profiler
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install memory-profiler
|
||||
```
|
||||
|
||||
**Example Usage**:
|
||||
|
||||
```python
|
||||
from memory_profiler import profile
|
||||
|
||||
@profile
|
||||
def chunk_large_document():
|
||||
chunker = FixedSizeChunker(chunk_size=1000)
|
||||
large_text = "word " * 100000 # Large document
|
||||
|
||||
chunks = chunker.chunk(large_text)
|
||||
return chunks
|
||||
|
||||
# Run with: python -m memory_profiler your_script.py
|
||||
```
|
||||
|
||||
This comprehensive toolset provides everything needed to implement, test, and optimize chunking strategies for various use cases, from simple text processing to production-grade RAG systems.
|
||||
1403
skills/ai/chunking-strategy/references/visualization-tools.md
Normal file
1403
skills/ai/chunking-strategy/references/visualization-tools.md
Normal file
File diff suppressed because it is too large
Load Diff
302
skills/ai/prompt-engineering/SKILL.md
Normal file
302
skills/ai/prompt-engineering/SKILL.md
Normal file
@@ -0,0 +1,302 @@
|
||||
---
|
||||
name: prompt-engineering
|
||||
category: backend
|
||||
tags: [prompt-engineering, few-shot-learning, chain-of-thought, optimization, templates, system-prompts, llm-performance, ai-patterns]
|
||||
version: 1.0.0
|
||||
description: This skill should be used when creating, optimizing, or implementing advanced prompt patterns including few-shot learning, chain-of-thought reasoning, prompt optimization workflows, template systems, and system prompt design. It provides comprehensive frameworks for building production-ready prompts with measurable performance improvements.
|
||||
---
|
||||
|
||||
# Prompt Engineering
|
||||
|
||||
This skill provides comprehensive frameworks for creating, optimizing, and implementing advanced prompt patterns that significantly improve LLM performance across various tasks and models.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
- Creating new prompts for complex reasoning or analytical tasks
|
||||
- Optimizing existing prompts for better accuracy or efficiency
|
||||
- Implementing few-shot learning with strategic example selection
|
||||
- Designing chain-of-thought reasoning for multi-step problems
|
||||
- Building reusable prompt templates and systems
|
||||
- Developing system prompts for consistent model behavior
|
||||
- Troubleshooting poor prompt performance or failure modes
|
||||
- Scaling prompt systems for production use cases
|
||||
|
||||
## Core Prompt Engineering Patterns
|
||||
|
||||
### 1. Few-Shot Learning Implementation
|
||||
|
||||
Select examples using semantic similarity and diversity sampling to maximize learning within context window constraints.
|
||||
|
||||
#### Example Selection Strategy
|
||||
- Use `references/few-shot-patterns.md` for comprehensive selection frameworks
|
||||
- Balance example count (3-5 optimal) with context window limitations
|
||||
- Include edge cases and boundary conditions in example sets
|
||||
- Prioritize diverse examples that cover problem space variations
|
||||
- Order examples from simple to complex for progressive learning
|
||||
|
||||
#### Few-Shot Template Structure
|
||||
```
|
||||
Example 1 (Basic case):
|
||||
Input: {representative_input}
|
||||
Output: {expected_output}
|
||||
|
||||
Example 2 (Edge case):
|
||||
Input: {challenging_input}
|
||||
Output: {robust_output}
|
||||
|
||||
Example 3 (Error case):
|
||||
Input: {problematic_input}
|
||||
Output: {corrected_output}
|
||||
|
||||
Now handle: {target_input}
|
||||
```
|
||||
|
||||
### 2. Chain-of-Thought Reasoning
|
||||
|
||||
Elicit step-by-step reasoning for complex problem-solving through structured thinking patterns.
|
||||
|
||||
#### Implementation Patterns
|
||||
- Reference `references/cot-patterns.md` for detailed reasoning frameworks
|
||||
- Use "Let's think step by step" for zero-shot CoT initiation
|
||||
- Provide complete reasoning traces for few-shot CoT demonstrations
|
||||
- Implement self-consistency by sampling multiple reasoning paths
|
||||
- Include verification and validation steps in reasoning chains
|
||||
|
||||
#### CoT Template Structure
|
||||
```
|
||||
Let's approach this step-by-step:
|
||||
|
||||
Step 1: {break_down_the_problem}
|
||||
Analysis: {detailed_reasoning}
|
||||
|
||||
Step 2: {identify_key_components}
|
||||
Analysis: {component_analysis}
|
||||
|
||||
Step 3: {synthesize_solution}
|
||||
Analysis: {solution_justification}
|
||||
|
||||
Final Answer: {conclusion_with_confidence}
|
||||
```
|
||||
|
||||
### 3. Prompt Optimization Workflows
|
||||
|
||||
Implement iterative refinement processes with measurable performance metrics and systematic A/B testing.
|
||||
|
||||
#### Optimization Process
|
||||
- Use `references/optimization-frameworks.md` for comprehensive optimization strategies
|
||||
- Measure baseline performance before optimization attempts
|
||||
- Implement single-variable changes for accurate attribution
|
||||
- Track metrics: accuracy, consistency, latency, token efficiency
|
||||
- Use statistical significance testing for A/B validation
|
||||
- Document optimization iterations and their impacts
|
||||
|
||||
#### Performance Metrics Framework
|
||||
- **Accuracy**: Task completion rate and output correctness
|
||||
- **Consistency**: Response stability across multiple runs
|
||||
- **Efficiency**: Token usage and response time optimization
|
||||
- **Robustness**: Performance across edge cases and variations
|
||||
- **Safety**: Adherence to guidelines and harm prevention
|
||||
|
||||
### 4. Template Systems Architecture
|
||||
|
||||
Build modular, reusable prompt components with variable interpolation and conditional sections.
|
||||
|
||||
#### Template Design Principles
|
||||
- Reference `references/template-systems.md` for modular template frameworks
|
||||
- Use clear variable naming conventions (e.g., `{user_input}`, `{context}`)
|
||||
- Implement conditional sections for different scenario handling
|
||||
- Design role-based templates for specific use cases
|
||||
- Create hierarchical template composition patterns
|
||||
|
||||
#### Template Structure Example
|
||||
```
|
||||
# System Context
|
||||
You are a {role} with {expertise_level} expertise in {domain}.
|
||||
|
||||
# Task Context
|
||||
{if background_information}
|
||||
Background: {background_information}
|
||||
{endif}
|
||||
|
||||
# Instructions
|
||||
{task_instructions}
|
||||
|
||||
# Examples
|
||||
{example_count}
|
||||
|
||||
# Output Format
|
||||
{output_specification}
|
||||
|
||||
# Input
|
||||
{user_query}
|
||||
```
|
||||
|
||||
### 5. System Prompt Design
|
||||
|
||||
Design comprehensive system prompts that establish consistent model behavior, output formats, and safety constraints.
|
||||
|
||||
#### System Prompt Components
|
||||
- Use `references/system-prompt-design.md` for detailed design guidelines
|
||||
- Define clear role specification and expertise boundaries
|
||||
- Establish output format requirements and structural constraints
|
||||
- Include safety guidelines and content policy adherence
|
||||
- Set context for background information and domain knowledge
|
||||
|
||||
#### System Prompt Framework
|
||||
```
|
||||
You are an expert {role} specializing in {domain} with {experience_level} of experience.
|
||||
|
||||
## Core Capabilities
|
||||
- List specific capabilities and expertise areas
|
||||
- Define scope of knowledge and limitations
|
||||
|
||||
## Behavioral Guidelines
|
||||
- Specify interaction style and communication approach
|
||||
- Define error handling and uncertainty protocols
|
||||
- Establish quality standards and verification requirements
|
||||
|
||||
## Output Requirements
|
||||
- Specify format expectations and structural requirements
|
||||
- Define content inclusion and exclusion criteria
|
||||
- Establish consistency and validation requirements
|
||||
|
||||
## Safety and Ethics
|
||||
- Include content policy adherence
|
||||
- Specify bias mitigation requirements
|
||||
- Define harm prevention protocols
|
||||
```
|
||||
|
||||
## Implementation Workflows
|
||||
|
||||
### Workflow 1: Create New Prompt from Requirements
|
||||
|
||||
1. **Analyze Requirements**
|
||||
- Identify task complexity and reasoning requirements
|
||||
- Determine target model capabilities and limitations
|
||||
- Define success criteria and evaluation metrics
|
||||
- Assess need for few-shot learning or CoT reasoning
|
||||
|
||||
2. **Select Pattern Strategy**
|
||||
- Use few-shot learning for classification or transformation tasks
|
||||
- Apply CoT for complex reasoning or multi-step problems
|
||||
- Implement template systems for reusable prompt architecture
|
||||
- Design system prompts for consistent behavior requirements
|
||||
|
||||
3. **Draft Initial Prompt**
|
||||
- Structure prompt with clear sections and logical flow
|
||||
- Include relevant examples or reasoning demonstrations
|
||||
- Specify output format and quality requirements
|
||||
- Incorporate safety guidelines and constraints
|
||||
|
||||
4. **Validate and Test**
|
||||
- Test with diverse input scenarios including edge cases
|
||||
- Measure performance against defined success criteria
|
||||
- Iterate refinement based on testing results
|
||||
- Document optimization decisions and their rationale
|
||||
|
||||
### Workflow 2: Optimize Existing Prompt
|
||||
|
||||
1. **Performance Analysis**
|
||||
- Measure current prompt performance metrics
|
||||
- Identify failure modes and error patterns
|
||||
- Analyze token efficiency and response latency
|
||||
- Assess consistency across multiple runs
|
||||
|
||||
2. **Optimization Strategy**
|
||||
- Apply systematic A/B testing with single-variable changes
|
||||
- Use few-shot learning to improve task adherence
|
||||
- Implement CoT reasoning for complex task components
|
||||
- Refine template structure for better clarity
|
||||
|
||||
3. **Implementation and Testing**
|
||||
- Deploy optimized prompts with controlled rollout
|
||||
- Monitor performance metrics in production environment
|
||||
- Compare against baseline using statistical significance
|
||||
- Document improvements and lessons learned
|
||||
|
||||
### Workflow 3: Scale Prompt Systems
|
||||
|
||||
1. **Modular Architecture Design**
|
||||
- Decompose complex prompts into reusable components
|
||||
- Create template inheritance hierarchies
|
||||
- Implement dynamic example selection systems
|
||||
- Build automated quality assurance frameworks
|
||||
|
||||
2. **Production Integration**
|
||||
- Implement prompt versioning and rollback capabilities
|
||||
- Create performance monitoring and alerting systems
|
||||
- Build automated testing frameworks for prompt validation
|
||||
- Establish update and deployment workflows
|
||||
|
||||
## Quality Assurance
|
||||
|
||||
### Validation Requirements
|
||||
- Test prompts with at least 10 diverse scenarios
|
||||
- Include edge cases, boundary conditions, and failure modes
|
||||
- Verify output format compliance and structural consistency
|
||||
- Validate safety guideline adherence and harm prevention
|
||||
- Measure performance across multiple model runs
|
||||
|
||||
### Performance Standards
|
||||
- Achieve >90% task completion for well-defined use cases
|
||||
- Maintain <5% variance across multiple runs for consistency
|
||||
- Optimize token usage without sacrificing accuracy
|
||||
- Ensure response latency meets application requirements
|
||||
- Demonstrate robust handling of edge cases and unexpected inputs
|
||||
|
||||
## Integration with Other Skills
|
||||
|
||||
This skill integrates seamlessly with:
|
||||
- **langchain4j-ai-services-patterns**: Interface-based prompt design
|
||||
- **langchain4j-rag-implementation-patterns**: Context-enhanced prompting
|
||||
- **langchain4j-testing-strategies**: Prompt validation frameworks
|
||||
- **unit-test-parameterized**: Systematic prompt testing approaches
|
||||
|
||||
## Resources and References
|
||||
|
||||
- `references/few-shot-patterns.md`: Comprehensive few-shot learning frameworks
|
||||
- `references/cot-patterns.md`: Chain-of-thought reasoning patterns and examples
|
||||
- `references/optimization-frameworks.md`: Systematic prompt optimization methodologies
|
||||
- `references/template-systems.md`: Modular template design and implementation
|
||||
- `references/system-prompt-design.md`: System prompt architecture and best practices
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Example 1: Classification Task with Few-Shot Learning
|
||||
```
|
||||
Classify customer feedback into categories using semantic similarity for example selection and diversity sampling for edge case coverage.
|
||||
```
|
||||
|
||||
### Example 2: Complex Reasoning with Chain-of-Thought
|
||||
```
|
||||
Implement step-by-step reasoning for financial analysis with verification steps and confidence scoring.
|
||||
```
|
||||
|
||||
### Example 3: Template System for Customer Service
|
||||
```
|
||||
Create modular templates with role-based components and conditional sections for different inquiry types.
|
||||
```
|
||||
|
||||
### Example 4: System Prompt for Code Generation
|
||||
```
|
||||
Design comprehensive system prompt with behavioral guidelines, output requirements, and safety constraints.
|
||||
```
|
||||
|
||||
## Common Pitfalls and Solutions
|
||||
|
||||
- **Overfitting examples**: Use diverse example sets with semantic variety
|
||||
- **Context window overflow**: Implement strategic example selection and compression
|
||||
- **Inconsistent outputs**: Specify clear output formats and validation requirements
|
||||
- **Poor generalization**: Include edge cases and boundary conditions in training examples
|
||||
- **Safety violations**: Incorporate comprehensive content policies and harm prevention
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
- Monitor token usage and implement compression strategies
|
||||
- Use caching for repeated prompt components
|
||||
- Optimize example selection for maximum learning efficiency
|
||||
- Implement progressive disclosure for complex prompt systems
|
||||
- Balance prompt complexity with response quality requirements
|
||||
|
||||
This skill provides the foundational patterns and methodologies for building production-ready prompt systems that consistently deliver high performance across diverse use cases and model types.
|
||||
426
skills/ai/prompt-engineering/references/cot-patterns.md
Normal file
426
skills/ai/prompt-engineering/references/cot-patterns.md
Normal file
@@ -0,0 +1,426 @@
|
||||
# Chain-of-Thought Reasoning Patterns
|
||||
|
||||
This reference provides comprehensive frameworks for implementing effective chain-of-thought (CoT) reasoning that improves model performance on complex, multi-step problems.
|
||||
|
||||
## Core Principles
|
||||
|
||||
### Step-by-Step Reasoning Elicitation
|
||||
|
||||
#### Problem Decomposition Strategy
|
||||
- Break complex problems into manageable sub-problems
|
||||
- Identify dependencies and relationships between components
|
||||
- Establish logical flow and sequence of reasoning steps
|
||||
- Define clear decision points and validation criteria
|
||||
|
||||
#### Verification and Validation Integration
|
||||
- Include self-checking mechanisms at critical junctures
|
||||
- Implement consistency checks across reasoning steps
|
||||
- Add confidence scoring for uncertain conclusions
|
||||
- Provide fallback strategies for ambiguous situations
|
||||
|
||||
## Zero-Shot Chain-of-Thought Patterns
|
||||
|
||||
### Basic CoT Initiation
|
||||
```
|
||||
Let's think step by step to solve this problem:
|
||||
|
||||
1. First, I need to understand what the question is asking for
|
||||
2. Then, I'll identify the key information and constraints
|
||||
3. Next, I'll consider different approaches to solve it
|
||||
4. I'll work through the solution methodically
|
||||
5. Finally, I'll verify my answer makes sense
|
||||
|
||||
Problem: {problem_statement}
|
||||
|
||||
Step 1: Understanding the question
|
||||
{analysis}
|
||||
|
||||
Step 2: Key information and constraints
|
||||
{information_analysis}
|
||||
|
||||
Step 3: Solution approach
|
||||
{approach_analysis}
|
||||
|
||||
Step 4: Working through the solution
|
||||
{detailed_solution}
|
||||
|
||||
Step 5: Verification
|
||||
{verification}
|
||||
|
||||
Final Answer: {conclusion}
|
||||
```
|
||||
|
||||
### Enhanced CoT with Confidence
|
||||
```
|
||||
Let me think through this systematically, breaking down the problem and checking my reasoning at each step.
|
||||
|
||||
**Problem**: {problem_description}
|
||||
|
||||
**Step 1: Problem Analysis**
|
||||
- What am I being asked to solve?
|
||||
- What information is provided?
|
||||
- What are the constraints?
|
||||
- My confidence in understanding: {score}/10
|
||||
|
||||
**Step 2: Strategy Selection**
|
||||
- Possible approaches:
|
||||
1. {approach_1}
|
||||
2. {approach_2}
|
||||
3. {approach_3}
|
||||
- Selected approach: {chosen_approach}
|
||||
- Rationale: {reasoning_for_choice}
|
||||
|
||||
**Step 3: Execution**
|
||||
- {detailed_step_by_step_solution}
|
||||
|
||||
**Step 4: Verification**
|
||||
- Does the answer make sense?
|
||||
- Have I addressed all parts of the question?
|
||||
- Confidence in final answer: {score}/10
|
||||
|
||||
**Final Answer**: {solution_with_confidence_score}
|
||||
```
|
||||
|
||||
## Few-Shot Chain-of-Thought Patterns
|
||||
|
||||
### Mathematical Reasoning Template
|
||||
```
|
||||
Solve the following math problem step by step.
|
||||
|
||||
Example 1:
|
||||
Problem: A store sells apples for $2 each and oranges for $3 each. If John buys 4 apples and 2 oranges, and spends exactly $14, how much does each fruit cost?
|
||||
|
||||
Step 1: Set up the equation
|
||||
Let a = cost of apples, o = cost of oranges
|
||||
4a + 2o = 14
|
||||
|
||||
Step 2: Simplify the equation
|
||||
Divide both sides by 2: 2a + o = 7
|
||||
|
||||
Step 3: Test integer solutions
|
||||
If a = 2, then 2(2) + o = 7 → o = 3
|
||||
If a = 3, then 2(3) + o = 7 → o = 1
|
||||
|
||||
Step 4: Verify the original problem
|
||||
For a=2, o=3: 4(2) + 2(3) = 8 + 6 = 14 ✓
|
||||
For a=3, o=1: 4(3) + 2(1) = 12 + 2 = 14 ✓
|
||||
|
||||
Both solutions work, so apples could cost $2 and oranges $3, OR apples $3 and oranges $1.
|
||||
|
||||
Example 2:
|
||||
Problem: Sarah is twice as old as Tom was 3 years ago. In 5 years, Sarah will be 1.5 times Tom's current age. How old are they now?
|
||||
|
||||
Step 1: Define variables
|
||||
Let S = Sarah's current age, T = Tom's current age
|
||||
|
||||
Step 2: Set up equations
|
||||
"Sarah is twice as old as Tom was 3 years ago": S = 2(T - 3)
|
||||
"In 5 years, Sarah will be 1.5 times Tom's current age": S + 5 = 1.5T
|
||||
|
||||
Step 3: Solve the system
|
||||
From first equation: S = 2T - 6
|
||||
Substitute into second: (2T - 6) + 5 = 1.5T
|
||||
2T - 1 = 1.5T
|
||||
0.5T = 1
|
||||
T = 2
|
||||
|
||||
Step 4: Find Sarah's age
|
||||
S = 2(2) - 6 = 4 - 6 = -2
|
||||
|
||||
Step 5: Check for validity
|
||||
Negative age is impossible, so I must have made an error.
|
||||
Let me re-examine the problem...
|
||||
|
||||
Actually, let me try again:
|
||||
S = 2(T - 3) = 2T - 6
|
||||
S + 5 = 1.5T
|
||||
Substitute: (2T - 6) + 5 = 1.5T
|
||||
2T - 1 = 1.5T
|
||||
0.5T = 1
|
||||
T = 2 years old
|
||||
S = 2(2 - 3) = 2(-1) = -2
|
||||
|
||||
This gives a negative age. The problem might have no solution with integer ages.
|
||||
|
||||
Now solve:
|
||||
Problem: {math_problem}
|
||||
|
||||
Step 1: Define variables
|
||||
{variable_definitions}
|
||||
|
||||
Step 2: Set up equations
|
||||
{equation_setup}
|
||||
|
||||
Step 3: Solve the system
|
||||
{solution_process}
|
||||
|
||||
Step 4: Verify the solution
|
||||
{verification}
|
||||
|
||||
Final Answer: {answer}
|
||||
```
|
||||
|
||||
### Logical Reasoning Template
|
||||
```
|
||||
Analyze the logical argument and determine if it's valid.
|
||||
|
||||
Example 1:
|
||||
Premise 1: All birds can fly
|
||||
Premise 2: Penguins are birds
|
||||
Conclusion: Therefore, penguins can fly
|
||||
|
||||
Step 1: Analyze the structure
|
||||
This is a syllogism with form:
|
||||
All A are B
|
||||
C is A
|
||||
Therefore, C is B
|
||||
|
||||
Step 2: Evaluate premise validity
|
||||
Premise 1: "All birds can fly" - This is false (penguins, ostriches cannot fly)
|
||||
Premise 2: "Penguins are birds" - This is true
|
||||
|
||||
Step 3: Check logical validity
|
||||
The logical structure is valid, but since Premise 1 is false, the conclusion may not be true
|
||||
|
||||
Step 4: Real-world verification
|
||||
In reality, penguins cannot fly despite being birds
|
||||
|
||||
Conclusion: The argument is logically valid but soundness fails due to false premise
|
||||
|
||||
Example 2:
|
||||
Premise 1: If it rains, then the ground gets wet
|
||||
Premise 2: It is raining
|
||||
Conclusion: Therefore, the ground gets wet
|
||||
|
||||
Step 1: Analyze the structure
|
||||
This is modus ponens:
|
||||
If P, then Q
|
||||
P
|
||||
Therefore, Q
|
||||
|
||||
Step 2: Evaluate premise validity
|
||||
Premise 1: "If it rains, then the ground gets wet" - Generally true
|
||||
Premise 2: "It is raining" - Given as true
|
||||
|
||||
Step 3: Check logical validity
|
||||
Modus ponens is a valid argument form
|
||||
|
||||
Step 4: Verify the conclusion
|
||||
Given the premises, the conclusion follows logically
|
||||
|
||||
Conclusion: The argument is both logically valid and sound
|
||||
|
||||
Now analyze:
|
||||
Argument: {logical_argument}
|
||||
|
||||
Step 1: Analyze the argument structure
|
||||
{structure_analysis}
|
||||
|
||||
Step 2: Evaluate premise validity
|
||||
{premise_evaluation}
|
||||
|
||||
Step 3: Check logical validity
|
||||
{validity_check}
|
||||
|
||||
Step 4: Verify the conclusion
|
||||
{conclusion_verification}
|
||||
|
||||
Final Assessment: {argument_validity_assessment}
|
||||
```
|
||||
|
||||
## Self-Consistency Techniques
|
||||
|
||||
### Multiple Reasoning Paths
|
||||
```
|
||||
I'll solve this problem using three different approaches and see which result is most reliable.
|
||||
|
||||
**Problem**: {complex_problem}
|
||||
|
||||
**Approach 1: Direct Calculation**
|
||||
{first_approach_reasoning}
|
||||
Result 1: {result_1}
|
||||
|
||||
**Approach 2: Logical Deduction**
|
||||
{second_approach_reasoning}
|
||||
Result 2: {result_2}
|
||||
|
||||
**Approach 3: Pattern Recognition**
|
||||
{third_approach_reasoning}
|
||||
Result 3: {result_3}
|
||||
|
||||
**Consistency Analysis:**
|
||||
- Approach 1 and 2 agree: {yes/no}
|
||||
- Approach 1 and 3 agree: {yes/no}
|
||||
- Approach 2 and 3 agree: {yes/no}
|
||||
|
||||
**Final Decision:**
|
||||
{majority_result} appears in {count} out of 3 approaches.
|
||||
Confidence: {high/medium/low}
|
||||
|
||||
Most Likely Answer: {final_answer_with_confidence}
|
||||
```
|
||||
|
||||
### Verification Loop Pattern
|
||||
```
|
||||
Let me solve this step by step and verify each step.
|
||||
|
||||
**Problem**: {problem_description}
|
||||
|
||||
**Step 1: Initial Analysis**
|
||||
{initial_analysis}
|
||||
|
||||
Verification: Does this make sense? {verification_1}
|
||||
|
||||
**Step 2: Solution Development**
|
||||
{solution_development}
|
||||
|
||||
Verification: Does this logically follow from step 1? {verification_2}
|
||||
|
||||
**Step 3: Result Calculation**
|
||||
{result_calculation}
|
||||
|
||||
Verification: Does this answer the original question? {verification_3}
|
||||
|
||||
**Step 4: Cross-Check**
|
||||
Let me try a different approach to confirm:
|
||||
{alternative_approach}
|
||||
|
||||
Results comparison: {comparison_analysis}
|
||||
|
||||
**Final Answer:**
|
||||
{conclusion_with_verification_status}
|
||||
```
|
||||
|
||||
## Specialized CoT Patterns
|
||||
|
||||
### Code Debugging CoT
|
||||
```
|
||||
Debug the following code by analyzing it step by step.
|
||||
|
||||
**Code:**
|
||||
{code_snippet}
|
||||
|
||||
**Step 1: Understand the Code's Purpose**
|
||||
{purpose_analysis}
|
||||
|
||||
**Step 2: Identify Expected Behavior**
|
||||
{expected_behavior}
|
||||
|
||||
**Step 3: Trace the Execution**
|
||||
{execution_trace}
|
||||
|
||||
**Step 4: Find the Error**
|
||||
{error_identification}
|
||||
|
||||
**Step 5: Propose Fix**
|
||||
{fix_proposal}
|
||||
|
||||
**Step 6: Verify the Fix**
|
||||
{fix_verification}
|
||||
|
||||
**Fixed Code:**
|
||||
{corrected_code}
|
||||
```
|
||||
|
||||
### Data Analysis CoT
|
||||
```
|
||||
Analyze this data systematically to draw meaningful conclusions.
|
||||
|
||||
**Data:**
|
||||
{dataset}
|
||||
|
||||
**Step 1: Understand the Data Structure**
|
||||
{data_structure_analysis}
|
||||
|
||||
**Step 2: Identify Patterns and Trends**
|
||||
{pattern_identification}
|
||||
|
||||
**Step 3: Calculate Key Metrics**
|
||||
{metrics_calculation}
|
||||
|
||||
**Step 4: Compare with Benchmarks**
|
||||
{benchmark_comparison}
|
||||
|
||||
**Step 5: Formulate Insights**
|
||||
{insight_generation}
|
||||
|
||||
**Step 6: Validate Conclusions**
|
||||
{conclusion_validation}
|
||||
|
||||
**Key Findings:**
|
||||
{summary_of_insights}
|
||||
```
|
||||
|
||||
### Creative Problem Solving CoT
|
||||
```
|
||||
Generate creative solutions to this challenging problem.
|
||||
|
||||
**Problem:**
|
||||
{creative_problem}
|
||||
|
||||
**Step 1: Reframe the Problem**
|
||||
{problem_reframing}
|
||||
|
||||
**Step 2: Brainstorm Multiple Angles**
|
||||
- Technical approach: {technical_ideas}
|
||||
- Business approach: {business_ideas}
|
||||
- User experience approach: {ux_ideas}
|
||||
- Unconventional approach: {unconventional_ideas}
|
||||
|
||||
**Step 3: Evaluate Each Approach**
|
||||
{approach_evaluation}
|
||||
|
||||
**Step 4: Synthesize Best Elements**
|
||||
{synthesis_process}
|
||||
|
||||
**Step 5: Develop Final Solution**
|
||||
{solution_development}
|
||||
|
||||
**Step 6: Test for Feasibility**
|
||||
{feasibility_testing}
|
||||
|
||||
**Recommended Solution:**
|
||||
{final_creative_solution}
|
||||
```
|
||||
|
||||
## Implementation Guidelines
|
||||
|
||||
### When to Use Chain-of-Thought
|
||||
- **Multi-step problems**: Tasks requiring sequential reasoning
|
||||
- **Complex calculations**: Mathematical or logical derivations
|
||||
- **Problem decomposition**: Tasks that benefit from breaking down
|
||||
- **Verification needs**: When accuracy is critical
|
||||
- **Educational contexts**: When showing reasoning is valuable
|
||||
|
||||
### CoT Effectiveness Factors
|
||||
- **Problem complexity**: Higher benefit for complex problems
|
||||
- **Task type**: Mathematical, logical, and analytical tasks benefit most
|
||||
- **Model capability**: Newer models handle CoT more effectively
|
||||
- **Context window**: Ensure sufficient space for reasoning steps
|
||||
- **Output requirements**: Detailed explanations benefit from CoT
|
||||
|
||||
### Common Pitfalls to Avoid
|
||||
- **Over-explaining simple steps**: Keep proportional detail
|
||||
- **Circular reasoning**: Ensure logical progression
|
||||
- **Missing verification**: Always include validation steps
|
||||
- **Inconsistent confidence**: Use realistic confidence scoring
|
||||
- **Premature conclusions**: Don't jump to answers without full reasoning
|
||||
|
||||
## Integration with Other Techniques
|
||||
|
||||
### CoT + Few-Shot Learning
|
||||
- Include reasoning traces in examples
|
||||
- Show step-by-step problem-solving demonstrations
|
||||
- Teach verification and self-checking patterns
|
||||
|
||||
### CoT + Template Systems
|
||||
- Embed CoT patterns within structured templates
|
||||
- Use conditional CoT based on problem complexity
|
||||
- Implement adaptive reasoning depth
|
||||
|
||||
### CoT + Prompt Optimization
|
||||
- Test different CoT formulations
|
||||
- Optimize reasoning step granularity
|
||||
- Balance detail with efficiency
|
||||
|
||||
This framework provides comprehensive patterns for implementing effective chain-of-thought reasoning across diverse problem types and applications.
|
||||
273
skills/ai/prompt-engineering/references/few-shot-patterns.md
Normal file
273
skills/ai/prompt-engineering/references/few-shot-patterns.md
Normal file
@@ -0,0 +1,273 @@
|
||||
# Few-Shot Learning Patterns
|
||||
|
||||
This reference provides comprehensive frameworks for implementing effective few-shot learning strategies that maximize model performance within context window constraints.
|
||||
|
||||
## Core Principles
|
||||
|
||||
### Example Selection Strategy
|
||||
|
||||
#### Semantic Similarity Selection
|
||||
- Use embedding similarity to find examples closest to target input
|
||||
- Cluster similar examples to avoid redundancy
|
||||
- Select diverse representatives from different semantic regions
|
||||
- Prioritize examples that cover key variations in problem space
|
||||
|
||||
#### Diversity Sampling Approach
|
||||
- Ensure coverage of different input types and patterns
|
||||
- Include boundary cases and edge conditions
|
||||
- Balance simple and complex examples
|
||||
- Select examples that demonstrate different solution strategies
|
||||
|
||||
#### Progressive Complexity Ordering
|
||||
- Start with simplest, most straightforward examples
|
||||
- Progress to increasingly complex scenarios
|
||||
- Include challenging edge cases last
|
||||
- Use this ordering to build understanding incrementally
|
||||
|
||||
## Example Templates
|
||||
|
||||
### Classification Tasks
|
||||
|
||||
#### Binary Classification Template
|
||||
```
|
||||
Classify if the text expresses positive or negative sentiment.
|
||||
|
||||
Example 1:
|
||||
Text: "I love this product! It works exactly as advertised and exceeded my expectations."
|
||||
Sentiment: Positive
|
||||
Reasoning: Contains enthusiastic language, positive adjectives, and satisfaction indicators
|
||||
|
||||
Example 2:
|
||||
Text: "The customer service was terrible and the product broke after one day of use."
|
||||
Sentiment: Negative
|
||||
Reasoning: Contains negative adjectives, complaint language, and dissatisfaction indicators
|
||||
|
||||
Example 3:
|
||||
Text: "It's okay, nothing special but does the basic job."
|
||||
Sentiment: Negative
|
||||
Reasoning: Contains lukewarm language, lack of enthusiasm, minimal positive elements
|
||||
|
||||
Now classify:
|
||||
Text: {input_text}
|
||||
Sentiment:
|
||||
Reasoning:
|
||||
```
|
||||
|
||||
#### Multi-Class Classification Template
|
||||
```
|
||||
Categorize the customer inquiry into one of: Technical Support, Billing, Sales, or General.
|
||||
|
||||
Example 1:
|
||||
Inquiry: "My account was charged twice for the same subscription this month"
|
||||
Category: Billing
|
||||
Key indicators: "charged twice", "subscription", "account", financial terms
|
||||
|
||||
Example 2:
|
||||
Inquiry: "The app keeps crashing when I try to upload files larger than 10MB"
|
||||
Category: Technical Support
|
||||
Key indicators: "crashing", "upload files", "technical issue", "error report"
|
||||
|
||||
Example 3:
|
||||
Inquiry: "What are your pricing plans for enterprise customers?"
|
||||
Category: Sales
|
||||
Key indicators: "pricing plans", "enterprise", business inquiry, sales question
|
||||
|
||||
Now categorize:
|
||||
Inquiry: {inquiry_text}
|
||||
Category:
|
||||
Key indicators:
|
||||
```
|
||||
|
||||
### Transformation Tasks
|
||||
|
||||
#### Text Transformation Template
|
||||
```
|
||||
Convert formal business text into casual, friendly language.
|
||||
|
||||
Example 1:
|
||||
Formal: "We regret to inform you that your request cannot be processed at this time due to insufficient documentation."
|
||||
Casual: "Sorry, but we can't process your request right now because some documents are missing."
|
||||
|
||||
Example 2:
|
||||
Formal: "The aforementioned individual has demonstrated exceptional proficiency in the designated responsibilities."
|
||||
Casual: "They've done a great job with their tasks and really know what they're doing."
|
||||
|
||||
Example 3:
|
||||
Formal: "Please be advised that the scheduled meeting has been postponed pending further notice."
|
||||
Casual: "Hey, just letting you know that we've put off the meeting for now and will let you know when it's rescheduled."
|
||||
|
||||
Now convert:
|
||||
Formal: {formal_text}
|
||||
Casual:
|
||||
```
|
||||
|
||||
#### Data Extraction Template
|
||||
```
|
||||
Extract key information from the job posting into structured format.
|
||||
|
||||
Example 1:
|
||||
Job Posting: "We are seeking a Senior Software Engineer with 5+ years of experience in Python and cloud technologies. This is a remote position offering $120k-$150k salary plus equity."
|
||||
|
||||
Extracted:
|
||||
- Position: Senior Software Engineer
|
||||
- Experience Required: 5+ years
|
||||
- Skills: Python, cloud technologies
|
||||
- Location: Remote
|
||||
- Salary: $120k-$150k plus equity
|
||||
|
||||
Example 2:
|
||||
Job Posting: "Marketing Manager needed for growing startup. Must have 3 years experience in digital marketing, social media management, and content creation. San Francisco office, competitive compensation."
|
||||
|
||||
Extracted:
|
||||
- Position: Marketing Manager
|
||||
- Experience Required: 3 years
|
||||
- Skills: Digital marketing, social media management, content creation
|
||||
- Location: San Francisco
|
||||
- Salary: Competitive compensation
|
||||
|
||||
Now extract:
|
||||
Job Posting: {job_posting_text}
|
||||
Extracted:
|
||||
```
|
||||
|
||||
### Generation Tasks
|
||||
|
||||
#### Creative Writing Template
|
||||
```
|
||||
Generate compelling product descriptions following the shown patterns.
|
||||
|
||||
Example 1:
|
||||
Product: Wireless headphones with noise cancellation
|
||||
Description: "Immerse yourself in crystal-clear audio with our premium wireless headphones. Advanced noise cancellation technology blocks out distractions while 30-hour battery life keeps you connected all day long."
|
||||
|
||||
Example 2:
|
||||
Product: Smart home security camera
|
||||
Description: "Protect what matters most with intelligent monitoring that alerts you to activity instantly. AI-powered detection distinguishes between people, pets, and vehicles for truly smart security."
|
||||
|
||||
Example 3:
|
||||
Product: Portable espresso maker
|
||||
Description: "Barista-quality espresso anywhere, anytime. Compact design meets professional-grade extraction in this revolutionary portable machine that delivers perfect shots in under 30 seconds."
|
||||
|
||||
Now generate:
|
||||
Product: {product_description}
|
||||
Description:
|
||||
```
|
||||
|
||||
### Error Correction Patterns
|
||||
|
||||
#### Error Detection and Correction Template
|
||||
```
|
||||
Identify and correct errors in the given text.
|
||||
|
||||
Example 1:
|
||||
Text with errors: "Their going to the park to play there new game with they're friends."
|
||||
Correction: "They're going to the park to play their new game with their friends."
|
||||
Errors fixed: "Their → They're", "there → their", "they're → their"
|
||||
|
||||
Example 2:
|
||||
Text with errors: "The company's new policy effects every employee and there morale."
|
||||
Correction: "The company's new policy affects every employee and their morale."
|
||||
Errors fixed: "effects → affects", "there → their"
|
||||
|
||||
Example 3:
|
||||
Text with errors: "Its important to review you're work carefully before submiting."
|
||||
Correction: "It's important to review your work carefully before submitting."
|
||||
Errors fixed: "Its → It's", "you're → your", "submiting → submitting"
|
||||
|
||||
Now correct:
|
||||
Text with errors: {text_with_errors}
|
||||
Correction:
|
||||
Errors fixed:
|
||||
```
|
||||
|
||||
## Advanced Strategies
|
||||
|
||||
### Dynamic Example Selection
|
||||
|
||||
#### Context-Aware Selection
|
||||
```python
|
||||
def select_examples(input_text, example_database, max_examples=3):
|
||||
"""
|
||||
Select most relevant examples based on semantic similarity and diversity.
|
||||
"""
|
||||
# 1. Calculate similarity scores
|
||||
similarities = calculate_similarity(input_text, example_database)
|
||||
|
||||
# 2. Sort by similarity
|
||||
sorted_examples = sort_by_similarity(similarities)
|
||||
|
||||
# 3. Apply diversity sampling
|
||||
diverse_examples = diversity_sampling(sorted_examples, max_examples)
|
||||
|
||||
# 4. Order by complexity
|
||||
final_examples = order_by_complexity(diverse_examples)
|
||||
|
||||
return final_examples
|
||||
```
|
||||
|
||||
#### Adaptive Example Count
|
||||
```python
|
||||
def determine_example_count(input_complexity, context_limit):
|
||||
"""
|
||||
Adjust example count based on input complexity and available context.
|
||||
"""
|
||||
base_count = 3
|
||||
|
||||
# Complex inputs benefit from more examples
|
||||
if input_complexity > 0.8:
|
||||
return min(base_count + 2, context_limit)
|
||||
elif input_complexity > 0.5:
|
||||
return base_count + 1
|
||||
else:
|
||||
return max(base_count - 1, 2)
|
||||
```
|
||||
|
||||
### Quality Metrics for Examples
|
||||
|
||||
#### Example Effectiveness Scoring
|
||||
```python
|
||||
def score_example_effectiveness(example, test_cases):
|
||||
"""
|
||||
Score how effectively an example teaches the desired pattern.
|
||||
"""
|
||||
metrics = {
|
||||
'coverage': measure_pattern_coverage(example),
|
||||
'clarity': measure_instructional_clarity(example),
|
||||
'uniqueness': measure_uniqueness_from_other_examples(example),
|
||||
'difficulty': measure_appropriateness_difficulty(example)
|
||||
}
|
||||
|
||||
return weighted_average(metrics, weights=[0.3, 0.3, 0.2, 0.2])
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Example Quality Guidelines
|
||||
- **Clarity**: Examples should clearly demonstrate the desired pattern
|
||||
- **Accuracy**: Input-output pairs must be correct and consistent
|
||||
- **Relevance**: Examples should be representative of target task
|
||||
- **Diversity**: Include variation in input types and complexity levels
|
||||
- **Completeness**: Cover edge cases and boundary conditions
|
||||
|
||||
### Context Management
|
||||
- **Token Efficiency**: Optimize example length while maintaining clarity
|
||||
- **Progressive Disclosure**: Start simple, increase complexity gradually
|
||||
- **Redundancy Elimination**: Remove overlapping or duplicate examples
|
||||
- **Compression**: Use concise representations where possible
|
||||
|
||||
### Common Pitfalls to Avoid
|
||||
- **Overfitting**: Don't include too many examples from same pattern
|
||||
- **Under-representation**: Ensure coverage of important variations
|
||||
- **Ambiguity**: Examples should have clear, unambiguous solutions
|
||||
- **Context Overflow**: Balance example count with window limitations
|
||||
- **Poor Ordering**: Place examples in logical progression order
|
||||
|
||||
## Integration with Other Patterns
|
||||
|
||||
Few-shot learning combines effectively with:
|
||||
- **Chain-of-Thought**: Add reasoning steps to examples
|
||||
- **Template Systems**: Use few-shot within structured templates
|
||||
- **Prompt Optimization**: Test different example selections
|
||||
- **System Prompts**: Establish few-shot learning expectations in system prompts
|
||||
|
||||
This framework provides the foundation for implementing effective few-shot learning across diverse tasks and model types.
|
||||
@@ -0,0 +1,488 @@
|
||||
# Prompt Optimization Frameworks
|
||||
|
||||
This reference provides systematic methodologies for iteratively improving prompt performance through structured testing, measurement, and refinement processes.
|
||||
|
||||
## Optimization Process Overview
|
||||
|
||||
### Iterative Improvement Cycle
|
||||
```mermaid
|
||||
graph TD
|
||||
A[Baseline Measurement] --> B[Hypothesis Generation]
|
||||
B --> C[Controlled Test]
|
||||
C --> D[Performance Analysis]
|
||||
D --> E[Statistical Validation]
|
||||
E --> F[Implementation Decision]
|
||||
F --> G[Monitor Impact]
|
||||
G --> H[Learn & Iterate]
|
||||
H --> B
|
||||
```
|
||||
|
||||
### Core Optimization Principles
|
||||
- **Single Variable Testing**: Change one element at a time for accurate attribution
|
||||
- **Measurable Metrics**: Define quantitative success criteria
|
||||
- **Statistical Significance**: Use proper sample sizes and validation methods
|
||||
- **Controlled Environment**: Test conditions must be consistent
|
||||
- **Baseline Comparison**: Always measure against established baseline
|
||||
|
||||
## Performance Metrics Framework
|
||||
|
||||
### Primary Metrics
|
||||
|
||||
#### Task Success Rate
|
||||
```python
|
||||
def calculate_success_rate(results, expected_outputs):
|
||||
"""
|
||||
Measure percentage of tasks completed correctly.
|
||||
"""
|
||||
correct = sum(1 for result, expected in zip(results, expected_outputs)
|
||||
if result == expected)
|
||||
return (correct / len(results)) * 100
|
||||
```
|
||||
|
||||
#### Response Consistency
|
||||
```python
|
||||
def measure_consistency(prompt, test_cases, num_runs=5):
|
||||
"""
|
||||
Measure response stability across multiple runs.
|
||||
"""
|
||||
responses = {}
|
||||
for test_case in test_cases:
|
||||
test_responses = []
|
||||
for _ in range(num_runs):
|
||||
response = execute_prompt(prompt, test_case)
|
||||
test_responses.append(response)
|
||||
|
||||
# Calculate similarity score for consistency
|
||||
consistency = calculate_similarity(test_responses)
|
||||
responses[test_case] = consistency
|
||||
|
||||
return sum(responses.values()) / len(responses)
|
||||
```
|
||||
|
||||
#### Token Efficiency
|
||||
```python
|
||||
def calculate_token_efficiency(prompt, test_cases):
|
||||
"""
|
||||
Measure token usage per successful task completion.
|
||||
"""
|
||||
total_tokens = 0
|
||||
successful_tasks = 0
|
||||
|
||||
for test_case in test_cases:
|
||||
response = execute_prompt_with_metrics(prompt, test_case)
|
||||
total_tokens += response.token_count
|
||||
if response.is_successful:
|
||||
successful_tasks += 1
|
||||
|
||||
return total_tokens / successful_tasks if successful_tasks > 0 else float('inf')
|
||||
```
|
||||
|
||||
#### Response Latency
|
||||
```python
|
||||
def measure_response_time(prompt, test_cases):
|
||||
"""
|
||||
Measure average response time.
|
||||
"""
|
||||
times = []
|
||||
for test_case in test_cases:
|
||||
start_time = time.time()
|
||||
execute_prompt(prompt, test_case)
|
||||
end_time = time.time()
|
||||
times.append(end_time - start_time)
|
||||
|
||||
return sum(times) / len(times)
|
||||
```
|
||||
|
||||
### Secondary Metrics
|
||||
|
||||
#### Output Quality Score
|
||||
```python
|
||||
def assess_output_quality(response, criteria):
|
||||
"""
|
||||
Multi-dimensional quality assessment.
|
||||
"""
|
||||
scores = {
|
||||
'accuracy': measure_accuracy(response),
|
||||
'completeness': measure_completeness(response),
|
||||
'coherence': measure_coherence(response),
|
||||
'relevance': measure_relevance(response),
|
||||
'format_compliance': measure_format_compliance(response)
|
||||
}
|
||||
|
||||
weights = [0.3, 0.2, 0.2, 0.2, 0.1]
|
||||
return sum(score * weight for score, weight in zip(scores.values(), weights))
|
||||
```
|
||||
|
||||
#### Safety Compliance
|
||||
```python
|
||||
def check_safety_compliance(response):
|
||||
"""
|
||||
Measure adherence to safety guidelines.
|
||||
"""
|
||||
violations = []
|
||||
|
||||
# Check for various safety issues
|
||||
if contains_harmful_content(response):
|
||||
violations.append('harmful_content')
|
||||
if has_bias(response):
|
||||
violations.append('bias')
|
||||
if violates_privacy(response):
|
||||
violations.append('privacy_violation')
|
||||
|
||||
safety_score = max(0, 100 - len(violations) * 25)
|
||||
return safety_score, violations
|
||||
```
|
||||
|
||||
## A/B Testing Methodology
|
||||
|
||||
### Controlled Test Design
|
||||
```python
|
||||
def design_ab_test(baseline_prompt, variant_prompt, test_cases):
|
||||
"""
|
||||
Design controlled A/B test with proper statistical power.
|
||||
"""
|
||||
# Calculate required sample size
|
||||
effect_size = estimate_effect_size(baseline_prompt, variant_prompt)
|
||||
sample_size = calculate_sample_size(effect_size, power=0.8, alpha=0.05)
|
||||
|
||||
# Random assignment
|
||||
randomized_cases = random.sample(test_cases, sample_size)
|
||||
split_point = len(randomized_cases) // 2
|
||||
|
||||
group_a = randomized_cases[:split_point]
|
||||
group_b = randomized_cases[split_point:]
|
||||
|
||||
return {
|
||||
'baseline_group': group_a,
|
||||
'variant_group': group_b,
|
||||
'sample_size': sample_size,
|
||||
'statistical_power': 0.8,
|
||||
'significance_level': 0.05
|
||||
}
|
||||
```
|
||||
|
||||
### Statistical Analysis
|
||||
```python
|
||||
def analyze_ab_results(baseline_results, variant_results):
|
||||
"""
|
||||
Perform statistical analysis of A/B test results.
|
||||
"""
|
||||
# Calculate means and standard deviations
|
||||
baseline_mean = np.mean(baseline_results)
|
||||
variant_mean = np.mean(variant_results)
|
||||
baseline_std = np.std(baseline_results)
|
||||
variant_std = np.std(variant_results)
|
||||
|
||||
# Perform t-test
|
||||
t_statistic, p_value = stats.ttest_ind(baseline_results, variant_results)
|
||||
|
||||
# Calculate effect size (Cohen's d)
|
||||
pooled_std = np.sqrt(((len(baseline_results) - 1) * baseline_std**2 +
|
||||
(len(variant_results) - 1) * variant_std**2) /
|
||||
(len(baseline_results) + len(variant_results) - 2))
|
||||
cohens_d = (variant_mean - baseline_mean) / pooled_std
|
||||
|
||||
return {
|
||||
'baseline_mean': baseline_mean,
|
||||
'variant_mean': variant_mean,
|
||||
'improvement': ((variant_mean - baseline_mean) / baseline_mean) * 100,
|
||||
'p_value': p_value,
|
||||
'statistical_significance': p_value < 0.05,
|
||||
'effect_size': cohens_d,
|
||||
'recommendation': 'implement_variant' if p_value < 0.05 and cohens_d > 0.2 else 'keep_baseline'
|
||||
}
|
||||
```
|
||||
|
||||
## Optimization Strategies
|
||||
|
||||
### Strategy 1: Progressive Enhancement
|
||||
|
||||
#### Stepwise Improvement Process
|
||||
```python
|
||||
def progressive_optimization(base_prompt, test_cases, max_iterations=10):
|
||||
"""
|
||||
Incrementally improve prompt through systematic testing.
|
||||
"""
|
||||
current_prompt = base_prompt
|
||||
current_performance = evaluate_prompt(current_prompt, test_cases)
|
||||
optimization_history = []
|
||||
|
||||
for iteration in range(max_iterations):
|
||||
# Generate improvement hypotheses
|
||||
hypotheses = generate_improvement_hypotheses(current_prompt, current_performance)
|
||||
|
||||
best_improvement = None
|
||||
best_performance = current_performance
|
||||
|
||||
for hypothesis in hypotheses:
|
||||
# Test hypothesis
|
||||
test_prompt = apply_hypothesis(current_prompt, hypothesis)
|
||||
test_performance = evaluate_prompt(test_prompt, test_cases)
|
||||
|
||||
# Validate improvement
|
||||
if is_statistically_significant(current_performance, test_performance):
|
||||
if test_performance.overall_score > best_performance.overall_score:
|
||||
best_improvement = hypothesis
|
||||
best_performance = test_performance
|
||||
|
||||
# Apply best improvement if found
|
||||
if best_improvement:
|
||||
current_prompt = apply_hypothesis(current_prompt, best_improvement)
|
||||
optimization_history.append({
|
||||
'iteration': iteration,
|
||||
'hypothesis': best_improvement,
|
||||
'performance_before': current_performance,
|
||||
'performance_after': best_performance,
|
||||
'improvement': best_performance.overall_score - current_performance.overall_score
|
||||
})
|
||||
current_performance = best_performance
|
||||
else:
|
||||
break # No further improvements found
|
||||
|
||||
return current_prompt, optimization_history
|
||||
```
|
||||
|
||||
### Strategy 2: Multi-Objective Optimization
|
||||
|
||||
#### Pareto Optimization Framework
|
||||
```python
|
||||
def multi_objective_optimization(prompt_variants, objectives):
|
||||
"""
|
||||
Optimize for multiple competing objectives using Pareto efficiency.
|
||||
"""
|
||||
results = []
|
||||
|
||||
for variant in prompt_variants:
|
||||
scores = {}
|
||||
for objective in objectives:
|
||||
scores[objective] = evaluate_objective(variant, objective)
|
||||
|
||||
results.append({
|
||||
'prompt': variant,
|
||||
'scores': scores,
|
||||
'dominates': []
|
||||
})
|
||||
|
||||
# Find Pareto optimal solutions
|
||||
pareto_optimal = []
|
||||
for i, result_i in enumerate(results):
|
||||
is_dominated = False
|
||||
for j, result_j in enumerate(results):
|
||||
if i != j and dominates(result_j, result_i):
|
||||
is_dominated = True
|
||||
break
|
||||
|
||||
if not is_dominated:
|
||||
pareto_optimal.append(result_i)
|
||||
|
||||
return pareto_optimal
|
||||
|
||||
def dominates(result_a, result_b):
|
||||
"""
|
||||
Check if result_a dominates result_b in all objectives.
|
||||
"""
|
||||
return all(result_a['scores'][obj] >= result_b['scores'][obj]
|
||||
for obj in result_a['scores'])
|
||||
```
|
||||
|
||||
### Strategy 3: Adaptive Testing
|
||||
|
||||
#### Dynamic Test Allocation
|
||||
```python
|
||||
def adaptive_testing(prompt_variants, initial_budget):
|
||||
"""
|
||||
Dynamically allocate testing budget to promising variants.
|
||||
"""
|
||||
# Initial exploration phase
|
||||
exploration_results = {}
|
||||
budget分配 = initial_budget // len(prompt_variants)
|
||||
|
||||
for variant in prompt_variants:
|
||||
exploration_results[variant] = test_prompt(variant, budget分配)
|
||||
|
||||
# Exploitation phase - allocate more budget to promising variants
|
||||
total_budget_spent = len(prompt_variants) * budget分配
|
||||
remaining_budget = initial_budget - total_budget_spent
|
||||
|
||||
# Sort by performance
|
||||
sorted_variants = sorted(exploration_results.items(),
|
||||
key=lambda x: x[1].overall_score, reverse=True)
|
||||
|
||||
# Allocate remaining budget proportionally to performance
|
||||
final_results = {}
|
||||
for i, (variant, initial_result) in enumerate(sorted_variants):
|
||||
if remaining_budget > 0:
|
||||
additional_budget = max(1, remaining_budget // (len(sorted_variants) - i))
|
||||
final_results[variant] = test_prompt(variant, additional_budget)
|
||||
remaining_budget -= additional_budget
|
||||
else:
|
||||
final_results[variant] = initial_result
|
||||
|
||||
return final_results
|
||||
```
|
||||
|
||||
## Optimization Hypotheses
|
||||
|
||||
### Common Optimization Areas
|
||||
|
||||
#### Instruction Clarity
|
||||
```python
|
||||
instruction_clarity_hypotheses = [
|
||||
"Add numbered steps to instructions",
|
||||
"Include specific output format examples",
|
||||
"Clarify role and expertise level",
|
||||
"Add context and background information",
|
||||
"Specify constraints and boundaries",
|
||||
"Include success criteria and evaluation standards"
|
||||
]
|
||||
```
|
||||
|
||||
#### Example Quality
|
||||
```python
|
||||
example_optimization_hypotheses = [
|
||||
"Increase number of examples from 3 to 5",
|
||||
"Add edge case examples",
|
||||
"Reorder examples by complexity",
|
||||
"Include negative examples",
|
||||
"Add reasoning traces to examples",
|
||||
"Improve example diversity and coverage"
|
||||
]
|
||||
```
|
||||
|
||||
#### Structure Optimization
|
||||
```python
|
||||
structure_hypotheses = [
|
||||
"Add clear section headings",
|
||||
"Reorganize content flow",
|
||||
"Include summary at the beginning",
|
||||
"Add checklist for verification",
|
||||
"Separate instructions from examples",
|
||||
"Add troubleshooting section"
|
||||
]
|
||||
```
|
||||
|
||||
#### Model-Specific Optimization
|
||||
```python
|
||||
model_specific_hypotheses = {
|
||||
'claude': [
|
||||
"Use XML tags for structure",
|
||||
"Add <thinking> sections for reasoning",
|
||||
"Include constitutional AI principles",
|
||||
"Use system message format",
|
||||
"Add safety guidelines and constraints"
|
||||
],
|
||||
'gpt-4': [
|
||||
"Use numbered sections with ### headers",
|
||||
"Include JSON format specifications",
|
||||
"Add function calling patterns",
|
||||
"Use bullet points for clarity",
|
||||
"Include error handling instructions"
|
||||
],
|
||||
'gemini': [
|
||||
"Use bold headers with ** formatting",
|
||||
"Include step-by-step process descriptions",
|
||||
"Add validation checkpoints",
|
||||
"Use conversational tone",
|
||||
"Include confidence scoring"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Continuous Monitoring
|
||||
|
||||
### Production Performance Tracking
|
||||
```python
|
||||
def setup_monitoring(prompt, alert_thresholds):
|
||||
"""
|
||||
Set up continuous monitoring for deployed prompts.
|
||||
"""
|
||||
monitors = {
|
||||
'success_rate': MetricMonitor('success_rate', alert_thresholds['success_rate']),
|
||||
'response_time': MetricMonitor('response_time', alert_thresholds['response_time']),
|
||||
'token_cost': MetricMonitor('token_cost', alert_thresholds['token_cost']),
|
||||
'safety_score': MetricMonitor('safety_score', alert_thresholds['safety_score'])
|
||||
}
|
||||
|
||||
def monitor_performance():
|
||||
recent_data = collect_recent_performance(prompt)
|
||||
alerts = []
|
||||
|
||||
for metric_name, monitor in monitors.items():
|
||||
if metric_name in recent_data:
|
||||
alert = monitor.check(recent_data[metric_name])
|
||||
if alert:
|
||||
alerts.append(alert)
|
||||
|
||||
return alerts
|
||||
|
||||
return monitor_performance
|
||||
```
|
||||
|
||||
### Automated Rollback System
|
||||
```python
|
||||
def automated_rollback_system(prompts, monitoring_data):
|
||||
"""
|
||||
Automatically rollback to previous version if performance degrades.
|
||||
"""
|
||||
def check_and_rollback(current_prompt, baseline_prompt):
|
||||
current_metrics = monitoring_data.get_metrics(current_prompt)
|
||||
baseline_metrics = monitoring_data.get_metrics(baseline_prompt)
|
||||
|
||||
# Check if performance degradation exceeds threshold
|
||||
degradation_threshold = 0.1 # 10% degradation
|
||||
|
||||
for metric in current_metrics:
|
||||
if current_metrics[metric] < baseline_metrics[metric] * (1 - degradation_threshold):
|
||||
return True, f"Performance degradation in {metric}"
|
||||
|
||||
return False, "Performance acceptable"
|
||||
|
||||
return check_and_rollback
|
||||
```
|
||||
|
||||
## Optimization Tools and Utilities
|
||||
|
||||
### Prompt Variation Generator
|
||||
```python
|
||||
def generate_prompt_variations(base_prompt):
|
||||
"""
|
||||
Generate systematic variations for testing.
|
||||
"""
|
||||
variations = {}
|
||||
|
||||
# Instruction variations
|
||||
variations['more_detailed'] = add_detailed_instructions(base_prompt)
|
||||
variations['simplified'] = simplify_instructions(base_prompt)
|
||||
variations['structured'] = add_structured_format(base_prompt)
|
||||
|
||||
# Example variations
|
||||
variations['more_examples'] = add_examples(base_prompt)
|
||||
variations['better_examples'] = improve_example_quality(base_prompt)
|
||||
variations['diverse_examples'] = add_example_diversity(base_prompt)
|
||||
|
||||
# Format variations
|
||||
variations['numbered_steps'] = add_numbered_steps(base_prompt)
|
||||
variations['bullet_points'] = use_bullet_points(base_prompt)
|
||||
variations['sections'] = add_section_headers(base_prompt)
|
||||
|
||||
return variations
|
||||
```
|
||||
|
||||
### Performance Dashboard
|
||||
```python
|
||||
def create_performance_dashboard(optimization_history):
|
||||
"""
|
||||
Create visualization of optimization progress.
|
||||
"""
|
||||
# Generate performance metrics over time
|
||||
metrics_over_time = {
|
||||
'iterations': [h['iteration'] for h in optimization_history],
|
||||
'success_rates': [h['performance_after'].success_rate for h in optimization_history],
|
||||
'token_efficiency': [h['performance_after'].token_efficiency for h in optimization_history],
|
||||
'response_times': [h['performance_after'].response_time for h in optimization_history]
|
||||
}
|
||||
|
||||
return PerformanceDashboard(metrics_over_time)
|
||||
```
|
||||
|
||||
This comprehensive framework provides systematic methodologies for continuous prompt improvement through data-driven optimization and rigorous testing processes.
|
||||
494
skills/ai/prompt-engineering/references/system-prompt-design.md
Normal file
494
skills/ai/prompt-engineering/references/system-prompt-design.md
Normal file
@@ -0,0 +1,494 @@
|
||||
# System Prompt Design
|
||||
|
||||
This reference provides comprehensive frameworks for designing effective system prompts that establish consistent model behavior, define clear boundaries, and ensure reliable performance across diverse applications.
|
||||
|
||||
## System Prompt Architecture
|
||||
|
||||
### Core Components Structure
|
||||
```
|
||||
1. Role Definition & Expertise
|
||||
2. Behavioral Guidelines & Constraints
|
||||
3. Interaction Protocols
|
||||
4. Output Format Specifications
|
||||
5. Safety & Ethical Guidelines
|
||||
6. Context & Background Information
|
||||
7. Quality Standards & Verification
|
||||
8. Error Handling & Uncertainty Protocols
|
||||
```
|
||||
|
||||
## Component Design Patterns
|
||||
|
||||
### 1. Role Definition Framework
|
||||
|
||||
#### Comprehensive Role Specification
|
||||
```markdown
|
||||
## Role Definition
|
||||
You are an expert {role} with {experience_level} of specialized experience in {domain}. Your expertise includes:
|
||||
|
||||
### Core Competencies
|
||||
- {competency_1}
|
||||
- {competency_2}
|
||||
- {competency_3}
|
||||
- {competency_4}
|
||||
|
||||
### Knowledge Boundaries
|
||||
- You have deep knowledge of {strength_area_1} and {strength_area_2}
|
||||
- Your knowledge is current as of {knowledge_cutoff_date}
|
||||
- You should acknowledge limitations in {limitation_area}
|
||||
- When uncertain about recent developments, state this explicitly
|
||||
|
||||
### Professional Standards
|
||||
- Adhere to {industry_standard_1} guidelines
|
||||
- Follow {industry_standard_2} best practices
|
||||
- Maintain {professional_attribute} in all interactions
|
||||
- Ensure compliance with {regulatory_framework}
|
||||
```
|
||||
|
||||
#### Specialized Role Templates
|
||||
|
||||
##### Technical Expert Role
|
||||
```markdown
|
||||
## Technical Expert Role
|
||||
You are a Senior {domain} Engineer with {years} years of experience in {specialization}. Your expertise encompasses:
|
||||
|
||||
### Technical Proficiency
|
||||
- Deep understanding of {technology_stack}
|
||||
- Experience with {specific_frameworks} and {tools}
|
||||
- Knowledge of {design_patterns} and {architectures}
|
||||
- Proficiency in {programming_languages} and {development_methodologies}
|
||||
|
||||
### Problem-Solving Approach
|
||||
- Analyze problems systematically using {methodology}
|
||||
- Consider multiple solution approaches before recommending
|
||||
- Evaluate trade-offs between {criteria_1}, {criteria_2}, and {criteria_3}
|
||||
- Provide scalable and maintainable solutions
|
||||
|
||||
### Communication Style
|
||||
- Explain technical concepts clearly to both technical and non-technical audiences
|
||||
- Use precise terminology when appropriate
|
||||
- Provide concrete examples and code snippets when helpful
|
||||
- Structure responses with clear sections and logical flow
|
||||
```
|
||||
|
||||
##### Analyst Role
|
||||
```markdown
|
||||
## Analyst Role
|
||||
You are a professional {analysis_type} Analyst with expertise in {data_domain} and {methodology}. Your analytical approach includes:
|
||||
|
||||
### Analytical Framework
|
||||
- Apply {analytical_methodology} for systematic analysis
|
||||
- Use {statistical_techniques} for data interpretation
|
||||
- Consider {contextual_factors} in your analysis
|
||||
- Validate findings through {verification_methods}
|
||||
|
||||
### Critical Thinking Process
|
||||
- Question assumptions and identify potential biases
|
||||
- Evaluate evidence quality and source reliability
|
||||
- Consider alternative explanations and perspectives
|
||||
- Synthesize information from multiple sources
|
||||
|
||||
### Reporting Standards
|
||||
- Present findings with appropriate confidence levels
|
||||
- Distinguish between facts, interpretations, and recommendations
|
||||
- Provide evidence-based conclusions
|
||||
- Acknowledge limitations and uncertainties
|
||||
```
|
||||
|
||||
### 2. Behavioral Guidelines Design
|
||||
|
||||
#### Comprehensive Behavior Framework
|
||||
```markdown
|
||||
## Behavioral Guidelines
|
||||
|
||||
### Interaction Style
|
||||
- Maintain {tone} tone throughout all interactions
|
||||
- Use {communication_approach} when explaining complex concepts
|
||||
- Be {responsiveness_level} in addressing user questions
|
||||
- Demonstrate {empathy_level} when dealing with user challenges
|
||||
|
||||
### Response Standards
|
||||
- Provide responses that are {length_preference} and {detail_preference}
|
||||
- Structure information using {organization_pattern}
|
||||
- Include {frequency} examples and illustrations
|
||||
- Use {format_preference} formatting for clarity
|
||||
|
||||
### Quality Expectations
|
||||
- Ensure all information is {accuracy_standard}
|
||||
- Provide citations for {information_type} when available
|
||||
- Cross-verify information using {verification_method}
|
||||
- Update knowledge based on {update_criteria}
|
||||
```
|
||||
|
||||
#### Model-Specific Behavior Patterns
|
||||
|
||||
##### Claude 3.5/4 Specific Guidelines
|
||||
```markdown
|
||||
## Claude-Specific Behavioral Guidelines
|
||||
|
||||
### Constitutional Alignment
|
||||
- Follow constitutional AI principles in all responses
|
||||
- Prioritize helpfulness while maintaining safety
|
||||
- Consider multiple perspectives before concluding
|
||||
- Avoid harmful content while remaining useful
|
||||
|
||||
### Output Formatting
|
||||
- Use XML tags for structured information: <tag>content</tag>
|
||||
- Include thinking blocks for complex reasoning: <thinking>...</thinking>
|
||||
- Provide clear section headers with proper hierarchy
|
||||
- Use markdown formatting for improved readability
|
||||
|
||||
### Safety Protocols
|
||||
- Apply content policies consistently
|
||||
- Identify and flag potentially harmful requests
|
||||
- Provide safe alternatives when appropriate
|
||||
- Maintain transparency about limitations
|
||||
```
|
||||
|
||||
##### GPT-4 Specific Guidelines
|
||||
```markdown
|
||||
## GPT-4 Specific Behavioral Guidelines
|
||||
|
||||
### Structured Response Patterns
|
||||
- Use numbered lists for step-by-step processes
|
||||
- Implement clear section boundaries with ### headers
|
||||
- Provide JSON formatted outputs when specified
|
||||
- Use consistent indentation and formatting
|
||||
|
||||
### Function Calling Integration
|
||||
- Recognize when function calling would be appropriate
|
||||
- Structure responses to facilitate tool usage
|
||||
- Provide clear parameter specifications
|
||||
- Handle function results systematically
|
||||
|
||||
### Optimization Behaviors
|
||||
- Balance conciseness with comprehensiveness
|
||||
- Prioritize information relevance and importance
|
||||
- Use efficient language patterns
|
||||
- Minimize redundancy while maintaining clarity
|
||||
```
|
||||
|
||||
### 3. Output Format Specifications
|
||||
|
||||
#### Comprehensive Format Framework
|
||||
```markdown
|
||||
## Output Format Requirements
|
||||
|
||||
### Structure Standards
|
||||
- Begin responses with {opening_pattern}
|
||||
- Use {section_pattern} for major sections
|
||||
- Implement {hierarchy_pattern} for information organization
|
||||
- Include {closing_pattern} for response completion
|
||||
|
||||
### Content Organization
|
||||
- Present information in {presentation_order}
|
||||
- Group related information using {grouping_method}
|
||||
- Use {transition_pattern} between sections
|
||||
- Include {summary_element} for complex responses
|
||||
|
||||
### Format Specifications
|
||||
{if json_format_required}
|
||||
- Provide responses in valid JSON format
|
||||
- Use consistent key naming conventions
|
||||
- Include all required fields
|
||||
- Validate JSON syntax before output
|
||||
{endif}
|
||||
|
||||
{if markdown_format_required}
|
||||
- Use markdown for formatting and emphasis
|
||||
- Include appropriate heading levels
|
||||
- Use code blocks for technical content
|
||||
- Implement tables for structured data
|
||||
{endif}
|
||||
```
|
||||
|
||||
### 4. Safety and Ethical Guidelines
|
||||
|
||||
#### Comprehensive Safety Framework
|
||||
```markdown
|
||||
## Safety and Ethical Guidelines
|
||||
|
||||
### Content Policies
|
||||
- Avoid generating {prohibited_content_1}
|
||||
- Do not provide {prohibited_content_2}
|
||||
- Flag {sensitive_topics} for human review
|
||||
- Provide {safe_alternatives} when appropriate
|
||||
|
||||
### Ethical Considerations
|
||||
- Consider {ethical_principle_1} in all responses
|
||||
- Evaluate potential {ethical_impact} of provided information
|
||||
- Balance helpfulness with {safety_consideration}
|
||||
- Maintain {transparency_standard} about limitations
|
||||
|
||||
### Bias Mitigation
|
||||
- Actively identify and mitigate {bias_type_1}
|
||||
- Present information {neutrality_standard}
|
||||
- Include {diverse_perspectives} when appropriate
|
||||
- Avoid {stereotype_patterns}
|
||||
|
||||
### Harm Prevention
|
||||
- Identify potential {harm_type_1} in responses
|
||||
- Implement {prevention_mechanism} for harmful content
|
||||
- Provide {warning_system} for sensitive topics
|
||||
- Include {escalation_protocol} for concerning requests
|
||||
```
|
||||
|
||||
### 5. Error Handling and Uncertainty
|
||||
|
||||
#### Comprehensive Error Management
|
||||
```markdown
|
||||
## Error Handling and Uncertainty Protocols
|
||||
|
||||
### Uncertainty Management
|
||||
- Explicitly state confidence levels for uncertain information
|
||||
- Use phrases like "I believe," "It appears that," "Based on available information"
|
||||
- Acknowledge when information may be {uncertainty_type}
|
||||
- Provide {verification_method} for uncertain claims
|
||||
|
||||
### Error Recognition
|
||||
- Identify when {error_pattern} might have occurred
|
||||
- Implement {self_checking_mechanism} for accuracy
|
||||
- Use {validation_process} for important information
|
||||
- Provide {correction_protocol} when errors are identified
|
||||
|
||||
### Limitation Acknowledgment
|
||||
- Clearly state {knowledge_limitation} when relevant
|
||||
- Explain {limitation_reason} when unable to provide complete information
|
||||
- Suggest {alternative_approach} when direct assistance isn't possible
|
||||
- Provide {escalation_option} for complex scenarios
|
||||
|
||||
### Correction Procedures
|
||||
- Implement {correction_workflow} for identified errors
|
||||
- Provide {explanation_format} for corrections
|
||||
- Use {acknowledgment_pattern} for mistakes
|
||||
- Include {improvement_commitment} for future accuracy
|
||||
```
|
||||
|
||||
## Specialized System Prompt Templates
|
||||
|
||||
### 1. Educational Assistant System Prompt
|
||||
```markdown
|
||||
# Educational Assistant System Prompt
|
||||
|
||||
## Role Definition
|
||||
You are an expert educational assistant specializing in {subject_area} with {experience_level} of teaching experience. Your pedagogical approach emphasizes {teaching_philosophy} and adapts to different learning styles.
|
||||
|
||||
## Educational Philosophy
|
||||
- Create inclusive and supportive learning environments
|
||||
- Adapt explanations to match learner's comprehension level
|
||||
- Use scaffolding techniques to build understanding progressively
|
||||
- Encourage critical thinking and independent learning
|
||||
|
||||
## Teaching Standards
|
||||
- Provide accurate, up-to-date information verified through {verification_sources}
|
||||
- Use clear, accessible language appropriate for the target audience
|
||||
- Include relevant examples and analogies to enhance understanding
|
||||
- Structure learning objectives with clear progression
|
||||
|
||||
## Interaction Protocols
|
||||
- Assess learner's current understanding before providing explanations
|
||||
- Ask clarifying questions to tailor responses appropriately
|
||||
- Provide opportunities for learner questions and feedback
|
||||
- Offer additional resources for extended learning
|
||||
|
||||
## Output Format
|
||||
- Begin with brief assessment of learner's needs
|
||||
- Use clear headings and organized structure
|
||||
- Include summary points for key takeaways
|
||||
- Provide practice exercises when appropriate
|
||||
- End with suggestions for further learning
|
||||
|
||||
## Safety Guidelines
|
||||
- Create psychologically safe learning environments
|
||||
- Avoid language that might discourage or intimidate learners
|
||||
- Be patient and supportive when learners struggle with concepts
|
||||
- Respect diverse backgrounds and learning abilities
|
||||
|
||||
## Uncertainty Handling
|
||||
- Acknowledge when topics are beyond current expertise
|
||||
- Suggest reliable resources for additional information
|
||||
- Be transparent about the limits of available knowledge
|
||||
- Encourage critical thinking and independent verification
|
||||
```
|
||||
|
||||
### 2. Technical Documentation Generator System Prompt
|
||||
```markdown
|
||||
# Technical Documentation System Prompt
|
||||
|
||||
## Role Definition
|
||||
You are a Senior Technical Writer with {years} of experience creating documentation for {technology_domain}. Your expertise encompasses {documentation_types} and you follow {industry_standards} for technical communication.
|
||||
|
||||
## Documentation Standards
|
||||
- Follow {style_guide} for consistent formatting and terminology
|
||||
- Ensure clarity and accuracy in all technical explanations
|
||||
- Include practical examples and code snippets when helpful
|
||||
- Structure content with clear hierarchy and logical flow
|
||||
|
||||
## Quality Requirements
|
||||
- Maintain technical accuracy verified through {review_process}
|
||||
- Use consistent terminology throughout documentation
|
||||
- Provide comprehensive coverage of topics without overwhelming detail
|
||||
- Include troubleshooting information for common issues
|
||||
|
||||
## Audience Considerations
|
||||
- Target documentation at {audience_level} technical proficiency
|
||||
- Define technical terms and concepts appropriately
|
||||
- Provide progressive disclosure of complex information
|
||||
- Include context and motivation for technical decisions
|
||||
|
||||
## Format Specifications
|
||||
- Use markdown formatting for clear structure and readability
|
||||
- Include code blocks with syntax highlighting
|
||||
- Implement consistent section headings and numbering
|
||||
- Provide navigation aids and cross-references
|
||||
|
||||
## Review Process
|
||||
- Verify technical accuracy through {verification_method}
|
||||
- Test all code examples and procedures
|
||||
- Ensure completeness of coverage for documented features
|
||||
- Validate clarity and comprehensibility with target audience
|
||||
|
||||
## Safety and Compliance
|
||||
- Include security considerations where relevant
|
||||
- Document potential risks and mitigation strategies
|
||||
- Follow industry compliance requirements
|
||||
- Maintain confidentiality for sensitive information
|
||||
```
|
||||
|
||||
### 3. Data Analysis System Prompt
|
||||
```markdown
|
||||
# Data Analysis System Prompt
|
||||
|
||||
## Role Definition
|
||||
You are an expert Data Analyst specializing in {data_domain} with {years} of experience in {analysis_methodologies}. Your analytical approach combines {technical_skills} with {business_acumen} to deliver actionable insights.
|
||||
|
||||
## Analytical Framework
|
||||
- Apply {statistical_methods} for rigorous data analysis
|
||||
- Use {visualization_techniques} for effective data communication
|
||||
- Implement {quality_assurance} processes for data validation
|
||||
- Follow {ethical_guidelines} for responsible data handling
|
||||
|
||||
## Analysis Standards
|
||||
- Ensure methodological soundness in all analyses
|
||||
- Provide clear documentation of analytical processes
|
||||
- Include appropriate statistical measures and confidence intervals
|
||||
- Validate findings through {validation_methods}
|
||||
|
||||
## Communication Requirements
|
||||
- Present findings with appropriate technical depth for the audience
|
||||
- Use clear visualizations and narrative explanations
|
||||
- Highlight actionable insights and recommendations
|
||||
- Acknowledge limitations and uncertainties in analyses
|
||||
|
||||
## Output Structure
|
||||
```json
|
||||
{
|
||||
"executive_summary": "High-level overview of key findings",
|
||||
"methodology": "Description of analytical approach and methods used",
|
||||
"data_overview": "Summary of data sources, quality, and limitations",
|
||||
"key_findings": [
|
||||
{
|
||||
"finding": "Specific discovery or insight",
|
||||
"evidence": "Supporting data and statistical measures",
|
||||
"confidence": "Confidence level in the finding",
|
||||
"implications": "Business or operational implications"
|
||||
}
|
||||
],
|
||||
"recommendations": [
|
||||
{
|
||||
"action": "Recommended action",
|
||||
"priority": "High/Medium/Low",
|
||||
"expected_impact": "Anticipated outcome",
|
||||
"implementation_considerations": "Factors to consider"
|
||||
}
|
||||
],
|
||||
"limitations": "Constraints and limitations of the analysis",
|
||||
"next_steps": "Suggested follow-up analyses or actions"
|
||||
}
|
||||
```
|
||||
|
||||
## Ethical Considerations
|
||||
- Protect privacy and confidentiality of data subjects
|
||||
- Ensure unbiased analysis and interpretation
|
||||
- Consider potential impact of findings on stakeholders
|
||||
- Maintain transparency about analytical limitations
|
||||
```
|
||||
|
||||
## System Prompt Testing and Validation
|
||||
|
||||
### Validation Framework
|
||||
```python
|
||||
class SystemPromptValidator:
|
||||
def __init__(self):
|
||||
self.validation_criteria = {
|
||||
'role_clarity': 0.2,
|
||||
'instruction_specificity': 0.2,
|
||||
'safety_completeness': 0.15,
|
||||
'output_format_clarity': 0.15,
|
||||
'error_handling_coverage': 0.1,
|
||||
'behavioral_consistency': 0.1,
|
||||
'ethical_considerations': 0.1
|
||||
}
|
||||
|
||||
def validate_prompt(self, system_prompt):
|
||||
"""Validate system prompt against quality criteria."""
|
||||
scores = {}
|
||||
|
||||
scores['role_clarity'] = self.assess_role_clarity(system_prompt)
|
||||
scores['instruction_specificity'] = self.assess_instruction_specificity(system_prompt)
|
||||
scores['safety_completeness'] = self.assess_safety_completeness(system_prompt)
|
||||
scores['output_format_clarity'] = self.assess_output_format_clarity(system_prompt)
|
||||
scores['error_handling_coverage'] = self.assess_error_handling(system_prompt)
|
||||
scores['behavioral_consistency'] = self.assess_behavioral_consistency(system_prompt)
|
||||
scores['ethical_considerations'] = self.assess_ethical_considerations(system_prompt)
|
||||
|
||||
# Calculate overall score
|
||||
overall_score = sum(score * weight for score, weight in
|
||||
zip(scores.values(), self.validation_criteria.values()))
|
||||
|
||||
return {
|
||||
'overall_score': overall_score,
|
||||
'individual_scores': scores,
|
||||
'recommendations': self.generate_recommendations(scores)
|
||||
}
|
||||
|
||||
def test_prompt_consistency(self, system_prompt, test_scenarios):
|
||||
"""Test prompt behavior consistency across different scenarios."""
|
||||
results = []
|
||||
|
||||
for scenario in test_scenarios:
|
||||
response = execute_with_system_prompt(system_prompt, scenario)
|
||||
|
||||
# Analyze response consistency
|
||||
consistency_score = self.analyze_response_consistency(response, system_prompt)
|
||||
results.append({
|
||||
'scenario': scenario,
|
||||
'response': response,
|
||||
'consistency_score': consistency_score
|
||||
})
|
||||
|
||||
average_consistency = sum(r['consistency_score'] for r in results) / len(results)
|
||||
|
||||
return {
|
||||
'average_consistency': average_consistency,
|
||||
'scenario_results': results,
|
||||
'recommendations': self.generate_consistency_recommendations(results)
|
||||
}
|
||||
```
|
||||
|
||||
## Best Practices Summary
|
||||
|
||||
### Design Principles
|
||||
- **Clarity First**: Ensure role and instructions are unambiguous
|
||||
- **Comprehensive Coverage**: Address all aspects of model behavior
|
||||
- **Consistency Focus**: Maintain consistent behavior across scenarios
|
||||
- **Safety Priority**: Include robust safety guidelines and constraints
|
||||
- **Flexibility Built-in**: Allow for adaptation to different contexts
|
||||
|
||||
### Common Pitfalls to Avoid
|
||||
- **Vague Instructions**: Be specific about expected behaviors
|
||||
- **Over-constraining**: Allow room for intelligent adaptation
|
||||
- **Missing Safety Guidelines**: Always include comprehensive safety measures
|
||||
- **Inconsistent Formatting**: Use consistent structure throughout
|
||||
- **Ignoring Model Capabilities**: Design prompts that leverage model strengths
|
||||
|
||||
This comprehensive system prompt design framework provides the foundation for creating effective, reliable, and safe AI system behaviors across diverse applications and use cases.
|
||||
599
skills/ai/prompt-engineering/references/template-systems.md
Normal file
599
skills/ai/prompt-engineering/references/template-systems.md
Normal file
@@ -0,0 +1,599 @@
|
||||
# Template Systems Architecture
|
||||
|
||||
This reference provides comprehensive frameworks for building modular, reusable prompt templates with variable interpolation, conditional sections, and hierarchical composition.
|
||||
|
||||
## Template Design Principles
|
||||
|
||||
### Modularity and Reusability
|
||||
- **Single Responsibility**: Each template handles one specific type of task
|
||||
- **Composability**: Templates can be combined to create complex prompts
|
||||
- **Parameterization**: Variables allow customization without core changes
|
||||
- **Inheritance**: Base templates can be extended for specific use cases
|
||||
|
||||
### Clear Variable Naming Conventions
|
||||
```
|
||||
{user_input} - Direct input from user
|
||||
{context} - Background information
|
||||
{examples} - Few-shot learning examples
|
||||
{constraints} - Task limitations and requirements
|
||||
{output_format} - Desired output structure
|
||||
{role} - AI role or persona
|
||||
{expertise_level} - Level of expertise for the role
|
||||
{domain} - Specific domain or field
|
||||
{difficulty} - Task complexity level
|
||||
{language} - Output language specification
|
||||
```
|
||||
|
||||
## Core Template Components
|
||||
|
||||
### 1. Base Template Structure
|
||||
```
|
||||
# Template: Universal Task Framework
|
||||
# Purpose: Base template for most task types
|
||||
# Variables: {role}, {task_description}, {context}, {examples}, {output_format}
|
||||
|
||||
## System Instructions
|
||||
You are a {role} with {expertise_level} expertise in {domain}.
|
||||
|
||||
## Context Information
|
||||
{if context}
|
||||
Background and relevant context:
|
||||
{context}
|
||||
{endif}
|
||||
|
||||
## Task Description
|
||||
{task_description}
|
||||
|
||||
## Examples
|
||||
{if examples}
|
||||
Here are some examples to guide your response:
|
||||
|
||||
{examples}
|
||||
{endif}
|
||||
|
||||
## Output Requirements
|
||||
{output_format}
|
||||
|
||||
## Constraints and Guidelines
|
||||
{constraints}
|
||||
|
||||
## User Input
|
||||
{user_input}
|
||||
```
|
||||
|
||||
### 2. Conditional Sections Framework
|
||||
```python
|
||||
def process_conditional_template(template, variables):
|
||||
"""
|
||||
Process template with conditional sections.
|
||||
"""
|
||||
# Process if/endif blocks
|
||||
while '{if ' in template:
|
||||
start = template.find('{if ')
|
||||
end_condition = template.find('}', start)
|
||||
condition = template[start+4:end_condition].strip()
|
||||
|
||||
start_endif = template.find('{endif}', end_condition)
|
||||
if_content = template[end_condition+1:start_endif].strip()
|
||||
|
||||
# Evaluate condition
|
||||
if evaluate_condition(condition, variables):
|
||||
template = template[:start] + if_content + template[start_endif+6:]
|
||||
else:
|
||||
template = template[:start] + template[start_endif+6:]
|
||||
|
||||
# Replace variables
|
||||
for key, value in variables.items():
|
||||
template = template.replace(f'{{{key}}}', str(value))
|
||||
|
||||
return template
|
||||
```
|
||||
|
||||
### 3. Variable Interpolation System
|
||||
```python
|
||||
class TemplateEngine:
|
||||
def __init__(self):
|
||||
self.variables = {}
|
||||
self.functions = {
|
||||
'upper': str.upper,
|
||||
'lower': str.lower,
|
||||
'capitalize': str.capitalize,
|
||||
'pluralize': self.pluralize,
|
||||
'format_date': self.format_date,
|
||||
'truncate': self.truncate
|
||||
}
|
||||
|
||||
def set_variable(self, name, value):
|
||||
"""Set a template variable."""
|
||||
self.variables[name] = value
|
||||
|
||||
def render(self, template):
|
||||
"""Render template with variable substitution."""
|
||||
# Process function calls {variable|function}
|
||||
template = self.process_functions(template)
|
||||
|
||||
# Replace variables
|
||||
for key, value in self.variables.items():
|
||||
template = template.replace(f'{{{key}}}', str(value))
|
||||
|
||||
return template
|
||||
|
||||
def process_functions(self, template):
|
||||
"""Process template functions."""
|
||||
import re
|
||||
pattern = r'\{(\w+)\|(\w+)\}'
|
||||
|
||||
def replace_function(match):
|
||||
var_name, func_name = match.groups()
|
||||
value = self.variables.get(var_name, '')
|
||||
if func_name in self.functions:
|
||||
return self.functions[func_name](str(value))
|
||||
return value
|
||||
|
||||
return re.sub(pattern, replace_function, template)
|
||||
```
|
||||
|
||||
## Specialized Template Types
|
||||
|
||||
### 1. Classification Template
|
||||
```
|
||||
# Template: Multi-Class Classification
|
||||
# Purpose: Classify inputs into predefined categories
|
||||
# Required Variables: {input_text}, {categories}, {role}
|
||||
|
||||
## Classification Framework
|
||||
You are a {role} specializing in accurate text classification.
|
||||
|
||||
## Classification Categories
|
||||
{categories}
|
||||
|
||||
## Classification Process
|
||||
1. Analyze the input text carefully
|
||||
2. Identify key indicators and features
|
||||
3. Match against category definitions
|
||||
4. Select the most appropriate category
|
||||
5. Provide confidence score
|
||||
|
||||
## Input to Classify
|
||||
{input_text}
|
||||
|
||||
## Output Format
|
||||
```json
|
||||
{{
|
||||
"category": "selected_category",
|
||||
"confidence": 0.95,
|
||||
"reasoning": "Brief explanation of classification logic",
|
||||
"key_indicators": ["indicator1", "indicator2"]
|
||||
}}
|
||||
```
|
||||
```
|
||||
|
||||
### 2. Transformation Template
|
||||
```
|
||||
# Template: Text Transformation
|
||||
# Purpose: Transform text from one format/style to another
|
||||
# Required Variables: {source_text}, {target_format}, {transformation_rules}
|
||||
|
||||
## Transformation Task
|
||||
Transform the given {source_format} text into {target_format} following these rules:
|
||||
{transformation_rules}
|
||||
|
||||
## Source Text
|
||||
{source_text}
|
||||
|
||||
## Transformation Process
|
||||
1. Analyze the structure and content of the source text
|
||||
2. Apply the specified transformation rules
|
||||
3. Maintain the core meaning and intent
|
||||
4. Ensure proper {target_format} formatting
|
||||
5. Verify completeness and accuracy
|
||||
|
||||
## Transformed Output
|
||||
```
|
||||
|
||||
### 3. Generation Template
|
||||
```
|
||||
# Template: Creative Generation
|
||||
# Purpose: Generate creative content based on specifications
|
||||
# Required Variables: {content_type}, {specifications}, {style_guidelines}
|
||||
|
||||
## Creative Generation Task
|
||||
Generate {content_type} that meets the following specifications:
|
||||
|
||||
## Content Specifications
|
||||
{specifications}
|
||||
|
||||
## Style Guidelines
|
||||
{style_guidelines}
|
||||
|
||||
## Quality Requirements
|
||||
- Originality and creativity
|
||||
- Adherence to specifications
|
||||
- Appropriate tone and style
|
||||
- Clear structure and coherence
|
||||
- Audience-appropriate language
|
||||
|
||||
## Generated Content
|
||||
```
|
||||
|
||||
### 4. Analysis Template
|
||||
```
|
||||
# Template: Comprehensive Analysis
|
||||
# Purpose: Perform detailed analysis of given input
|
||||
# Required Variables: {input_data}, {analysis_framework}, {focus_areas}
|
||||
|
||||
## Analysis Framework
|
||||
You are an expert analyst with deep expertise in {domain}.
|
||||
|
||||
## Analysis Scope
|
||||
Focus on these key areas:
|
||||
{focus_areas}
|
||||
|
||||
## Analysis Methodology
|
||||
{analysis_framework}
|
||||
|
||||
## Input Data for Analysis
|
||||
{input_data}
|
||||
|
||||
## Analysis Process
|
||||
1. Initial assessment and context understanding
|
||||
2. Detailed examination of each focus area
|
||||
3. Pattern and trend identification
|
||||
4. Comparative analysis with benchmarks
|
||||
5. Insight generation and recommendation formulation
|
||||
|
||||
## Analysis Output Structure
|
||||
```yaml
|
||||
executive_summary:
|
||||
key_findings: []
|
||||
overall_assessment: ""
|
||||
|
||||
detailed_analysis:
|
||||
{focus_area_1}:
|
||||
observations: []
|
||||
patterns: []
|
||||
insights: []
|
||||
{focus_area_2}:
|
||||
observations: []
|
||||
patterns: []
|
||||
insights: []
|
||||
|
||||
recommendations:
|
||||
immediate: []
|
||||
short_term: []
|
||||
long_term: []
|
||||
```
|
||||
|
||||
## Advanced Template Patterns
|
||||
|
||||
### 1. Hierarchical Template Composition
|
||||
```python
|
||||
class HierarchicalTemplate:
|
||||
def __init__(self, name, content, parent=None):
|
||||
self.name = name
|
||||
self.content = content
|
||||
self.parent = parent
|
||||
self.children = []
|
||||
self.variables = {}
|
||||
|
||||
def add_child(self, child_template):
|
||||
"""Add a child template."""
|
||||
child_template.parent = self
|
||||
self.children.append(child_template)
|
||||
|
||||
def render(self, variables=None):
|
||||
"""Render template with inherited variables."""
|
||||
# Combine variables from parent hierarchy
|
||||
combined_vars = {}
|
||||
|
||||
# Collect variables from parents
|
||||
current = self.parent
|
||||
while current:
|
||||
combined_vars.update(current.variables)
|
||||
current = current.parent
|
||||
|
||||
# Add current variables
|
||||
combined_vars.update(self.variables)
|
||||
|
||||
# Override with provided variables
|
||||
if variables:
|
||||
combined_vars.update(variables)
|
||||
|
||||
# Render content
|
||||
rendered_content = self.render_content(self.content, combined_vars)
|
||||
|
||||
# Render children
|
||||
for child in self.children:
|
||||
child_rendered = child.render(combined_vars)
|
||||
rendered_content = rendered_content.replace(
|
||||
f'{{child:{child.name}}}', child_rendered
|
||||
)
|
||||
|
||||
return rendered_content
|
||||
```
|
||||
|
||||
### 2. Role-Based Template System
|
||||
```python
|
||||
class RoleBasedTemplate:
|
||||
def __init__(self):
|
||||
self.roles = {
|
||||
'analyst': {
|
||||
'persona': 'You are a professional analyst with expertise in data interpretation and pattern recognition.',
|
||||
'approach': 'systematic',
|
||||
'output_style': 'detailed and evidence-based',
|
||||
'verification': 'Always cross-check findings and cite sources'
|
||||
},
|
||||
'creative_writer': {
|
||||
'persona': 'You are a creative writer with a talent for engaging storytelling and vivid descriptions.',
|
||||
'approach': 'imaginative',
|
||||
'output_style': 'descriptive and engaging',
|
||||
'verification': 'Ensure narrative consistency and flow'
|
||||
},
|
||||
'technical_expert': {
|
||||
'persona': 'You are a technical expert with deep knowledge of {domain} and practical implementation experience.',
|
||||
'approach': 'methodical',
|
||||
'output_style': 'precise and technical',
|
||||
'verification': 'Include technical accuracy and best practices'
|
||||
}
|
||||
}
|
||||
|
||||
def create_prompt(self, role, task, domain=None):
|
||||
"""Create role-specific prompt template."""
|
||||
role_config = self.roles.get(role, self.roles['analyst'])
|
||||
|
||||
template = f"""
|
||||
## Role Definition
|
||||
{role_config['persona']}
|
||||
|
||||
## Approach
|
||||
Use a {role_config['approach']} approach to this task.
|
||||
|
||||
## Task
|
||||
{task}
|
||||
|
||||
## Output Style
|
||||
{role_config['output_style']}
|
||||
|
||||
## Verification
|
||||
{role_config['verification']}
|
||||
"""
|
||||
|
||||
if domain and '{domain}' in role_config['persona']:
|
||||
template = template.replace('{domain}', domain)
|
||||
|
||||
return template
|
||||
```
|
||||
|
||||
### 3. Dynamic Template Selection
|
||||
```python
|
||||
class DynamicTemplateSelector:
|
||||
def __init__(self):
|
||||
self.templates = {}
|
||||
self.selection_rules = {}
|
||||
|
||||
def register_template(self, name, template, selection_criteria):
|
||||
"""Register a template with selection criteria."""
|
||||
self.templates[name] = template
|
||||
self.selection_rules[name] = selection_criteria
|
||||
|
||||
def select_template(self, task_characteristics):
|
||||
"""Select the most appropriate template based on task characteristics."""
|
||||
best_template = None
|
||||
best_score = 0
|
||||
|
||||
for name, criteria in self.selection_rules.items():
|
||||
score = self.calculate_match_score(task_characteristics, criteria)
|
||||
if score > best_score:
|
||||
best_score = score
|
||||
best_template = name
|
||||
|
||||
return self.templates[best_template] if best_template else None
|
||||
|
||||
def calculate_match_score(self, task_characteristics, criteria):
|
||||
"""Calculate how well task matches template criteria."""
|
||||
score = 0
|
||||
total_weight = 0
|
||||
|
||||
for characteristic, weight in criteria.items():
|
||||
if characteristic in task_characteristics:
|
||||
if task_characteristics[characteristic] == weight['value']:
|
||||
score += weight['weight']
|
||||
total_weight += weight['weight']
|
||||
|
||||
return score / total_weight if total_weight > 0 else 0
|
||||
```
|
||||
|
||||
## Template Implementation Examples
|
||||
|
||||
### Example 1: Customer Service Template
|
||||
```python
|
||||
customer_service_template = """
|
||||
# Customer Service Response Template
|
||||
|
||||
## Role Definition
|
||||
You are a {customer_service_role} with {experience_level} of customer service experience in {industry}.
|
||||
|
||||
## Context
|
||||
{if customer_history}
|
||||
Customer History:
|
||||
{customer_history}
|
||||
{endif}
|
||||
|
||||
{if issue_context}
|
||||
Issue Context:
|
||||
{issue_context}
|
||||
{endif}
|
||||
|
||||
## Response Guidelines
|
||||
- Maintain {tone} tone throughout
|
||||
- Address all aspects of the customer's inquiry
|
||||
- Provide {level_of_detail} explanation
|
||||
- Include {additional_elements}
|
||||
- Follow company {communication_style} style
|
||||
|
||||
## Customer Inquiry
|
||||
{customer_inquiry}
|
||||
|
||||
## Response Structure
|
||||
1. Greeting and acknowledgment
|
||||
2. Understanding and empathy
|
||||
3. Solution or explanation
|
||||
4. Additional assistance offered
|
||||
5. Professional closing
|
||||
|
||||
## Response
|
||||
"""
|
||||
```
|
||||
|
||||
### Example 2: Technical Documentation Template
|
||||
```python
|
||||
documentation_template = """
|
||||
# Technical Documentation Generator
|
||||
|
||||
## Role Definition
|
||||
You are a {technical_writer_role} specializing in {technology} documentation with {experience_level} of experience.
|
||||
|
||||
## Documentation Standards
|
||||
- Target audience: {audience_level}
|
||||
- Technical depth: {technical_depth}
|
||||
- Include examples: {include_examples}
|
||||
- Add troubleshooting: {add_troubleshooting}
|
||||
- Version: {version}
|
||||
|
||||
## Content to Document
|
||||
{content_to_document}
|
||||
|
||||
## Documentation Structure
|
||||
```markdown
|
||||
# {title}
|
||||
|
||||
## Overview
|
||||
{overview}
|
||||
|
||||
## Prerequisites
|
||||
{prerequisites}
|
||||
|
||||
## {main_sections}
|
||||
|
||||
## Examples
|
||||
{if include_examples}
|
||||
{examples}
|
||||
{endif}
|
||||
|
||||
## Troubleshooting
|
||||
{if add_troubleshooting}
|
||||
{troubleshooting}
|
||||
{endif}
|
||||
|
||||
## Additional Resources
|
||||
{additional_resources}
|
||||
```
|
||||
|
||||
## Generated Documentation
|
||||
"""
|
||||
```
|
||||
|
||||
## Template Management System
|
||||
|
||||
### Version Control Integration
|
||||
```python
|
||||
class TemplateVersionManager:
|
||||
def __init__(self):
|
||||
self.versions = {}
|
||||
self.current_versions = {}
|
||||
|
||||
def create_version(self, template_name, template_content, author, description):
|
||||
"""Create a new version of a template."""
|
||||
import datetime
|
||||
import hashlib
|
||||
|
||||
version_id = hashlib.md5(template_content.encode()).hexdigest()[:8]
|
||||
timestamp = datetime.datetime.now().isoformat()
|
||||
|
||||
version_info = {
|
||||
'version_id': version_id,
|
||||
'content': template_content,
|
||||
'author': author,
|
||||
'description': description,
|
||||
'timestamp': timestamp,
|
||||
'parent_version': self.current_versions.get(template_name)
|
||||
}
|
||||
|
||||
if template_name not in self.versions:
|
||||
self.versions[template_name] = []
|
||||
|
||||
self.versions[template_name].append(version_info)
|
||||
self.current_versions[template_name] = version_id
|
||||
|
||||
return version_id
|
||||
|
||||
def rollback(self, template_name, version_id):
|
||||
"""Rollback to a specific version."""
|
||||
if template_name in self.versions:
|
||||
for version in self.versions[template_name]:
|
||||
if version['version_id'] == version_id:
|
||||
self.current_versions[template_name] = version_id
|
||||
return version['content']
|
||||
return None
|
||||
```
|
||||
|
||||
### Performance Monitoring
|
||||
```python
|
||||
class TemplatePerformanceMonitor:
|
||||
def __init__(self):
|
||||
self.usage_stats = {}
|
||||
self.performance_metrics = {}
|
||||
|
||||
def track_usage(self, template_name, execution_time, success):
|
||||
"""Track template usage and performance."""
|
||||
if template_name not in self.usage_stats:
|
||||
self.usage_stats[template_name] = {
|
||||
'usage_count': 0,
|
||||
'total_time': 0,
|
||||
'success_count': 0,
|
||||
'failure_count': 0
|
||||
}
|
||||
|
||||
stats = self.usage_stats[template_name]
|
||||
stats['usage_count'] += 1
|
||||
stats['total_time'] += execution_time
|
||||
|
||||
if success:
|
||||
stats['success_count'] += 1
|
||||
else:
|
||||
stats['failure_count'] += 1
|
||||
|
||||
def get_performance_report(self, template_name):
|
||||
"""Generate performance report for a template."""
|
||||
if template_name not in self.usage_stats:
|
||||
return None
|
||||
|
||||
stats = self.usage_stats[template_name]
|
||||
avg_time = stats['total_time'] / stats['usage_count']
|
||||
success_rate = stats['success_count'] / stats['usage_count']
|
||||
|
||||
return {
|
||||
'template_name': template_name,
|
||||
'total_usage': stats['usage_count'],
|
||||
'average_execution_time': avg_time,
|
||||
'success_rate': success_rate,
|
||||
'failure_rate': 1 - success_rate
|
||||
}
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Template Quality Guidelines
|
||||
- **Clear Documentation**: Include purpose, variables, and usage examples
|
||||
- **Consistent Naming**: Use standardized variable naming conventions
|
||||
- **Error Handling**: Include fallback mechanisms for missing variables
|
||||
- **Performance Optimization**: Minimize template complexity and rendering time
|
||||
- **Testing**: Implement comprehensive template testing frameworks
|
||||
|
||||
### Security Considerations
|
||||
- **Input Validation**: Sanitize all template variables
|
||||
- **Injection Prevention**: Prevent code injection in template rendering
|
||||
- **Access Control**: Implement proper authorization for template modifications
|
||||
- **Audit Trail**: Track template changes and usage
|
||||
|
||||
This comprehensive template system architecture provides the foundation for building scalable, maintainable prompt templates that can be efficiently managed and optimized across diverse use cases.
|
||||
286
skills/ai/rag/SKILL.md
Normal file
286
skills/ai/rag/SKILL.md
Normal file
@@ -0,0 +1,286 @@
|
||||
---
|
||||
name: rag-implementation
|
||||
description: Build Retrieval-Augmented Generation (RAG) systems for AI applications with vector databases and semantic search. Use when implementing knowledge-grounded AI, building document Q&A systems, or integrating LLMs with external knowledge bases.
|
||||
allowed-tools: Read, Write, Bash
|
||||
category: ai-engineering
|
||||
tags: [rag, vector-databases, embeddings, retrieval, semantic-search]
|
||||
version: 1.0.0
|
||||
---
|
||||
|
||||
# RAG Implementation
|
||||
|
||||
Build Retrieval-Augmented Generation systems that extend AI capabilities with external knowledge sources.
|
||||
|
||||
## Overview
|
||||
|
||||
RAG (Retrieval-Augmented Generation) enhances AI applications by retrieving relevant information from knowledge bases and incorporating it into AI responses, reducing hallucinations and providing accurate, grounded answers.
|
||||
|
||||
## When to Use
|
||||
|
||||
Use this skill when:
|
||||
|
||||
- Building Q&A systems over proprietary documents
|
||||
- Creating chatbots with current, factual information
|
||||
- Implementing semantic search with natural language queries
|
||||
- Reducing hallucinations with grounded responses
|
||||
- Enabling AI systems to access domain-specific knowledge
|
||||
- Building documentation assistants
|
||||
- Creating research tools with source citation
|
||||
- Developing knowledge management systems
|
||||
|
||||
## Core Components
|
||||
|
||||
### Vector Databases
|
||||
Store and efficiently retrieve document embeddings for semantic search.
|
||||
|
||||
**Key Options:**
|
||||
- **Pinecone**: Managed, scalable, production-ready
|
||||
- **Weaviate**: Open-source, hybrid search capabilities
|
||||
- **Milvus**: High performance, on-premise deployment
|
||||
- **Chroma**: Lightweight, easy local development
|
||||
- **Qdrant**: Fast, advanced filtering
|
||||
- **FAISS**: Meta's library, full control
|
||||
|
||||
### Embedding Models
|
||||
Convert text to numerical vectors for similarity search.
|
||||
|
||||
**Popular Models:**
|
||||
- **text-embedding-ada-002** (OpenAI): General purpose, 1536 dimensions
|
||||
- **all-MiniLM-L6-v2**: Fast, lightweight, 384 dimensions
|
||||
- **e5-large-v2**: High quality, multilingual
|
||||
- **bge-large-en-v1.5**: State-of-the-art performance
|
||||
|
||||
### Retrieval Strategies
|
||||
Find relevant content based on user queries.
|
||||
|
||||
**Approaches:**
|
||||
- **Dense Retrieval**: Semantic similarity via embeddings
|
||||
- **Sparse Retrieval**: Keyword matching (BM25, TF-IDF)
|
||||
- **Hybrid Search**: Combine dense + sparse for best results
|
||||
- **Multi-Query**: Generate multiple query variations
|
||||
- **Contextual Compression**: Extract only relevant parts
|
||||
|
||||
## Quick Implementation
|
||||
|
||||
### Basic RAG Setup
|
||||
|
||||
```java
|
||||
// Load documents from file system
|
||||
List<Document> documents = FileSystemDocumentLoader.loadDocuments("/path/to/docs");
|
||||
|
||||
// Create embedding store
|
||||
InMemoryEmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();
|
||||
|
||||
// Ingest documents into the store
|
||||
EmbeddingStoreIngestor.ingest(documents, embeddingStore);
|
||||
|
||||
// Create AI service with RAG capability
|
||||
Assistant assistant = AiServices.builder(Assistant.class)
|
||||
.chatModel(chatModel)
|
||||
.chatMemory(MessageWindowChatMemory.withMaxMessages(10))
|
||||
.contentRetriever(EmbeddingStoreContentRetriever.from(embeddingStore))
|
||||
.build();
|
||||
```
|
||||
|
||||
### Document Processing Pipeline
|
||||
|
||||
```java
|
||||
// Split documents into chunks
|
||||
DocumentSplitter splitter = new RecursiveCharacterTextSplitter(
|
||||
500, // chunk size
|
||||
100 // overlap
|
||||
);
|
||||
|
||||
// Create embedding model
|
||||
EmbeddingModel embeddingModel = OpenAiEmbeddingModel.builder()
|
||||
.apiKey("your-api-key")
|
||||
.build();
|
||||
|
||||
// Create embedding store
|
||||
EmbeddingStore<TextSegment> embeddingStore = PgVectorEmbeddingStore.builder()
|
||||
.host("localhost")
|
||||
.database("postgres")
|
||||
.user("postgres")
|
||||
.password("password")
|
||||
.table("embeddings")
|
||||
.dimension(1536)
|
||||
.build();
|
||||
|
||||
// Process and store documents
|
||||
for (Document document : documents) {
|
||||
List<TextSegment> segments = splitter.split(document);
|
||||
for (TextSegment segment : segments) {
|
||||
Embedding embedding = embeddingModel.embed(segment).content();
|
||||
embeddingStore.add(embedding, segment);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Implementation Patterns
|
||||
|
||||
### Pattern 1: Simple Document Q&A
|
||||
|
||||
Create a basic Q&A system over your documents.
|
||||
|
||||
```java
|
||||
public interface DocumentAssistant {
|
||||
String answer(String question);
|
||||
}
|
||||
|
||||
DocumentAssistant assistant = AiServices.builder(DocumentAssistant.class)
|
||||
.chatModel(chatModel)
|
||||
.contentRetriever(retriever)
|
||||
.build();
|
||||
```
|
||||
|
||||
### Pattern 2: Metadata-Filtered Retrieval
|
||||
|
||||
Filter results based on document metadata.
|
||||
|
||||
```java
|
||||
// Add metadata during document loading
|
||||
Document document = Document.builder()
|
||||
.text("Content here")
|
||||
.metadata("source", "technical-manual.pdf")
|
||||
.metadata("category", "technical")
|
||||
.metadata("date", "2024-01-15")
|
||||
.build();
|
||||
|
||||
// Filter during retrieval
|
||||
EmbeddingStoreContentRetriever retriever = EmbeddingStoreContentRetriever.builder()
|
||||
.embeddingStore(embeddingStore)
|
||||
.embeddingModel(embeddingModel)
|
||||
.maxResults(5)
|
||||
.minScore(0.7)
|
||||
.filter(metadataKey("category").isEqualTo("technical"))
|
||||
.build();
|
||||
```
|
||||
|
||||
### Pattern 3: Multi-Source Retrieval
|
||||
|
||||
Combine results from multiple knowledge sources.
|
||||
|
||||
```java
|
||||
ContentRetriever webRetriever = EmbeddingStoreContentRetriever.from(webStore);
|
||||
ContentRetriever documentRetriever = EmbeddingStoreContentRetriever.from(documentStore);
|
||||
ContentRetriever databaseRetriever = EmbeddingStoreContentRetriever.from(databaseStore);
|
||||
|
||||
// Combine results
|
||||
List<Content> allResults = new ArrayList<>();
|
||||
allResults.addAll(webRetriever.retrieve(query));
|
||||
allResults.addAll(documentRetriever.retrieve(query));
|
||||
allResults.addAll(databaseRetriever.retrieve(query));
|
||||
|
||||
// Rerank combined results
|
||||
List<Content> rerankedResults = reranker.reorder(query, allResults);
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Document Preparation
|
||||
- Clean and preprocess documents before ingestion
|
||||
- Remove irrelevant content and formatting artifacts
|
||||
- Standardize document structure for consistent processing
|
||||
- Add relevant metadata for filtering and context
|
||||
|
||||
### Chunking Strategy
|
||||
- Use 500-1000 tokens per chunk for optimal balance
|
||||
- Include 10-20% overlap to preserve context at boundaries
|
||||
- Consider document structure when determining chunk boundaries
|
||||
- Test different chunk sizes for your specific use case
|
||||
|
||||
### Retrieval Optimization
|
||||
- Start with high k values (10-20) then filter/rerank
|
||||
- Use metadata filtering to improve relevance
|
||||
- Combine multiple retrieval strategies for better coverage
|
||||
- Monitor retrieval quality and user feedback
|
||||
|
||||
### Performance Considerations
|
||||
- Cache embeddings for frequently accessed content
|
||||
- Use batch processing for document ingestion
|
||||
- Optimize vector store configuration for your scale
|
||||
- Monitor query performance and system resources
|
||||
|
||||
## Common Issues and Solutions
|
||||
|
||||
### Poor Retrieval Quality
|
||||
**Problem**: Retrieved documents don't match user queries
|
||||
**Solutions**:
|
||||
- Improve document preprocessing and cleaning
|
||||
- Adjust chunk size and overlap parameters
|
||||
- Try different embedding models
|
||||
- Use hybrid search combining semantic and keyword matching
|
||||
|
||||
### Irrelevant Results
|
||||
**Problem**: Retrieved documents contain relevant information but are not specific enough
|
||||
**Solutions**:
|
||||
- Add metadata filtering for domain-specific constraints
|
||||
- Implement reranking with cross-encoder models
|
||||
- Use contextual compression to extract relevant parts
|
||||
- Fine-tune retrieval parameters (k values, similarity thresholds)
|
||||
|
||||
### Performance Issues
|
||||
**Problem**: Slow response times during retrieval
|
||||
**Solutions**:
|
||||
- Optimize vector store configuration and indexing
|
||||
- Implement caching for frequently retrieved content
|
||||
- Use smaller embedding models for faster inference
|
||||
- Consider approximate nearest neighbor algorithms
|
||||
|
||||
### Hallucination Prevention
|
||||
**Problem**: AI generates information not present in retrieved documents
|
||||
**Solutions**:
|
||||
- Improve prompt engineering to emphasize grounding
|
||||
- Add verification steps to check answer alignment
|
||||
- Include confidence scoring for responses
|
||||
- Implement fact-checking mechanisms
|
||||
|
||||
## Evaluation Framework
|
||||
|
||||
### Retrieval Metrics
|
||||
- **Precision@k**: Percentage of relevant documents in top-k results
|
||||
- **Recall@k**: Percentage of all relevant documents found in top-k results
|
||||
- **Mean Reciprocal Rank (MRR)**: Average rank of first relevant result
|
||||
- **Normalized Discounted Cumulative Gain (nDCG)**: Ranking quality metric
|
||||
|
||||
### Answer Quality Metrics
|
||||
- **Faithfulness**: Degree to which answers are grounded in retrieved documents
|
||||
- **Answer Relevance**: How well answers address user questions
|
||||
- **Context Recall**: Percentage of relevant context used in answers
|
||||
- **Context Precision**: Percentage of retrieved context that is relevant
|
||||
|
||||
### User Experience Metrics
|
||||
- **Response Time**: Time from query to answer
|
||||
- **User Satisfaction**: Feedback ratings on answer quality
|
||||
- **Task Completion**: Rate of successful task completion
|
||||
- **Engagement**: User interaction patterns with the system
|
||||
|
||||
## Resources
|
||||
|
||||
### Reference Documentation
|
||||
- [Vector Database Comparison](references/vector-databases.md) - Detailed comparison of vector database options
|
||||
- [Embedding Models Guide](references/embedding-models.md) - Model selection and optimization
|
||||
- [Retrieval Strategies](references/retrieval-strategies.md) - Advanced retrieval techniques
|
||||
- [Document Chunking](references/document-chunking.md) - Chunking strategies and best practices
|
||||
- [LangChain4j RAG Guide](references/langchain4j-rag-guide.md) - Official implementation patterns
|
||||
|
||||
### Assets
|
||||
- `assets/vector-store-config.yaml` - Configuration templates for different vector stores
|
||||
- `assets/retriever-pipeline.java` - Complete RAG pipeline implementation
|
||||
- `assets/evaluation-metrics.java` - Evaluation framework code
|
||||
|
||||
## Constraints and Limitations
|
||||
|
||||
1. **Token Limits**: Respect model context window limitations
|
||||
2. **API Rate Limits**: Manage external API rate limits and costs
|
||||
3. **Data Privacy**: Ensure compliance with data protection regulations
|
||||
4. **Resource Requirements**: Consider memory and computational requirements
|
||||
5. **Maintenance**: Plan for regular updates and system monitoring
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- Secure access to vector databases and embedding services
|
||||
- Implement proper authentication and authorization
|
||||
- Validate and sanitize user inputs
|
||||
- Monitor for abuse and unusual usage patterns
|
||||
- Regular security audits and penetration testing
|
||||
307
skills/ai/rag/assets/retriever-pipeline.java
Normal file
307
skills/ai/rag/assets/retriever-pipeline.java
Normal file
@@ -0,0 +1,307 @@
|
||||
package com.example.rag;
|
||||
|
||||
import dev.langchain4j.data.document.Document;
|
||||
import dev.langchain4j.data.document.DocumentSplitter;
|
||||
import dev.langchain4j.data.document.parser.TextDocumentParser;
|
||||
import dev.langchain4j.data.document.splitter.RecursiveCharacterTextSplitter;
|
||||
import dev.langchain4j.data.embedding.Embedding;
|
||||
import dev.langchain4j.data.segment.TextSegment;
|
||||
import dev.langchain4j.model.embedding.EmbeddingModel;
|
||||
import dev.langchain4j.model.openai.OpenAiEmbeddingModel;
|
||||
import dev.langchain4j.store.embedding.EmbeddingStore;
|
||||
import dev.langchain4j.store.embedding.EmbeddingStoreIngestor;
|
||||
import dev.langchain4j.store.embedding.inmemory.InMemoryEmbeddingStore;
|
||||
import dev.langchain4j.store.embedding.pinecone.PineconeEmbeddingStore;
|
||||
import dev.langchain4j.store.embedding.chroma.ChromaEmbeddingStore;
|
||||
import dev.langchain4j.store.embedding.qdrant.QdrantEmbeddingStore;
|
||||
import dev.langchain4j.data.document.loader.FileSystemDocumentLoader;
|
||||
import dev.langchain4j.store.embedding.filter.Filter;
|
||||
import dev.langchain4j.store.embedding.filter.MetadataFilterBuilder;
|
||||
|
||||
import java.nio.file.Path;
|
||||
import java.nio.file.Paths;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.HashMap;
|
||||
|
||||
/**
|
||||
* Complete RAG Pipeline Implementation
|
||||
*
|
||||
* This class provides a comprehensive implementation of a RAG (Retrieval-Augmented Generation)
|
||||
* system with support for multiple vector stores and advanced retrieval strategies.
|
||||
*/
|
||||
public class RAGPipeline {
|
||||
|
||||
private final EmbeddingModel embeddingModel;
|
||||
private final EmbeddingStore<TextSegment> embeddingStore;
|
||||
private final DocumentSplitter documentSplitter;
|
||||
private final RAGConfig config;
|
||||
|
||||
/**
|
||||
* Configuration class for RAG pipeline
|
||||
*/
|
||||
public static class RAGConfig {
|
||||
private String vectorStoreType = "chroma";
|
||||
private String openAiApiKey;
|
||||
private String pineconeApiKey;
|
||||
private String pineconeEnvironment;
|
||||
private String pineconeIndex = "rag-documents";
|
||||
private String chromaCollection = "rag-documents";
|
||||
private String chromaPersistPath = "./chroma_db";
|
||||
private String qdrantHost = "localhost";
|
||||
private int qdrantPort = 6333;
|
||||
private String qdrantCollection = "rag-documents";
|
||||
private int chunkSize = 1000;
|
||||
private int chunkOverlap = 200;
|
||||
private int embeddingDimension = 1536;
|
||||
|
||||
// Getters and setters
|
||||
public String getVectorStoreType() { return vectorStoreType; }
|
||||
public void setVectorStoreType(String vectorStoreType) { this.vectorStoreType = vectorStoreType; }
|
||||
public String getOpenAiApiKey() { return openAiApiKey; }
|
||||
public void setOpenAiApiKey(String openAiApiKey) { this.openAiApiKey = openAiApiKey; }
|
||||
public String getPineconeApiKey() { return pineconeApiKey; }
|
||||
public void setPineconeApiKey(String pineconeApiKey) { this.pineconeApiKey = pineconeApiKey; }
|
||||
public String getPineconeEnvironment() { return pineconeEnvironment; }
|
||||
public void setPineconeEnvironment(String pineconeEnvironment) { this.pineconeEnvironment = pineconeEnvironment; }
|
||||
public String getPineconeIndex() { return pineconeIndex; }
|
||||
public void setPineconeIndex(String pineconeIndex) { this.pineconeIndex = pineconeIndex; }
|
||||
public String getChromaCollection() { return chromaCollection; }
|
||||
public void setChromaCollection(String chromaCollection) { this.chromaCollection = chromaCollection; }
|
||||
public String getChromaPersistPath() { return chromaPersistPath; }
|
||||
public void setChromaPersistPath(String chromaPersistPath) { this.chromaPersistPath = chromaPersistPath; }
|
||||
public String getQdrantHost() { return qdrantHost; }
|
||||
public void setQdrantHost(String qdrantHost) { this.qdrantHost = qdrantHost; }
|
||||
public int getQdrantPort() { return qdrantPort; }
|
||||
public void setQdrantPort(int qdrantPort) { this.qdrantPort = qdrantPort; }
|
||||
public String getQdrantCollection() { return qdrantCollection; }
|
||||
public void setQdrantCollection(String qdrantCollection) { this.qdrantCollection = qdrantCollection; }
|
||||
public int getChunkSize() { return chunkSize; }
|
||||
public void setChunkSize(int chunkSize) { this.chunkSize = chunkSize; }
|
||||
public int getChunkOverlap() { return chunkOverlap; }
|
||||
public void setChunkOverlap(int chunkOverlap) { this.chunkOverlap = chunkOverlap; }
|
||||
public int getEmbeddingDimension() { return embeddingDimension; }
|
||||
public void setEmbeddingDimension(int embeddingDimension) { this.embeddingDimension = embeddingDimension; }
|
||||
}
|
||||
|
||||
/**
|
||||
* Constructor
|
||||
*/
|
||||
public RAGPipeline(RAGConfig config) {
|
||||
this.config = config;
|
||||
this.embeddingModel = createEmbeddingModel();
|
||||
this.embeddingStore = createEmbeddingStore();
|
||||
this.documentSplitter = createDocumentSplitter();
|
||||
}
|
||||
|
||||
/**
|
||||
* Create embedding model based on configuration
|
||||
*/
|
||||
private EmbeddingModel createEmbeddingModel() {
|
||||
return OpenAiEmbeddingModel.builder()
|
||||
.apiKey(config.getOpenAiApiKey())
|
||||
.modelName("text-embedding-ada-002")
|
||||
.build();
|
||||
}
|
||||
|
||||
/**
|
||||
* Create embedding store based on configuration
|
||||
*/
|
||||
private EmbeddingStore<TextSegment> createEmbeddingStore() {
|
||||
switch (config.getVectorStoreType().toLowerCase()) {
|
||||
case "pinecone":
|
||||
return PineconeEmbeddingStore.builder()
|
||||
.apiKey(config.getPineconeApiKey())
|
||||
.environment(config.getPineconeEnvironment())
|
||||
.index(config.getPineconeIndex())
|
||||
.dimension(config.getEmbeddingDimension())
|
||||
.build();
|
||||
|
||||
case "chroma":
|
||||
return ChromaEmbeddingStore.builder()
|
||||
.collectionName(config.getChromaCollection())
|
||||
.persistDirectory(config.getChromaPersistPath())
|
||||
.build();
|
||||
|
||||
case "qdrant":
|
||||
return QdrantEmbeddingStore.builder()
|
||||
.host(config.getQdrantHost())
|
||||
.port(config.getQdrantPort())
|
||||
.collectionName(config.getQdrantCollection())
|
||||
.dimension(config.getEmbeddingDimension())
|
||||
.build();
|
||||
|
||||
case "memory":
|
||||
default:
|
||||
return new InMemoryEmbeddingStore<>();
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Create document splitter
|
||||
*/
|
||||
private DocumentSplitter createDocumentSplitter() {
|
||||
return new RecursiveCharacterTextSplitter(
|
||||
config.getChunkSize(),
|
||||
config.getChunkOverlap()
|
||||
);
|
||||
}
|
||||
|
||||
/**
|
||||
* Load documents from directory
|
||||
*/
|
||||
public List<Document> loadDocuments(String directoryPath) {
|
||||
try {
|
||||
Path directory = Paths.get(directoryPath);
|
||||
List<Document> documents = FileSystemDocumentLoader.loadDocuments(directory);
|
||||
|
||||
// Add metadata to documents
|
||||
for (Document document : documents) {
|
||||
Map<String, Object> metadata = new HashMap<>(document.metadata().toMap());
|
||||
metadata.put("loaded_at", System.currentTimeMillis());
|
||||
metadata.put("source_directory", directoryPath);
|
||||
|
||||
// Update document metadata
|
||||
document = Document.from(document.text(), metadata);
|
||||
}
|
||||
|
||||
return documents;
|
||||
} catch (Exception e) {
|
||||
throw new RuntimeException("Failed to load documents from " + directoryPath, e);
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Process and ingest documents
|
||||
*/
|
||||
public void ingestDocuments(List<Document> documents) {
|
||||
// Split documents into segments
|
||||
List<TextSegment> segments = documentSplitter.split(documents);
|
||||
|
||||
// Add additional metadata to segments
|
||||
for (int i = 0; i < segments.size(); i++) {
|
||||
TextSegment segment = segments.get(i);
|
||||
Map<String, Object> metadata = new HashMap<>(segment.metadata().toMap());
|
||||
metadata.put("segment_index", i);
|
||||
metadata.put("total_segments", segments.size());
|
||||
metadata.put("processed_at", System.currentTimeMillis());
|
||||
|
||||
segments.set(i, TextSegment.from(segment.text(), metadata));
|
||||
}
|
||||
|
||||
// Ingest into embedding store
|
||||
EmbeddingStoreIngestor.ingest(segments, embeddingStore);
|
||||
|
||||
System.out.println("Ingested " + documents.size() + " documents into " +
|
||||
segments.size() + " segments");
|
||||
}
|
||||
|
||||
/**
|
||||
* Search documents with optional filtering
|
||||
*/
|
||||
public List<TextSegment> search(String query, int maxResults, Filter filter) {
|
||||
Embedding queryEmbedding = embeddingModel.embed(query).content();
|
||||
|
||||
return embeddingStore.findRelevant(queryEmbedding, maxResults, filter);
|
||||
}
|
||||
|
||||
/**
|
||||
* Search documents with metadata filtering
|
||||
*/
|
||||
public List<TextSegment> searchWithMetadataFilter(String query, int maxResults,
|
||||
Map<String, Object> metadataFilters) {
|
||||
Filter filter = null;
|
||||
|
||||
if (metadataFilters != null && !metadataFilters.isEmpty()) {
|
||||
MetadataFilterBuilder filterBuilder = new MetadataFilterBuilder();
|
||||
|
||||
for (Map.Entry<String, Object> entry : metadataFilters.entrySet()) {
|
||||
String key = entry.getKey();
|
||||
Object value = entry.getValue();
|
||||
|
||||
if (value instanceof String) {
|
||||
filterBuilder = filterBuilder.metadata(key).isEqualTo((String) value);
|
||||
} else if (value instanceof Number) {
|
||||
filterBuilder = filterBuilder.metadata(key).isEqualTo(((Number) value).doubleValue());
|
||||
}
|
||||
// Add more type handling as needed
|
||||
}
|
||||
|
||||
filter = filterBuilder.build();
|
||||
}
|
||||
|
||||
return search(query, maxResults, filter);
|
||||
}
|
||||
|
||||
/**
|
||||
* Get statistics about the stored documents
|
||||
*/
|
||||
public RAGStatistics getStatistics() {
|
||||
// This is a simplified implementation
|
||||
// In practice, you might want to track more detailed statistics
|
||||
return new RAGStatistics(
|
||||
embeddingStore.getClass().getSimpleName(),
|
||||
config.getVectorStoreType()
|
||||
);
|
||||
}
|
||||
|
||||
/**
|
||||
* Statistics holder class
|
||||
*/
|
||||
public static class RAGStatistics {
|
||||
private final String storeType;
|
||||
private final String implementation;
|
||||
|
||||
public RAGStatistics(String storeType, String implementation) {
|
||||
this.storeType = storeType;
|
||||
this.implementation = implementation;
|
||||
}
|
||||
|
||||
public String getStoreType() { return storeType; }
|
||||
public String getImplementation() { return implementation; }
|
||||
|
||||
@Override
|
||||
public String toString() {
|
||||
return "RAGStatistics{" +
|
||||
"storeType='" + storeType + '\'' +
|
||||
", implementation='" + implementation + '\'' +
|
||||
'}';
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Example usage
|
||||
*/
|
||||
public static void main(String[] args) {
|
||||
// Configure the pipeline
|
||||
RAGConfig config = new RAGConfig();
|
||||
config.setVectorStoreType("chroma"); // or "pinecone", "qdrant", "memory"
|
||||
config.setOpenAiApiKey("your-openai-api-key");
|
||||
config.setChunkSize(1000);
|
||||
config.setChunkOverlap(200);
|
||||
|
||||
// Create pipeline
|
||||
RAGPipeline pipeline = new RAGPipeline(config);
|
||||
|
||||
// Load documents
|
||||
List<Document> documents = pipeline.loadDocuments("./documents");
|
||||
|
||||
// Ingest documents
|
||||
pipeline.ingestDocuments(documents);
|
||||
|
||||
// Search for relevant content
|
||||
List<TextSegment> results = pipeline.search("What is machine learning?", 5, null);
|
||||
|
||||
// Print results
|
||||
for (int i = 0; i < results.size(); i++) {
|
||||
TextSegment segment = results.get(i);
|
||||
System.out.println("Result " + (i + 1) + ":");
|
||||
System.out.println("Content: " + segment.text().substring(0, Math.min(200, segment.text().length())) + "...");
|
||||
System.out.println("Metadata: " + segment.metadata());
|
||||
System.out.println();
|
||||
}
|
||||
|
||||
// Print statistics
|
||||
System.out.println("Pipeline Statistics: " + pipeline.getStatistics());
|
||||
}
|
||||
}
|
||||
127
skills/ai/rag/assets/vector-store-config.yaml
Normal file
127
skills/ai/rag/assets/vector-store-config.yaml
Normal file
@@ -0,0 +1,127 @@
|
||||
# Vector Store Configuration Templates
|
||||
# This file contains configuration templates for different vector databases
|
||||
|
||||
# Chroma (Local/Development)
|
||||
chroma:
|
||||
type: chroma
|
||||
settings:
|
||||
persist_directory: "./chroma_db"
|
||||
collection_name: "rag_documents"
|
||||
host: "localhost"
|
||||
port: 8000
|
||||
|
||||
# Recommended for: Development, small-scale applications
|
||||
# Pros: Easy setup, local deployment, free
|
||||
# Cons: Limited scalability, single-node only
|
||||
|
||||
# Pinecone (Cloud/Production)
|
||||
pinecone:
|
||||
type: pinecone
|
||||
settings:
|
||||
api_key: "${PINECONE_API_KEY}"
|
||||
environment: "us-west1-gcp"
|
||||
index_name: "rag-documents"
|
||||
dimension: 1536
|
||||
metric: "cosine"
|
||||
pods: 1
|
||||
pod_type: "p1.x1"
|
||||
|
||||
# Recommended for: Production applications, large-scale
|
||||
# Pros: Managed service, scalable, fast
|
||||
# Cons: Cost, requires internet connection
|
||||
|
||||
# Weaviate (Open-source/Cloud)
|
||||
weaviate:
|
||||
type: weaviate
|
||||
settings:
|
||||
url: "http://localhost:8080"
|
||||
api_key: "${WEAVIATE_API_KEY}"
|
||||
class_name: "Document"
|
||||
text_key: "content"
|
||||
vectorizer: "text2vec-openai"
|
||||
module_config:
|
||||
text2vec-openai:
|
||||
model: "ada"
|
||||
modelVersion: "002"
|
||||
type: "text"
|
||||
baseUrl: "https://api.openai.com/v1"
|
||||
|
||||
# Recommended for: Hybrid search, GraphQL API
|
||||
# Pros: Open-source, hybrid search, flexible
|
||||
# Cons: More complex setup
|
||||
|
||||
# Qdrant (Performance-focused)
|
||||
qdrant:
|
||||
type: qdrant
|
||||
settings:
|
||||
host: "localhost"
|
||||
port: 6333
|
||||
collection_name: "rag_documents"
|
||||
vector_size: 1536
|
||||
distance: "Cosine"
|
||||
api_key: "${QDRANT_API_KEY}"
|
||||
|
||||
# Recommended for: Performance, advanced filtering
|
||||
# Pros: Fast, good filtering, open-source
|
||||
# Cons: Newer project, smaller community
|
||||
|
||||
# Milvus (Enterprise/Scale)
|
||||
milvus:
|
||||
type: milvus
|
||||
settings:
|
||||
host: "localhost"
|
||||
port: 19530
|
||||
collection_name: "rag_documents"
|
||||
dimension: 1536
|
||||
index_type: "IVF_FLAT"
|
||||
metric_type: "COSINE"
|
||||
nlist: 1024
|
||||
|
||||
# Recommended for: Enterprise, large-scale deployments
|
||||
# Pros: High performance, distributed
|
||||
# Cons: Complex setup, resource intensive
|
||||
|
||||
# FAISS (Local/Research)
|
||||
faiss:
|
||||
type: faiss
|
||||
settings:
|
||||
index_type: "IndexFlatL2"
|
||||
dimension: 1536
|
||||
save_path: "./faiss_index"
|
||||
|
||||
# Recommended for: Research, local processing
|
||||
# Pros: Fast, local, no dependencies
|
||||
# Cons: No persistence, limited features
|
||||
|
||||
# Common Configuration Parameters
|
||||
common:
|
||||
chunking:
|
||||
chunk_size: 1000
|
||||
chunk_overlap: 200
|
||||
separators: ["\n\n", "\n", " ", ""]
|
||||
|
||||
embedding:
|
||||
model: "text-embedding-ada-002"
|
||||
batch_size: 100
|
||||
max_retries: 3
|
||||
timeout: 30
|
||||
|
||||
retrieval:
|
||||
default_k: 5
|
||||
similarity_threshold: 0.7
|
||||
max_results: 20
|
||||
|
||||
performance:
|
||||
cache_embeddings: true
|
||||
cache_size: 1000
|
||||
parallel_processing: true
|
||||
batch_size: 50
|
||||
|
||||
# Environment Variables Template
|
||||
# Copy this to .env file and fill in your values
|
||||
environment:
|
||||
OPENAI_API_KEY: "your-openai-api-key-here"
|
||||
PINECONE_API_KEY: "your-pinecone-api-key-here"
|
||||
PINECONE_ENVIRONMENT: "us-west1-gcp"
|
||||
WEAVIATE_API_KEY: "your-weaviate-api-key-here"
|
||||
QDRANT_API_KEY: "your-qdrant-api-key-here"
|
||||
137
skills/ai/rag/references/document-chunking.md
Normal file
137
skills/ai/rag/references/document-chunking.md
Normal file
@@ -0,0 +1,137 @@
|
||||
# Document Chunking Strategies
|
||||
|
||||
## Overview
|
||||
Document chunking is the process of breaking large documents into smaller, manageable pieces that can be effectively embedded and retrieved.
|
||||
|
||||
## Chunking Strategies
|
||||
|
||||
### 1. Recursive Character Text Splitter
|
||||
**Method**: Split text based on character count, trying separators in order
|
||||
**Use Case**: General purpose text splitting
|
||||
**Advantages**: Preserves sentence and paragraph boundaries when possible
|
||||
|
||||
```python
|
||||
from langchain.text_splitters import RecursiveCharacterTextSplitter
|
||||
|
||||
splitter = RecursiveCharacterTextSplitter(
|
||||
chunk_size=1000,
|
||||
chunk_overlap=200,
|
||||
length_function=len,
|
||||
separators=["\n\n", "\n", " ", ""] # Try these in order
|
||||
)
|
||||
chunks = splitter.split_documents(documents)
|
||||
```
|
||||
|
||||
### 2. Token-Based Splitting
|
||||
**Method**: Split based on token count rather than characters
|
||||
**Use Case**: When working with token limits of language models
|
||||
**Advantages**: Better control over context window usage
|
||||
|
||||
```python
|
||||
from langchain.text_splitters import TokenTextSplitter
|
||||
|
||||
splitter = TokenTextSplitter(
|
||||
chunk_size=512,
|
||||
chunk_overlap=50
|
||||
)
|
||||
chunks = splitter.split_documents(documents)
|
||||
```
|
||||
|
||||
### 3. Semantic Chunking
|
||||
**Method**: Split based on semantic similarity
|
||||
**Use Case**: When maintaining semantic coherence is important
|
||||
**Advantages**: Chunks are more semantically meaningful
|
||||
|
||||
```python
|
||||
from langchain.text_splitters import SemanticChunker
|
||||
|
||||
splitter = SemanticChunker(
|
||||
embeddings=OpenAIEmbeddings(),
|
||||
breakpoint_threshold_type="percentile"
|
||||
)
|
||||
chunks = splitter.split_documents(documents)
|
||||
```
|
||||
|
||||
### 4. Markdown Header Splitter
|
||||
**Method**: Split based on markdown headers
|
||||
**Use Case**: Structured documents with clear hierarchical organization
|
||||
**Advantages**: Maintains document structure and context
|
||||
|
||||
```python
|
||||
from langchain.text_splitters import MarkdownHeaderTextSplitter
|
||||
|
||||
headers_to_split_on = [
|
||||
("#", "Header 1"),
|
||||
("##", "Header 2"),
|
||||
("###", "Header 3"),
|
||||
]
|
||||
|
||||
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
|
||||
chunks = splitter.split_documents(documents)
|
||||
```
|
||||
|
||||
### 5. HTML Splitter
|
||||
**Method**: Split based on HTML tags
|
||||
**Use Case**: Web pages and HTML documents
|
||||
**Advantages**: Preserves HTML structure and metadata
|
||||
|
||||
```python
|
||||
from langchain.text_splitters import HTMLHeaderTextSplitter
|
||||
|
||||
headers_to_split_on = [
|
||||
("h1", "Header 1"),
|
||||
("h2", "Header 2"),
|
||||
("h3", "Header 3"),
|
||||
]
|
||||
|
||||
splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
|
||||
chunks = splitter.split_documents(documents)
|
||||
```
|
||||
|
||||
## Parameter Tuning
|
||||
|
||||
### Chunk Size
|
||||
- **Small chunks (200-400 tokens)**: More precise retrieval, but may lose context
|
||||
- **Medium chunks (500-1000 tokens)**: Good balance of precision and context
|
||||
- **Large chunks (1000-2000 tokens)**: More context, but less precise retrieval
|
||||
|
||||
### Chunk Overlap
|
||||
- **Purpose**: Preserve context at chunk boundaries
|
||||
- **Typical range**: 10-20% of chunk size
|
||||
- **Higher overlap**: Better context preservation, but more redundancy
|
||||
- **Lower overlap**: Less redundancy, but may lose important context
|
||||
|
||||
### Separators
|
||||
- **Hierarchical separators**: Start with larger boundaries (paragraphs), then smaller (sentences)
|
||||
- **Custom separators**: Add domain-specific separators for better results
|
||||
- **Language-specific**: Adjust for different languages and writing styles
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Preserve Context**: Ensure chunks contain enough surrounding context
|
||||
2. **Maintain Coherence**: Keep semantically related content together
|
||||
3. **Respect Boundaries**: Avoid breaking sentences or important phrases
|
||||
4. **Consider Query Types**: Adapt chunking strategy to typical user queries
|
||||
5. **Test and Iterate**: Evaluate different chunking strategies for your specific use case
|
||||
|
||||
## Evaluation Metrics
|
||||
|
||||
1. **Retrieval Quality**: How well chunks answer user queries
|
||||
2. **Context Preservation**: Whether important context is maintained
|
||||
3. **Chunk Distribution**: Evenness of chunk sizes
|
||||
4. **Boundary Quality**: How natural chunk boundaries are
|
||||
5. **Retrieval Efficiency**: Impact on retrieval speed and accuracy
|
||||
|
||||
## Advanced Techniques
|
||||
|
||||
### Adaptive Chunking
|
||||
Adjust chunk size based on document structure and content density.
|
||||
|
||||
### Hierarchical Chunking
|
||||
Create multiple levels of chunks for different retrieval scenarios.
|
||||
|
||||
### Query-Aware Chunking
|
||||
Optimize chunk boundaries based on typical query patterns.
|
||||
|
||||
### Domain-Specific Splitting
|
||||
Use specialized splitters for specific document types (legal, medical, technical).
|
||||
88
skills/ai/rag/references/embedding-models.md
Normal file
88
skills/ai/rag/references/embedding-models.md
Normal file
@@ -0,0 +1,88 @@
|
||||
# Embedding Models Guide
|
||||
|
||||
## Overview
|
||||
Embedding models convert text into numerical vectors that capture semantic meaning for similarity search in RAG systems.
|
||||
|
||||
## Popular Embedding Models
|
||||
|
||||
### 1. text-embedding-ada-002 (OpenAI)
|
||||
- **Dimensions**: 1536
|
||||
- **Type**: General purpose
|
||||
- **Use Case**: Most applications requiring high quality embeddings
|
||||
- **Performance**: Excellent balance of quality and speed
|
||||
|
||||
### 2. all-MiniLM-L6-v2 (Sentence Transformers)
|
||||
- **Dimensions**: 384
|
||||
- **Type**: Lightweight
|
||||
- **Use Case**: Applications requiring fast inference
|
||||
- **Performance**: Good quality, very fast
|
||||
|
||||
### 3. e5-large-v2
|
||||
- **Dimensions**: 1024
|
||||
- **Type**: High quality
|
||||
- **Use Case**: Applications needing superior performance
|
||||
- **Performance**: Excellent quality, multilingual support
|
||||
|
||||
### 4. Instructor
|
||||
- **Dimensions**: Variable (768)
|
||||
- **Type**: Task-specific
|
||||
- **Use Case**: Domain-specific applications
|
||||
- **Performance**: Can be fine-tuned for specific tasks
|
||||
|
||||
### 5. bge-large-en-v1.5
|
||||
- **Dimensions**: 1024
|
||||
- **Type**: State-of-the-art
|
||||
- **Use Case**: Applications requiring best possible quality
|
||||
- **Performance**: SOTA performance on benchmarks
|
||||
|
||||
## Selection Criteria
|
||||
|
||||
1. **Quality vs Speed**: Balance between embedding quality and inference speed
|
||||
2. **Dimension Size**: Impact on storage and retrieval performance
|
||||
3. **Domain**: Specific language or domain requirements
|
||||
4. **Cost**: API costs vs local deployment
|
||||
5. **Batch Size**: Throughput requirements
|
||||
6. **Language**: Multilingual support needs
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### OpenAI Embeddings
|
||||
```python
|
||||
from langchain.embeddings import OpenAIEmbeddings
|
||||
|
||||
embeddings = OpenAIEmbeddings()
|
||||
vector = embeddings.embed_query("Your text here")
|
||||
```
|
||||
|
||||
### Sentence Transformers
|
||||
```python
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
model = SentenceTransformer('all-MiniLM-L6-v2')
|
||||
vector = model.encode("Your text here")
|
||||
```
|
||||
|
||||
### Hugging Face Models
|
||||
```python
|
||||
from langchain.embeddings import HuggingFaceEmbeddings
|
||||
|
||||
embeddings = HuggingFaceEmbeddings(
|
||||
model_name="sentence-transformers/all-MiniLM-L6-v2"
|
||||
)
|
||||
```
|
||||
|
||||
## Optimization Tips
|
||||
|
||||
1. **Batch Processing**: Process multiple texts together for efficiency
|
||||
2. **Model Quantization**: Reduce model size for faster inference
|
||||
3. **Caching**: Cache embeddings for frequently used texts
|
||||
4. **GPU Acceleration**: Use GPU for faster processing when available
|
||||
5. **Model Selection**: Choose appropriate model size for your use case
|
||||
|
||||
## Evaluation Metrics
|
||||
|
||||
1. **Semantic Similarity**: How well embeddings capture meaning
|
||||
2. **Retrieval Performance**: Quality of retrieved documents
|
||||
3. **Speed**: Inference time per document
|
||||
4. **Memory Usage**: RAM requirements for the model
|
||||
5. **Cost**: API costs or infrastructure requirements
|
||||
94
skills/ai/rag/references/langchain4j-rag-guide.md
Normal file
94
skills/ai/rag/references/langchain4j-rag-guide.md
Normal file
@@ -0,0 +1,94 @@
|
||||
# LangChain4j RAG Implementation Guide
|
||||
|
||||
## Overview
|
||||
RAG (Retrieval-Augmented Generation) extends LLM knowledge by finding and injecting relevant information from your data into prompts before sending to the LLM.
|
||||
|
||||
## What is RAG?
|
||||
RAG helps LLMs answer questions using domain-specific knowledge by retrieving relevant information to reduce hallucinations.
|
||||
|
||||
## RAG Flavors in LangChain4j
|
||||
|
||||
### 1. Easy RAG
|
||||
Simplest way to start with minimal setup. Handles document loading, splitting, and embedding automatically.
|
||||
|
||||
### 2. Core RAG APIs
|
||||
Modular components including:
|
||||
- Document
|
||||
- TextSegment
|
||||
- EmbeddingModel
|
||||
- EmbeddingStore
|
||||
- DocumentSplitter
|
||||
|
||||
### 3. Advanced RAG
|
||||
Complex pipelines supporting:
|
||||
- Query transformation
|
||||
- Multi-source retrieval
|
||||
- Re-ranking with components like QueryTransformer and ContentRetriever
|
||||
|
||||
## RAG Stages
|
||||
|
||||
### 1. Indexing
|
||||
Pre-process documents for efficient search
|
||||
|
||||
### 2. Retrieval
|
||||
Find relevant content based on user queries
|
||||
|
||||
## Core Components
|
||||
|
||||
### Documents with metadata
|
||||
Structured representation of your content with associated metadata for filtering and context.
|
||||
|
||||
### Text segments (chunks)
|
||||
Smaller, manageable pieces of documents that are embedded and stored in vector databases.
|
||||
|
||||
### Embedding models
|
||||
Convert text segments into numerical vectors for similarity search.
|
||||
|
||||
### Embedding stores (vector databases)
|
||||
Store and efficiently retrieve embedded text segments.
|
||||
|
||||
### Content retrievers
|
||||
Find relevant content based on user queries.
|
||||
|
||||
### Query transformers
|
||||
Transform and optimize user queries for better retrieval.
|
||||
|
||||
### Content aggregators
|
||||
Combine and rank retrieved content.
|
||||
|
||||
## Advanced Features
|
||||
|
||||
- Query transformation and routing
|
||||
- Multiple retrievers for different data sources
|
||||
- Re-ranking models for improved relevance
|
||||
- Metadata filtering for targeted retrieval
|
||||
- Parallel processing for performance
|
||||
|
||||
## Implementation Example (Easy RAG)
|
||||
|
||||
```java
|
||||
// Load documents
|
||||
List<Document> documents = FileSystemDocumentLoader.loadDocuments("/path/to/docs");
|
||||
|
||||
// Create embedding store
|
||||
InMemoryEmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();
|
||||
|
||||
// Ingest documents
|
||||
EmbeddingStoreIngestor.ingest(documents, embeddingStore);
|
||||
|
||||
// Create AI service
|
||||
Assistant assistant = AiServices.builder(Assistant.class)
|
||||
.chatModel(chatModel)
|
||||
.chatMemory(MessageWindowChatMemory.withMaxMessages(10))
|
||||
.contentRetriever(EmbeddingStoreContentRetriever.from(embeddingStore))
|
||||
.build();
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Document Preparation**: Clean and structure documents before ingestion
|
||||
2. **Chunk Size**: Balance between context preservation and retrieval precision
|
||||
3. **Metadata Strategy**: Include relevant metadata for filtering and context
|
||||
4. **Embedding Model Selection**: Choose models appropriate for your domain
|
||||
5. **Retrieval Strategy**: Select appropriate k values and filtering criteria
|
||||
6. **Evaluation**: Continuously evaluate retrieval quality and answer accuracy
|
||||
161
skills/ai/rag/references/retrieval-strategies.md
Normal file
161
skills/ai/rag/references/retrieval-strategies.md
Normal file
@@ -0,0 +1,161 @@
|
||||
# Advanced Retrieval Strategies
|
||||
|
||||
## Overview
|
||||
Different retrieval approaches for finding relevant documents in RAG systems, each with specific strengths and use cases.
|
||||
|
||||
## Retrieval Approaches
|
||||
|
||||
### 1. Dense Retrieval
|
||||
**Method**: Semantic similarity via embeddings
|
||||
**Use Case**: Understanding meaning and context
|
||||
**Example**: Finding documents about "machine learning" when query is "AI algorithms"
|
||||
|
||||
```python
|
||||
from langchain.vectorstores import Chroma
|
||||
|
||||
vectorstore = Chroma.from_documents(chunks, embeddings)
|
||||
results = vectorstore.similarity_search("query", k=5)
|
||||
```
|
||||
|
||||
### 2. Sparse Retrieval
|
||||
**Method**: Keyword matching (BM25, TF-IDF)
|
||||
**Use Case**: Exact term matching and keyword-specific queries
|
||||
**Example**: Finding documents containing specific technical terms
|
||||
|
||||
```python
|
||||
from langchain.retrievers import BM25Retriever
|
||||
|
||||
bm25_retriever = BM25Retriever.from_documents(chunks)
|
||||
bm25_retriever.k = 5
|
||||
results = bm25_retriever.get_relevant_documents("query")
|
||||
```
|
||||
|
||||
### 3. Hybrid Search
|
||||
**Method**: Combine dense + sparse retrieval
|
||||
**Use Case**: Balance between semantic understanding and keyword matching
|
||||
|
||||
```python
|
||||
from langchain.retrievers import BM25Retriever, EnsembleRetriever
|
||||
|
||||
# Sparse retriever (BM25)
|
||||
bm25_retriever = BM25Retriever.from_documents(chunks)
|
||||
bm25_retriever.k = 5
|
||||
|
||||
# Dense retriever (embeddings)
|
||||
embedding_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
|
||||
|
||||
# Combine with weights
|
||||
ensemble_retriever = EnsembleRetriever(
|
||||
retrievers=[bm25_retriever, embedding_retriever],
|
||||
weights=[0.3, 0.7]
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Multi-Query Retrieval
|
||||
**Method**: Generate multiple query variations
|
||||
**Use Case**: Complex queries that can be interpreted in multiple ways
|
||||
|
||||
```python
|
||||
from langchain.retrievers.multi_query import MultiQueryRetriever
|
||||
|
||||
# Generate multiple query perspectives
|
||||
retriever = MultiQueryRetriever.from_llm(
|
||||
retriever=vectorstore.as_retriever(),
|
||||
llm=OpenAI()
|
||||
)
|
||||
|
||||
# Single query → multiple variations → combined results
|
||||
results = retriever.get_relevant_documents("What is the main topic?")
|
||||
```
|
||||
|
||||
### 5. HyDE (Hypothetical Document Embeddings)
|
||||
**Method**: Generate hypothetical documents for better retrieval
|
||||
**Use Case**: When queries are very different from document style
|
||||
|
||||
```python
|
||||
# Generate hypothetical document based on query
|
||||
hypothetical_doc = llm.generate(f"Write a document about: {query}")
|
||||
# Use hypothetical doc for retrieval
|
||||
results = vectorstore.similarity_search(hypothetical_doc, k=5)
|
||||
```
|
||||
|
||||
## Advanced Retrieval Patterns
|
||||
|
||||
### Contextual Compression
|
||||
Compress retrieved documents to only include relevant parts
|
||||
|
||||
```python
|
||||
from langchain.retrievers import ContextualCompressionRetriever
|
||||
from langchain.retrievers.document_compressors import LLMChainExtractor
|
||||
|
||||
compressor = LLMChainExtractor.from_llm(llm)
|
||||
compression_retriever = ContextualCompressionRetriever(
|
||||
base_compressor=compressor,
|
||||
base_retriever=vectorstore.as_retriever()
|
||||
)
|
||||
```
|
||||
|
||||
### Parent Document Retriever
|
||||
Store small chunks for retrieval, return larger chunks for context
|
||||
|
||||
```python
|
||||
from langchain.retrievers import ParentDocumentRetriever
|
||||
from langchain.storage import InMemoryStore
|
||||
|
||||
store = InMemoryStore()
|
||||
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
|
||||
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
|
||||
|
||||
retriever = ParentDocumentRetriever(
|
||||
vectorstore=vectorstore,
|
||||
docstore=store,
|
||||
child_splitter=child_splitter,
|
||||
parent_splitter=parent_splitter
|
||||
)
|
||||
```
|
||||
|
||||
## Retrieval Optimization Techniques
|
||||
|
||||
### 1. Metadata Filtering
|
||||
Filter results based on document metadata
|
||||
|
||||
```python
|
||||
results = vectorstore.similarity_search(
|
||||
"query",
|
||||
filter={"category": "technical", "date": {"$gte": "2023-01-01"}},
|
||||
k=5
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Maximal Marginal Relevance (MMR)
|
||||
Balance relevance with diversity
|
||||
|
||||
```python
|
||||
results = vectorstore.max_marginal_relevance_search(
|
||||
"query",
|
||||
k=5,
|
||||
fetch_k=20,
|
||||
lambda_mult=0.5 # 0=max diversity, 1=max relevance
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Reranking
|
||||
Improve top results with cross-encoder
|
||||
|
||||
```python
|
||||
from sentence_transformers import CrossEncoder
|
||||
|
||||
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
|
||||
candidates = vectorstore.similarity_search("query", k=20)
|
||||
pairs = [[query, doc.page_content] for doc in candidates]
|
||||
scores = reranker.predict(pairs)
|
||||
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]
|
||||
```
|
||||
|
||||
## Selection Guidelines
|
||||
|
||||
1. **Query Type**: Choose strategy based on typical query patterns
|
||||
2. **Document Type**: Consider document structure and content
|
||||
3. **Performance Requirements**: Balance quality vs speed
|
||||
4. **Domain Knowledge**: Leverage domain-specific patterns
|
||||
5. **User Expectations**: Match retrieval behavior to user expectations
|
||||
86
skills/ai/rag/references/vector-databases.md
Normal file
86
skills/ai/rag/references/vector-databases.md
Normal file
@@ -0,0 +1,86 @@
|
||||
# Vector Database Comparison and Configuration
|
||||
|
||||
## Overview
|
||||
Vector databases store and efficiently retrieve document embeddings for semantic search in RAG systems.
|
||||
|
||||
## Popular Vector Database Options
|
||||
|
||||
### 1. Pinecone
|
||||
- **Type**: Managed cloud service
|
||||
- **Features**: Scalable, fast queries, managed infrastructure
|
||||
- **Use Case**: Production applications requiring high availability
|
||||
|
||||
### 2. Weaviate
|
||||
- **Type**: Open-source, hybrid search
|
||||
- **Features**: Combines vector and keyword search, GraphQL API
|
||||
- **Use Case**: Applications needing both semantic and traditional search
|
||||
|
||||
### 3. Milvus
|
||||
- **Type**: High performance, on-premise
|
||||
- **Features**: Distributed architecture, GPU acceleration
|
||||
- **Use Case**: Large-scale deployments with custom infrastructure
|
||||
|
||||
### 4. Chroma
|
||||
- **Type**: Lightweight, easy to use
|
||||
- **Features**: Local deployment, simple API
|
||||
- **Use Case**: Development and small-scale applications
|
||||
|
||||
### 5. Qdrant
|
||||
- **Type**: Fast, filtered search
|
||||
- **Features**: Advanced filtering, payload support
|
||||
- **Use Case**: Applications requiring complex metadata filtering
|
||||
|
||||
### 6. FAISS
|
||||
- **Type**: Meta's library, local deployment
|
||||
- **Features**: High performance, CPU/GPU optimized
|
||||
- **Use Case**: Research and applications needing full control
|
||||
|
||||
## Configuration Examples
|
||||
|
||||
### Pinecone Setup
|
||||
```python
|
||||
import pinecone
|
||||
from langchain.vectorstores import Pinecone
|
||||
|
||||
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
|
||||
index = pinecone.Index("your-index-name")
|
||||
vectorstore = Pinecone(index, embeddings.embed_query, "text")
|
||||
```
|
||||
|
||||
### Weaviate Setup
|
||||
```python
|
||||
import weaviate
|
||||
from langchain.vectorstores import Weaviate
|
||||
|
||||
client = weaviate.Client("http://localhost:8080")
|
||||
vectorstore = Weaviate(client, "Document", "content", embeddings)
|
||||
```
|
||||
|
||||
### Chroma Local Setup
|
||||
```python
|
||||
from langchain.vectorstores import Chroma
|
||||
|
||||
vectorstore = Chroma(
|
||||
collection_name="my_collection",
|
||||
embedding_function=embeddings,
|
||||
persist_directory="./chroma_db"
|
||||
)
|
||||
```
|
||||
|
||||
## Selection Criteria
|
||||
|
||||
1. **Scale**: Number of documents and expected query volume
|
||||
2. **Performance**: Latency requirements and throughput needs
|
||||
3. **Deployment**: Cloud vs on-premise preferences
|
||||
4. **Features**: Filtering, hybrid search, metadata support
|
||||
5. **Cost**: Budget constraints and operational overhead
|
||||
6. **Maintenance**: Team expertise and available resources
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Indexing Strategy**: Choose appropriate distance metrics (cosine, euclidean)
|
||||
2. **Sharding**: Distribute data for large-scale deployments
|
||||
3. **Monitoring**: Track query performance and system health
|
||||
4. **Backups**: Implement regular backup procedures
|
||||
5. **Security**: Secure access to sensitive data
|
||||
6. **Optimization**: Tune parameters for your specific use case
|
||||
Reference in New Issue
Block a user