Initial commit
This commit is contained in:
324
skills/fine-tune/evaluation_practices.md
Normal file
324
skills/fine-tune/evaluation_practices.md
Normal file
@@ -0,0 +1,324 @@
|
||||
# Evaluation Best Practices
|
||||
|
||||
Practical guidelines for effective evaluation of LangGraph applications.
|
||||
|
||||
## 🎯 Evaluation Best Practices
|
||||
|
||||
### 1. Ensuring Consistency
|
||||
|
||||
#### Evaluation Under Same Conditions
|
||||
|
||||
```python
|
||||
class EvaluationConfig:
|
||||
"""Fix evaluation settings to ensure consistency"""
|
||||
|
||||
def __init__(self):
|
||||
self.test_cases_path = "tests/evaluation/test_cases.json"
|
||||
self.seed = 42 # For reproducibility
|
||||
self.iterations = 5
|
||||
self.timeout = 30 # seconds
|
||||
self.model = "claude-3-5-sonnet-20241022"
|
||||
|
||||
def load_test_cases(self) -> List[Dict]:
|
||||
"""Load the same test cases"""
|
||||
with open(self.test_cases_path) as f:
|
||||
data = json.load(f)
|
||||
return data["test_cases"]
|
||||
|
||||
# Usage
|
||||
config = EvaluationConfig()
|
||||
test_cases = config.load_test_cases()
|
||||
# Use the same test cases for all evaluations
|
||||
```
|
||||
|
||||
### 2. Staged Evaluation
|
||||
|
||||
#### Start Small and Gradually Expand
|
||||
|
||||
```python
|
||||
# Phase 1: Quick check (3 cases, 1 iteration)
|
||||
quick_results = evaluate(test_cases[:3], iterations=1)
|
||||
|
||||
if quick_results["accuracy"] > baseline["accuracy"]:
|
||||
# Phase 2: Medium check (10 cases, 3 iterations)
|
||||
medium_results = evaluate(test_cases[:10], iterations=3)
|
||||
|
||||
if medium_results["accuracy"] > baseline["accuracy"]:
|
||||
# Phase 3: Full evaluation (all cases, 5 iterations)
|
||||
full_results = evaluate(test_cases, iterations=5)
|
||||
```
|
||||
|
||||
### 3. Recording Evaluation Results
|
||||
|
||||
#### Structured Logging
|
||||
|
||||
```python
|
||||
import json
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
def save_evaluation_result(
|
||||
results: Dict,
|
||||
version: str,
|
||||
output_dir: Path = Path("evaluation_results")
|
||||
):
|
||||
"""Save evaluation results"""
|
||||
output_dir.mkdir(exist_ok=True)
|
||||
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
filename = f"{version}_{timestamp}.json"
|
||||
|
||||
full_results = {
|
||||
"version": version,
|
||||
"timestamp": timestamp,
|
||||
"metrics": results,
|
||||
"config": {
|
||||
"model": "claude-3-5-sonnet-20241022",
|
||||
"test_cases": len(test_cases),
|
||||
"iterations": 5
|
||||
}
|
||||
}
|
||||
|
||||
with open(output_dir / filename, "w") as f:
|
||||
json.dump(full_results, f, indent=2)
|
||||
|
||||
print(f"Results saved to: {output_dir / filename}")
|
||||
|
||||
# Usage
|
||||
save_evaluation_result(results, version="baseline")
|
||||
save_evaluation_result(results, version="iteration_1")
|
||||
```
|
||||
|
||||
### 4. Visualization
|
||||
|
||||
#### Visualizing Results
|
||||
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
def visualize_improvement(
|
||||
baseline: Dict,
|
||||
iterations: List[Dict],
|
||||
metrics: List[str] = ["accuracy", "latency", "cost"]
|
||||
):
|
||||
"""Visualize improvement progress"""
|
||||
fig, axes = plt.subplots(1, len(metrics), figsize=(15, 5))
|
||||
|
||||
for idx, metric in enumerate(metrics):
|
||||
ax = axes[idx]
|
||||
|
||||
# Prepare data
|
||||
x = ["Baseline"] + [f"Iter {i+1}" for i in range(len(iterations))]
|
||||
y = [baseline[metric]] + [it[metric] for it in iterations]
|
||||
|
||||
# Plot
|
||||
ax.plot(x, y, marker='o', linewidth=2)
|
||||
ax.set_title(f"{metric.capitalize()} Progress")
|
||||
ax.set_ylabel(metric.capitalize())
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
# Goal line
|
||||
if metric in baseline.get("goals", {}):
|
||||
goal = baseline["goals"][metric]
|
||||
ax.axhline(y=goal, color='r', linestyle='--', label='Goal')
|
||||
ax.legend()
|
||||
|
||||
plt.tight_layout()
|
||||
plt.savefig("evaluation_results/improvement_progress.png")
|
||||
print("Visualization saved to: evaluation_results/improvement_progress.png")
|
||||
```
|
||||
|
||||
## 📋 Evaluation Report Template
|
||||
|
||||
### Standard Report Format
|
||||
|
||||
```markdown
|
||||
# Evaluation Report - [Version/Iteration]
|
||||
|
||||
Execution Date: 2024-11-24 12:00:00
|
||||
Executed by: Claude Code (fine-tune skill)
|
||||
|
||||
## Configuration
|
||||
|
||||
- **Model**: claude-3-5-sonnet-20241022
|
||||
- **Number of Test Cases**: 20
|
||||
- **Number of Runs**: 5
|
||||
- **Evaluation Duration**: 10 minutes
|
||||
|
||||
## Results Summary
|
||||
|
||||
| Metric | Mean | Std Dev | 95% CI | Goal | Achievement |
|
||||
|--------|------|---------|--------|------|-------------|
|
||||
| Accuracy | 86.0% | 2.1% | [83.9%, 88.1%] | 90.0% | 95.6% |
|
||||
| Latency | 2.4s | 0.3s | [2.1s, 2.7s] | 2.0s | 83.3% |
|
||||
| Cost | $0.014 | $0.001 | [$0.013, $0.015] | $0.010 | 71.4% |
|
||||
|
||||
## Detailed Analysis
|
||||
|
||||
### Accuracy
|
||||
- **Improvement**: +11.0% (75.0% → 86.0%)
|
||||
- **Statistical Significance**: p < 0.01 ✅
|
||||
- **Effect Size**: Cohen's d = 2.3 (large)
|
||||
|
||||
### Latency
|
||||
- **Improvement**: -0.1s (2.5s → 2.4s)
|
||||
- **Statistical Significance**: p = 0.12 ❌ (not significant)
|
||||
- **Effect Size**: Cohen's d = 0.3 (small)
|
||||
|
||||
## Error Analysis
|
||||
|
||||
- **Total Errors**: 0
|
||||
- **Error Rate**: 0.0%
|
||||
- **Retry Rate**: 0.0%
|
||||
|
||||
## Next Actions
|
||||
|
||||
1. ✅ Accuracy significantly improved → Continue
|
||||
2. ⚠️ Latency improvement is small → Focus in next iteration
|
||||
3. ⚠️ Cost still below goal → Consider max_tokens limit
|
||||
```
|
||||
|
||||
## 🔍 Troubleshooting
|
||||
|
||||
### Common Problems and Solutions
|
||||
|
||||
#### 1. Large Variance in Evaluation Results
|
||||
|
||||
**Symptom**: Standard deviation > 20% of mean
|
||||
|
||||
**Causes**:
|
||||
- LLM temperature is too high
|
||||
- Test cases are uneven
|
||||
- Network latency effects
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Lower temperature
|
||||
llm = ChatAnthropic(
|
||||
model="claude-3-5-sonnet-20241022",
|
||||
temperature=0.3 # Set lower
|
||||
)
|
||||
|
||||
# Increase number of runs
|
||||
iterations = 10 # 5 → 10
|
||||
|
||||
# Remove outliers
|
||||
results_clean = remove_outliers(results)
|
||||
```
|
||||
|
||||
#### 2. Evaluation Takes Too Long
|
||||
|
||||
**Symptom**: Evaluation takes over 1 hour
|
||||
|
||||
**Causes**:
|
||||
- Too many test cases
|
||||
- Not running in parallel
|
||||
- Timeout setting too long
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Subset evaluation
|
||||
quick_test_cases = test_cases[:10] # First 10 cases only
|
||||
|
||||
# Parallel execution
|
||||
import concurrent.futures
|
||||
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
|
||||
futures = [executor.submit(evaluate_case, case) for case in test_cases]
|
||||
results = [f.result() for f in futures]
|
||||
|
||||
# Timeout setting
|
||||
timeout = 10 # 30s → 10s
|
||||
```
|
||||
|
||||
#### 3. No Statistical Significance
|
||||
|
||||
**Symptom**: p-value ≥ 0.05
|
||||
|
||||
**Causes**:
|
||||
- Improvement effect is small
|
||||
- Insufficient sample size
|
||||
- High data variance
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Aim for larger improvements
|
||||
# - Apply multiple optimizations simultaneously
|
||||
# - Choose more effective techniques
|
||||
|
||||
# Increase sample size
|
||||
iterations = 20 # 5 → 20
|
||||
|
||||
# Reduce variance
|
||||
# - Lower temperature
|
||||
# - Stabilize evaluation environment
|
||||
```
|
||||
|
||||
## 📊 Continuous Evaluation
|
||||
|
||||
### Scheduled Evaluation
|
||||
|
||||
```yaml
|
||||
evaluation_schedule:
|
||||
daily:
|
||||
- quick_check: 3 test cases, 1 iteration
|
||||
- purpose: Detect major regressions
|
||||
|
||||
weekly:
|
||||
- medium_check: 10 test cases, 3 iterations
|
||||
- purpose: Continuous quality monitoring
|
||||
|
||||
before_release:
|
||||
- full_evaluation: all test cases, 5-10 iterations
|
||||
- purpose: Release quality assurance
|
||||
|
||||
after_major_changes:
|
||||
- comprehensive_evaluation: all test cases, 10+ iterations
|
||||
- purpose: Impact assessment of major changes
|
||||
```
|
||||
|
||||
### Automated Evaluation Pipeline
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# continuous_evaluation.sh
|
||||
|
||||
# Daily evaluation script
|
||||
|
||||
DATE=$(date +%Y%m%d)
|
||||
RESULTS_DIR="evaluation_results/continuous/$DATE"
|
||||
mkdir -p $RESULTS_DIR
|
||||
|
||||
# Quick check
|
||||
echo "Running quick evaluation..."
|
||||
uv run python -m tests.evaluation.evaluator \
|
||||
--test-cases 3 \
|
||||
--iterations 1 \
|
||||
--output "$RESULTS_DIR/quick.json"
|
||||
|
||||
# Compare with previous results
|
||||
uv run python -m tests.evaluation.compare \
|
||||
--baseline "evaluation_results/baseline/summary.json" \
|
||||
--current "$RESULTS_DIR/quick.json" \
|
||||
--threshold 0.05
|
||||
|
||||
# Notify if regression detected
|
||||
if [ $? -ne 0 ]; then
|
||||
echo "⚠️ Regression detected! Sending notification..."
|
||||
# Notification process (Slack, Email, etc.)
|
||||
fi
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
For effective evaluation:
|
||||
- ✅ **Multiple Metrics**: Quality, performance, cost, reliability
|
||||
- ✅ **Statistical Validation**: Multiple runs and significance testing
|
||||
- ✅ **Consistency**: Same test cases, same conditions
|
||||
- ✅ **Visualization**: Track improvements with graphs and tables
|
||||
- ✅ **Documentation**: Record evaluation results and analysis
|
||||
|
||||
## 📋 Related Documentation
|
||||
|
||||
- [Evaluation Metrics](./evaluation_metrics.md) - Metric definitions and calculation methods
|
||||
- [Test Case Design](./evaluation_testcases.md) - Test case structure
|
||||
- [Statistical Significance](./evaluation_statistics.md) - Statistical analysis methods
|
||||
Reference in New Issue
Block a user