325 lines
8.0 KiB
Markdown
325 lines
8.0 KiB
Markdown
# Evaluation Best Practices
|
|
|
|
Practical guidelines for effective evaluation of LangGraph applications.
|
|
|
|
## 🎯 Evaluation Best Practices
|
|
|
|
### 1. Ensuring Consistency
|
|
|
|
#### Evaluation Under Same Conditions
|
|
|
|
```python
|
|
class EvaluationConfig:
|
|
"""Fix evaluation settings to ensure consistency"""
|
|
|
|
def __init__(self):
|
|
self.test_cases_path = "tests/evaluation/test_cases.json"
|
|
self.seed = 42 # For reproducibility
|
|
self.iterations = 5
|
|
self.timeout = 30 # seconds
|
|
self.model = "claude-3-5-sonnet-20241022"
|
|
|
|
def load_test_cases(self) -> List[Dict]:
|
|
"""Load the same test cases"""
|
|
with open(self.test_cases_path) as f:
|
|
data = json.load(f)
|
|
return data["test_cases"]
|
|
|
|
# Usage
|
|
config = EvaluationConfig()
|
|
test_cases = config.load_test_cases()
|
|
# Use the same test cases for all evaluations
|
|
```
|
|
|
|
### 2. Staged Evaluation
|
|
|
|
#### Start Small and Gradually Expand
|
|
|
|
```python
|
|
# Phase 1: Quick check (3 cases, 1 iteration)
|
|
quick_results = evaluate(test_cases[:3], iterations=1)
|
|
|
|
if quick_results["accuracy"] > baseline["accuracy"]:
|
|
# Phase 2: Medium check (10 cases, 3 iterations)
|
|
medium_results = evaluate(test_cases[:10], iterations=3)
|
|
|
|
if medium_results["accuracy"] > baseline["accuracy"]:
|
|
# Phase 3: Full evaluation (all cases, 5 iterations)
|
|
full_results = evaluate(test_cases, iterations=5)
|
|
```
|
|
|
|
### 3. Recording Evaluation Results
|
|
|
|
#### Structured Logging
|
|
|
|
```python
|
|
import json
|
|
from datetime import datetime
|
|
from pathlib import Path
|
|
|
|
def save_evaluation_result(
|
|
results: Dict,
|
|
version: str,
|
|
output_dir: Path = Path("evaluation_results")
|
|
):
|
|
"""Save evaluation results"""
|
|
output_dir.mkdir(exist_ok=True)
|
|
|
|
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
|
filename = f"{version}_{timestamp}.json"
|
|
|
|
full_results = {
|
|
"version": version,
|
|
"timestamp": timestamp,
|
|
"metrics": results,
|
|
"config": {
|
|
"model": "claude-3-5-sonnet-20241022",
|
|
"test_cases": len(test_cases),
|
|
"iterations": 5
|
|
}
|
|
}
|
|
|
|
with open(output_dir / filename, "w") as f:
|
|
json.dump(full_results, f, indent=2)
|
|
|
|
print(f"Results saved to: {output_dir / filename}")
|
|
|
|
# Usage
|
|
save_evaluation_result(results, version="baseline")
|
|
save_evaluation_result(results, version="iteration_1")
|
|
```
|
|
|
|
### 4. Visualization
|
|
|
|
#### Visualizing Results
|
|
|
|
```python
|
|
import matplotlib.pyplot as plt
|
|
|
|
def visualize_improvement(
|
|
baseline: Dict,
|
|
iterations: List[Dict],
|
|
metrics: List[str] = ["accuracy", "latency", "cost"]
|
|
):
|
|
"""Visualize improvement progress"""
|
|
fig, axes = plt.subplots(1, len(metrics), figsize=(15, 5))
|
|
|
|
for idx, metric in enumerate(metrics):
|
|
ax = axes[idx]
|
|
|
|
# Prepare data
|
|
x = ["Baseline"] + [f"Iter {i+1}" for i in range(len(iterations))]
|
|
y = [baseline[metric]] + [it[metric] for it in iterations]
|
|
|
|
# Plot
|
|
ax.plot(x, y, marker='o', linewidth=2)
|
|
ax.set_title(f"{metric.capitalize()} Progress")
|
|
ax.set_ylabel(metric.capitalize())
|
|
ax.grid(True, alpha=0.3)
|
|
|
|
# Goal line
|
|
if metric in baseline.get("goals", {}):
|
|
goal = baseline["goals"][metric]
|
|
ax.axhline(y=goal, color='r', linestyle='--', label='Goal')
|
|
ax.legend()
|
|
|
|
plt.tight_layout()
|
|
plt.savefig("evaluation_results/improvement_progress.png")
|
|
print("Visualization saved to: evaluation_results/improvement_progress.png")
|
|
```
|
|
|
|
## 📋 Evaluation Report Template
|
|
|
|
### Standard Report Format
|
|
|
|
```markdown
|
|
# Evaluation Report - [Version/Iteration]
|
|
|
|
Execution Date: 2024-11-24 12:00:00
|
|
Executed by: Claude Code (fine-tune skill)
|
|
|
|
## Configuration
|
|
|
|
- **Model**: claude-3-5-sonnet-20241022
|
|
- **Number of Test Cases**: 20
|
|
- **Number of Runs**: 5
|
|
- **Evaluation Duration**: 10 minutes
|
|
|
|
## Results Summary
|
|
|
|
| Metric | Mean | Std Dev | 95% CI | Goal | Achievement |
|
|
|--------|------|---------|--------|------|-------------|
|
|
| Accuracy | 86.0% | 2.1% | [83.9%, 88.1%] | 90.0% | 95.6% |
|
|
| Latency | 2.4s | 0.3s | [2.1s, 2.7s] | 2.0s | 83.3% |
|
|
| Cost | $0.014 | $0.001 | [$0.013, $0.015] | $0.010 | 71.4% |
|
|
|
|
## Detailed Analysis
|
|
|
|
### Accuracy
|
|
- **Improvement**: +11.0% (75.0% → 86.0%)
|
|
- **Statistical Significance**: p < 0.01 ✅
|
|
- **Effect Size**: Cohen's d = 2.3 (large)
|
|
|
|
### Latency
|
|
- **Improvement**: -0.1s (2.5s → 2.4s)
|
|
- **Statistical Significance**: p = 0.12 ❌ (not significant)
|
|
- **Effect Size**: Cohen's d = 0.3 (small)
|
|
|
|
## Error Analysis
|
|
|
|
- **Total Errors**: 0
|
|
- **Error Rate**: 0.0%
|
|
- **Retry Rate**: 0.0%
|
|
|
|
## Next Actions
|
|
|
|
1. ✅ Accuracy significantly improved → Continue
|
|
2. ⚠️ Latency improvement is small → Focus in next iteration
|
|
3. ⚠️ Cost still below goal → Consider max_tokens limit
|
|
```
|
|
|
|
## 🔍 Troubleshooting
|
|
|
|
### Common Problems and Solutions
|
|
|
|
#### 1. Large Variance in Evaluation Results
|
|
|
|
**Symptom**: Standard deviation > 20% of mean
|
|
|
|
**Causes**:
|
|
- LLM temperature is too high
|
|
- Test cases are uneven
|
|
- Network latency effects
|
|
|
|
**Solutions**:
|
|
```python
|
|
# Lower temperature
|
|
llm = ChatAnthropic(
|
|
model="claude-3-5-sonnet-20241022",
|
|
temperature=0.3 # Set lower
|
|
)
|
|
|
|
# Increase number of runs
|
|
iterations = 10 # 5 → 10
|
|
|
|
# Remove outliers
|
|
results_clean = remove_outliers(results)
|
|
```
|
|
|
|
#### 2. Evaluation Takes Too Long
|
|
|
|
**Symptom**: Evaluation takes over 1 hour
|
|
|
|
**Causes**:
|
|
- Too many test cases
|
|
- Not running in parallel
|
|
- Timeout setting too long
|
|
|
|
**Solutions**:
|
|
```python
|
|
# Subset evaluation
|
|
quick_test_cases = test_cases[:10] # First 10 cases only
|
|
|
|
# Parallel execution
|
|
import concurrent.futures
|
|
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
|
|
futures = [executor.submit(evaluate_case, case) for case in test_cases]
|
|
results = [f.result() for f in futures]
|
|
|
|
# Timeout setting
|
|
timeout = 10 # 30s → 10s
|
|
```
|
|
|
|
#### 3. No Statistical Significance
|
|
|
|
**Symptom**: p-value ≥ 0.05
|
|
|
|
**Causes**:
|
|
- Improvement effect is small
|
|
- Insufficient sample size
|
|
- High data variance
|
|
|
|
**Solutions**:
|
|
```python
|
|
# Aim for larger improvements
|
|
# - Apply multiple optimizations simultaneously
|
|
# - Choose more effective techniques
|
|
|
|
# Increase sample size
|
|
iterations = 20 # 5 → 20
|
|
|
|
# Reduce variance
|
|
# - Lower temperature
|
|
# - Stabilize evaluation environment
|
|
```
|
|
|
|
## 📊 Continuous Evaluation
|
|
|
|
### Scheduled Evaluation
|
|
|
|
```yaml
|
|
evaluation_schedule:
|
|
daily:
|
|
- quick_check: 3 test cases, 1 iteration
|
|
- purpose: Detect major regressions
|
|
|
|
weekly:
|
|
- medium_check: 10 test cases, 3 iterations
|
|
- purpose: Continuous quality monitoring
|
|
|
|
before_release:
|
|
- full_evaluation: all test cases, 5-10 iterations
|
|
- purpose: Release quality assurance
|
|
|
|
after_major_changes:
|
|
- comprehensive_evaluation: all test cases, 10+ iterations
|
|
- purpose: Impact assessment of major changes
|
|
```
|
|
|
|
### Automated Evaluation Pipeline
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# continuous_evaluation.sh
|
|
|
|
# Daily evaluation script
|
|
|
|
DATE=$(date +%Y%m%d)
|
|
RESULTS_DIR="evaluation_results/continuous/$DATE"
|
|
mkdir -p $RESULTS_DIR
|
|
|
|
# Quick check
|
|
echo "Running quick evaluation..."
|
|
uv run python -m tests.evaluation.evaluator \
|
|
--test-cases 3 \
|
|
--iterations 1 \
|
|
--output "$RESULTS_DIR/quick.json"
|
|
|
|
# Compare with previous results
|
|
uv run python -m tests.evaluation.compare \
|
|
--baseline "evaluation_results/baseline/summary.json" \
|
|
--current "$RESULTS_DIR/quick.json" \
|
|
--threshold 0.05
|
|
|
|
# Notify if regression detected
|
|
if [ $? -ne 0 ]; then
|
|
echo "⚠️ Regression detected! Sending notification..."
|
|
# Notification process (Slack, Email, etc.)
|
|
fi
|
|
```
|
|
|
|
## Summary
|
|
|
|
For effective evaluation:
|
|
- ✅ **Multiple Metrics**: Quality, performance, cost, reliability
|
|
- ✅ **Statistical Validation**: Multiple runs and significance testing
|
|
- ✅ **Consistency**: Same test cases, same conditions
|
|
- ✅ **Visualization**: Track improvements with graphs and tables
|
|
- ✅ **Documentation**: Record evaluation results and analysis
|
|
|
|
## 📋 Related Documentation
|
|
|
|
- [Evaluation Metrics](./evaluation_metrics.md) - Metric definitions and calculation methods
|
|
- [Test Case Design](./evaluation_testcases.md) - Test case structure
|
|
- [Statistical Significance](./evaluation_statistics.md) - Statistical analysis methods
|