8.0 KiB
8.0 KiB
Evaluation Best Practices
Practical guidelines for effective evaluation of LangGraph applications.
🎯 Evaluation Best Practices
1. Ensuring Consistency
Evaluation Under Same Conditions
class EvaluationConfig:
"""Fix evaluation settings to ensure consistency"""
def __init__(self):
self.test_cases_path = "tests/evaluation/test_cases.json"
self.seed = 42 # For reproducibility
self.iterations = 5
self.timeout = 30 # seconds
self.model = "claude-3-5-sonnet-20241022"
def load_test_cases(self) -> List[Dict]:
"""Load the same test cases"""
with open(self.test_cases_path) as f:
data = json.load(f)
return data["test_cases"]
# Usage
config = EvaluationConfig()
test_cases = config.load_test_cases()
# Use the same test cases for all evaluations
2. Staged Evaluation
Start Small and Gradually Expand
# Phase 1: Quick check (3 cases, 1 iteration)
quick_results = evaluate(test_cases[:3], iterations=1)
if quick_results["accuracy"] > baseline["accuracy"]:
# Phase 2: Medium check (10 cases, 3 iterations)
medium_results = evaluate(test_cases[:10], iterations=3)
if medium_results["accuracy"] > baseline["accuracy"]:
# Phase 3: Full evaluation (all cases, 5 iterations)
full_results = evaluate(test_cases, iterations=5)
3. Recording Evaluation Results
Structured Logging
import json
from datetime import datetime
from pathlib import Path
def save_evaluation_result(
results: Dict,
version: str,
output_dir: Path = Path("evaluation_results")
):
"""Save evaluation results"""
output_dir.mkdir(exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"{version}_{timestamp}.json"
full_results = {
"version": version,
"timestamp": timestamp,
"metrics": results,
"config": {
"model": "claude-3-5-sonnet-20241022",
"test_cases": len(test_cases),
"iterations": 5
}
}
with open(output_dir / filename, "w") as f:
json.dump(full_results, f, indent=2)
print(f"Results saved to: {output_dir / filename}")
# Usage
save_evaluation_result(results, version="baseline")
save_evaluation_result(results, version="iteration_1")
4. Visualization
Visualizing Results
import matplotlib.pyplot as plt
def visualize_improvement(
baseline: Dict,
iterations: List[Dict],
metrics: List[str] = ["accuracy", "latency", "cost"]
):
"""Visualize improvement progress"""
fig, axes = plt.subplots(1, len(metrics), figsize=(15, 5))
for idx, metric in enumerate(metrics):
ax = axes[idx]
# Prepare data
x = ["Baseline"] + [f"Iter {i+1}" for i in range(len(iterations))]
y = [baseline[metric]] + [it[metric] for it in iterations]
# Plot
ax.plot(x, y, marker='o', linewidth=2)
ax.set_title(f"{metric.capitalize()} Progress")
ax.set_ylabel(metric.capitalize())
ax.grid(True, alpha=0.3)
# Goal line
if metric in baseline.get("goals", {}):
goal = baseline["goals"][metric]
ax.axhline(y=goal, color='r', linestyle='--', label='Goal')
ax.legend()
plt.tight_layout()
plt.savefig("evaluation_results/improvement_progress.png")
print("Visualization saved to: evaluation_results/improvement_progress.png")
📋 Evaluation Report Template
Standard Report Format
# Evaluation Report - [Version/Iteration]
Execution Date: 2024-11-24 12:00:00
Executed by: Claude Code (fine-tune skill)
## Configuration
- **Model**: claude-3-5-sonnet-20241022
- **Number of Test Cases**: 20
- **Number of Runs**: 5
- **Evaluation Duration**: 10 minutes
## Results Summary
| Metric | Mean | Std Dev | 95% CI | Goal | Achievement |
|--------|------|---------|--------|------|-------------|
| Accuracy | 86.0% | 2.1% | [83.9%, 88.1%] | 90.0% | 95.6% |
| Latency | 2.4s | 0.3s | [2.1s, 2.7s] | 2.0s | 83.3% |
| Cost | $0.014 | $0.001 | [$0.013, $0.015] | $0.010 | 71.4% |
## Detailed Analysis
### Accuracy
- **Improvement**: +11.0% (75.0% → 86.0%)
- **Statistical Significance**: p < 0.01 ✅
- **Effect Size**: Cohen's d = 2.3 (large)
### Latency
- **Improvement**: -0.1s (2.5s → 2.4s)
- **Statistical Significance**: p = 0.12 ❌ (not significant)
- **Effect Size**: Cohen's d = 0.3 (small)
## Error Analysis
- **Total Errors**: 0
- **Error Rate**: 0.0%
- **Retry Rate**: 0.0%
## Next Actions
1. ✅ Accuracy significantly improved → Continue
2. ⚠️ Latency improvement is small → Focus in next iteration
3. ⚠️ Cost still below goal → Consider max_tokens limit
🔍 Troubleshooting
Common Problems and Solutions
1. Large Variance in Evaluation Results
Symptom: Standard deviation > 20% of mean
Causes:
- LLM temperature is too high
- Test cases are uneven
- Network latency effects
Solutions:
# Lower temperature
llm = ChatAnthropic(
model="claude-3-5-sonnet-20241022",
temperature=0.3 # Set lower
)
# Increase number of runs
iterations = 10 # 5 → 10
# Remove outliers
results_clean = remove_outliers(results)
2. Evaluation Takes Too Long
Symptom: Evaluation takes over 1 hour
Causes:
- Too many test cases
- Not running in parallel
- Timeout setting too long
Solutions:
# Subset evaluation
quick_test_cases = test_cases[:10] # First 10 cases only
# Parallel execution
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(evaluate_case, case) for case in test_cases]
results = [f.result() for f in futures]
# Timeout setting
timeout = 10 # 30s → 10s
3. No Statistical Significance
Symptom: p-value ≥ 0.05
Causes:
- Improvement effect is small
- Insufficient sample size
- High data variance
Solutions:
# Aim for larger improvements
# - Apply multiple optimizations simultaneously
# - Choose more effective techniques
# Increase sample size
iterations = 20 # 5 → 20
# Reduce variance
# - Lower temperature
# - Stabilize evaluation environment
📊 Continuous Evaluation
Scheduled Evaluation
evaluation_schedule:
daily:
- quick_check: 3 test cases, 1 iteration
- purpose: Detect major regressions
weekly:
- medium_check: 10 test cases, 3 iterations
- purpose: Continuous quality monitoring
before_release:
- full_evaluation: all test cases, 5-10 iterations
- purpose: Release quality assurance
after_major_changes:
- comprehensive_evaluation: all test cases, 10+ iterations
- purpose: Impact assessment of major changes
Automated Evaluation Pipeline
#!/bin/bash
# continuous_evaluation.sh
# Daily evaluation script
DATE=$(date +%Y%m%d)
RESULTS_DIR="evaluation_results/continuous/$DATE"
mkdir -p $RESULTS_DIR
# Quick check
echo "Running quick evaluation..."
uv run python -m tests.evaluation.evaluator \
--test-cases 3 \
--iterations 1 \
--output "$RESULTS_DIR/quick.json"
# Compare with previous results
uv run python -m tests.evaluation.compare \
--baseline "evaluation_results/baseline/summary.json" \
--current "$RESULTS_DIR/quick.json" \
--threshold 0.05
# Notify if regression detected
if [ $? -ne 0 ]; then
echo "⚠️ Regression detected! Sending notification..."
# Notification process (Slack, Email, etc.)
fi
Summary
For effective evaluation:
- ✅ Multiple Metrics: Quality, performance, cost, reliability
- ✅ Statistical Validation: Multiple runs and significance testing
- ✅ Consistency: Same test cases, same conditions
- ✅ Visualization: Track improvements with graphs and tables
- ✅ Documentation: Record evaluation results and analysis
📋 Related Documentation
- Evaluation Metrics - Metric definitions and calculation methods
- Test Case Design - Test case structure
- Statistical Significance - Statistical analysis methods