5.4 KiB
5.4 KiB
Phase 2: Baseline Evaluation Examples
Examples of evaluation scripts and result reports.
📋 Related Documentation: Examples Home | Workflow Phase 2 | Evaluation Methods
Phase 2: Baseline Evaluation Examples
Example 2.1: Evaluation Script
File: tests/evaluation/evaluator.py
import json
import time
from pathlib import Path
from typing import Dict, List
def evaluate_test_cases(test_cases: List[Dict]) -> Dict:
"""Evaluate test cases"""
results = {
"total_cases": len(test_cases),
"correct": 0,
"total_latency": 0.0,
"total_cost": 0.0,
"case_results": []
}
for case in test_cases:
start_time = time.time()
# Run LangGraph application
output = run_langgraph_app(case["input"])
latency = time.time() - start_time
# Correctness judgment
is_correct = output["answer"] == case["expected_answer"]
if is_correct:
results["correct"] += 1
# Cost calculation (from token usage)
cost = calculate_cost(output["token_usage"])
results["total_latency"] += latency
results["total_cost"] += cost
results["case_results"].append({
"case_id": case["id"],
"correct": is_correct,
"latency": latency,
"cost": cost
})
# Calculate metrics
results["accuracy"] = (results["correct"] / results["total_cases"]) * 100
results["avg_latency"] = results["total_latency"] / results["total_cases"]
results["avg_cost"] = results["total_cost"] / results["total_cases"]
return results
def calculate_cost(token_usage: Dict) -> float:
"""Calculate cost from token usage"""
# Claude 3.5 Sonnet pricing
INPUT_COST_PER_1M = 3.0 # $3.00 per 1M input tokens
OUTPUT_COST_PER_1M = 15.0 # $15.00 per 1M output tokens
input_cost = (token_usage["input_tokens"] / 1_000_000) * INPUT_COST_PER_1M
output_cost = (token_usage["output_tokens"] / 1_000_000) * OUTPUT_COST_PER_1M
return input_cost + output_cost
if __name__ == "__main__":
# Load test cases
with open("tests/evaluation/test_cases.json") as f:
test_cases = json.load(f)["test_cases"]
# Execute evaluation
results = evaluate_test_cases(test_cases)
# Save results
with open("evaluation_results/baseline_run.json", "w") as f:
json.dump(results, f, indent=2)
print(f"Accuracy: {results['accuracy']:.1f}%")
print(f"Avg Latency: {results['avg_latency']:.2f}s")
print(f"Avg Cost: ${results['avg_cost']:.4f}")
Example 2.2: Baseline Measurement Script
File: scripts/baseline_evaluation.sh
#!/bin/bash
ITERATIONS=5
RESULTS_DIR="evaluation_results/baseline"
mkdir -p $RESULTS_DIR
echo "Starting baseline evaluation: $ITERATIONS iterations"
for i in $(seq 1 $ITERATIONS); do
echo "----------------------------------------"
echo "Iteration $i/$ITERATIONS"
echo "----------------------------------------"
uv run python -m tests.evaluation.evaluator \
--output "$RESULTS_DIR/run_$i.json" \
--verbose
echo "Completed iteration $i"
# API rate limit mitigation
if [ $i -lt $ITERATIONS ]; then
echo "Waiting 5 seconds before next iteration..."
sleep 5
fi
done
echo ""
echo "All iterations completed. Aggregating results..."
# Aggregate results
uv run python -m tests.evaluation.aggregate \
--input-dir "$RESULTS_DIR" \
--output "$RESULTS_DIR/summary.json"
echo "Baseline evaluation complete!"
echo "Results saved to: $RESULTS_DIR/summary.json"
Example 2.3: Baseline Results Report
# Baseline Evaluation Results
Execution Date/Time: 2024-11-24 10:00:00
Number of Runs: 5
Number of Test Cases: 20
## Evaluation Metrics Summary
| Metric | Average | Std Dev | Min | Max | Target | Gap |
| -------- | ------- | ------- | ------ | ------ | ------ | ---------- |
| Accuracy | 75.0% | 3.2% | 70.0% | 80.0% | 90.0% | **-15.0%** |
| Latency | 2.5s | 0.4s | 2.1s | 3.2s | 2.0s | **+0.5s** |
| Cost/req | $0.015 | $0.002 | $0.013 | $0.018 | $0.010 | **+$0.005** |
## Detailed Analysis
### Accuracy Issues
- **Current**: 75.0% (Target: 90.0%)
- **Main incorrect answer patterns**:
1. Intent classification errors: 12 cases (60% of errors)
2. Insufficient context understanding: 5 cases (25% of errors)
3. Ambiguous question handling: 3 cases (15% of errors)
### Latency Issues
- **Current**: 2.5s (Target: 2.0s)
- **Bottlenecks**:
1. generate_response node: Average 1.8s (72% of total)
2. analyze_intent node: Average 0.5s (20% of total)
3. Other: Average 0.2s (8% of total)
### Cost Issues
- **Current**: $0.015/req (Target: $0.010/req)
- **Cost breakdown**:
1. generate_response: $0.011 (73%)
2. analyze_intent: $0.003 (20%)
3. Other: $0.001 (7%)
- **Main factor**: High output token count (average 800 tokens)
## Improvement Directions
### Priority 1: Improve analyze_intent accuracy
- **Impact**: Direct impact on Accuracy (accounts for 60% of the -15% gap)
- **Improvement measures**: Few-shot examples, clear classification criteria, JSON output format
- **Estimated effect**: +10-12% accuracy
### Priority 2: Optimize generate_response efficiency
- **Impact**: Affects both Latency and Cost
- **Improvement measures**: Conciseness instructions, max_tokens limit, temperature adjustment
- **Estimated effect**: -0.4s latency, -$0.004 cost