195 lines
5.4 KiB
Markdown
195 lines
5.4 KiB
Markdown
# Phase 2: Baseline Evaluation Examples
|
|
|
|
Examples of evaluation scripts and result reports.
|
|
|
|
**📋 Related Documentation**: [Examples Home](./examples.md) | [Workflow Phase 2](./workflow_phase2.md) | [Evaluation Methods](./evaluation.md)
|
|
|
|
---
|
|
|
|
## Phase 2: Baseline Evaluation Examples
|
|
|
|
### Example 2.1: Evaluation Script
|
|
|
|
**File**: `tests/evaluation/evaluator.py`
|
|
|
|
```python
|
|
import json
|
|
import time
|
|
from pathlib import Path
|
|
from typing import Dict, List
|
|
|
|
def evaluate_test_cases(test_cases: List[Dict]) -> Dict:
|
|
"""Evaluate test cases"""
|
|
results = {
|
|
"total_cases": len(test_cases),
|
|
"correct": 0,
|
|
"total_latency": 0.0,
|
|
"total_cost": 0.0,
|
|
"case_results": []
|
|
}
|
|
|
|
for case in test_cases:
|
|
start_time = time.time()
|
|
|
|
# Run LangGraph application
|
|
output = run_langgraph_app(case["input"])
|
|
|
|
latency = time.time() - start_time
|
|
|
|
# Correctness judgment
|
|
is_correct = output["answer"] == case["expected_answer"]
|
|
if is_correct:
|
|
results["correct"] += 1
|
|
|
|
# Cost calculation (from token usage)
|
|
cost = calculate_cost(output["token_usage"])
|
|
|
|
results["total_latency"] += latency
|
|
results["total_cost"] += cost
|
|
|
|
results["case_results"].append({
|
|
"case_id": case["id"],
|
|
"correct": is_correct,
|
|
"latency": latency,
|
|
"cost": cost
|
|
})
|
|
|
|
# Calculate metrics
|
|
results["accuracy"] = (results["correct"] / results["total_cases"]) * 100
|
|
results["avg_latency"] = results["total_latency"] / results["total_cases"]
|
|
results["avg_cost"] = results["total_cost"] / results["total_cases"]
|
|
|
|
return results
|
|
|
|
def calculate_cost(token_usage: Dict) -> float:
|
|
"""Calculate cost from token usage"""
|
|
# Claude 3.5 Sonnet pricing
|
|
INPUT_COST_PER_1M = 3.0 # $3.00 per 1M input tokens
|
|
OUTPUT_COST_PER_1M = 15.0 # $15.00 per 1M output tokens
|
|
|
|
input_cost = (token_usage["input_tokens"] / 1_000_000) * INPUT_COST_PER_1M
|
|
output_cost = (token_usage["output_tokens"] / 1_000_000) * OUTPUT_COST_PER_1M
|
|
|
|
return input_cost + output_cost
|
|
|
|
if __name__ == "__main__":
|
|
# Load test cases
|
|
with open("tests/evaluation/test_cases.json") as f:
|
|
test_cases = json.load(f)["test_cases"]
|
|
|
|
# Execute evaluation
|
|
results = evaluate_test_cases(test_cases)
|
|
|
|
# Save results
|
|
with open("evaluation_results/baseline_run.json", "w") as f:
|
|
json.dump(results, f, indent=2)
|
|
|
|
print(f"Accuracy: {results['accuracy']:.1f}%")
|
|
print(f"Avg Latency: {results['avg_latency']:.2f}s")
|
|
print(f"Avg Cost: ${results['avg_cost']:.4f}")
|
|
```
|
|
|
|
### Example 2.2: Baseline Measurement Script
|
|
|
|
**File**: `scripts/baseline_evaluation.sh`
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
|
|
ITERATIONS=5
|
|
RESULTS_DIR="evaluation_results/baseline"
|
|
mkdir -p $RESULTS_DIR
|
|
|
|
echo "Starting baseline evaluation: $ITERATIONS iterations"
|
|
|
|
for i in $(seq 1 $ITERATIONS); do
|
|
echo "----------------------------------------"
|
|
echo "Iteration $i/$ITERATIONS"
|
|
echo "----------------------------------------"
|
|
|
|
uv run python -m tests.evaluation.evaluator \
|
|
--output "$RESULTS_DIR/run_$i.json" \
|
|
--verbose
|
|
|
|
echo "Completed iteration $i"
|
|
|
|
# API rate limit mitigation
|
|
if [ $i -lt $ITERATIONS ]; then
|
|
echo "Waiting 5 seconds before next iteration..."
|
|
sleep 5
|
|
fi
|
|
done
|
|
|
|
echo ""
|
|
echo "All iterations completed. Aggregating results..."
|
|
|
|
# Aggregate results
|
|
uv run python -m tests.evaluation.aggregate \
|
|
--input-dir "$RESULTS_DIR" \
|
|
--output "$RESULTS_DIR/summary.json"
|
|
|
|
echo "Baseline evaluation complete!"
|
|
echo "Results saved to: $RESULTS_DIR/summary.json"
|
|
```
|
|
|
|
### Example 2.3: Baseline Results Report
|
|
|
|
```markdown
|
|
# Baseline Evaluation Results
|
|
|
|
Execution Date/Time: 2024-11-24 10:00:00
|
|
Number of Runs: 5
|
|
Number of Test Cases: 20
|
|
|
|
## Evaluation Metrics Summary
|
|
|
|
| Metric | Average | Std Dev | Min | Max | Target | Gap |
|
|
| -------- | ------- | ------- | ------ | ------ | ------ | ---------- |
|
|
| Accuracy | 75.0% | 3.2% | 70.0% | 80.0% | 90.0% | **-15.0%** |
|
|
| Latency | 2.5s | 0.4s | 2.1s | 3.2s | 2.0s | **+0.5s** |
|
|
| Cost/req | $0.015 | $0.002 | $0.013 | $0.018 | $0.010 | **+$0.005** |
|
|
|
|
## Detailed Analysis
|
|
|
|
### Accuracy Issues
|
|
|
|
- **Current**: 75.0% (Target: 90.0%)
|
|
- **Main incorrect answer patterns**:
|
|
1. Intent classification errors: 12 cases (60% of errors)
|
|
2. Insufficient context understanding: 5 cases (25% of errors)
|
|
3. Ambiguous question handling: 3 cases (15% of errors)
|
|
|
|
### Latency Issues
|
|
|
|
- **Current**: 2.5s (Target: 2.0s)
|
|
- **Bottlenecks**:
|
|
1. generate_response node: Average 1.8s (72% of total)
|
|
2. analyze_intent node: Average 0.5s (20% of total)
|
|
3. Other: Average 0.2s (8% of total)
|
|
|
|
### Cost Issues
|
|
|
|
- **Current**: $0.015/req (Target: $0.010/req)
|
|
- **Cost breakdown**:
|
|
1. generate_response: $0.011 (73%)
|
|
2. analyze_intent: $0.003 (20%)
|
|
3. Other: $0.001 (7%)
|
|
- **Main factor**: High output token count (average 800 tokens)
|
|
|
|
## Improvement Directions
|
|
|
|
### Priority 1: Improve analyze_intent accuracy
|
|
|
|
- **Impact**: Direct impact on Accuracy (accounts for 60% of the -15% gap)
|
|
- **Improvement measures**: Few-shot examples, clear classification criteria, JSON output format
|
|
- **Estimated effect**: +10-12% accuracy
|
|
|
|
### Priority 2: Optimize generate_response efficiency
|
|
|
|
- **Impact**: Affects both Latency and Cost
|
|
- **Improvement measures**: Conciseness instructions, max_tokens limit, temperature adjustment
|
|
- **Estimated effect**: -0.4s latency, -$0.004 cost
|
|
```
|
|
|
|
---
|