Files
gh-hiroshi75-protografico-p…/skills/fine-tune/evaluation_practices.md
2025-11-29 18:45:58 +08:00

8.0 KiB

Evaluation Best Practices

Practical guidelines for effective evaluation of LangGraph applications.

🎯 Evaluation Best Practices

1. Ensuring Consistency

Evaluation Under Same Conditions

class EvaluationConfig:
    """Fix evaluation settings to ensure consistency"""

    def __init__(self):
        self.test_cases_path = "tests/evaluation/test_cases.json"
        self.seed = 42  # For reproducibility
        self.iterations = 5
        self.timeout = 30  # seconds
        self.model = "claude-3-5-sonnet-20241022"

    def load_test_cases(self) -> List[Dict]:
        """Load the same test cases"""
        with open(self.test_cases_path) as f:
            data = json.load(f)
        return data["test_cases"]

# Usage
config = EvaluationConfig()
test_cases = config.load_test_cases()
# Use the same test cases for all evaluations

2. Staged Evaluation

Start Small and Gradually Expand

# Phase 1: Quick check (3 cases, 1 iteration)
quick_results = evaluate(test_cases[:3], iterations=1)

if quick_results["accuracy"] > baseline["accuracy"]:
    # Phase 2: Medium check (10 cases, 3 iterations)
    medium_results = evaluate(test_cases[:10], iterations=3)

    if medium_results["accuracy"] > baseline["accuracy"]:
        # Phase 3: Full evaluation (all cases, 5 iterations)
        full_results = evaluate(test_cases, iterations=5)

3. Recording Evaluation Results

Structured Logging

import json
from datetime import datetime
from pathlib import Path

def save_evaluation_result(
    results: Dict,
    version: str,
    output_dir: Path = Path("evaluation_results")
):
    """Save evaluation results"""
    output_dir.mkdir(exist_ok=True)

    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{version}_{timestamp}.json"

    full_results = {
        "version": version,
        "timestamp": timestamp,
        "metrics": results,
        "config": {
            "model": "claude-3-5-sonnet-20241022",
            "test_cases": len(test_cases),
            "iterations": 5
        }
    }

    with open(output_dir / filename, "w") as f:
        json.dump(full_results, f, indent=2)

    print(f"Results saved to: {output_dir / filename}")

# Usage
save_evaluation_result(results, version="baseline")
save_evaluation_result(results, version="iteration_1")

4. Visualization

Visualizing Results

import matplotlib.pyplot as plt

def visualize_improvement(
    baseline: Dict,
    iterations: List[Dict],
    metrics: List[str] = ["accuracy", "latency", "cost"]
):
    """Visualize improvement progress"""
    fig, axes = plt.subplots(1, len(metrics), figsize=(15, 5))

    for idx, metric in enumerate(metrics):
        ax = axes[idx]

        # Prepare data
        x = ["Baseline"] + [f"Iter {i+1}" for i in range(len(iterations))]
        y = [baseline[metric]] + [it[metric] for it in iterations]

        # Plot
        ax.plot(x, y, marker='o', linewidth=2)
        ax.set_title(f"{metric.capitalize()} Progress")
        ax.set_ylabel(metric.capitalize())
        ax.grid(True, alpha=0.3)

        # Goal line
        if metric in baseline.get("goals", {}):
            goal = baseline["goals"][metric]
            ax.axhline(y=goal, color='r', linestyle='--', label='Goal')
            ax.legend()

    plt.tight_layout()
    plt.savefig("evaluation_results/improvement_progress.png")
    print("Visualization saved to: evaluation_results/improvement_progress.png")

📋 Evaluation Report Template

Standard Report Format

# Evaluation Report - [Version/Iteration]

Execution Date: 2024-11-24 12:00:00
Executed by: Claude Code (fine-tune skill)

## Configuration

- **Model**: claude-3-5-sonnet-20241022
- **Number of Test Cases**: 20
- **Number of Runs**: 5
- **Evaluation Duration**: 10 minutes

## Results Summary

| Metric | Mean | Std Dev | 95% CI | Goal | Achievement |
|--------|------|---------|--------|------|-------------|
| Accuracy | 86.0% | 2.1% | [83.9%, 88.1%] | 90.0% | 95.6% |
| Latency | 2.4s | 0.3s | [2.1s, 2.7s] | 2.0s | 83.3% |
| Cost | $0.014 | $0.001 | [$0.013, $0.015] | $0.010 | 71.4% |

## Detailed Analysis

### Accuracy
- **Improvement**: +11.0% (75.0% → 86.0%)
- **Statistical Significance**: p < 0.01 ✅
- **Effect Size**: Cohen's d = 2.3 (large)

### Latency
- **Improvement**: -0.1s (2.5s → 2.4s)
- **Statistical Significance**: p = 0.12 ❌ (not significant)
- **Effect Size**: Cohen's d = 0.3 (small)

## Error Analysis

- **Total Errors**: 0
- **Error Rate**: 0.0%
- **Retry Rate**: 0.0%

## Next Actions

1. ✅ Accuracy significantly improved → Continue
2. ⚠️ Latency improvement is small → Focus in next iteration
3. ⚠️ Cost still below goal → Consider max_tokens limit

🔍 Troubleshooting

Common Problems and Solutions

1. Large Variance in Evaluation Results

Symptom: Standard deviation > 20% of mean

Causes:

  • LLM temperature is too high
  • Test cases are uneven
  • Network latency effects

Solutions:

# Lower temperature
llm = ChatAnthropic(
    model="claude-3-5-sonnet-20241022",
    temperature=0.3  # Set lower
)

# Increase number of runs
iterations = 10  # 5 → 10

# Remove outliers
results_clean = remove_outliers(results)

2. Evaluation Takes Too Long

Symptom: Evaluation takes over 1 hour

Causes:

  • Too many test cases
  • Not running in parallel
  • Timeout setting too long

Solutions:

# Subset evaluation
quick_test_cases = test_cases[:10]  # First 10 cases only

# Parallel execution
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    futures = [executor.submit(evaluate_case, case) for case in test_cases]
    results = [f.result() for f in futures]

# Timeout setting
timeout = 10  # 30s → 10s

3. No Statistical Significance

Symptom: p-value ≥ 0.05

Causes:

  • Improvement effect is small
  • Insufficient sample size
  • High data variance

Solutions:

# Aim for larger improvements
# - Apply multiple optimizations simultaneously
# - Choose more effective techniques

# Increase sample size
iterations = 20  # 5 → 20

# Reduce variance
# - Lower temperature
# - Stabilize evaluation environment

📊 Continuous Evaluation

Scheduled Evaluation

evaluation_schedule:
  daily:
    - quick_check: 3 test cases, 1 iteration
    - purpose: Detect major regressions

  weekly:
    - medium_check: 10 test cases, 3 iterations
    - purpose: Continuous quality monitoring

  before_release:
    - full_evaluation: all test cases, 5-10 iterations
    - purpose: Release quality assurance

  after_major_changes:
    - comprehensive_evaluation: all test cases, 10+ iterations
    - purpose: Impact assessment of major changes

Automated Evaluation Pipeline

#!/bin/bash
# continuous_evaluation.sh

# Daily evaluation script

DATE=$(date +%Y%m%d)
RESULTS_DIR="evaluation_results/continuous/$DATE"
mkdir -p $RESULTS_DIR

# Quick check
echo "Running quick evaluation..."
uv run python -m tests.evaluation.evaluator \
    --test-cases 3 \
    --iterations 1 \
    --output "$RESULTS_DIR/quick.json"

# Compare with previous results
uv run python -m tests.evaluation.compare \
    --baseline "evaluation_results/baseline/summary.json" \
    --current "$RESULTS_DIR/quick.json" \
    --threshold 0.05

# Notify if regression detected
if [ $? -ne 0 ]; then
    echo "⚠️ Regression detected! Sending notification..."
    # Notification process (Slack, Email, etc.)
fi

Summary

For effective evaluation:

  • Multiple Metrics: Quality, performance, cost, reliability
  • Statistical Validation: Multiple runs and significance testing
  • Consistency: Same test cases, same conditions
  • Visualization: Track improvements with graphs and tables
  • Documentation: Record evaluation results and analysis