Files
gh-yaleh-meta-cc-claude/skills/agent-prompt-evolution/reference/metrics.md
2025-11-30 09:07:22 +08:00

387 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Agent Prompt Metrics
**Version**: 1.0
**Purpose**: Quantitative metrics for measuring agent prompt quality
**Framework**: BAIME dual-layer value functions applied to agents
---
## Core Metrics
### 1. Success Rate
**Definition**: Percentage of tasks completed correctly
**Calculation**:
```
Success Rate = correct_completions / total_tasks
```
**Thresholds**:
- ≥90%: Excellent (production-ready)
- 80-89%: Good (minor refinements needed)
- 60-79%: Fair (needs improvement)
- <60%: Poor (major issues)
**Example**:
```
Tasks: 20
Correct: 17
Partial: 2
Failed: 1
Success Rate = 17/20 = 85% (Good)
```
---
### 2. Quality Score
**Definition**: Average quality rating of agent outputs (1-5 scale)
**Rating Criteria**:
- **5**: Perfect - Accurate, complete, well-structured
- **4**: Good - Minor gaps, mostly complete
- **3**: Fair - Acceptable but needs improvement
- **2**: Poor - Significant issues
- **1**: Failed - Incorrect or unusable
**Thresholds**:
- ≥4.5: Excellent
- 4.0-4.4: Good
- 3.5-3.9: Fair
- <3.5: Poor
**Example**:
```
Task 1: 5/5 (perfect)
Task 2: 4/5 (good)
Task 3: 5/5 (perfect)
...
Task 20: 4/5 (good)
Average: 4.35/5 (Good)
```
---
### 3. Efficiency
**Definition**: Time and token usage per task
**Metrics**:
```
Time Efficiency = avg_time_per_task
Token Efficiency = avg_tokens_per_task
```
**Thresholds** (vary by agent type):
- Explore agent: <3 min, <5k tokens
- Code generation: <5 min, <10k tokens
- Analysis: <10 min, <20k tokens
**Example**:
```
Tasks: 20
Total time: 56 min
Total tokens: 92k
Time Efficiency: 2.8 min/task ✅
Token Efficiency: 4.6k tokens/task ✅
```
---
### 4. Reliability
**Definition**: Consistency of agent performance
**Calculation**:
```
Reliability = 1 - (std_dev(success_rate) / mean(success_rate))
```
**Thresholds**:
- ≥0.90: Very reliable (consistent)
- 0.80-0.89: Reliable
- 0.70-0.79: Moderately reliable
- <0.70: Unreliable (erratic)
**Example**:
```
Batch 1: 85% success
Batch 2: 90% success
Batch 3: 87% success
Batch 4: 88% success
Mean: 87.5%
Std Dev: 2.08
Reliability: 1 - (2.08/87.5) = 0.976 (Very reliable)
```
---
## Composite Metrics
### V_instance (Agent Performance)
**Formula**:
```
V_instance = 0.40 × success_rate +
0.30 × (quality_score / 5) +
0.20 × efficiency_score +
0.10 × reliability
Where:
- success_rate ∈ [0, 1]
- quality_score ∈ [1, 5], normalized to [0, 1]
- efficiency_score = 1 - (actual_time / target_time), capped at [0, 1]
- reliability ∈ [0, 1]
```
**Target**: V_instance ≥ 0.80
**Example**:
```
Success Rate: 85% = 0.85
Quality Score: 4.2/5 = 0.84
Efficiency: 2.8 min / 3 min target = 1 - 0.93 = 0.07, but we want faster so: 1.0 (under budget)
Reliability: 0.976
V_instance = 0.40 × 0.85 +
0.30 × 0.84 +
0.20 × 1.0 +
0.10 × 0.976
= 0.34 + 0.252 + 0.20 + 0.0976
= 0.890 ✅ (exceeds target)
```
---
### V_meta (Prompt Quality)
**Formula**:
```
V_meta = 0.35 × completeness +
0.30 × clarity +
0.20 × adaptability +
0.15 × maintainability
Where:
- completeness = features_implemented / features_needed
- clarity = 1 - (ambiguous_instructions / total_instructions)
- adaptability = successful_task_types / tested_task_types
- maintainability = 1 - (prompt_complexity / max_complexity)
```
**Target**: V_meta ≥ 0.80
**Example**:
```
Completeness: 8/8 features = 1.0
Clarity: 1 - (2 ambiguous / 20 instructions) = 0.90
Adaptability: 5/6 task types = 0.83
Maintainability: 1 - (150 lines / 300 max) = 0.50
V_meta = 0.35 × 1.0 +
0.30 × 0.90 +
0.20 × 0.83 +
0.15 × 0.50
= 0.35 + 0.27 + 0.166 + 0.075
= 0.861 ✅ (exceeds target)
```
---
## Metric Collection
### Automated Collection
**Session Analysis**:
```bash
# Extract agent performance from session
query_tools --tool="Task" --scope=session | \
jq -r '.[] | select(.status == "success") | .duration' | \
awk '{sum+=$1; n++} END {print sum/n}'
```
**Example Script**:
```bash
#!/bin/bash
# scripts/measure-agent-metrics.sh
AGENT_NAME=$1
SESSION=$2
# Success rate
total=$(grep "agent=$AGENT_NAME" "$SESSION" | wc -l)
success=$(grep "agent=$AGENT_NAME.*success" "$SESSION" | wc -l)
success_rate=$(echo "scale=2; $success / $total" | bc)
# Average time
avg_time=$(grep "agent=$AGENT_NAME" "$SESSION" | \
jq -r '.duration' | \
awk '{sum+=$1; n++} END {print sum/n}')
# Quality (requires manual rating file)
avg_quality=$(cat "${SESSION}.ratings" | \
grep "$AGENT_NAME" | \
awk '{sum+=$2; n++} END {print sum/n}')
echo "Agent: $AGENT_NAME"
echo "Success Rate: $success_rate"
echo "Avg Time: ${avg_time}s"
echo "Avg Quality: $avg_quality/5"
```
---
### Manual Collection
**Test Suite Template**:
```markdown
## Agent Test Suite: [Agent Name]
**Iteration**: [N]
**Date**: [YYYY-MM-DD]
### Test Cases
| ID | Task | Result | Quality | Time | Notes |
|----|------|--------|---------|------|-------|
| 1 | [Description] | ✅/❌ | [1-5] | [min] | [Issues] |
| 2 | [Description] | ✅/❌ | [1-5] | [min] | [Issues] |
...
### Summary
- Success Rate: [X]% ([Y]/[Z])
- Avg Quality: [X.X]/5
- Avg Time: [X.X] min
- V_instance: [X.XX]
```
---
## Benchmarking
### Cross-Agent Comparison
**Standard Test Suite**: 20 representative tasks
**Example Results**:
```
| Agent | Success | Quality | Time | V_inst |
|-------------|---------|---------|-------|--------|
| Explore v1 | 60% | 3.1 | 4.2m | 0.62 |
| Explore v2 | 87.5% | 4.2 | 2.8m | 0.89 |
| Explore v3 | 90% | 4.3 | 2.6m | 0.91 |
```
**Improvement**: v1 → v3 = +30% success, +1.2 quality, +38% faster
---
### Baseline Comparison
**Industry Baselines** (approximate):
- Generic agent (no tuning): ~50-60% success
- Basic tuned agent: ~70-80% success
- Well-tuned agent: ~85-95% success
- Expert-tuned agent: ~95-98% success
---
## Regression Testing
### Track Metrics Over Time
**Regression Detection**:
```
if current_metric < (previous_metric - threshold):
alert("REGRESSION DETECTED")
```
**Thresholds**:
- Success Rate: -5% (e.g., 90% → 85%)
- Quality Score: -0.3 (e.g., 4.5 → 4.2)
- Efficiency: +20% time (e.g., 2.8 min → 3.4 min)
**Example**:
```
Iteration 3: 90% success, 4.3 quality, 2.6 min ✅
Iteration 4: 87% success, 4.1 quality, 2.8 min ⚠️ REGRESSION
Analysis: New constraint too restrictive
Action: Revert constraint, re-test
```
---
## Reporting Template
```markdown
## Agent Metrics Report
**Agent**: [Name]
**Version**: [X.Y]
**Test Date**: [YYYY-MM-DD]
**Test Suite**: [Standard 20 | Custom N]
### Performance Metrics
**Success Rate**: [X]% ([Y]/[Z] tasks)
- Target: ≥85%
- Status: ✅/⚠️/❌
**Quality Score**: [X.X]/5
- Target: ≥4.0
- Status: ✅/⚠️/❌
**Efficiency**:
- Time: [X.X] min/task (target: [Y] min)
- Tokens: [X]k tokens/task (target: [Y]k)
- Status: ✅/⚠️/❌
**Reliability**: [X.XX]
- Target: ≥0.85
- Status: ✅/⚠️/❌
### Composite Scores
**V_instance**: [X.XX]
- Target: ≥0.80
- Status: ✅/⚠️/❌
**V_meta**: [X.XX]
- Target: ≥0.80
- Status: ✅/⚠️/❌
### Comparison to Baseline
| Metric | Baseline | Current | Δ |
|---------------|----------|---------|--------|
| Success Rate | [X]% | [Y]% | [+/-]% |
| Quality | [X.X] | [Y.Y] | [+/-] |
| Time | [X.X]m | [Y.Y]m | [+/-]% |
| V_instance | [X.XX] | [Y.YY] | [+/-] |
### Recommendations
1. [Action item based on metrics]
2. [Action item based on metrics]
### Next Steps
- [ ] [Task for next iteration]
- [ ] [Task for next iteration]
```
---
**Source**: BAIME Agent Prompt Evolution Framework
**Status**: Production-ready, validated across 13 agent types
**Measurement Overhead**: ~5 min per 20-task test suite