387 lines
7.7 KiB
Markdown
387 lines
7.7 KiB
Markdown
# Agent Prompt Metrics
|
||
|
||
**Version**: 1.0
|
||
**Purpose**: Quantitative metrics for measuring agent prompt quality
|
||
**Framework**: BAIME dual-layer value functions applied to agents
|
||
|
||
---
|
||
|
||
## Core Metrics
|
||
|
||
### 1. Success Rate
|
||
|
||
**Definition**: Percentage of tasks completed correctly
|
||
|
||
**Calculation**:
|
||
```
|
||
Success Rate = correct_completions / total_tasks
|
||
```
|
||
|
||
**Thresholds**:
|
||
- ≥90%: Excellent (production-ready)
|
||
- 80-89%: Good (minor refinements needed)
|
||
- 60-79%: Fair (needs improvement)
|
||
- <60%: Poor (major issues)
|
||
|
||
**Example**:
|
||
```
|
||
Tasks: 20
|
||
Correct: 17
|
||
Partial: 2
|
||
Failed: 1
|
||
|
||
Success Rate = 17/20 = 85% (Good)
|
||
```
|
||
|
||
---
|
||
|
||
### 2. Quality Score
|
||
|
||
**Definition**: Average quality rating of agent outputs (1-5 scale)
|
||
|
||
**Rating Criteria**:
|
||
- **5**: Perfect - Accurate, complete, well-structured
|
||
- **4**: Good - Minor gaps, mostly complete
|
||
- **3**: Fair - Acceptable but needs improvement
|
||
- **2**: Poor - Significant issues
|
||
- **1**: Failed - Incorrect or unusable
|
||
|
||
**Thresholds**:
|
||
- ≥4.5: Excellent
|
||
- 4.0-4.4: Good
|
||
- 3.5-3.9: Fair
|
||
- <3.5: Poor
|
||
|
||
**Example**:
|
||
```
|
||
Task 1: 5/5 (perfect)
|
||
Task 2: 4/5 (good)
|
||
Task 3: 5/5 (perfect)
|
||
...
|
||
Task 20: 4/5 (good)
|
||
|
||
Average: 4.35/5 (Good)
|
||
```
|
||
|
||
---
|
||
|
||
### 3. Efficiency
|
||
|
||
**Definition**: Time and token usage per task
|
||
|
||
**Metrics**:
|
||
```
|
||
Time Efficiency = avg_time_per_task
|
||
Token Efficiency = avg_tokens_per_task
|
||
```
|
||
|
||
**Thresholds** (vary by agent type):
|
||
- Explore agent: <3 min, <5k tokens
|
||
- Code generation: <5 min, <10k tokens
|
||
- Analysis: <10 min, <20k tokens
|
||
|
||
**Example**:
|
||
```
|
||
Tasks: 20
|
||
Total time: 56 min
|
||
Total tokens: 92k
|
||
|
||
Time Efficiency: 2.8 min/task ✅
|
||
Token Efficiency: 4.6k tokens/task ✅
|
||
```
|
||
|
||
---
|
||
|
||
### 4. Reliability
|
||
|
||
**Definition**: Consistency of agent performance
|
||
|
||
**Calculation**:
|
||
```
|
||
Reliability = 1 - (std_dev(success_rate) / mean(success_rate))
|
||
```
|
||
|
||
**Thresholds**:
|
||
- ≥0.90: Very reliable (consistent)
|
||
- 0.80-0.89: Reliable
|
||
- 0.70-0.79: Moderately reliable
|
||
- <0.70: Unreliable (erratic)
|
||
|
||
**Example**:
|
||
```
|
||
Batch 1: 85% success
|
||
Batch 2: 90% success
|
||
Batch 3: 87% success
|
||
Batch 4: 88% success
|
||
|
||
Mean: 87.5%
|
||
Std Dev: 2.08
|
||
Reliability: 1 - (2.08/87.5) = 0.976 (Very reliable)
|
||
```
|
||
|
||
---
|
||
|
||
## Composite Metrics
|
||
|
||
### V_instance (Agent Performance)
|
||
|
||
**Formula**:
|
||
```
|
||
V_instance = 0.40 × success_rate +
|
||
0.30 × (quality_score / 5) +
|
||
0.20 × efficiency_score +
|
||
0.10 × reliability
|
||
|
||
Where:
|
||
- success_rate ∈ [0, 1]
|
||
- quality_score ∈ [1, 5], normalized to [0, 1]
|
||
- efficiency_score = 1 - (actual_time / target_time), capped at [0, 1]
|
||
- reliability ∈ [0, 1]
|
||
```
|
||
|
||
**Target**: V_instance ≥ 0.80
|
||
|
||
**Example**:
|
||
```
|
||
Success Rate: 85% = 0.85
|
||
Quality Score: 4.2/5 = 0.84
|
||
Efficiency: 2.8 min / 3 min target = 1 - 0.93 = 0.07, but we want faster so: 1.0 (under budget)
|
||
Reliability: 0.976
|
||
|
||
V_instance = 0.40 × 0.85 +
|
||
0.30 × 0.84 +
|
||
0.20 × 1.0 +
|
||
0.10 × 0.976
|
||
|
||
= 0.34 + 0.252 + 0.20 + 0.0976
|
||
= 0.890 ✅ (exceeds target)
|
||
```
|
||
|
||
---
|
||
|
||
### V_meta (Prompt Quality)
|
||
|
||
**Formula**:
|
||
```
|
||
V_meta = 0.35 × completeness +
|
||
0.30 × clarity +
|
||
0.20 × adaptability +
|
||
0.15 × maintainability
|
||
|
||
Where:
|
||
- completeness = features_implemented / features_needed
|
||
- clarity = 1 - (ambiguous_instructions / total_instructions)
|
||
- adaptability = successful_task_types / tested_task_types
|
||
- maintainability = 1 - (prompt_complexity / max_complexity)
|
||
```
|
||
|
||
**Target**: V_meta ≥ 0.80
|
||
|
||
**Example**:
|
||
```
|
||
Completeness: 8/8 features = 1.0
|
||
Clarity: 1 - (2 ambiguous / 20 instructions) = 0.90
|
||
Adaptability: 5/6 task types = 0.83
|
||
Maintainability: 1 - (150 lines / 300 max) = 0.50
|
||
|
||
V_meta = 0.35 × 1.0 +
|
||
0.30 × 0.90 +
|
||
0.20 × 0.83 +
|
||
0.15 × 0.50
|
||
|
||
= 0.35 + 0.27 + 0.166 + 0.075
|
||
= 0.861 ✅ (exceeds target)
|
||
```
|
||
|
||
---
|
||
|
||
## Metric Collection
|
||
|
||
### Automated Collection
|
||
|
||
**Session Analysis**:
|
||
```bash
|
||
# Extract agent performance from session
|
||
query_tools --tool="Task" --scope=session | \
|
||
jq -r '.[] | select(.status == "success") | .duration' | \
|
||
awk '{sum+=$1; n++} END {print sum/n}'
|
||
```
|
||
|
||
**Example Script**:
|
||
```bash
|
||
#!/bin/bash
|
||
# scripts/measure-agent-metrics.sh
|
||
|
||
AGENT_NAME=$1
|
||
SESSION=$2
|
||
|
||
# Success rate
|
||
total=$(grep "agent=$AGENT_NAME" "$SESSION" | wc -l)
|
||
success=$(grep "agent=$AGENT_NAME.*success" "$SESSION" | wc -l)
|
||
success_rate=$(echo "scale=2; $success / $total" | bc)
|
||
|
||
# Average time
|
||
avg_time=$(grep "agent=$AGENT_NAME" "$SESSION" | \
|
||
jq -r '.duration' | \
|
||
awk '{sum+=$1; n++} END {print sum/n}')
|
||
|
||
# Quality (requires manual rating file)
|
||
avg_quality=$(cat "${SESSION}.ratings" | \
|
||
grep "$AGENT_NAME" | \
|
||
awk '{sum+=$2; n++} END {print sum/n}')
|
||
|
||
echo "Agent: $AGENT_NAME"
|
||
echo "Success Rate: $success_rate"
|
||
echo "Avg Time: ${avg_time}s"
|
||
echo "Avg Quality: $avg_quality/5"
|
||
```
|
||
|
||
---
|
||
|
||
### Manual Collection
|
||
|
||
**Test Suite Template**:
|
||
```markdown
|
||
## Agent Test Suite: [Agent Name]
|
||
|
||
**Iteration**: [N]
|
||
**Date**: [YYYY-MM-DD]
|
||
|
||
### Test Cases
|
||
|
||
| ID | Task | Result | Quality | Time | Notes |
|
||
|----|------|--------|---------|------|-------|
|
||
| 1 | [Description] | ✅/❌ | [1-5] | [min] | [Issues] |
|
||
| 2 | [Description] | ✅/❌ | [1-5] | [min] | [Issues] |
|
||
...
|
||
|
||
### Summary
|
||
|
||
- Success Rate: [X]% ([Y]/[Z])
|
||
- Avg Quality: [X.X]/5
|
||
- Avg Time: [X.X] min
|
||
- V_instance: [X.XX]
|
||
```
|
||
|
||
---
|
||
|
||
## Benchmarking
|
||
|
||
### Cross-Agent Comparison
|
||
|
||
**Standard Test Suite**: 20 representative tasks
|
||
|
||
**Example Results**:
|
||
```
|
||
| Agent | Success | Quality | Time | V_inst |
|
||
|-------------|---------|---------|-------|--------|
|
||
| Explore v1 | 60% | 3.1 | 4.2m | 0.62 |
|
||
| Explore v2 | 87.5% | 4.2 | 2.8m | 0.89 |
|
||
| Explore v3 | 90% | 4.3 | 2.6m | 0.91 |
|
||
```
|
||
|
||
**Improvement**: v1 → v3 = +30% success, +1.2 quality, +38% faster
|
||
|
||
---
|
||
|
||
### Baseline Comparison
|
||
|
||
**Industry Baselines** (approximate):
|
||
- Generic agent (no tuning): ~50-60% success
|
||
- Basic tuned agent: ~70-80% success
|
||
- Well-tuned agent: ~85-95% success
|
||
- Expert-tuned agent: ~95-98% success
|
||
|
||
---
|
||
|
||
## Regression Testing
|
||
|
||
### Track Metrics Over Time
|
||
|
||
**Regression Detection**:
|
||
```
|
||
if current_metric < (previous_metric - threshold):
|
||
alert("REGRESSION DETECTED")
|
||
```
|
||
|
||
**Thresholds**:
|
||
- Success Rate: -5% (e.g., 90% → 85%)
|
||
- Quality Score: -0.3 (e.g., 4.5 → 4.2)
|
||
- Efficiency: +20% time (e.g., 2.8 min → 3.4 min)
|
||
|
||
**Example**:
|
||
```
|
||
Iteration 3: 90% success, 4.3 quality, 2.6 min ✅
|
||
Iteration 4: 87% success, 4.1 quality, 2.8 min ⚠️ REGRESSION
|
||
|
||
Analysis: New constraint too restrictive
|
||
Action: Revert constraint, re-test
|
||
```
|
||
|
||
---
|
||
|
||
## Reporting Template
|
||
|
||
```markdown
|
||
## Agent Metrics Report
|
||
|
||
**Agent**: [Name]
|
||
**Version**: [X.Y]
|
||
**Test Date**: [YYYY-MM-DD]
|
||
**Test Suite**: [Standard 20 | Custom N]
|
||
|
||
### Performance Metrics
|
||
|
||
**Success Rate**: [X]% ([Y]/[Z] tasks)
|
||
- Target: ≥85%
|
||
- Status: ✅/⚠️/❌
|
||
|
||
**Quality Score**: [X.X]/5
|
||
- Target: ≥4.0
|
||
- Status: ✅/⚠️/❌
|
||
|
||
**Efficiency**:
|
||
- Time: [X.X] min/task (target: [Y] min)
|
||
- Tokens: [X]k tokens/task (target: [Y]k)
|
||
- Status: ✅/⚠️/❌
|
||
|
||
**Reliability**: [X.XX]
|
||
- Target: ≥0.85
|
||
- Status: ✅/⚠️/❌
|
||
|
||
### Composite Scores
|
||
|
||
**V_instance**: [X.XX]
|
||
- Target: ≥0.80
|
||
- Status: ✅/⚠️/❌
|
||
|
||
**V_meta**: [X.XX]
|
||
- Target: ≥0.80
|
||
- Status: ✅/⚠️/❌
|
||
|
||
### Comparison to Baseline
|
||
|
||
| Metric | Baseline | Current | Δ |
|
||
|---------------|----------|---------|--------|
|
||
| Success Rate | [X]% | [Y]% | [+/-]% |
|
||
| Quality | [X.X] | [Y.Y] | [+/-] |
|
||
| Time | [X.X]m | [Y.Y]m | [+/-]% |
|
||
| V_instance | [X.XX] | [Y.YY] | [+/-] |
|
||
|
||
### Recommendations
|
||
|
||
1. [Action item based on metrics]
|
||
2. [Action item based on metrics]
|
||
|
||
### Next Steps
|
||
|
||
- [ ] [Task for next iteration]
|
||
- [ ] [Task for next iteration]
|
||
```
|
||
|
||
---
|
||
|
||
**Source**: BAIME Agent Prompt Evolution Framework
|
||
**Status**: Production-ready, validated across 13 agent types
|
||
**Measurement Overhead**: ~5 min per 20-task test suite
|