Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 09:07:22 +08:00
commit fab98d059b
179 changed files with 46209 additions and 0 deletions

View File

@@ -0,0 +1,404 @@
---
name: Agent Prompt Evolution
description: Track and optimize agent specialization during methodology development. Use when agent specialization emerges (generic agents show >5x performance gap), multi-experiment comparison needed, or methodology transferability analysis required. Captures agent set evolution (Aₙ tracking), meta-agent evolution (Mₙ tracking), specialization decisions (when/why to create specialized agents), and reusability assessment (universal vs domain-specific vs task-specific). Enables systematic cross-experiment learning and optimized M₀ evolution. 2-3 hours overhead per experiment.
allowed-tools: Read, Grep, Glob, Edit, Write
---
# Agent Prompt Evolution
**Systematically track how agents specialize during methodology development.**
> Specialized agents emerge from need, not prediction. Track their evolution to understand when specialization adds value.
---
## When to Use This Skill
Use this skill when:
- 🔄 **Agent specialization emerges**: Generic agents show >5x performance gap
- 📊 **Multi-experiment comparison**: Want to learn across experiments
- 🧩 **Methodology transferability**: Analyzing what's reusable vs domain-specific
- 📈 **M₀ optimization**: Want to evolve base Meta-Agent capabilities
- 🎯 **Specialization decisions**: Deciding when to create new agents
- 📚 **Agent library**: Building reusable agent catalog
**Don't use when**:
- ❌ Single experiment with no specialization
- ❌ Generic agents sufficient throughout
- ❌ No cross-experiment learning goals
- ❌ Tracking overhead not worth insights
---
## Quick Start (10 minutes per iteration)
### Track Agent Evolution in Each Iteration
**iteration-N.md template**:
```markdown
## Agent Set Evolution
### Current Agent Set (Aₙ)
1. **coder** (generic) - Write code, implement features
2. **doc-writer** (generic) - Documentation
3. **data-analyst** (generic) - Data analysis
4. **coverage-analyzer** (specialized, created iteration 3) - Analyze test coverage gaps
### Changes from Previous Iteration
- Added: coverage-analyzer (10x speedup for coverage analysis)
- Removed: None
- Modified: None
### Specialization Decision
**Why coverage-analyzer?**
- Generic data-analyst took 45 min for coverage analysis
- Identified 10x performance gap
- Coverage analysis is recurring task (every iteration)
- Domain knowledge: Go coverage tools, gap identification patterns
- **ROI**: 3 hours creation cost, saves 40 min/iteration × 3 remaining iterations = 2 hours saved
### Agent Reusability Assessment
- **coder**: Universal (100% transferable)
- **doc-writer**: Universal (100% transferable)
- **data-analyst**: Universal (100% transferable)
- **coverage-analyzer**: Domain-specific (testing methodology, 70% transferable to other languages)
### System State
- Aₙ ≠ Aₙ₋₁ (new agent added)
- System UNSTABLE (need iteration N+1 to confirm stability)
```
---
## Four Tracking Dimensions
### 1. Agent Set Evolution (Aₙ)
**Track changes iteration-to-iteration**:
```
A₀ = {coder, doc-writer, data-analyst}
A₁ = {coder, doc-writer, data-analyst} (unchanged)
A₂ = {coder, doc-writer, data-analyst} (unchanged)
A₃ = {coder, doc-writer, data-analyst, coverage-analyzer} (new specialist)
A₄ = {coder, doc-writer, data-analyst, coverage-analyzer, test-generator} (new specialist)
A₅ = {coder, doc-writer, data-analyst, coverage-analyzer, test-generator} (stable)
```
**Stability**: Aₙ == Aₙ₋₁ for convergence
### 2. Meta-Agent Evolution (Mₙ)
**Standard M₀ capabilities**:
1. **observe**: Pattern observation
2. **plan**: Iteration planning
3. **execute**: Agent orchestration
4. **reflect**: Value assessment
5. **evolve**: System evolution
**Track enhancements**:
```
M₀ = {observe, plan, execute, reflect, evolve}
M₁ = {observe, plan, execute, reflect, evolve, gap-identify} (new capability)
M₂ = {observe, plan, execute, reflect, evolve, gap-identify} (stable)
```
**Finding** (from 8 experiments): M₀ sufficient in all cases (no evolution needed)
### 3. Specialization Decision Tree
**When to create specialized agent**:
```
Decision tree:
1. Is generic agent sufficient? (performance within 2x)
YES → No specialization
NO → Continue
2. Is task recurring? (happens ≥3 times)
NO → One-off, tolerate slowness
YES → Continue
3. Is performance gap >5x?
NO → Tolerate moderate slowness
YES → Continue
4. Is creation cost <ROI?
Creation cost < (Time saved per use × Remaining uses)
NO → Not worth it
YES → Create specialized agent
```
**Example** (Bootstrap-002):
```
Task: Test coverage gap analysis
Generic agent (data-analyst): 45 min
Potential specialist (coverage-analyzer): 4.5 min (10x faster)
Recurring: YES (every iteration, 3 remaining)
Performance gap: 10x (>5x threshold)
Creation cost: 3 hours
ROI: (45-4.5) min × 3 = 121.5 min = 2 hours saved
Decision: CREATE (positive ROI)
```
### 4. Reusability Assessment
**Three categories**:
**Universal** (90-100% transferable):
- Generic agents (coder, doc-writer, data-analyst)
- No domain knowledge required
- Applicable across all domains
**Domain-Specific** (60-80% transferable):
- Requires domain knowledge (testing, CI/CD, error handling)
- Patterns apply within domain
- Needs adaptation for other domains
**Task-Specific** (10-30% transferable):
- Highly specialized for particular task
- One-off creation
- Unlikely to reuse
**Examples**:
```
Agent: coverage-analyzer
Domain: Testing methodology
Transferability: 70%
- Go coverage tools (language-specific, 30% adaptation)
- Gap identification patterns (universal, 100%)
- Overall: 70% transferable to Python/Rust/TypeScript testing
Agent: test-generator
Domain: Testing methodology
Transferability: 40%
- Go test syntax (language-specific, 0% to other languages)
- Test pattern templates (moderately transferable, 60%)
- Overall: 40% transferable
Agent: log-analyzer
Domain: Observability
Transferability: 85%
- Log parsing (universal, 95%)
- Pattern recognition (universal, 100%)
- Structured logging concepts (universal, 100%)
- Go slog specifics (language-specific, 20%)
- Overall: 85% transferable
```
---
## Evolution Log Template
Create `agents/EVOLUTION-LOG.md`:
```markdown
# Agent Evolution Log
## Experiment Overview
- Domain: Testing Strategy
- Baseline agents: 3 (coder, doc-writer, data-analyst)
- Final agents: 5 (+coverage-analyzer, +test-generator)
- Specialization count: 2
---
## Iteration-by-Iteration Evolution
### Iteration 0
**Agent Set**: {coder, doc-writer, data-analyst}
**Changes**: None (baseline)
**Observations**: Generic agents sufficient for baseline establishment
### Iteration 3
**Agent Set**: {coder, doc-writer, data-analyst, coverage-analyzer}
**Changes**: +coverage-analyzer
**Reason**: 10x performance gap (45 min → 4.5 min)
**Creation Cost**: 3 hours
**ROI**: Positive (2 hours saved over 3 iterations)
**Reusability**: 70% (domain-specific, testing)
### Iteration 4
**Agent Set**: {coder, doc-writer, data-analyst, coverage-analyzer, test-generator}
**Changes**: +test-generator
**Reason**: 200x performance gap (manual test writing too slow)
**Creation Cost**: 4 hours
**ROI**: Massive (saved 10+ hours)
**Reusability**: 40% (task-specific, Go testing)
### Iteration 5
**Agent Set**: {coder, doc-writer, data-analyst, coverage-analyzer, test-generator}
**Changes**: None
**System**: STABLE (Aₙ == Aₙ₋₁)
---
## Specialization Analysis
### coverage-analyzer
**Purpose**: Analyze test coverage, identify gaps
**Performance**: 10x faster than generic data-analyst
**Domain**: Testing methodology
**Transferability**: 70%
**Lessons**: Coverage gap identification patterns are universal, tool integration is language-specific
### test-generator
**Purpose**: Generate test boilerplate from coverage gaps
**Performance**: 200x faster than manual
**Domain**: Testing methodology (Go-specific)
**Transferability**: 40%
**Lessons**: High speedup justified low transferability, patterns reusable but syntax is not
---
## Cross-Experiment Reuse
### From Previous Experiments
- **validation-builder** (from API design experiment) → Used for smoke test validation
- Reusability: Excellent (validation patterns are universal)
- Adaptation: Minimal (10 min to adapt from API to CI/CD context)
### To Future Experiments
- **coverage-analyzer** → Reusable for Python/Rust/TypeScript testing (70% transferable)
- **test-generator** → Less reusable (40% transferable, needs rewrite for other languages)
---
## Meta-Agent Evolution
### M₀ Capabilities
{observe, plan, execute, reflect, evolve}
### Changes
None (M₀ sufficient throughout)
### Observations
- M₀'s "evolve" capability successfully identified need for specialization
- No Meta-Agent evolution required
- Convergence: Mₙ == M₀ for all iterations
---
## Lessons Learned
### Specialization Decisions
- **10x performance gap** is good threshold (< 5x not worth it, >10x clear win)
- **Positive ROI required**: Creation cost must be justified by time savings
- **Recurring tasks only**: One-off tasks don't justify specialization
### Reusability Patterns
- **Generic agents always reusable**: coder, doc-writer, data-analyst (100%)
- **Domain agents moderately reusable**: coverage-analyzer (70%)
- **Task agents rarely reusable**: test-generator (40%)
### When NOT to Specialize
- Performance gap <5x (tolerable slowness)
- Task is one-off (no recurring benefit)
- Creation cost >ROI (not worth time investment)
- Generic agent will improve with practice (learning curve)
```
---
## Cross-Experiment Analysis
After 3+ experiments, create `agents/CROSS-EXPERIMENT-ANALYSIS.md`:
```markdown
# Cross-Experiment Agent Analysis
## Agent Reuse Matrix
| Agent | Exp1 | Exp2 | Exp3 | Reuse Rate | Transferability |
|-------|------|------|------|------------|-----------------|
| coder | ✓ | ✓ | ✓ | 100% | Universal |
| doc-writer | ✓ | ✓ | ✓ | 100% | Universal |
| data-analyst | ✓ | ✓ | ✓ | 100% | Universal |
| coverage-analyzer | ✓ | - | ✓ | 67% | Domain (testing) |
| test-generator | ✓ | - | - | 33% | Task-specific |
| validation-builder | - | ✓ | ✓ | 67% | Domain (validation) |
| log-analyzer | - | - | ✓ | 33% | Domain (observability) |
## Specialization Patterns
### Universal Agents (100% reuse)
- Generic capabilities (coder, doc-writer, data-analyst)
- No domain knowledge
- Always included in A₀
### Domain Agents (50-80% reuse)
- Require domain knowledge (testing, CI/CD, observability)
- Reusable within domain
- Examples: coverage-analyzer, validation-builder, log-analyzer
### Task Agents (10-40% reuse)
- Highly specialized
- One-off or rare reuse
- Examples: test-generator (Go-specific)
## M₀ Sufficiency
**Finding**: M₀ = {observe, plan, execute, reflect, evolve} sufficient in ALL experiments
**Implications**:
- No Meta-Agent evolution needed
- Base capabilities handle all domains
- Specialization occurs at Agent layer, not Meta-Agent layer
## Specialization Threshold
**Data** (from 3 experiments):
- Average performance gap for specialization: 15x (range: 5x-200x)
- Average creation cost: 3.5 hours (range: 2-5 hours)
- Average ROI: Positive in 8/9 cases (89% success rate)
**Recommendation**: Use 5x performance gap as threshold
---
**Updated**: After each new experiment
```
---
## Success Criteria
Agent evolution tracking succeeded when:
1. **Complete tracking**: All agent changes documented each iteration
2. **Specialization justified**: Each specialized agent has clear ROI
3. **Reusability assessed**: Each agent categorized (universal/domain/task)
4. **Cross-experiment learning**: Patterns identified across 2+ experiments
5. **M₀ stability documented**: Meta-Agent evolution (or lack thereof) tracked
---
## Related Skills
**Parent framework**:
- [methodology-bootstrapping](../methodology-bootstrapping/SKILL.md) - Core OCA cycle
**Complementary**:
- [rapid-convergence](../rapid-convergence/SKILL.md) - Agent stability criterion
---
## References
**Core guide**:
- [Evolution Tracking](reference/tracking.md) - Detailed tracking process
- [Specialization Decisions](reference/specialization.md) - Decision tree
- [Reusability Framework](reference/reusability.md) - Assessment rubric
**Examples**:
- [Bootstrap-002 Evolution](examples/testing-strategy-agent-evolution.md) - 2 specialists
- [Bootstrap-007 No Evolution](examples/ci-cd-no-specialization.md) - Generic sufficient
---
**Status**: ✅ Formalized | 2-3 hours overhead | Enables systematic learning

View File

@@ -0,0 +1,377 @@
# Explore Agent Evolution: v1 → v3
**Agent**: Explore (codebase exploration)
**Iterations**: 3
**Improvement**: 60% → 90% success rate (+50%)
**Time**: 4.2 min → 2.6 min (-38%)
**Status**: Converged (production-ready)
Complete walkthrough of evolving Explore agent prompt through BAIME methodology.
---
## Iteration 0: Baseline (v1)
### Initial Prompt
```markdown
# Explore Agent
You are a codebase exploration agent. Your task is to help users understand
code structure, find implementations, and explain how things work.
When given a query:
1. Use Glob to find relevant files
2. Use Grep to search for patterns
3. Read files to understand implementations
4. Provide a summary
Tools available: Glob, Grep, Read, Bash
```
**Prompt Length**: 58 lines
---
### Baseline Testing (10 tasks)
| Task | Query | Result | Quality | Time |
|------|-------|--------|---------|------|
| 1 | "show architecture" | ❌ Failed | 2/5 | 5.2 min |
| 2 | "find API endpoints" | ⚠️ Partial | 3/5 | 4.8 min |
| 3 | "explain auth" | ⚠️ Partial | 3/5 | 6.1 min |
| 4 | "list CLI commands" | ✅ Success | 4/5 | 2.8 min |
| 5 | "find database code" | ✅ Success | 5/5 | 3.2 min |
| 6 | "show test structure" | ❌ Failed | 2/5 | 4.5 min |
| 7 | "explain config" | ✅ Success | 4/5 | 3.9 min |
| 8 | "find error handlers" | ✅ Success | 5/5 | 2.9 min |
| 9 | "show imports" | ✅ Success | 4/5 | 3.1 min |
| 10 | "find middleware" | ✅ Success | 4/5 | 5.3 min |
**Baseline Metrics**:
- Success Rate: 60% (6/10)
- Average Quality: 3.6/5
- Average Time: 4.18 min
- V_instance: 0.68 (below target)
---
### Failure Analysis
**Pattern 1: Scope Ambiguity** (Tasks 1, 2, 3)
- Queries too broad ("architecture", "auth")
- Agent doesn't know search depth
- Either stops too early or runs too long
**Pattern 2: Incomplete Coverage** (Tasks 2, 6)
- Agent finds 1-2 files, stops
- Misses related implementations
- No verification of completeness
**Pattern 3: Time Management** (Tasks 1, 3, 10)
- Long-running queries (>5 min)
- Diminishing returns after 3 min
- No time-boxing mechanism
---
## Iteration 1: Add Structure (v2)
### Prompt Changes
**Added: Thoroughness Guidelines**
```markdown
## Thoroughness Levels
Assess query complexity and choose thoroughness:
**quick** (1-2 min):
- Check 3-5 obvious locations
- Direct pattern matches only
- Use for simple lookups
**medium** (2-4 min):
- Check 10-15 related files
- Follow cross-references
- Use for typical queries
**thorough** (4-6 min):
- Comprehensive search across codebase
- Deep dependency analysis
- Use for architecture questions
```
**Added: Time-Boxing**
```markdown
## Time Management
Allocate time based on thoroughness:
- quick: 1-2 min
- medium: 2-4 min
- thorough: 4-6 min
Stop if <10% new findings in last 20% of time budget.
```
**Added: Completeness Checklist**
```markdown
## Before Responding
Verify completeness:
□ All direct matches found (Glob/Grep)
□ Related implementations checked
□ Cross-references validated
□ No obvious gaps remaining
State confidence level: Low / Medium / High
```
**Prompt Length**: 112 lines (+54)
---
### Testing (8 tasks: 3 re-tests + 5 new)
| Task | Query | Result | Quality | Time |
|------|-------|--------|---------|------|
| 1R | "show architecture" | ✅ Success | 4/5 | 3.8 min |
| 2R | "find API endpoints" | ✅ Success | 5/5 | 2.9 min |
| 3R | "explain auth" | ✅ Success | 4/5 | 3.2 min |
| 11 | "list database schemas" | ✅ Success | 5/5 | 2.1 min |
| 12 | "find error handlers" | ✅ Success | 4/5 | 2.5 min |
| 13 | "show test structure" | ⚠️ Partial | 3/5 | 3.6 min |
| 14 | "explain config system" | ✅ Success | 5/5 | 2.4 min |
| 15 | "find CLI commands" | ✅ Success | 4/5 | 2.2 min |
**Iteration 1 Metrics**:
- Success Rate: 87.5% (7/8) - **+45.8% improvement**
- Average Quality: 4.25/5 - **+18.1%**
- Average Time: 2.84 min - **-32.1%**
- V_instance: 0.88 ✅ (exceeds target)
---
### Key Improvements
✅ Fixed scope ambiguity (Tasks 1R, 2R, 3R all succeeded)
✅ Better time management (all <4 min)
✅ Higher quality outputs (4.25 avg)
⚠️ Still one partial success (Task 13)
**Remaining Issue**: Test structure query missed integration tests
---
## Iteration 2: Refine Coverage (v3)
### Prompt Changes
**Enhanced: Completeness Verification**
```markdown
## Completeness Verification
Before concluding, verify coverage by category:
**For "find" queries**:
□ Main implementations found
□ Related utilities checked
□ Test files reviewed (if applicable)
□ Configuration/setup files checked
**For "show" queries**:
□ Primary structure identified
□ Secondary components listed
□ Relationships mapped
□ Examples provided
**For "explain" queries**:
□ Core mechanism described
□ Key components identified
□ Data flow explained
□ Edge cases noted
```
**Added: Search Strategy**
```markdown
## Search Strategy
**Phase 1 (30% of time)**: Broad search
- Glob for file patterns
- Grep for key terms
- Identify main locations
**Phase 2 (50% of time)**: Deep investigation
- Read main files
- Follow references
- Build understanding
**Phase 3 (20% of time)**: Verification
- Check for gaps
- Validate findings
- Prepare summary
```
**Refined: Confidence Scoring**
```markdown
## Confidence Level
**High**: All major components found, verified complete
**Medium**: Core components found, minor gaps possible
**Low**: Partial findings, significant gaps likely
Always state confidence level and identify known gaps.
```
**Prompt Length**: 138 lines (+26)
---
### Testing (10 tasks: 1 re-test + 9 new)
| Task | Query | Result | Quality | Time |
|------|-------|--------|---------|------|
| 13R | "show test structure" | ✅ Success | 5/5 | 2.9 min |
| 16 | "find auth middleware" | ✅ Success | 5/5 | 2.3 min |
| 17 | "explain routing" | ✅ Success | 4/5 | 3.1 min |
| 18 | "list validation rules" | ✅ Success | 5/5 | 2.1 min |
| 19 | "find logging setup" | ✅ Success | 4/5 | 2.5 min |
| 20 | "show data models" | ✅ Success | 5/5 | 2.8 min |
| 21 | "explain caching" | ✅ Success | 4/5 | 2.7 min |
| 22 | "find background jobs" | ✅ Success | 5/5 | 2.4 min |
| 23 | "show dependencies" | ✅ Success | 4/5 | 2.2 min |
| 24 | "explain deployment" | ❌ Failed | 2/5 | 3.8 min |
**Iteration 2 Metrics**:
- Success Rate: 90% (9/10) - **+2.5% improvement** (stable)
- Average Quality: 4.3/5 - **+1.2%**
- Average Time: 2.68 min - **-5.6%**
- V_instance: 0.90 ✅ ✅ (2 consecutive ≥ 0.80)
**CONVERGED**
---
### Stability Validation
**Iteration 1**: V_instance = 0.88
**Iteration 2**: V_instance = 0.90
**Change**: +2.3% (stable, within ±5%)
**Criteria Met**:
✅ V_instance ≥ 0.80 for 2 consecutive iterations
✅ Success rate ≥ 85%
✅ Quality ≥ 4.0
✅ Time within budget (<3 min avg)
---
## Final Metrics Comparison
| Metric | v1 (Baseline) | v2 (Iteration 1) | v3 (Iteration 2) | Δ Total |
|--------|---------------|------------------|------------------|---------|
| Success Rate | 60% | 87.5% | 90% | **+50%** |
| Quality | 3.6/5 | 4.25/5 | 4.3/5 | **+19.4%** |
| Time | 4.18 min | 2.84 min | 2.68 min | **-35.9%** |
| V_instance | 0.68 | 0.88 | 0.90 | **+32.4%** |
---
## Evolution Summary
### Iteration 0 → 1: Major Improvements
**Key Changes**:
- Added thoroughness levels (quick/medium/thorough)
- Added time-boxing (1-6 min)
- Added completeness checklist
**Impact**:
- Success: 60% → 87.5% (+45.8%)
- Time: 4.18 → 2.84 min (-32.1%)
- Quality: 3.6 → 4.25 (+18.1%)
**Root Causes Addressed**:
✅ Scope ambiguity resolved
✅ Time management improved
✅ Completeness awareness added
---
### Iteration 1 → 2: Refinement
**Key Changes**:
- Enhanced completeness verification (by query type)
- Added search strategy (3-phase)
- Refined confidence scoring
**Impact**:
- Success: 87.5% → 90% (+2.5%, stable)
- Time: 2.84 → 2.68 min (-5.6%)
- Quality: 4.25 → 4.3 (+1.2%)
**Root Causes Addressed**:
✅ Test structure coverage gap fixed
✅ Verification process strengthened
---
## Key Learnings
### What Worked
1. **Thoroughness Levels**: Clear guidance on search depth
2. **Time-Boxing**: Prevented runaway queries
3. **Completeness Checklist**: Improved coverage
4. **Phased Search**: Structured approach to exploration
### What Didn't Work
1. **Deployment Query Failed**: Outside agent scope (requires infra knowledge)
- Solution: Document limitations, suggest alternative agents
### Best Practices Validated
**Start Simple**: v1 was minimal, added structure incrementally
**Measure Everything**: Quantitative metrics guided refinements
**Focus on Patterns**: Fixed systematic failures, not one-off issues
**Validate Stability**: 2-iteration convergence confirmed reliability
---
## Production Deployment
**Status**: ✅ Production-ready (v3)
**Confidence**: High (90% success, 2 iterations stable)
**Deployment**:
```bash
# Update agent prompt
cp explore-agent-v3.md .claude/agents/explore.md
# Validate
test-agent-suite explore 20
# Expected: Success ≥ 85%, Quality ≥ 4.0, Time ≤ 3 min
```
**Monitoring**:
- Track success rate (alert if <80%)
- Monitor time (alert if >3.5 min avg)
- Review failures weekly
---
## Future Enhancements (v4+)
**Potential Improvements**:
1. **Context Caching**: Reuse codebase knowledge across queries (Est: -20% time)
2. **Query Classification**: Auto-detect thoroughness level (Est: +5% success)
3. **Result Ranking**: Prioritize most relevant findings (Est: +10% quality)
**Decision**: Hold v3, monitor for 2 weeks before v4
---
**Source**: Bootstrap-005 Agent Prompt Evolution
**Agent**: Explore
**Final Version**: v3 (90% success, 4.3/5 quality, 2.68 min avg)
**Status**: Production-ready, converged, deployed

View File

@@ -0,0 +1,409 @@
# Rapid Iteration Pattern for Agent Evolution
**Pattern**: Fast convergence (2-3 iterations) for agent prompt evolution
**Success Rate**: 85% (11/13 agents converged in ≤3 iterations)
**Time**: 3-6 hours total vs 8-12 hours standard
How to achieve rapid convergence when evolving agent prompts.
---
## Pattern Overview
**Standard Evolution**: 4-6 iterations, 8-12 hours
**Rapid Evolution**: 2-3 iterations, 3-6 hours
**Key Difference**: Strong Iteration 0 (comprehensive baseline analysis)
---
## Rapid Iteration Workflow
### Iteration 0: Comprehensive Baseline (90-120 min)
**Standard Baseline** (30 min):
- Run 5 test cases
- Note obvious failures
- Quick metrics
**Comprehensive Baseline** (90-120 min):
- Run 15-20 diverse test cases
- Systematic failure pattern analysis
- Deep root cause investigation
- Document all edge cases
- Compare to similar agents
**Investment**: +60-90 min
**Return**: -2 to -3 iterations (save 3-6 hours)
---
### Example: Explore Agent (Standard vs Rapid)
**Standard Approach**:
```
Iteration 0 (30 min): 5 tasks, quick notes
Iteration 1 (90 min): Add thoroughness levels
Iteration 2 (90 min): Add time-boxing
Iteration 3 (75 min): Add completeness checks
Iteration 4 (60 min): Refine verification
Iteration 5 (60 min): Final polish
Total: 6.75 hours, 5 iterations
```
**Rapid Approach**:
```
Iteration 0 (120 min): 20 tasks, pattern analysis, root causes
Iteration 1 (90 min): Add thoroughness + time-boxing + completeness
Iteration 2 (75 min): Refine + validate stability
Total: 4.75 hours, 2 iterations
```
**Savings**: 2 hours, 3 fewer iterations
---
## Comprehensive Baseline Checklist
### Task Coverage (15-20 tasks)
**Complexity Distribution**:
- 5 simple tasks (1-2 min expected)
- 10 medium tasks (2-4 min expected)
- 5 complex tasks (4-6 min expected)
**Query Type Diversity**:
- Search queries (find, locate, list)
- Analysis queries (explain, describe, analyze)
- Comparison queries (compare, evaluate, contrast)
- Edge cases (ambiguous, overly broad, very specific)
---
### Failure Pattern Analysis (30 min)
**Systematic Analysis**:
1. **Categorize Failures**
- Scope issues (too broad/narrow)
- Coverage issues (incomplete)
- Time issues (too slow/fast)
- Quality issues (inaccurate)
2. **Identify Root Causes**
- Missing instructions
- Ambiguous guidelines
- Incorrect constraints
- Tool usage issues
3. **Prioritize by Impact**
- High frequency + high impact → Fix first
- Low frequency + high impact → Document
- High frequency + low impact → Automate
- Low frequency + low impact → Ignore
**Example**:
```markdown
## Failure Patterns (Explore Agent)
**Pattern 1: Scope Ambiguity** (6/20 tasks, 30%)
Root Cause: No guidance on search depth
Impact: High (3 failures, 3 partial successes)
Priority: P1 (fix in Iteration 1)
**Pattern 2: Incomplete Coverage** (4/20 tasks, 20%)
Root Cause: No completeness verification
Impact: Medium (4 partial successes)
Priority: P1 (fix in Iteration 1)
**Pattern 3: Time Overruns** (3/20 tasks, 15%)
Root Cause: No time-boxing mechanism
Impact: Medium (3 slow but successful)
Priority: P2 (fix in Iteration 1)
**Pattern 4: Tool Selection** (1/20 tasks, 5%)
Root Cause: Not using best tool for task
Impact: Low (1 inefficient but successful)
Priority: P3 (defer to Iteration 2 if time)
```
---
### Comparative Analysis (15 min)
**Compare to Similar Agents**:
- What works well in other agents?
- What patterns are transferable?
- What mistakes were made before?
**Example**:
```markdown
## Comparative Analysis
**Code-Gen Agent** (similar agent):
- Uses complexity assessment (simple/medium/complex)
- Has explicit quality checklist
- Includes time estimates
**Transferable**:
✅ Complexity assessment → thoroughness levels
✅ Quality checklist → completeness verification
❌ Time estimates (less predictable for exploration)
**Analysis Agent** (similar agent):
- Uses phased approach (scan → analyze → synthesize)
- Includes confidence scoring
**Transferable**:
✅ Phased approach → search strategy
✅ Confidence scoring → already planned
```
---
## Iteration 1: Comprehensive Fix (90 min)
**Standard Iteration 1**: Fix 1-2 major issues
**Rapid Iteration 1**: Fix ALL P1 issues + some P2
**Approach**:
1. Address all high-priority patterns (P1)
2. Add preventive measures for P2 issues
3. Include transferable patterns from similar agents
**Example** (Explore Agent):
```markdown
## Iteration 1 Changes
**P1 Fixes**:
1. Scope Ambiguity → Add thoroughness levels (quick/medium/thorough)
2. Incomplete Coverage → Add completeness checklist
3. Time Management → Add time-boxing (1-6 min)
**P2 Improvements**:
4. Search Strategy → Add 3-phase approach
5. Confidence → Add confidence scoring
**Borrowed Patterns**:
6. From Code-Gen: Complexity assessment framework
7. From Analysis: Verification checkpoints
Total Changes: 7 (vs standard 2-3)
```
**Result**: Higher chance of convergence in Iteration 2
---
## Iteration 2: Validate & Converge (75 min)
**Objectives**:
1. Test comprehensive fixes
2. Measure stability
3. Validate convergence
**Test Suite** (30 min):
- Re-run all 20 Iteration 0 tasks
- Add 5-10 new edge cases
- Measure metrics
**Analysis** (20 min):
- Compare to Iteration 0 and Iteration 1
- Check convergence criteria
- Identify remaining gaps (if any)
**Refinement** (25 min):
- Minor adjustments only
- Polish documentation
- Validate stability
**Convergence Check**:
```
Iteration 1: V_instance = 0.88 ✅
Iteration 2: V_instance = 0.90 ✅
Stable: 0.88 → 0.90 (+2.3%, within ±5%)
CONVERGED ✅
```
---
## Success Factors
### 1. Comprehensive Baseline (60-90 min extra)
**Investment**: 2x standard baseline time
**Return**: -2 to -3 iterations (6-9 hours saved)
**ROI**: 4-6x
**Critical Elements**:
- 15-20 diverse tasks (not 5-10)
- Systematic failure pattern analysis
- Root cause investigation (not just symptoms)
- Comparative analysis with similar agents
---
### 2. Aggressive Iteration 1 (Fix All P1)
**Standard**: Fix 1-2 issues
**Rapid**: Fix all P1 + some P2 (5-7 fixes)
**Approach**:
- Batch related fixes together
- Borrow proven patterns
- Add preventive measures
**Risk**: Over-complication
**Mitigation**: Focus on core issues, defer P3
---
### 3. Borrowed Patterns (20-30% reuse)
**Sources**:
- Similar agents in same project
- Agents from other projects
- Industry best practices
**Example**:
```
Explore Agent borrowed from:
- Code-Gen: Complexity assessment (100% reuse)
- Analysis: Phased approach (90% reuse)
- Testing: Verification checklist (80% reuse)
Total reuse: ~60% of Iteration 1 changes
```
**Savings**: 30-40 min per iteration
---
## Anti-Patterns
### ❌ Skipping Comprehensive Baseline
**Symptom**: "Let's just try some fixes and see"
**Result**: 5-6 iterations, trial and error
**Cost**: 8-12 hours
**Fix**: Invest 90-120 min in Iteration 0
---
### ❌ Incremental Fixes (One Issue at a Time)
**Symptom**: Fixing one pattern per iteration
**Result**: 4-6 iterations for convergence
**Cost**: 8-10 hours
**Fix**: Batch P1 fixes in Iteration 1
---
### ❌ Ignoring Similar Agents
**Symptom**: Reinventing solutions
**Result**: Slower convergence, lower quality
**Cost**: 2-3 extra hours
**Fix**: 15 min comparative analysis in Iteration 0
---
## When to Use Rapid Pattern
**Good Fit**:
- Agent is similar to existing agents (60%+ overlap)
- Clear failure patterns in baseline
- Time constraint (need results in 1-2 days)
**Poor Fit**:
- Novel agent type (no similar agents)
- Complex domain (many unknowns)
- Learning objective (want to explore incrementally)
---
## Metrics Comparison
### Standard Evolution
```
Iteration 0: 30 min (5 tasks)
Iteration 1: 90 min (fix 1-2 issues)
Iteration 2: 90 min (fix 2-3 more)
Iteration 3: 75 min (refine)
Iteration 4: 60 min (converge)
Total: 5.75 hours, 4 iterations
V_instance: 0.68 → 0.74 → 0.79 → 0.83 → 0.85 ✅
```
### Rapid Evolution
```
Iteration 0: 120 min (20 tasks, analysis)
Iteration 1: 90 min (fix all P1+P2)
Iteration 2: 75 min (validate, converge)
Total: 4.75 hours, 2 iterations
V_instance: 0.68 → 0.88 → 0.90 ✅
```
**Savings**: 1 hour, 2 fewer iterations
---
## Replication Guide
### Day 1: Comprehensive Baseline
**Morning** (2 hours):
1. Design 20-task test suite
2. Run baseline tests
3. Document all failures
**Afternoon** (1 hour):
4. Analyze failure patterns
5. Identify root causes
6. Compare to similar agents
7. Prioritize fixes
---
### Day 2: Comprehensive Fix
**Morning** (1.5 hours):
1. Implement all P1 fixes
2. Add P2 improvements
3. Incorporate borrowed patterns
**Afternoon** (1 hour):
4. Test on 15-20 tasks
5. Measure metrics
6. Document changes
---
### Day 3: Validate & Deploy
**Morning** (1 hour):
1. Test on 25-30 tasks
2. Check stability
3. Minor refinements
**Afternoon** (0.5 hours):
4. Final validation
5. Deploy to production
6. Setup monitoring
---
**Source**: BAIME Agent Prompt Evolution - Rapid Pattern
**Success Rate**: 85% (11/13 agents)
**Average Time**: 4.2 hours (vs 9.3 hours standard)
**Average Iterations**: 2.3 (vs 4.8 standard)

View File

@@ -0,0 +1,395 @@
# Agent Prompt Evolution Framework
**Version**: 1.0
**Purpose**: Systematic methodology for evolving agent prompts through iterative refinement
**Basis**: BAIME OCA cycle applied to prompt engineering
---
## Overview
Agent prompt evolution applies the Observe-Codify-Automate cycle to improve agent prompts through empirical testing and structured refinement.
**Goal**: Transform initial agent prompts into production-quality prompts through measured iterations.
---
## Evolution Cycle
```
Iteration N:
Observe → Analyze → Refine → Test → Measure
↑ ↓
└────────── Feedback Loop ──────────┘
```
---
## Phase 1: Observe (30 min)
### Run Agent with Current Prompt
**Activities**:
1. Execute agent on 5-10 representative tasks
2. Record agent behavior and outputs
3. Note successes and failures
4. Measure performance metrics
**Metrics**:
- Success rate (tasks completed correctly)
- Response quality (accuracy, completeness)
- Efficiency (time, token usage)
- Error patterns
**Example**:
```markdown
## Iteration 0: Baseline Observation
**Agent**: Explore subagent (codebase exploration)
**Tasks**: 10 exploration queries
**Success Rate**: 60% (6/10)
**Failures**:
1. Query "show architecture" → Too broad, agent confused
2. Query "find API endpoints" → Missed 3 key files
3. Query "explain auth" → Incomplete, stopped too early
**Time**: Avg 4.2 min per query (target: 2 min)
**Quality**: 3.1/5 average rating
```
---
## Phase 2: Analyze (20 min)
### Identify Failure Patterns
**Analysis Questions**:
1. What types of failures occurred?
2. Are failures systematic or random?
3. What context is missing from prompt?
4. Are instructions clear enough?
5. Are constraints too loose or too tight?
**Example Analysis**:
```markdown
## Failure Pattern Analysis
**Pattern 1: Scope Ambiguity** (3 failures)
- Queries too broad ("architecture", "overview")
- Agent doesn't know how deep to search
- Fix: Add explicit depth guidelines
**Pattern 2: Search Coverage** (2 failures)
- Agent stops after finding 1-2 files
- Misses related implementations
- Fix: Add thoroughness requirements
**Pattern 3: Time Management** (2 failures)
- Agent runs too long (>5 min)
- Diminishing returns after 2 min
- Fix: Add time-boxing guidelines
```
---
## Phase 3: Refine (25 min)
### Update Agent Prompt
**Refinement Strategies**:
1. **Add Missing Context**
- Domain knowledge
- Codebase structure
- Common patterns
2. **Clarify Instructions**
- Break down complex tasks
- Add examples
- Define success criteria
3. **Adjust Constraints**
- Time limits
- Scope boundaries
- Quality thresholds
4. **Provide Tools**
- Specific commands
- Search patterns
- Decision frameworks
**Example Refinements**:
```markdown
## Prompt Changes (v0 → v1)
**Added: Thoroughness Guidelines**
```
When searching for patterns:
- "quick": Check 3-5 obvious locations
- "medium": Check 10-15 related files
- "thorough": Check all matching patterns
```
**Added: Time-Boxing**
```
Allocate time based on thoroughness:
- quick: 1-2 min
- medium: 2-4 min
- thorough: 4-6 min
Stop if diminishing returns after 80% of time used.
```
**Clarified: Success Criteria**
```
Complete search means:
✓ All direct matches found
✓ Related implementations identified
✓ Cross-references checked
✓ Confidence score provided (Low/Medium/High)
```
```
---
## Phase 4: Test (20 min)
### Validate Refinements
**Test Suite**:
1. Re-run failed tasks from Iteration 0
2. Add 3-5 new test cases
3. Measure improvement
**Example Test**:
```markdown
## Iteration 1 Testing
**Re-run Failed Tasks** (3):
1. "show architecture" → ✅ SUCCESS (added thoroughness=medium)
2. "find API endpoints" → ✅ SUCCESS (found all 5 files)
3. "explain auth" → ✅ SUCCESS (complete explanation)
**New Test Cases** (5):
1. "list database schemas" → ✅ SUCCESS
2. "find error handlers" → ✅ SUCCESS
3. "show test structure" → ⚠️ PARTIAL (missed integration tests)
4. "explain config system" → ✅ SUCCESS
5. "find CLI commands" → ✅ SUCCESS
**Success Rate**: 87.5% (7/8) - improved from 60%
```
---
## Phase 5: Measure (15 min)
### Calculate Improvement Metrics
**Metrics**:
```
Δ Success Rate = (new_rate - baseline_rate) / baseline_rate
Δ Quality = (new_score - baseline_score) / baseline_score
Δ Efficiency = (baseline_time - new_time) / baseline_time
```
**Example**:
```markdown
## Iteration 1 Metrics
**Success Rate**:
- Baseline: 60% (6/10)
- Iteration 1: 87.5% (7/8)
- Improvement: +45.8%
**Quality** (1-5 scale):
- Baseline: 3.1 avg
- Iteration 1: 4.2 avg
- Improvement: +35.5%
**Efficiency**:
- Baseline: 4.2 min avg
- Iteration 1: 2.8 min avg
- Improvement: +33.3% (faster)
**Overall V_instance**: 0.85 ✅ (target: 0.80)
```
---
## Convergence Criteria
**Prompt is production-ready when**:
1. **Success Rate ≥ 85%** (reliable)
2. **Quality Score ≥ 4.0/5** (high quality)
3. **Efficiency within target** (time/tokens)
4. **Stable for 2 iterations** (no regression)
**Example Convergence**:
```
Iteration 0: 60% success, 3.1 quality, 4.2 min
Iteration 1: 87.5% success, 4.2 quality, 2.8 min ✅
Iteration 2: 90% success, 4.3 quality, 2.6 min ✅ (stable)
CONVERGED: Ready for production
```
---
## Evolution Patterns
### Pattern 1: Scope Definition
**Problem**: Agent doesn't know how broad/deep to search
**Solution**: Add thoroughness parameter
```markdown
When invoked, assess query complexity:
- Simple (1-2 files): thoroughness=quick
- Medium (5-10 files): thoroughness=medium
- Complex (>10 files): thoroughness=thorough
```
### Pattern 2: Early Termination
**Problem**: Agent stops too early, misses results
**Solution**: Add completeness checklist
```markdown
Before concluding search, verify:
□ All direct matches found (Glob/Grep)
□ Related implementations checked
□ Cross-references validated
□ No obvious gaps remaining
```
### Pattern 3: Time Management
**Problem**: Agent runs too long, poor efficiency
**Solution**: Add time-boxing with checkpoints
```markdown
Allocate time budget:
- 0-30%: Initial broad search
- 30-70%: Deep investigation
- 70-100%: Verification and summary
Stop if <10% new findings in last 20% of time.
```
### Pattern 4: Context Accumulation
**Problem**: Agent forgets earlier findings
**Solution**: Add intermediate summaries
```markdown
After each major finding:
1. Summarize what was found
2. Update mental model
3. Identify remaining gaps
4. Adjust search strategy
```
### Pattern 5: Quality Assurance
**Problem**: Agent provides low-quality outputs
**Solution**: Add self-review checklist
```markdown
Before responding, verify:
□ Answer is accurate and complete
□ Examples are provided
□ Confidence level stated
□ Next steps suggested (if applicable)
```
---
## Iteration Template
```markdown
## Iteration N: [Focus Area]
### Observations (30 min)
- Tasks tested: [count]
- Success rate: [X]%
- Avg quality: [X]/5
- Avg time: [X] min
**Key Issues**:
1. [Issue description]
2. [Issue description]
### Analysis (20 min)
- Pattern 1: [Name] ([frequency])
- Pattern 2: [Name] ([frequency])
### Refinements (25 min)
- Added: [Feature/guideline]
- Clarified: [Instruction]
- Adjusted: [Constraint]
### Testing (20 min)
- Re-test failures: [X]/[Y] fixed
- New tests: [X]/[Y] passed
- Overall success: [X]%
### Metrics (15 min)
- Δ Success: [+/-X]%
- Δ Quality: [+/-X]%
- Δ Efficiency: [+/-X]%
- V_instance: [X.XX]
**Status**: [Converged/Continue]
**Next Focus**: [Area to improve]
```
---
## Best Practices
### Do's
**Test on diverse cases** - Cover edge cases and common queries
**Measure objectively** - Use quantitative metrics
**Iterate quickly** - 90-120 min per iteration
**Focus improvements** - One major change per iteration
**Validate stability** - Test 2 iterations for convergence
### Don'ts
**Don't overtune** - Avoid overfitting to test cases
**Don't skip baselines** - Always measure Iteration 0
**Don't ignore regressions** - Track quality across iterations
**Don't add complexity** - Keep prompts concise
**Don't stop too early** - Ensure 2-iteration stability
---
## Example: Explore Agent Evolution
**Baseline** (Iteration 0):
- Generic instructions
- No thoroughness guidance
- No time management
- Success: 60%
**Iteration 1**:
- Added thoroughness levels
- Added time-boxing
- Success: 87.5% (+45.8%)
**Iteration 2**:
- Added completeness checklist
- Refined search strategy
- Success: 90% (+2.5% improvement, stable)
**Convergence**: 2 iterations, 87.5% → 90% stable
---
**Source**: BAIME Agent Prompt Evolution Framework
**Status**: Production-ready, validated across 13 agent types
**Average Improvement**: +42% success rate over baseline

View File

@@ -0,0 +1,386 @@
# Agent Prompt Metrics
**Version**: 1.0
**Purpose**: Quantitative metrics for measuring agent prompt quality
**Framework**: BAIME dual-layer value functions applied to agents
---
## Core Metrics
### 1. Success Rate
**Definition**: Percentage of tasks completed correctly
**Calculation**:
```
Success Rate = correct_completions / total_tasks
```
**Thresholds**:
- ≥90%: Excellent (production-ready)
- 80-89%: Good (minor refinements needed)
- 60-79%: Fair (needs improvement)
- <60%: Poor (major issues)
**Example**:
```
Tasks: 20
Correct: 17
Partial: 2
Failed: 1
Success Rate = 17/20 = 85% (Good)
```
---
### 2. Quality Score
**Definition**: Average quality rating of agent outputs (1-5 scale)
**Rating Criteria**:
- **5**: Perfect - Accurate, complete, well-structured
- **4**: Good - Minor gaps, mostly complete
- **3**: Fair - Acceptable but needs improvement
- **2**: Poor - Significant issues
- **1**: Failed - Incorrect or unusable
**Thresholds**:
- ≥4.5: Excellent
- 4.0-4.4: Good
- 3.5-3.9: Fair
- <3.5: Poor
**Example**:
```
Task 1: 5/5 (perfect)
Task 2: 4/5 (good)
Task 3: 5/5 (perfect)
...
Task 20: 4/5 (good)
Average: 4.35/5 (Good)
```
---
### 3. Efficiency
**Definition**: Time and token usage per task
**Metrics**:
```
Time Efficiency = avg_time_per_task
Token Efficiency = avg_tokens_per_task
```
**Thresholds** (vary by agent type):
- Explore agent: <3 min, <5k tokens
- Code generation: <5 min, <10k tokens
- Analysis: <10 min, <20k tokens
**Example**:
```
Tasks: 20
Total time: 56 min
Total tokens: 92k
Time Efficiency: 2.8 min/task ✅
Token Efficiency: 4.6k tokens/task ✅
```
---
### 4. Reliability
**Definition**: Consistency of agent performance
**Calculation**:
```
Reliability = 1 - (std_dev(success_rate) / mean(success_rate))
```
**Thresholds**:
- ≥0.90: Very reliable (consistent)
- 0.80-0.89: Reliable
- 0.70-0.79: Moderately reliable
- <0.70: Unreliable (erratic)
**Example**:
```
Batch 1: 85% success
Batch 2: 90% success
Batch 3: 87% success
Batch 4: 88% success
Mean: 87.5%
Std Dev: 2.08
Reliability: 1 - (2.08/87.5) = 0.976 (Very reliable)
```
---
## Composite Metrics
### V_instance (Agent Performance)
**Formula**:
```
V_instance = 0.40 × success_rate +
0.30 × (quality_score / 5) +
0.20 × efficiency_score +
0.10 × reliability
Where:
- success_rate ∈ [0, 1]
- quality_score ∈ [1, 5], normalized to [0, 1]
- efficiency_score = 1 - (actual_time / target_time), capped at [0, 1]
- reliability ∈ [0, 1]
```
**Target**: V_instance ≥ 0.80
**Example**:
```
Success Rate: 85% = 0.85
Quality Score: 4.2/5 = 0.84
Efficiency: 2.8 min / 3 min target = 1 - 0.93 = 0.07, but we want faster so: 1.0 (under budget)
Reliability: 0.976
V_instance = 0.40 × 0.85 +
0.30 × 0.84 +
0.20 × 1.0 +
0.10 × 0.976
= 0.34 + 0.252 + 0.20 + 0.0976
= 0.890 ✅ (exceeds target)
```
---
### V_meta (Prompt Quality)
**Formula**:
```
V_meta = 0.35 × completeness +
0.30 × clarity +
0.20 × adaptability +
0.15 × maintainability
Where:
- completeness = features_implemented / features_needed
- clarity = 1 - (ambiguous_instructions / total_instructions)
- adaptability = successful_task_types / tested_task_types
- maintainability = 1 - (prompt_complexity / max_complexity)
```
**Target**: V_meta ≥ 0.80
**Example**:
```
Completeness: 8/8 features = 1.0
Clarity: 1 - (2 ambiguous / 20 instructions) = 0.90
Adaptability: 5/6 task types = 0.83
Maintainability: 1 - (150 lines / 300 max) = 0.50
V_meta = 0.35 × 1.0 +
0.30 × 0.90 +
0.20 × 0.83 +
0.15 × 0.50
= 0.35 + 0.27 + 0.166 + 0.075
= 0.861 ✅ (exceeds target)
```
---
## Metric Collection
### Automated Collection
**Session Analysis**:
```bash
# Extract agent performance from session
query_tools --tool="Task" --scope=session | \
jq -r '.[] | select(.status == "success") | .duration' | \
awk '{sum+=$1; n++} END {print sum/n}'
```
**Example Script**:
```bash
#!/bin/bash
# scripts/measure-agent-metrics.sh
AGENT_NAME=$1
SESSION=$2
# Success rate
total=$(grep "agent=$AGENT_NAME" "$SESSION" | wc -l)
success=$(grep "agent=$AGENT_NAME.*success" "$SESSION" | wc -l)
success_rate=$(echo "scale=2; $success / $total" | bc)
# Average time
avg_time=$(grep "agent=$AGENT_NAME" "$SESSION" | \
jq -r '.duration' | \
awk '{sum+=$1; n++} END {print sum/n}')
# Quality (requires manual rating file)
avg_quality=$(cat "${SESSION}.ratings" | \
grep "$AGENT_NAME" | \
awk '{sum+=$2; n++} END {print sum/n}')
echo "Agent: $AGENT_NAME"
echo "Success Rate: $success_rate"
echo "Avg Time: ${avg_time}s"
echo "Avg Quality: $avg_quality/5"
```
---
### Manual Collection
**Test Suite Template**:
```markdown
## Agent Test Suite: [Agent Name]
**Iteration**: [N]
**Date**: [YYYY-MM-DD]
### Test Cases
| ID | Task | Result | Quality | Time | Notes |
|----|------|--------|---------|------|-------|
| 1 | [Description] | ✅/❌ | [1-5] | [min] | [Issues] |
| 2 | [Description] | ✅/❌ | [1-5] | [min] | [Issues] |
...
### Summary
- Success Rate: [X]% ([Y]/[Z])
- Avg Quality: [X.X]/5
- Avg Time: [X.X] min
- V_instance: [X.XX]
```
---
## Benchmarking
### Cross-Agent Comparison
**Standard Test Suite**: 20 representative tasks
**Example Results**:
```
| Agent | Success | Quality | Time | V_inst |
|-------------|---------|---------|-------|--------|
| Explore v1 | 60% | 3.1 | 4.2m | 0.62 |
| Explore v2 | 87.5% | 4.2 | 2.8m | 0.89 |
| Explore v3 | 90% | 4.3 | 2.6m | 0.91 |
```
**Improvement**: v1 → v3 = +30% success, +1.2 quality, +38% faster
---
### Baseline Comparison
**Industry Baselines** (approximate):
- Generic agent (no tuning): ~50-60% success
- Basic tuned agent: ~70-80% success
- Well-tuned agent: ~85-95% success
- Expert-tuned agent: ~95-98% success
---
## Regression Testing
### Track Metrics Over Time
**Regression Detection**:
```
if current_metric < (previous_metric - threshold):
alert("REGRESSION DETECTED")
```
**Thresholds**:
- Success Rate: -5% (e.g., 90% → 85%)
- Quality Score: -0.3 (e.g., 4.5 → 4.2)
- Efficiency: +20% time (e.g., 2.8 min → 3.4 min)
**Example**:
```
Iteration 3: 90% success, 4.3 quality, 2.6 min ✅
Iteration 4: 87% success, 4.1 quality, 2.8 min ⚠️ REGRESSION
Analysis: New constraint too restrictive
Action: Revert constraint, re-test
```
---
## Reporting Template
```markdown
## Agent Metrics Report
**Agent**: [Name]
**Version**: [X.Y]
**Test Date**: [YYYY-MM-DD]
**Test Suite**: [Standard 20 | Custom N]
### Performance Metrics
**Success Rate**: [X]% ([Y]/[Z] tasks)
- Target: ≥85%
- Status: ✅/⚠️/❌
**Quality Score**: [X.X]/5
- Target: ≥4.0
- Status: ✅/⚠️/❌
**Efficiency**:
- Time: [X.X] min/task (target: [Y] min)
- Tokens: [X]k tokens/task (target: [Y]k)
- Status: ✅/⚠️/❌
**Reliability**: [X.XX]
- Target: ≥0.85
- Status: ✅/⚠️/❌
### Composite Scores
**V_instance**: [X.XX]
- Target: ≥0.80
- Status: ✅/⚠️/❌
**V_meta**: [X.XX]
- Target: ≥0.80
- Status: ✅/⚠️/❌
### Comparison to Baseline
| Metric | Baseline | Current | Δ |
|---------------|----------|---------|--------|
| Success Rate | [X]% | [Y]% | [+/-]% |
| Quality | [X.X] | [Y.Y] | [+/-] |
| Time | [X.X]m | [Y.Y]m | [+/-]% |
| V_instance | [X.XX] | [Y.YY] | [+/-] |
### Recommendations
1. [Action item based on metrics]
2. [Action item based on metrics]
### Next Steps
- [ ] [Task for next iteration]
- [ ] [Task for next iteration]
```
---
**Source**: BAIME Agent Prompt Evolution Framework
**Status**: Production-ready, validated across 13 agent types
**Measurement Overhead**: ~5 min per 20-task test suite

View File

@@ -0,0 +1,339 @@
# Agent Test Suite Template
**Purpose**: Standardized test suite for agent prompt validation
**Usage**: Copy and customize for your agent type
---
## Test Suite: [Agent Name]
**Agent Type**: [Explore/Code-Gen/Analysis/etc.]
**Version**: [X.Y]
**Test Date**: [YYYY-MM-DD]
**Tester**: [Name]
---
## Test Configuration
**Test Environment**:
- Claude Code Version: [version]
- Model: [model-id]
- Session ID: [session-id]
**Test Parameters**:
- Number of tasks: [20 recommended]
- Task diversity: [Low/Medium/High]
- Complexity distribution:
- Simple: [N] tasks
- Medium: [N] tasks
- Complex: [N] tasks
---
## Test Cases
### Task 1: [Brief Description]
**Type**: [Simple/Medium/Complex]
**Category**: [Search/Analysis/Generation/etc.]
**Input**:
```
[Exact prompt or command given to agent]
```
**Expected Outcome**:
```
[What a successful completion looks like]
```
**Actual Result**:
- Status: ✅ Success / ⚠️ Partial / ❌ Failed
- Quality Rating: [1-5]
- Time: [X.X] min
- Tokens: [X]k
**Notes**:
```
[Any observations, issues, or improvements identified]
```
---
### Task 2: [Brief Description]
**Type**: [Simple/Medium/Complex]
**Category**: [Search/Analysis/Generation/etc.]
**Input**:
```
[Exact prompt or command given to agent]
```
**Expected Outcome**:
```
[What a successful completion looks like]
```
**Actual Result**:
- Status: ✅ Success / ⚠️ Partial / ❌ Failed
- Quality Rating: [1-5]
- Time: [X.X] min
- Tokens: [X]k
**Notes**:
```
[Any observations, issues, or improvements identified]
```
---
[Repeat for all 20 tasks]
---
## Summary Statistics
### Overall Performance
**Success Rate**:
```
Total Tasks: [N]
Successful: [N] (✅)
Partial: [N] (⚠️)
Failed: [N] (❌)
Success Rate: [X]% ([successful] / [total])
```
**Quality Score**:
```
Task Quality Ratings: [4, 5, 3, 4, 5, ...]
Average Quality: [X.X] / 5
```
**Efficiency**:
```
Total Time: [X.X] min
Average Time: [X.X] min/task
Total Tokens: [X]k
Average Tokens: [X.X]k/task
```
**Reliability**:
```
Success by Complexity:
- Simple: [X]% ([Y]/[Z])
- Medium: [X]% ([Y]/[Z])
- Complex: [X]% ([Y]/[Z])
Reliability Score: [X.XX]
```
---
## Composite Metrics
### V_instance Calculation
```
Success Rate: [X]% = [0.XX]
Quality Score: [X.X]/5 = [0.XX]
Efficiency Score: [target - actual] / target = [0.XX]
Reliability: [0.XX]
V_instance = 0.40 × [success_rate] +
0.30 × [quality_normalized] +
0.20 × [efficiency_score] +
0.10 × [reliability]
= [0.XX] + [0.XX] + [0.XX] + [0.XX]
= [0.XX]
Target: ≥ 0.80
Status: ✅ / ⚠️ / ❌
```
---
## Failure Analysis
### Failed Tasks
| Task ID | Description | Failure Reason | Pattern |
|---------|-------------|----------------|---------|
| [N] | [Brief] | [Why failed] | [Type] |
| [N] | [Brief] | [Why failed] | [Type] |
### Failure Patterns
**Pattern 1: [Name]** ([N] occurrences)
- Description: [What went wrong]
- Root Cause: [Why it happened]
- Proposed Fix: [How to address]
**Pattern 2: [Name]** ([N] occurrences)
- Description: [What went wrong]
- Root Cause: [Why it happened]
- Proposed Fix: [How to address]
---
## Quality Issues
### Tasks with Quality < 4
| Task ID | Quality | Issues Identified | Improvements Needed |
|---------|---------|-------------------|---------------------|
| [N] | [1-3] | [Description] | [Actions] |
| [N] | [1-3] | [Description] | [Actions] |
---
## Efficiency Analysis
### Tasks Exceeding Time Budget
| Task ID | Actual Time | Target Time | Δ | Reason |
|---------|-------------|-------------|------|--------|
| [N] | [X.X] min | [Y] min | [+Z] | [Why] |
| [N] | [X.X] min | [Y] min | [+Z] | [Why] |
### Token Usage Analysis
```
Tokens per task: [min-max] range
High-usage tasks: [list]
Optimization opportunities: [suggestions]
```
---
## Recommendations
### Priority 1 (Critical)
1. **[Issue]**: [Description]
- Impact: [High/Medium/Low]
- Frequency: [X] occurrences
- Proposed Fix: [Action]
- Expected Improvement: [X]% success rate
2. **[Issue]**: [Description]
- Impact: [High/Medium/Low]
- Frequency: [X] occurrences
- Proposed Fix: [Action]
- Expected Improvement: [X]% quality
### Priority 2 (Important)
1. **[Issue]**: [Description]
- Impact: [High/Medium/Low]
- Frequency: [X] occurrences
- Proposed Fix: [Action]
### Priority 3 (Nice to Have)
1. **[Improvement]**: [Description]
- Benefit: [What improves]
- Effort: [Low/Medium/High]
---
## Next Iteration Plan
### Focus Areas
1. **[Area 1]**: [Why focus here]
- Baseline: [Current metric]
- Target: [Goal metric]
- Approach: [How to improve]
2. **[Area 2]**: [Why focus here]
- Baseline: [Current metric]
- Target: [Goal metric]
- Approach: [How to improve]
### Prompt Changes
**Planned Additions**:
- [ ] [Guideline/instruction to add]
- [ ] [Constraint to add]
- [ ] [Example to add]
**Planned Clarifications**:
- [ ] [Instruction to clarify]
- [ ] [Constraint to adjust]
**Planned Removals**:
- [ ] [Unnecessary instruction]
- [ ] [Redundant constraint]
---
## Test Suite Evolution
### Version History
| Version | Date | Success | Quality | V_inst | Changes |
|---------|------|---------|---------|--------|---------|
| 0.1 | [D] | [X]% | [X.X] | [0.XX] | Baseline|
| 0.2 | [D] | [X]% | [X.X] | [0.XX] | [Changes]|
| [curr] | [D] | [X]% | [X.X] | [0.XX] | [Changes]|
### Convergence Tracking
```
Iteration 0: V_instance = [0.XX] (baseline)
Iteration 1: V_instance = [0.XX] ([+/-]%)
Iteration 2: V_instance = [0.XX] ([+/-]%)
Current: V_instance = [0.XX] ([+/-]%)
Converged: ✅ / ❌
(Requires V_instance ≥ 0.80 for 2 consecutive iterations)
```
---
## Appendix: Task Catalog
### Task Templates by Category
**Search Tasks**:
- "Find all [pattern] in [scope]"
- "Locate [functionality] implementation"
- "Show [architecture aspect]"
**Analysis Tasks**:
- "Explain how [feature] works"
- "Identify [issue type] in [code]"
- "Compare [approach A] vs [approach B]"
**Generation Tasks**:
- "Create [artifact type] for [purpose]"
- "Generate [code/docs] following [pattern]"
- "Refactor [code] to [goal]"
### Complexity Guidelines
**Simple** (1-2 min, 1-3k tokens):
- Single-file search
- Direct lookup
- Straightforward generation
**Medium** (2-4 min, 3-7k tokens):
- Multi-file search
- Pattern analysis
- Moderate generation
**Complex** (4-6 min, 7-15k tokens):
- Cross-codebase search
- Deep analysis
- Complex generation
---
**Template Version**: 1.0
**Source**: BAIME Agent Prompt Evolution
**Usage**: Copy to `agent-test-suite-[name]-[version].md`