Initial commit
This commit is contained in:
404
skills/agent-prompt-evolution/SKILL.md
Normal file
404
skills/agent-prompt-evolution/SKILL.md
Normal file
@@ -0,0 +1,404 @@
|
||||
---
|
||||
name: Agent Prompt Evolution
|
||||
description: Track and optimize agent specialization during methodology development. Use when agent specialization emerges (generic agents show >5x performance gap), multi-experiment comparison needed, or methodology transferability analysis required. Captures agent set evolution (Aₙ tracking), meta-agent evolution (Mₙ tracking), specialization decisions (when/why to create specialized agents), and reusability assessment (universal vs domain-specific vs task-specific). Enables systematic cross-experiment learning and optimized M₀ evolution. 2-3 hours overhead per experiment.
|
||||
allowed-tools: Read, Grep, Glob, Edit, Write
|
||||
---
|
||||
|
||||
# Agent Prompt Evolution
|
||||
|
||||
**Systematically track how agents specialize during methodology development.**
|
||||
|
||||
> Specialized agents emerge from need, not prediction. Track their evolution to understand when specialization adds value.
|
||||
|
||||
---
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
- 🔄 **Agent specialization emerges**: Generic agents show >5x performance gap
|
||||
- 📊 **Multi-experiment comparison**: Want to learn across experiments
|
||||
- 🧩 **Methodology transferability**: Analyzing what's reusable vs domain-specific
|
||||
- 📈 **M₀ optimization**: Want to evolve base Meta-Agent capabilities
|
||||
- 🎯 **Specialization decisions**: Deciding when to create new agents
|
||||
- 📚 **Agent library**: Building reusable agent catalog
|
||||
|
||||
**Don't use when**:
|
||||
- ❌ Single experiment with no specialization
|
||||
- ❌ Generic agents sufficient throughout
|
||||
- ❌ No cross-experiment learning goals
|
||||
- ❌ Tracking overhead not worth insights
|
||||
|
||||
---
|
||||
|
||||
## Quick Start (10 minutes per iteration)
|
||||
|
||||
### Track Agent Evolution in Each Iteration
|
||||
|
||||
**iteration-N.md template**:
|
||||
|
||||
```markdown
|
||||
## Agent Set Evolution
|
||||
|
||||
### Current Agent Set (Aₙ)
|
||||
1. **coder** (generic) - Write code, implement features
|
||||
2. **doc-writer** (generic) - Documentation
|
||||
3. **data-analyst** (generic) - Data analysis
|
||||
4. **coverage-analyzer** (specialized, created iteration 3) - Analyze test coverage gaps
|
||||
|
||||
### Changes from Previous Iteration
|
||||
- Added: coverage-analyzer (10x speedup for coverage analysis)
|
||||
- Removed: None
|
||||
- Modified: None
|
||||
|
||||
### Specialization Decision
|
||||
**Why coverage-analyzer?**
|
||||
- Generic data-analyst took 45 min for coverage analysis
|
||||
- Identified 10x performance gap
|
||||
- Coverage analysis is recurring task (every iteration)
|
||||
- Domain knowledge: Go coverage tools, gap identification patterns
|
||||
- **ROI**: 3 hours creation cost, saves 40 min/iteration × 3 remaining iterations = 2 hours saved
|
||||
|
||||
### Agent Reusability Assessment
|
||||
- **coder**: Universal (100% transferable)
|
||||
- **doc-writer**: Universal (100% transferable)
|
||||
- **data-analyst**: Universal (100% transferable)
|
||||
- **coverage-analyzer**: Domain-specific (testing methodology, 70% transferable to other languages)
|
||||
|
||||
### System State
|
||||
- Aₙ ≠ Aₙ₋₁ (new agent added)
|
||||
- System UNSTABLE (need iteration N+1 to confirm stability)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Four Tracking Dimensions
|
||||
|
||||
### 1. Agent Set Evolution (Aₙ)
|
||||
|
||||
**Track changes iteration-to-iteration**:
|
||||
|
||||
```
|
||||
A₀ = {coder, doc-writer, data-analyst}
|
||||
A₁ = {coder, doc-writer, data-analyst} (unchanged)
|
||||
A₂ = {coder, doc-writer, data-analyst} (unchanged)
|
||||
A₃ = {coder, doc-writer, data-analyst, coverage-analyzer} (new specialist)
|
||||
A₄ = {coder, doc-writer, data-analyst, coverage-analyzer, test-generator} (new specialist)
|
||||
A₅ = {coder, doc-writer, data-analyst, coverage-analyzer, test-generator} (stable)
|
||||
```
|
||||
|
||||
**Stability**: Aₙ == Aₙ₋₁ for convergence
|
||||
|
||||
### 2. Meta-Agent Evolution (Mₙ)
|
||||
|
||||
**Standard M₀ capabilities**:
|
||||
1. **observe**: Pattern observation
|
||||
2. **plan**: Iteration planning
|
||||
3. **execute**: Agent orchestration
|
||||
4. **reflect**: Value assessment
|
||||
5. **evolve**: System evolution
|
||||
|
||||
**Track enhancements**:
|
||||
|
||||
```
|
||||
M₀ = {observe, plan, execute, reflect, evolve}
|
||||
M₁ = {observe, plan, execute, reflect, evolve, gap-identify} (new capability)
|
||||
M₂ = {observe, plan, execute, reflect, evolve, gap-identify} (stable)
|
||||
```
|
||||
|
||||
**Finding** (from 8 experiments): M₀ sufficient in all cases (no evolution needed)
|
||||
|
||||
### 3. Specialization Decision Tree
|
||||
|
||||
**When to create specialized agent**:
|
||||
|
||||
```
|
||||
Decision tree:
|
||||
1. Is generic agent sufficient? (performance within 2x)
|
||||
YES → No specialization
|
||||
NO → Continue
|
||||
|
||||
2. Is task recurring? (happens ≥3 times)
|
||||
NO → One-off, tolerate slowness
|
||||
YES → Continue
|
||||
|
||||
3. Is performance gap >5x?
|
||||
NO → Tolerate moderate slowness
|
||||
YES → Continue
|
||||
|
||||
4. Is creation cost <ROI?
|
||||
Creation cost < (Time saved per use × Remaining uses)
|
||||
NO → Not worth it
|
||||
YES → Create specialized agent
|
||||
```
|
||||
|
||||
**Example** (Bootstrap-002):
|
||||
|
||||
```
|
||||
Task: Test coverage gap analysis
|
||||
Generic agent (data-analyst): 45 min
|
||||
Potential specialist (coverage-analyzer): 4.5 min (10x faster)
|
||||
|
||||
Recurring: YES (every iteration, 3 remaining)
|
||||
Performance gap: 10x (>5x threshold)
|
||||
Creation cost: 3 hours
|
||||
ROI: (45-4.5) min × 3 = 121.5 min = 2 hours saved
|
||||
Decision: CREATE (positive ROI)
|
||||
```
|
||||
|
||||
### 4. Reusability Assessment
|
||||
|
||||
**Three categories**:
|
||||
|
||||
**Universal** (90-100% transferable):
|
||||
- Generic agents (coder, doc-writer, data-analyst)
|
||||
- No domain knowledge required
|
||||
- Applicable across all domains
|
||||
|
||||
**Domain-Specific** (60-80% transferable):
|
||||
- Requires domain knowledge (testing, CI/CD, error handling)
|
||||
- Patterns apply within domain
|
||||
- Needs adaptation for other domains
|
||||
|
||||
**Task-Specific** (10-30% transferable):
|
||||
- Highly specialized for particular task
|
||||
- One-off creation
|
||||
- Unlikely to reuse
|
||||
|
||||
**Examples**:
|
||||
|
||||
```
|
||||
Agent: coverage-analyzer
|
||||
Domain: Testing methodology
|
||||
Transferability: 70%
|
||||
- Go coverage tools (language-specific, 30% adaptation)
|
||||
- Gap identification patterns (universal, 100%)
|
||||
- Overall: 70% transferable to Python/Rust/TypeScript testing
|
||||
|
||||
Agent: test-generator
|
||||
Domain: Testing methodology
|
||||
Transferability: 40%
|
||||
- Go test syntax (language-specific, 0% to other languages)
|
||||
- Test pattern templates (moderately transferable, 60%)
|
||||
- Overall: 40% transferable
|
||||
|
||||
Agent: log-analyzer
|
||||
Domain: Observability
|
||||
Transferability: 85%
|
||||
- Log parsing (universal, 95%)
|
||||
- Pattern recognition (universal, 100%)
|
||||
- Structured logging concepts (universal, 100%)
|
||||
- Go slog specifics (language-specific, 20%)
|
||||
- Overall: 85% transferable
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Evolution Log Template
|
||||
|
||||
Create `agents/EVOLUTION-LOG.md`:
|
||||
|
||||
```markdown
|
||||
# Agent Evolution Log
|
||||
|
||||
## Experiment Overview
|
||||
- Domain: Testing Strategy
|
||||
- Baseline agents: 3 (coder, doc-writer, data-analyst)
|
||||
- Final agents: 5 (+coverage-analyzer, +test-generator)
|
||||
- Specialization count: 2
|
||||
|
||||
---
|
||||
|
||||
## Iteration-by-Iteration Evolution
|
||||
|
||||
### Iteration 0
|
||||
**Agent Set**: {coder, doc-writer, data-analyst}
|
||||
**Changes**: None (baseline)
|
||||
**Observations**: Generic agents sufficient for baseline establishment
|
||||
|
||||
### Iteration 3
|
||||
**Agent Set**: {coder, doc-writer, data-analyst, coverage-analyzer}
|
||||
**Changes**: +coverage-analyzer
|
||||
**Reason**: 10x performance gap (45 min → 4.5 min)
|
||||
**Creation Cost**: 3 hours
|
||||
**ROI**: Positive (2 hours saved over 3 iterations)
|
||||
**Reusability**: 70% (domain-specific, testing)
|
||||
|
||||
### Iteration 4
|
||||
**Agent Set**: {coder, doc-writer, data-analyst, coverage-analyzer, test-generator}
|
||||
**Changes**: +test-generator
|
||||
**Reason**: 200x performance gap (manual test writing too slow)
|
||||
**Creation Cost**: 4 hours
|
||||
**ROI**: Massive (saved 10+ hours)
|
||||
**Reusability**: 40% (task-specific, Go testing)
|
||||
|
||||
### Iteration 5
|
||||
**Agent Set**: {coder, doc-writer, data-analyst, coverage-analyzer, test-generator}
|
||||
**Changes**: None
|
||||
**System**: STABLE (Aₙ == Aₙ₋₁)
|
||||
|
||||
---
|
||||
|
||||
## Specialization Analysis
|
||||
|
||||
### coverage-analyzer
|
||||
**Purpose**: Analyze test coverage, identify gaps
|
||||
**Performance**: 10x faster than generic data-analyst
|
||||
**Domain**: Testing methodology
|
||||
**Transferability**: 70%
|
||||
**Lessons**: Coverage gap identification patterns are universal, tool integration is language-specific
|
||||
|
||||
### test-generator
|
||||
**Purpose**: Generate test boilerplate from coverage gaps
|
||||
**Performance**: 200x faster than manual
|
||||
**Domain**: Testing methodology (Go-specific)
|
||||
**Transferability**: 40%
|
||||
**Lessons**: High speedup justified low transferability, patterns reusable but syntax is not
|
||||
|
||||
---
|
||||
|
||||
## Cross-Experiment Reuse
|
||||
|
||||
### From Previous Experiments
|
||||
- **validation-builder** (from API design experiment) → Used for smoke test validation
|
||||
- Reusability: Excellent (validation patterns are universal)
|
||||
- Adaptation: Minimal (10 min to adapt from API to CI/CD context)
|
||||
|
||||
### To Future Experiments
|
||||
- **coverage-analyzer** → Reusable for Python/Rust/TypeScript testing (70% transferable)
|
||||
- **test-generator** → Less reusable (40% transferable, needs rewrite for other languages)
|
||||
|
||||
---
|
||||
|
||||
## Meta-Agent Evolution
|
||||
|
||||
### M₀ Capabilities
|
||||
{observe, plan, execute, reflect, evolve}
|
||||
|
||||
### Changes
|
||||
None (M₀ sufficient throughout)
|
||||
|
||||
### Observations
|
||||
- M₀'s "evolve" capability successfully identified need for specialization
|
||||
- No Meta-Agent evolution required
|
||||
- Convergence: Mₙ == M₀ for all iterations
|
||||
|
||||
---
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
### Specialization Decisions
|
||||
- **10x performance gap** is good threshold (< 5x not worth it, >10x clear win)
|
||||
- **Positive ROI required**: Creation cost must be justified by time savings
|
||||
- **Recurring tasks only**: One-off tasks don't justify specialization
|
||||
|
||||
### Reusability Patterns
|
||||
- **Generic agents always reusable**: coder, doc-writer, data-analyst (100%)
|
||||
- **Domain agents moderately reusable**: coverage-analyzer (70%)
|
||||
- **Task agents rarely reusable**: test-generator (40%)
|
||||
|
||||
### When NOT to Specialize
|
||||
- Performance gap <5x (tolerable slowness)
|
||||
- Task is one-off (no recurring benefit)
|
||||
- Creation cost >ROI (not worth time investment)
|
||||
- Generic agent will improve with practice (learning curve)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cross-Experiment Analysis
|
||||
|
||||
After 3+ experiments, create `agents/CROSS-EXPERIMENT-ANALYSIS.md`:
|
||||
|
||||
```markdown
|
||||
# Cross-Experiment Agent Analysis
|
||||
|
||||
## Agent Reuse Matrix
|
||||
|
||||
| Agent | Exp1 | Exp2 | Exp3 | Reuse Rate | Transferability |
|
||||
|-------|------|------|------|------------|-----------------|
|
||||
| coder | ✓ | ✓ | ✓ | 100% | Universal |
|
||||
| doc-writer | ✓ | ✓ | ✓ | 100% | Universal |
|
||||
| data-analyst | ✓ | ✓ | ✓ | 100% | Universal |
|
||||
| coverage-analyzer | ✓ | - | ✓ | 67% | Domain (testing) |
|
||||
| test-generator | ✓ | - | - | 33% | Task-specific |
|
||||
| validation-builder | - | ✓ | ✓ | 67% | Domain (validation) |
|
||||
| log-analyzer | - | - | ✓ | 33% | Domain (observability) |
|
||||
|
||||
## Specialization Patterns
|
||||
|
||||
### Universal Agents (100% reuse)
|
||||
- Generic capabilities (coder, doc-writer, data-analyst)
|
||||
- No domain knowledge
|
||||
- Always included in A₀
|
||||
|
||||
### Domain Agents (50-80% reuse)
|
||||
- Require domain knowledge (testing, CI/CD, observability)
|
||||
- Reusable within domain
|
||||
- Examples: coverage-analyzer, validation-builder, log-analyzer
|
||||
|
||||
### Task Agents (10-40% reuse)
|
||||
- Highly specialized
|
||||
- One-off or rare reuse
|
||||
- Examples: test-generator (Go-specific)
|
||||
|
||||
## M₀ Sufficiency
|
||||
|
||||
**Finding**: M₀ = {observe, plan, execute, reflect, evolve} sufficient in ALL experiments
|
||||
|
||||
**Implications**:
|
||||
- No Meta-Agent evolution needed
|
||||
- Base capabilities handle all domains
|
||||
- Specialization occurs at Agent layer, not Meta-Agent layer
|
||||
|
||||
## Specialization Threshold
|
||||
|
||||
**Data** (from 3 experiments):
|
||||
- Average performance gap for specialization: 15x (range: 5x-200x)
|
||||
- Average creation cost: 3.5 hours (range: 2-5 hours)
|
||||
- Average ROI: Positive in 8/9 cases (89% success rate)
|
||||
|
||||
**Recommendation**: Use 5x performance gap as threshold
|
||||
|
||||
---
|
||||
|
||||
**Updated**: After each new experiment
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
Agent evolution tracking succeeded when:
|
||||
|
||||
1. **Complete tracking**: All agent changes documented each iteration
|
||||
2. **Specialization justified**: Each specialized agent has clear ROI
|
||||
3. **Reusability assessed**: Each agent categorized (universal/domain/task)
|
||||
4. **Cross-experiment learning**: Patterns identified across 2+ experiments
|
||||
5. **M₀ stability documented**: Meta-Agent evolution (or lack thereof) tracked
|
||||
|
||||
---
|
||||
|
||||
## Related Skills
|
||||
|
||||
**Parent framework**:
|
||||
- [methodology-bootstrapping](../methodology-bootstrapping/SKILL.md) - Core OCA cycle
|
||||
|
||||
**Complementary**:
|
||||
- [rapid-convergence](../rapid-convergence/SKILL.md) - Agent stability criterion
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
**Core guide**:
|
||||
- [Evolution Tracking](reference/tracking.md) - Detailed tracking process
|
||||
- [Specialization Decisions](reference/specialization.md) - Decision tree
|
||||
- [Reusability Framework](reference/reusability.md) - Assessment rubric
|
||||
|
||||
**Examples**:
|
||||
- [Bootstrap-002 Evolution](examples/testing-strategy-agent-evolution.md) - 2 specialists
|
||||
- [Bootstrap-007 No Evolution](examples/ci-cd-no-specialization.md) - Generic sufficient
|
||||
|
||||
---
|
||||
|
||||
**Status**: ✅ Formalized | 2-3 hours overhead | Enables systematic learning
|
||||
377
skills/agent-prompt-evolution/examples/explore-agent-v1-v3.md
Normal file
377
skills/agent-prompt-evolution/examples/explore-agent-v1-v3.md
Normal file
@@ -0,0 +1,377 @@
|
||||
# Explore Agent Evolution: v1 → v3
|
||||
|
||||
**Agent**: Explore (codebase exploration)
|
||||
**Iterations**: 3
|
||||
**Improvement**: 60% → 90% success rate (+50%)
|
||||
**Time**: 4.2 min → 2.6 min (-38%)
|
||||
**Status**: Converged (production-ready)
|
||||
|
||||
Complete walkthrough of evolving Explore agent prompt through BAIME methodology.
|
||||
|
||||
---
|
||||
|
||||
## Iteration 0: Baseline (v1)
|
||||
|
||||
### Initial Prompt
|
||||
|
||||
```markdown
|
||||
# Explore Agent
|
||||
|
||||
You are a codebase exploration agent. Your task is to help users understand
|
||||
code structure, find implementations, and explain how things work.
|
||||
|
||||
When given a query:
|
||||
1. Use Glob to find relevant files
|
||||
2. Use Grep to search for patterns
|
||||
3. Read files to understand implementations
|
||||
4. Provide a summary
|
||||
|
||||
Tools available: Glob, Grep, Read, Bash
|
||||
```
|
||||
|
||||
**Prompt Length**: 58 lines
|
||||
|
||||
---
|
||||
|
||||
### Baseline Testing (10 tasks)
|
||||
|
||||
| Task | Query | Result | Quality | Time |
|
||||
|------|-------|--------|---------|------|
|
||||
| 1 | "show architecture" | ❌ Failed | 2/5 | 5.2 min |
|
||||
| 2 | "find API endpoints" | ⚠️ Partial | 3/5 | 4.8 min |
|
||||
| 3 | "explain auth" | ⚠️ Partial | 3/5 | 6.1 min |
|
||||
| 4 | "list CLI commands" | ✅ Success | 4/5 | 2.8 min |
|
||||
| 5 | "find database code" | ✅ Success | 5/5 | 3.2 min |
|
||||
| 6 | "show test structure" | ❌ Failed | 2/5 | 4.5 min |
|
||||
| 7 | "explain config" | ✅ Success | 4/5 | 3.9 min |
|
||||
| 8 | "find error handlers" | ✅ Success | 5/5 | 2.9 min |
|
||||
| 9 | "show imports" | ✅ Success | 4/5 | 3.1 min |
|
||||
| 10 | "find middleware" | ✅ Success | 4/5 | 5.3 min |
|
||||
|
||||
**Baseline Metrics**:
|
||||
- Success Rate: 60% (6/10)
|
||||
- Average Quality: 3.6/5
|
||||
- Average Time: 4.18 min
|
||||
- V_instance: 0.68 (below target)
|
||||
|
||||
---
|
||||
|
||||
### Failure Analysis
|
||||
|
||||
**Pattern 1: Scope Ambiguity** (Tasks 1, 2, 3)
|
||||
- Queries too broad ("architecture", "auth")
|
||||
- Agent doesn't know search depth
|
||||
- Either stops too early or runs too long
|
||||
|
||||
**Pattern 2: Incomplete Coverage** (Tasks 2, 6)
|
||||
- Agent finds 1-2 files, stops
|
||||
- Misses related implementations
|
||||
- No verification of completeness
|
||||
|
||||
**Pattern 3: Time Management** (Tasks 1, 3, 10)
|
||||
- Long-running queries (>5 min)
|
||||
- Diminishing returns after 3 min
|
||||
- No time-boxing mechanism
|
||||
|
||||
---
|
||||
|
||||
## Iteration 1: Add Structure (v2)
|
||||
|
||||
### Prompt Changes
|
||||
|
||||
**Added: Thoroughness Guidelines**
|
||||
```markdown
|
||||
## Thoroughness Levels
|
||||
|
||||
Assess query complexity and choose thoroughness:
|
||||
|
||||
**quick** (1-2 min):
|
||||
- Check 3-5 obvious locations
|
||||
- Direct pattern matches only
|
||||
- Use for simple lookups
|
||||
|
||||
**medium** (2-4 min):
|
||||
- Check 10-15 related files
|
||||
- Follow cross-references
|
||||
- Use for typical queries
|
||||
|
||||
**thorough** (4-6 min):
|
||||
- Comprehensive search across codebase
|
||||
- Deep dependency analysis
|
||||
- Use for architecture questions
|
||||
```
|
||||
|
||||
**Added: Time-Boxing**
|
||||
```markdown
|
||||
## Time Management
|
||||
|
||||
Allocate time based on thoroughness:
|
||||
- quick: 1-2 min
|
||||
- medium: 2-4 min
|
||||
- thorough: 4-6 min
|
||||
|
||||
Stop if <10% new findings in last 20% of time budget.
|
||||
```
|
||||
|
||||
**Added: Completeness Checklist**
|
||||
```markdown
|
||||
## Before Responding
|
||||
|
||||
Verify completeness:
|
||||
□ All direct matches found (Glob/Grep)
|
||||
□ Related implementations checked
|
||||
□ Cross-references validated
|
||||
□ No obvious gaps remaining
|
||||
|
||||
State confidence level: Low / Medium / High
|
||||
```
|
||||
|
||||
**Prompt Length**: 112 lines (+54)
|
||||
|
||||
---
|
||||
|
||||
### Testing (8 tasks: 3 re-tests + 5 new)
|
||||
|
||||
| Task | Query | Result | Quality | Time |
|
||||
|------|-------|--------|---------|------|
|
||||
| 1R | "show architecture" | ✅ Success | 4/5 | 3.8 min |
|
||||
| 2R | "find API endpoints" | ✅ Success | 5/5 | 2.9 min |
|
||||
| 3R | "explain auth" | ✅ Success | 4/5 | 3.2 min |
|
||||
| 11 | "list database schemas" | ✅ Success | 5/5 | 2.1 min |
|
||||
| 12 | "find error handlers" | ✅ Success | 4/5 | 2.5 min |
|
||||
| 13 | "show test structure" | ⚠️ Partial | 3/5 | 3.6 min |
|
||||
| 14 | "explain config system" | ✅ Success | 5/5 | 2.4 min |
|
||||
| 15 | "find CLI commands" | ✅ Success | 4/5 | 2.2 min |
|
||||
|
||||
**Iteration 1 Metrics**:
|
||||
- Success Rate: 87.5% (7/8) - **+45.8% improvement**
|
||||
- Average Quality: 4.25/5 - **+18.1%**
|
||||
- Average Time: 2.84 min - **-32.1%**
|
||||
- V_instance: 0.88 ✅ (exceeds target)
|
||||
|
||||
---
|
||||
|
||||
### Key Improvements
|
||||
|
||||
✅ Fixed scope ambiguity (Tasks 1R, 2R, 3R all succeeded)
|
||||
✅ Better time management (all <4 min)
|
||||
✅ Higher quality outputs (4.25 avg)
|
||||
⚠️ Still one partial success (Task 13)
|
||||
|
||||
**Remaining Issue**: Test structure query missed integration tests
|
||||
|
||||
---
|
||||
|
||||
## Iteration 2: Refine Coverage (v3)
|
||||
|
||||
### Prompt Changes
|
||||
|
||||
**Enhanced: Completeness Verification**
|
||||
```markdown
|
||||
## Completeness Verification
|
||||
|
||||
Before concluding, verify coverage by category:
|
||||
|
||||
**For "find" queries**:
|
||||
□ Main implementations found
|
||||
□ Related utilities checked
|
||||
□ Test files reviewed (if applicable)
|
||||
□ Configuration/setup files checked
|
||||
|
||||
**For "show" queries**:
|
||||
□ Primary structure identified
|
||||
□ Secondary components listed
|
||||
□ Relationships mapped
|
||||
□ Examples provided
|
||||
|
||||
**For "explain" queries**:
|
||||
□ Core mechanism described
|
||||
□ Key components identified
|
||||
□ Data flow explained
|
||||
□ Edge cases noted
|
||||
```
|
||||
|
||||
**Added: Search Strategy**
|
||||
```markdown
|
||||
## Search Strategy
|
||||
|
||||
**Phase 1 (30% of time)**: Broad search
|
||||
- Glob for file patterns
|
||||
- Grep for key terms
|
||||
- Identify main locations
|
||||
|
||||
**Phase 2 (50% of time)**: Deep investigation
|
||||
- Read main files
|
||||
- Follow references
|
||||
- Build understanding
|
||||
|
||||
**Phase 3 (20% of time)**: Verification
|
||||
- Check for gaps
|
||||
- Validate findings
|
||||
- Prepare summary
|
||||
```
|
||||
|
||||
**Refined: Confidence Scoring**
|
||||
```markdown
|
||||
## Confidence Level
|
||||
|
||||
**High**: All major components found, verified complete
|
||||
**Medium**: Core components found, minor gaps possible
|
||||
**Low**: Partial findings, significant gaps likely
|
||||
|
||||
Always state confidence level and identify known gaps.
|
||||
```
|
||||
|
||||
**Prompt Length**: 138 lines (+26)
|
||||
|
||||
---
|
||||
|
||||
### Testing (10 tasks: 1 re-test + 9 new)
|
||||
|
||||
| Task | Query | Result | Quality | Time |
|
||||
|------|-------|--------|---------|------|
|
||||
| 13R | "show test structure" | ✅ Success | 5/5 | 2.9 min |
|
||||
| 16 | "find auth middleware" | ✅ Success | 5/5 | 2.3 min |
|
||||
| 17 | "explain routing" | ✅ Success | 4/5 | 3.1 min |
|
||||
| 18 | "list validation rules" | ✅ Success | 5/5 | 2.1 min |
|
||||
| 19 | "find logging setup" | ✅ Success | 4/5 | 2.5 min |
|
||||
| 20 | "show data models" | ✅ Success | 5/5 | 2.8 min |
|
||||
| 21 | "explain caching" | ✅ Success | 4/5 | 2.7 min |
|
||||
| 22 | "find background jobs" | ✅ Success | 5/5 | 2.4 min |
|
||||
| 23 | "show dependencies" | ✅ Success | 4/5 | 2.2 min |
|
||||
| 24 | "explain deployment" | ❌ Failed | 2/5 | 3.8 min |
|
||||
|
||||
**Iteration 2 Metrics**:
|
||||
- Success Rate: 90% (9/10) - **+2.5% improvement** (stable)
|
||||
- Average Quality: 4.3/5 - **+1.2%**
|
||||
- Average Time: 2.68 min - **-5.6%**
|
||||
- V_instance: 0.90 ✅ ✅ (2 consecutive ≥ 0.80)
|
||||
|
||||
**CONVERGED** ✅
|
||||
|
||||
---
|
||||
|
||||
### Stability Validation
|
||||
|
||||
**Iteration 1**: V_instance = 0.88
|
||||
**Iteration 2**: V_instance = 0.90
|
||||
**Change**: +2.3% (stable, within ±5%)
|
||||
|
||||
**Criteria Met**:
|
||||
✅ V_instance ≥ 0.80 for 2 consecutive iterations
|
||||
✅ Success rate ≥ 85%
|
||||
✅ Quality ≥ 4.0
|
||||
✅ Time within budget (<3 min avg)
|
||||
|
||||
---
|
||||
|
||||
## Final Metrics Comparison
|
||||
|
||||
| Metric | v1 (Baseline) | v2 (Iteration 1) | v3 (Iteration 2) | Δ Total |
|
||||
|--------|---------------|------------------|------------------|---------|
|
||||
| Success Rate | 60% | 87.5% | 90% | **+50%** |
|
||||
| Quality | 3.6/5 | 4.25/5 | 4.3/5 | **+19.4%** |
|
||||
| Time | 4.18 min | 2.84 min | 2.68 min | **-35.9%** |
|
||||
| V_instance | 0.68 | 0.88 | 0.90 | **+32.4%** |
|
||||
|
||||
---
|
||||
|
||||
## Evolution Summary
|
||||
|
||||
### Iteration 0 → 1: Major Improvements
|
||||
|
||||
**Key Changes**:
|
||||
- Added thoroughness levels (quick/medium/thorough)
|
||||
- Added time-boxing (1-6 min)
|
||||
- Added completeness checklist
|
||||
|
||||
**Impact**:
|
||||
- Success: 60% → 87.5% (+45.8%)
|
||||
- Time: 4.18 → 2.84 min (-32.1%)
|
||||
- Quality: 3.6 → 4.25 (+18.1%)
|
||||
|
||||
**Root Causes Addressed**:
|
||||
✅ Scope ambiguity resolved
|
||||
✅ Time management improved
|
||||
✅ Completeness awareness added
|
||||
|
||||
---
|
||||
|
||||
### Iteration 1 → 2: Refinement
|
||||
|
||||
**Key Changes**:
|
||||
- Enhanced completeness verification (by query type)
|
||||
- Added search strategy (3-phase)
|
||||
- Refined confidence scoring
|
||||
|
||||
**Impact**:
|
||||
- Success: 87.5% → 90% (+2.5%, stable)
|
||||
- Time: 2.84 → 2.68 min (-5.6%)
|
||||
- Quality: 4.25 → 4.3 (+1.2%)
|
||||
|
||||
**Root Causes Addressed**:
|
||||
✅ Test structure coverage gap fixed
|
||||
✅ Verification process strengthened
|
||||
|
||||
---
|
||||
|
||||
## Key Learnings
|
||||
|
||||
### What Worked
|
||||
|
||||
1. **Thoroughness Levels**: Clear guidance on search depth
|
||||
2. **Time-Boxing**: Prevented runaway queries
|
||||
3. **Completeness Checklist**: Improved coverage
|
||||
4. **Phased Search**: Structured approach to exploration
|
||||
|
||||
### What Didn't Work
|
||||
|
||||
1. **Deployment Query Failed**: Outside agent scope (requires infra knowledge)
|
||||
- Solution: Document limitations, suggest alternative agents
|
||||
|
||||
### Best Practices Validated
|
||||
|
||||
✅ **Start Simple**: v1 was minimal, added structure incrementally
|
||||
✅ **Measure Everything**: Quantitative metrics guided refinements
|
||||
✅ **Focus on Patterns**: Fixed systematic failures, not one-off issues
|
||||
✅ **Validate Stability**: 2-iteration convergence confirmed reliability
|
||||
|
||||
---
|
||||
|
||||
## Production Deployment
|
||||
|
||||
**Status**: ✅ Production-ready (v3)
|
||||
**Confidence**: High (90% success, 2 iterations stable)
|
||||
|
||||
**Deployment**:
|
||||
```bash
|
||||
# Update agent prompt
|
||||
cp explore-agent-v3.md .claude/agents/explore.md
|
||||
|
||||
# Validate
|
||||
test-agent-suite explore 20
|
||||
# Expected: Success ≥ 85%, Quality ≥ 4.0, Time ≤ 3 min
|
||||
```
|
||||
|
||||
**Monitoring**:
|
||||
- Track success rate (alert if <80%)
|
||||
- Monitor time (alert if >3.5 min avg)
|
||||
- Review failures weekly
|
||||
|
||||
---
|
||||
|
||||
## Future Enhancements (v4+)
|
||||
|
||||
**Potential Improvements**:
|
||||
1. **Context Caching**: Reuse codebase knowledge across queries (Est: -20% time)
|
||||
2. **Query Classification**: Auto-detect thoroughness level (Est: +5% success)
|
||||
3. **Result Ranking**: Prioritize most relevant findings (Est: +10% quality)
|
||||
|
||||
**Decision**: Hold v3, monitor for 2 weeks before v4
|
||||
|
||||
---
|
||||
|
||||
**Source**: Bootstrap-005 Agent Prompt Evolution
|
||||
**Agent**: Explore
|
||||
**Final Version**: v3 (90% success, 4.3/5 quality, 2.68 min avg)
|
||||
**Status**: Production-ready, converged, deployed
|
||||
@@ -0,0 +1,409 @@
|
||||
# Rapid Iteration Pattern for Agent Evolution
|
||||
|
||||
**Pattern**: Fast convergence (2-3 iterations) for agent prompt evolution
|
||||
**Success Rate**: 85% (11/13 agents converged in ≤3 iterations)
|
||||
**Time**: 3-6 hours total vs 8-12 hours standard
|
||||
|
||||
How to achieve rapid convergence when evolving agent prompts.
|
||||
|
||||
---
|
||||
|
||||
## Pattern Overview
|
||||
|
||||
**Standard Evolution**: 4-6 iterations, 8-12 hours
|
||||
**Rapid Evolution**: 2-3 iterations, 3-6 hours
|
||||
|
||||
**Key Difference**: Strong Iteration 0 (comprehensive baseline analysis)
|
||||
|
||||
---
|
||||
|
||||
## Rapid Iteration Workflow
|
||||
|
||||
### Iteration 0: Comprehensive Baseline (90-120 min)
|
||||
|
||||
**Standard Baseline** (30 min):
|
||||
- Run 5 test cases
|
||||
- Note obvious failures
|
||||
- Quick metrics
|
||||
|
||||
**Comprehensive Baseline** (90-120 min):
|
||||
- Run 15-20 diverse test cases
|
||||
- Systematic failure pattern analysis
|
||||
- Deep root cause investigation
|
||||
- Document all edge cases
|
||||
- Compare to similar agents
|
||||
|
||||
**Investment**: +60-90 min
|
||||
**Return**: -2 to -3 iterations (save 3-6 hours)
|
||||
|
||||
---
|
||||
|
||||
### Example: Explore Agent (Standard vs Rapid)
|
||||
|
||||
**Standard Approach**:
|
||||
```
|
||||
Iteration 0 (30 min): 5 tasks, quick notes
|
||||
Iteration 1 (90 min): Add thoroughness levels
|
||||
Iteration 2 (90 min): Add time-boxing
|
||||
Iteration 3 (75 min): Add completeness checks
|
||||
Iteration 4 (60 min): Refine verification
|
||||
Iteration 5 (60 min): Final polish
|
||||
|
||||
Total: 6.75 hours, 5 iterations
|
||||
```
|
||||
|
||||
**Rapid Approach**:
|
||||
```
|
||||
Iteration 0 (120 min): 20 tasks, pattern analysis, root causes
|
||||
Iteration 1 (90 min): Add thoroughness + time-boxing + completeness
|
||||
Iteration 2 (75 min): Refine + validate stability
|
||||
|
||||
Total: 4.75 hours, 2 iterations
|
||||
```
|
||||
|
||||
**Savings**: 2 hours, 3 fewer iterations
|
||||
|
||||
---
|
||||
|
||||
## Comprehensive Baseline Checklist
|
||||
|
||||
### Task Coverage (15-20 tasks)
|
||||
|
||||
**Complexity Distribution**:
|
||||
- 5 simple tasks (1-2 min expected)
|
||||
- 10 medium tasks (2-4 min expected)
|
||||
- 5 complex tasks (4-6 min expected)
|
||||
|
||||
**Query Type Diversity**:
|
||||
- Search queries (find, locate, list)
|
||||
- Analysis queries (explain, describe, analyze)
|
||||
- Comparison queries (compare, evaluate, contrast)
|
||||
- Edge cases (ambiguous, overly broad, very specific)
|
||||
|
||||
---
|
||||
|
||||
### Failure Pattern Analysis (30 min)
|
||||
|
||||
**Systematic Analysis**:
|
||||
|
||||
1. **Categorize Failures**
|
||||
- Scope issues (too broad/narrow)
|
||||
- Coverage issues (incomplete)
|
||||
- Time issues (too slow/fast)
|
||||
- Quality issues (inaccurate)
|
||||
|
||||
2. **Identify Root Causes**
|
||||
- Missing instructions
|
||||
- Ambiguous guidelines
|
||||
- Incorrect constraints
|
||||
- Tool usage issues
|
||||
|
||||
3. **Prioritize by Impact**
|
||||
- High frequency + high impact → Fix first
|
||||
- Low frequency + high impact → Document
|
||||
- High frequency + low impact → Automate
|
||||
- Low frequency + low impact → Ignore
|
||||
|
||||
**Example**:
|
||||
```markdown
|
||||
## Failure Patterns (Explore Agent)
|
||||
|
||||
**Pattern 1: Scope Ambiguity** (6/20 tasks, 30%)
|
||||
Root Cause: No guidance on search depth
|
||||
Impact: High (3 failures, 3 partial successes)
|
||||
Priority: P1 (fix in Iteration 1)
|
||||
|
||||
**Pattern 2: Incomplete Coverage** (4/20 tasks, 20%)
|
||||
Root Cause: No completeness verification
|
||||
Impact: Medium (4 partial successes)
|
||||
Priority: P1 (fix in Iteration 1)
|
||||
|
||||
**Pattern 3: Time Overruns** (3/20 tasks, 15%)
|
||||
Root Cause: No time-boxing mechanism
|
||||
Impact: Medium (3 slow but successful)
|
||||
Priority: P2 (fix in Iteration 1)
|
||||
|
||||
**Pattern 4: Tool Selection** (1/20 tasks, 5%)
|
||||
Root Cause: Not using best tool for task
|
||||
Impact: Low (1 inefficient but successful)
|
||||
Priority: P3 (defer to Iteration 2 if time)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Comparative Analysis (15 min)
|
||||
|
||||
**Compare to Similar Agents**:
|
||||
- What works well in other agents?
|
||||
- What patterns are transferable?
|
||||
- What mistakes were made before?
|
||||
|
||||
**Example**:
|
||||
```markdown
|
||||
## Comparative Analysis
|
||||
|
||||
**Code-Gen Agent** (similar agent):
|
||||
- Uses complexity assessment (simple/medium/complex)
|
||||
- Has explicit quality checklist
|
||||
- Includes time estimates
|
||||
|
||||
**Transferable**:
|
||||
✅ Complexity assessment → thoroughness levels
|
||||
✅ Quality checklist → completeness verification
|
||||
❌ Time estimates (less predictable for exploration)
|
||||
|
||||
**Analysis Agent** (similar agent):
|
||||
- Uses phased approach (scan → analyze → synthesize)
|
||||
- Includes confidence scoring
|
||||
|
||||
**Transferable**:
|
||||
✅ Phased approach → search strategy
|
||||
✅ Confidence scoring → already planned
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Iteration 1: Comprehensive Fix (90 min)
|
||||
|
||||
**Standard Iteration 1**: Fix 1-2 major issues
|
||||
**Rapid Iteration 1**: Fix ALL P1 issues + some P2
|
||||
|
||||
**Approach**:
|
||||
1. Address all high-priority patterns (P1)
|
||||
2. Add preventive measures for P2 issues
|
||||
3. Include transferable patterns from similar agents
|
||||
|
||||
**Example** (Explore Agent):
|
||||
```markdown
|
||||
## Iteration 1 Changes
|
||||
|
||||
**P1 Fixes**:
|
||||
1. Scope Ambiguity → Add thoroughness levels (quick/medium/thorough)
|
||||
2. Incomplete Coverage → Add completeness checklist
|
||||
3. Time Management → Add time-boxing (1-6 min)
|
||||
|
||||
**P2 Improvements**:
|
||||
4. Search Strategy → Add 3-phase approach
|
||||
5. Confidence → Add confidence scoring
|
||||
|
||||
**Borrowed Patterns**:
|
||||
6. From Code-Gen: Complexity assessment framework
|
||||
7. From Analysis: Verification checkpoints
|
||||
|
||||
Total Changes: 7 (vs standard 2-3)
|
||||
```
|
||||
|
||||
**Result**: Higher chance of convergence in Iteration 2
|
||||
|
||||
---
|
||||
|
||||
## Iteration 2: Validate & Converge (75 min)
|
||||
|
||||
**Objectives**:
|
||||
1. Test comprehensive fixes
|
||||
2. Measure stability
|
||||
3. Validate convergence
|
||||
|
||||
**Test Suite** (30 min):
|
||||
- Re-run all 20 Iteration 0 tasks
|
||||
- Add 5-10 new edge cases
|
||||
- Measure metrics
|
||||
|
||||
**Analysis** (20 min):
|
||||
- Compare to Iteration 0 and Iteration 1
|
||||
- Check convergence criteria
|
||||
- Identify remaining gaps (if any)
|
||||
|
||||
**Refinement** (25 min):
|
||||
- Minor adjustments only
|
||||
- Polish documentation
|
||||
- Validate stability
|
||||
|
||||
**Convergence Check**:
|
||||
```
|
||||
Iteration 1: V_instance = 0.88 ✅
|
||||
Iteration 2: V_instance = 0.90 ✅
|
||||
Stable: 0.88 → 0.90 (+2.3%, within ±5%)
|
||||
|
||||
CONVERGED ✅
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Success Factors
|
||||
|
||||
### 1. Comprehensive Baseline (60-90 min extra)
|
||||
|
||||
**Investment**: 2x standard baseline time
|
||||
**Return**: -2 to -3 iterations (6-9 hours saved)
|
||||
**ROI**: 4-6x
|
||||
|
||||
**Critical Elements**:
|
||||
- 15-20 diverse tasks (not 5-10)
|
||||
- Systematic failure pattern analysis
|
||||
- Root cause investigation (not just symptoms)
|
||||
- Comparative analysis with similar agents
|
||||
|
||||
---
|
||||
|
||||
### 2. Aggressive Iteration 1 (Fix All P1)
|
||||
|
||||
**Standard**: Fix 1-2 issues
|
||||
**Rapid**: Fix all P1 + some P2 (5-7 fixes)
|
||||
|
||||
**Approach**:
|
||||
- Batch related fixes together
|
||||
- Borrow proven patterns
|
||||
- Add preventive measures
|
||||
|
||||
**Risk**: Over-complication
|
||||
**Mitigation**: Focus on core issues, defer P3
|
||||
|
||||
---
|
||||
|
||||
### 3. Borrowed Patterns (20-30% reuse)
|
||||
|
||||
**Sources**:
|
||||
- Similar agents in same project
|
||||
- Agents from other projects
|
||||
- Industry best practices
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Explore Agent borrowed from:
|
||||
- Code-Gen: Complexity assessment (100% reuse)
|
||||
- Analysis: Phased approach (90% reuse)
|
||||
- Testing: Verification checklist (80% reuse)
|
||||
|
||||
Total reuse: ~60% of Iteration 1 changes
|
||||
```
|
||||
|
||||
**Savings**: 30-40 min per iteration
|
||||
|
||||
---
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
### ❌ Skipping Comprehensive Baseline
|
||||
|
||||
**Symptom**: "Let's just try some fixes and see"
|
||||
**Result**: 5-6 iterations, trial and error
|
||||
**Cost**: 8-12 hours
|
||||
|
||||
**Fix**: Invest 90-120 min in Iteration 0
|
||||
|
||||
---
|
||||
|
||||
### ❌ Incremental Fixes (One Issue at a Time)
|
||||
|
||||
**Symptom**: Fixing one pattern per iteration
|
||||
**Result**: 4-6 iterations for convergence
|
||||
**Cost**: 8-10 hours
|
||||
|
||||
**Fix**: Batch P1 fixes in Iteration 1
|
||||
|
||||
---
|
||||
|
||||
### ❌ Ignoring Similar Agents
|
||||
|
||||
**Symptom**: Reinventing solutions
|
||||
**Result**: Slower convergence, lower quality
|
||||
**Cost**: 2-3 extra hours
|
||||
|
||||
**Fix**: 15 min comparative analysis in Iteration 0
|
||||
|
||||
---
|
||||
|
||||
## When to Use Rapid Pattern
|
||||
|
||||
**Good Fit**:
|
||||
- Agent is similar to existing agents (60%+ overlap)
|
||||
- Clear failure patterns in baseline
|
||||
- Time constraint (need results in 1-2 days)
|
||||
|
||||
**Poor Fit**:
|
||||
- Novel agent type (no similar agents)
|
||||
- Complex domain (many unknowns)
|
||||
- Learning objective (want to explore incrementally)
|
||||
|
||||
---
|
||||
|
||||
## Metrics Comparison
|
||||
|
||||
### Standard Evolution
|
||||
|
||||
```
|
||||
Iteration 0: 30 min (5 tasks)
|
||||
Iteration 1: 90 min (fix 1-2 issues)
|
||||
Iteration 2: 90 min (fix 2-3 more)
|
||||
Iteration 3: 75 min (refine)
|
||||
Iteration 4: 60 min (converge)
|
||||
|
||||
Total: 5.75 hours, 4 iterations
|
||||
V_instance: 0.68 → 0.74 → 0.79 → 0.83 → 0.85 ✅
|
||||
```
|
||||
|
||||
### Rapid Evolution
|
||||
|
||||
```
|
||||
Iteration 0: 120 min (20 tasks, analysis)
|
||||
Iteration 1: 90 min (fix all P1+P2)
|
||||
Iteration 2: 75 min (validate, converge)
|
||||
|
||||
Total: 4.75 hours, 2 iterations
|
||||
V_instance: 0.68 → 0.88 → 0.90 ✅
|
||||
```
|
||||
|
||||
**Savings**: 1 hour, 2 fewer iterations
|
||||
|
||||
---
|
||||
|
||||
## Replication Guide
|
||||
|
||||
### Day 1: Comprehensive Baseline
|
||||
|
||||
**Morning** (2 hours):
|
||||
1. Design 20-task test suite
|
||||
2. Run baseline tests
|
||||
3. Document all failures
|
||||
|
||||
**Afternoon** (1 hour):
|
||||
4. Analyze failure patterns
|
||||
5. Identify root causes
|
||||
6. Compare to similar agents
|
||||
7. Prioritize fixes
|
||||
|
||||
---
|
||||
|
||||
### Day 2: Comprehensive Fix
|
||||
|
||||
**Morning** (1.5 hours):
|
||||
1. Implement all P1 fixes
|
||||
2. Add P2 improvements
|
||||
3. Incorporate borrowed patterns
|
||||
|
||||
**Afternoon** (1 hour):
|
||||
4. Test on 15-20 tasks
|
||||
5. Measure metrics
|
||||
6. Document changes
|
||||
|
||||
---
|
||||
|
||||
### Day 3: Validate & Deploy
|
||||
|
||||
**Morning** (1 hour):
|
||||
1. Test on 25-30 tasks
|
||||
2. Check stability
|
||||
3. Minor refinements
|
||||
|
||||
**Afternoon** (0.5 hours):
|
||||
4. Final validation
|
||||
5. Deploy to production
|
||||
6. Setup monitoring
|
||||
|
||||
---
|
||||
|
||||
**Source**: BAIME Agent Prompt Evolution - Rapid Pattern
|
||||
**Success Rate**: 85% (11/13 agents)
|
||||
**Average Time**: 4.2 hours (vs 9.3 hours standard)
|
||||
**Average Iterations**: 2.3 (vs 4.8 standard)
|
||||
395
skills/agent-prompt-evolution/reference/evolution-framework.md
Normal file
395
skills/agent-prompt-evolution/reference/evolution-framework.md
Normal file
@@ -0,0 +1,395 @@
|
||||
# Agent Prompt Evolution Framework
|
||||
|
||||
**Version**: 1.0
|
||||
**Purpose**: Systematic methodology for evolving agent prompts through iterative refinement
|
||||
**Basis**: BAIME OCA cycle applied to prompt engineering
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Agent prompt evolution applies the Observe-Codify-Automate cycle to improve agent prompts through empirical testing and structured refinement.
|
||||
|
||||
**Goal**: Transform initial agent prompts into production-quality prompts through measured iterations.
|
||||
|
||||
---
|
||||
|
||||
## Evolution Cycle
|
||||
|
||||
```
|
||||
Iteration N:
|
||||
Observe → Analyze → Refine → Test → Measure
|
||||
↑ ↓
|
||||
└────────── Feedback Loop ──────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Observe (30 min)
|
||||
|
||||
### Run Agent with Current Prompt
|
||||
|
||||
**Activities**:
|
||||
1. Execute agent on 5-10 representative tasks
|
||||
2. Record agent behavior and outputs
|
||||
3. Note successes and failures
|
||||
4. Measure performance metrics
|
||||
|
||||
**Metrics**:
|
||||
- Success rate (tasks completed correctly)
|
||||
- Response quality (accuracy, completeness)
|
||||
- Efficiency (time, token usage)
|
||||
- Error patterns
|
||||
|
||||
**Example**:
|
||||
```markdown
|
||||
## Iteration 0: Baseline Observation
|
||||
|
||||
**Agent**: Explore subagent (codebase exploration)
|
||||
**Tasks**: 10 exploration queries
|
||||
**Success Rate**: 60% (6/10)
|
||||
|
||||
**Failures**:
|
||||
1. Query "show architecture" → Too broad, agent confused
|
||||
2. Query "find API endpoints" → Missed 3 key files
|
||||
3. Query "explain auth" → Incomplete, stopped too early
|
||||
|
||||
**Time**: Avg 4.2 min per query (target: 2 min)
|
||||
**Quality**: 3.1/5 average rating
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Analyze (20 min)
|
||||
|
||||
### Identify Failure Patterns
|
||||
|
||||
**Analysis Questions**:
|
||||
1. What types of failures occurred?
|
||||
2. Are failures systematic or random?
|
||||
3. What context is missing from prompt?
|
||||
4. Are instructions clear enough?
|
||||
5. Are constraints too loose or too tight?
|
||||
|
||||
**Example Analysis**:
|
||||
```markdown
|
||||
## Failure Pattern Analysis
|
||||
|
||||
**Pattern 1: Scope Ambiguity** (3 failures)
|
||||
- Queries too broad ("architecture", "overview")
|
||||
- Agent doesn't know how deep to search
|
||||
- Fix: Add explicit depth guidelines
|
||||
|
||||
**Pattern 2: Search Coverage** (2 failures)
|
||||
- Agent stops after finding 1-2 files
|
||||
- Misses related implementations
|
||||
- Fix: Add thoroughness requirements
|
||||
|
||||
**Pattern 3: Time Management** (2 failures)
|
||||
- Agent runs too long (>5 min)
|
||||
- Diminishing returns after 2 min
|
||||
- Fix: Add time-boxing guidelines
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Refine (25 min)
|
||||
|
||||
### Update Agent Prompt
|
||||
|
||||
**Refinement Strategies**:
|
||||
|
||||
1. **Add Missing Context**
|
||||
- Domain knowledge
|
||||
- Codebase structure
|
||||
- Common patterns
|
||||
|
||||
2. **Clarify Instructions**
|
||||
- Break down complex tasks
|
||||
- Add examples
|
||||
- Define success criteria
|
||||
|
||||
3. **Adjust Constraints**
|
||||
- Time limits
|
||||
- Scope boundaries
|
||||
- Quality thresholds
|
||||
|
||||
4. **Provide Tools**
|
||||
- Specific commands
|
||||
- Search patterns
|
||||
- Decision frameworks
|
||||
|
||||
**Example Refinements**:
|
||||
```markdown
|
||||
## Prompt Changes (v0 → v1)
|
||||
|
||||
**Added: Thoroughness Guidelines**
|
||||
```
|
||||
When searching for patterns:
|
||||
- "quick": Check 3-5 obvious locations
|
||||
- "medium": Check 10-15 related files
|
||||
- "thorough": Check all matching patterns
|
||||
```
|
||||
|
||||
**Added: Time-Boxing**
|
||||
```
|
||||
Allocate time based on thoroughness:
|
||||
- quick: 1-2 min
|
||||
- medium: 2-4 min
|
||||
- thorough: 4-6 min
|
||||
|
||||
Stop if diminishing returns after 80% of time used.
|
||||
```
|
||||
|
||||
**Clarified: Success Criteria**
|
||||
```
|
||||
Complete search means:
|
||||
✓ All direct matches found
|
||||
✓ Related implementations identified
|
||||
✓ Cross-references checked
|
||||
✓ Confidence score provided (Low/Medium/High)
|
||||
```
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Test (20 min)
|
||||
|
||||
### Validate Refinements
|
||||
|
||||
**Test Suite**:
|
||||
1. Re-run failed tasks from Iteration 0
|
||||
2. Add 3-5 new test cases
|
||||
3. Measure improvement
|
||||
|
||||
**Example Test**:
|
||||
```markdown
|
||||
## Iteration 1 Testing
|
||||
|
||||
**Re-run Failed Tasks** (3):
|
||||
1. "show architecture" → ✅ SUCCESS (added thoroughness=medium)
|
||||
2. "find API endpoints" → ✅ SUCCESS (found all 5 files)
|
||||
3. "explain auth" → ✅ SUCCESS (complete explanation)
|
||||
|
||||
**New Test Cases** (5):
|
||||
1. "list database schemas" → ✅ SUCCESS
|
||||
2. "find error handlers" → ✅ SUCCESS
|
||||
3. "show test structure" → ⚠️ PARTIAL (missed integration tests)
|
||||
4. "explain config system" → ✅ SUCCESS
|
||||
5. "find CLI commands" → ✅ SUCCESS
|
||||
|
||||
**Success Rate**: 87.5% (7/8) - improved from 60%
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 5: Measure (15 min)
|
||||
|
||||
### Calculate Improvement Metrics
|
||||
|
||||
**Metrics**:
|
||||
```
|
||||
Δ Success Rate = (new_rate - baseline_rate) / baseline_rate
|
||||
Δ Quality = (new_score - baseline_score) / baseline_score
|
||||
Δ Efficiency = (baseline_time - new_time) / baseline_time
|
||||
```
|
||||
|
||||
**Example**:
|
||||
```markdown
|
||||
## Iteration 1 Metrics
|
||||
|
||||
**Success Rate**:
|
||||
- Baseline: 60% (6/10)
|
||||
- Iteration 1: 87.5% (7/8)
|
||||
- Improvement: +45.8%
|
||||
|
||||
**Quality** (1-5 scale):
|
||||
- Baseline: 3.1 avg
|
||||
- Iteration 1: 4.2 avg
|
||||
- Improvement: +35.5%
|
||||
|
||||
**Efficiency**:
|
||||
- Baseline: 4.2 min avg
|
||||
- Iteration 1: 2.8 min avg
|
||||
- Improvement: +33.3% (faster)
|
||||
|
||||
**Overall V_instance**: 0.85 ✅ (target: 0.80)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Convergence Criteria
|
||||
|
||||
**Prompt is production-ready when**:
|
||||
|
||||
1. **Success Rate ≥ 85%** (reliable)
|
||||
2. **Quality Score ≥ 4.0/5** (high quality)
|
||||
3. **Efficiency within target** (time/tokens)
|
||||
4. **Stable for 2 iterations** (no regression)
|
||||
|
||||
**Example Convergence**:
|
||||
```
|
||||
Iteration 0: 60% success, 3.1 quality, 4.2 min
|
||||
Iteration 1: 87.5% success, 4.2 quality, 2.8 min ✅
|
||||
Iteration 2: 90% success, 4.3 quality, 2.6 min ✅ (stable)
|
||||
|
||||
CONVERGED: Ready for production
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Evolution Patterns
|
||||
|
||||
### Pattern 1: Scope Definition
|
||||
|
||||
**Problem**: Agent doesn't know how broad/deep to search
|
||||
|
||||
**Solution**: Add thoroughness parameter
|
||||
```markdown
|
||||
When invoked, assess query complexity:
|
||||
- Simple (1-2 files): thoroughness=quick
|
||||
- Medium (5-10 files): thoroughness=medium
|
||||
- Complex (>10 files): thoroughness=thorough
|
||||
```
|
||||
|
||||
### Pattern 2: Early Termination
|
||||
|
||||
**Problem**: Agent stops too early, misses results
|
||||
|
||||
**Solution**: Add completeness checklist
|
||||
```markdown
|
||||
Before concluding search, verify:
|
||||
□ All direct matches found (Glob/Grep)
|
||||
□ Related implementations checked
|
||||
□ Cross-references validated
|
||||
□ No obvious gaps remaining
|
||||
```
|
||||
|
||||
### Pattern 3: Time Management
|
||||
|
||||
**Problem**: Agent runs too long, poor efficiency
|
||||
|
||||
**Solution**: Add time-boxing with checkpoints
|
||||
```markdown
|
||||
Allocate time budget:
|
||||
- 0-30%: Initial broad search
|
||||
- 30-70%: Deep investigation
|
||||
- 70-100%: Verification and summary
|
||||
|
||||
Stop if <10% new findings in last 20% of time.
|
||||
```
|
||||
|
||||
### Pattern 4: Context Accumulation
|
||||
|
||||
**Problem**: Agent forgets earlier findings
|
||||
|
||||
**Solution**: Add intermediate summaries
|
||||
```markdown
|
||||
After each major finding:
|
||||
1. Summarize what was found
|
||||
2. Update mental model
|
||||
3. Identify remaining gaps
|
||||
4. Adjust search strategy
|
||||
```
|
||||
|
||||
### Pattern 5: Quality Assurance
|
||||
|
||||
**Problem**: Agent provides low-quality outputs
|
||||
|
||||
**Solution**: Add self-review checklist
|
||||
```markdown
|
||||
Before responding, verify:
|
||||
□ Answer is accurate and complete
|
||||
□ Examples are provided
|
||||
□ Confidence level stated
|
||||
□ Next steps suggested (if applicable)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Iteration Template
|
||||
|
||||
```markdown
|
||||
## Iteration N: [Focus Area]
|
||||
|
||||
### Observations (30 min)
|
||||
- Tasks tested: [count]
|
||||
- Success rate: [X]%
|
||||
- Avg quality: [X]/5
|
||||
- Avg time: [X] min
|
||||
|
||||
**Key Issues**:
|
||||
1. [Issue description]
|
||||
2. [Issue description]
|
||||
|
||||
### Analysis (20 min)
|
||||
- Pattern 1: [Name] ([frequency])
|
||||
- Pattern 2: [Name] ([frequency])
|
||||
|
||||
### Refinements (25 min)
|
||||
- Added: [Feature/guideline]
|
||||
- Clarified: [Instruction]
|
||||
- Adjusted: [Constraint]
|
||||
|
||||
### Testing (20 min)
|
||||
- Re-test failures: [X]/[Y] fixed
|
||||
- New tests: [X]/[Y] passed
|
||||
- Overall success: [X]%
|
||||
|
||||
### Metrics (15 min)
|
||||
- Δ Success: [+/-X]%
|
||||
- Δ Quality: [+/-X]%
|
||||
- Δ Efficiency: [+/-X]%
|
||||
- V_instance: [X.XX]
|
||||
|
||||
**Status**: [Converged/Continue]
|
||||
**Next Focus**: [Area to improve]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Do's
|
||||
|
||||
✅ **Test on diverse cases** - Cover edge cases and common queries
|
||||
✅ **Measure objectively** - Use quantitative metrics
|
||||
✅ **Iterate quickly** - 90-120 min per iteration
|
||||
✅ **Focus improvements** - One major change per iteration
|
||||
✅ **Validate stability** - Test 2 iterations for convergence
|
||||
|
||||
### Don'ts
|
||||
|
||||
❌ **Don't overtune** - Avoid overfitting to test cases
|
||||
❌ **Don't skip baselines** - Always measure Iteration 0
|
||||
❌ **Don't ignore regressions** - Track quality across iterations
|
||||
❌ **Don't add complexity** - Keep prompts concise
|
||||
❌ **Don't stop too early** - Ensure 2-iteration stability
|
||||
|
||||
---
|
||||
|
||||
## Example: Explore Agent Evolution
|
||||
|
||||
**Baseline** (Iteration 0):
|
||||
- Generic instructions
|
||||
- No thoroughness guidance
|
||||
- No time management
|
||||
- Success: 60%
|
||||
|
||||
**Iteration 1**:
|
||||
- Added thoroughness levels
|
||||
- Added time-boxing
|
||||
- Success: 87.5% (+45.8%)
|
||||
|
||||
**Iteration 2**:
|
||||
- Added completeness checklist
|
||||
- Refined search strategy
|
||||
- Success: 90% (+2.5% improvement, stable)
|
||||
|
||||
**Convergence**: 2 iterations, 87.5% → 90% stable
|
||||
|
||||
---
|
||||
|
||||
**Source**: BAIME Agent Prompt Evolution Framework
|
||||
**Status**: Production-ready, validated across 13 agent types
|
||||
**Average Improvement**: +42% success rate over baseline
|
||||
386
skills/agent-prompt-evolution/reference/metrics.md
Normal file
386
skills/agent-prompt-evolution/reference/metrics.md
Normal file
@@ -0,0 +1,386 @@
|
||||
# Agent Prompt Metrics
|
||||
|
||||
**Version**: 1.0
|
||||
**Purpose**: Quantitative metrics for measuring agent prompt quality
|
||||
**Framework**: BAIME dual-layer value functions applied to agents
|
||||
|
||||
---
|
||||
|
||||
## Core Metrics
|
||||
|
||||
### 1. Success Rate
|
||||
|
||||
**Definition**: Percentage of tasks completed correctly
|
||||
|
||||
**Calculation**:
|
||||
```
|
||||
Success Rate = correct_completions / total_tasks
|
||||
```
|
||||
|
||||
**Thresholds**:
|
||||
- ≥90%: Excellent (production-ready)
|
||||
- 80-89%: Good (minor refinements needed)
|
||||
- 60-79%: Fair (needs improvement)
|
||||
- <60%: Poor (major issues)
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Tasks: 20
|
||||
Correct: 17
|
||||
Partial: 2
|
||||
Failed: 1
|
||||
|
||||
Success Rate = 17/20 = 85% (Good)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. Quality Score
|
||||
|
||||
**Definition**: Average quality rating of agent outputs (1-5 scale)
|
||||
|
||||
**Rating Criteria**:
|
||||
- **5**: Perfect - Accurate, complete, well-structured
|
||||
- **4**: Good - Minor gaps, mostly complete
|
||||
- **3**: Fair - Acceptable but needs improvement
|
||||
- **2**: Poor - Significant issues
|
||||
- **1**: Failed - Incorrect or unusable
|
||||
|
||||
**Thresholds**:
|
||||
- ≥4.5: Excellent
|
||||
- 4.0-4.4: Good
|
||||
- 3.5-3.9: Fair
|
||||
- <3.5: Poor
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Task 1: 5/5 (perfect)
|
||||
Task 2: 4/5 (good)
|
||||
Task 3: 5/5 (perfect)
|
||||
...
|
||||
Task 20: 4/5 (good)
|
||||
|
||||
Average: 4.35/5 (Good)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. Efficiency
|
||||
|
||||
**Definition**: Time and token usage per task
|
||||
|
||||
**Metrics**:
|
||||
```
|
||||
Time Efficiency = avg_time_per_task
|
||||
Token Efficiency = avg_tokens_per_task
|
||||
```
|
||||
|
||||
**Thresholds** (vary by agent type):
|
||||
- Explore agent: <3 min, <5k tokens
|
||||
- Code generation: <5 min, <10k tokens
|
||||
- Analysis: <10 min, <20k tokens
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Tasks: 20
|
||||
Total time: 56 min
|
||||
Total tokens: 92k
|
||||
|
||||
Time Efficiency: 2.8 min/task ✅
|
||||
Token Efficiency: 4.6k tokens/task ✅
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4. Reliability
|
||||
|
||||
**Definition**: Consistency of agent performance
|
||||
|
||||
**Calculation**:
|
||||
```
|
||||
Reliability = 1 - (std_dev(success_rate) / mean(success_rate))
|
||||
```
|
||||
|
||||
**Thresholds**:
|
||||
- ≥0.90: Very reliable (consistent)
|
||||
- 0.80-0.89: Reliable
|
||||
- 0.70-0.79: Moderately reliable
|
||||
- <0.70: Unreliable (erratic)
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Batch 1: 85% success
|
||||
Batch 2: 90% success
|
||||
Batch 3: 87% success
|
||||
Batch 4: 88% success
|
||||
|
||||
Mean: 87.5%
|
||||
Std Dev: 2.08
|
||||
Reliability: 1 - (2.08/87.5) = 0.976 (Very reliable)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Composite Metrics
|
||||
|
||||
### V_instance (Agent Performance)
|
||||
|
||||
**Formula**:
|
||||
```
|
||||
V_instance = 0.40 × success_rate +
|
||||
0.30 × (quality_score / 5) +
|
||||
0.20 × efficiency_score +
|
||||
0.10 × reliability
|
||||
|
||||
Where:
|
||||
- success_rate ∈ [0, 1]
|
||||
- quality_score ∈ [1, 5], normalized to [0, 1]
|
||||
- efficiency_score = 1 - (actual_time / target_time), capped at [0, 1]
|
||||
- reliability ∈ [0, 1]
|
||||
```
|
||||
|
||||
**Target**: V_instance ≥ 0.80
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Success Rate: 85% = 0.85
|
||||
Quality Score: 4.2/5 = 0.84
|
||||
Efficiency: 2.8 min / 3 min target = 1 - 0.93 = 0.07, but we want faster so: 1.0 (under budget)
|
||||
Reliability: 0.976
|
||||
|
||||
V_instance = 0.40 × 0.85 +
|
||||
0.30 × 0.84 +
|
||||
0.20 × 1.0 +
|
||||
0.10 × 0.976
|
||||
|
||||
= 0.34 + 0.252 + 0.20 + 0.0976
|
||||
= 0.890 ✅ (exceeds target)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### V_meta (Prompt Quality)
|
||||
|
||||
**Formula**:
|
||||
```
|
||||
V_meta = 0.35 × completeness +
|
||||
0.30 × clarity +
|
||||
0.20 × adaptability +
|
||||
0.15 × maintainability
|
||||
|
||||
Where:
|
||||
- completeness = features_implemented / features_needed
|
||||
- clarity = 1 - (ambiguous_instructions / total_instructions)
|
||||
- adaptability = successful_task_types / tested_task_types
|
||||
- maintainability = 1 - (prompt_complexity / max_complexity)
|
||||
```
|
||||
|
||||
**Target**: V_meta ≥ 0.80
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Completeness: 8/8 features = 1.0
|
||||
Clarity: 1 - (2 ambiguous / 20 instructions) = 0.90
|
||||
Adaptability: 5/6 task types = 0.83
|
||||
Maintainability: 1 - (150 lines / 300 max) = 0.50
|
||||
|
||||
V_meta = 0.35 × 1.0 +
|
||||
0.30 × 0.90 +
|
||||
0.20 × 0.83 +
|
||||
0.15 × 0.50
|
||||
|
||||
= 0.35 + 0.27 + 0.166 + 0.075
|
||||
= 0.861 ✅ (exceeds target)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Metric Collection
|
||||
|
||||
### Automated Collection
|
||||
|
||||
**Session Analysis**:
|
||||
```bash
|
||||
# Extract agent performance from session
|
||||
query_tools --tool="Task" --scope=session | \
|
||||
jq -r '.[] | select(.status == "success") | .duration' | \
|
||||
awk '{sum+=$1; n++} END {print sum/n}'
|
||||
```
|
||||
|
||||
**Example Script**:
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# scripts/measure-agent-metrics.sh
|
||||
|
||||
AGENT_NAME=$1
|
||||
SESSION=$2
|
||||
|
||||
# Success rate
|
||||
total=$(grep "agent=$AGENT_NAME" "$SESSION" | wc -l)
|
||||
success=$(grep "agent=$AGENT_NAME.*success" "$SESSION" | wc -l)
|
||||
success_rate=$(echo "scale=2; $success / $total" | bc)
|
||||
|
||||
# Average time
|
||||
avg_time=$(grep "agent=$AGENT_NAME" "$SESSION" | \
|
||||
jq -r '.duration' | \
|
||||
awk '{sum+=$1; n++} END {print sum/n}')
|
||||
|
||||
# Quality (requires manual rating file)
|
||||
avg_quality=$(cat "${SESSION}.ratings" | \
|
||||
grep "$AGENT_NAME" | \
|
||||
awk '{sum+=$2; n++} END {print sum/n}')
|
||||
|
||||
echo "Agent: $AGENT_NAME"
|
||||
echo "Success Rate: $success_rate"
|
||||
echo "Avg Time: ${avg_time}s"
|
||||
echo "Avg Quality: $avg_quality/5"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Manual Collection
|
||||
|
||||
**Test Suite Template**:
|
||||
```markdown
|
||||
## Agent Test Suite: [Agent Name]
|
||||
|
||||
**Iteration**: [N]
|
||||
**Date**: [YYYY-MM-DD]
|
||||
|
||||
### Test Cases
|
||||
|
||||
| ID | Task | Result | Quality | Time | Notes |
|
||||
|----|------|--------|---------|------|-------|
|
||||
| 1 | [Description] | ✅/❌ | [1-5] | [min] | [Issues] |
|
||||
| 2 | [Description] | ✅/❌ | [1-5] | [min] | [Issues] |
|
||||
...
|
||||
|
||||
### Summary
|
||||
|
||||
- Success Rate: [X]% ([Y]/[Z])
|
||||
- Avg Quality: [X.X]/5
|
||||
- Avg Time: [X.X] min
|
||||
- V_instance: [X.XX]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Benchmarking
|
||||
|
||||
### Cross-Agent Comparison
|
||||
|
||||
**Standard Test Suite**: 20 representative tasks
|
||||
|
||||
**Example Results**:
|
||||
```
|
||||
| Agent | Success | Quality | Time | V_inst |
|
||||
|-------------|---------|---------|-------|--------|
|
||||
| Explore v1 | 60% | 3.1 | 4.2m | 0.62 |
|
||||
| Explore v2 | 87.5% | 4.2 | 2.8m | 0.89 |
|
||||
| Explore v3 | 90% | 4.3 | 2.6m | 0.91 |
|
||||
```
|
||||
|
||||
**Improvement**: v1 → v3 = +30% success, +1.2 quality, +38% faster
|
||||
|
||||
---
|
||||
|
||||
### Baseline Comparison
|
||||
|
||||
**Industry Baselines** (approximate):
|
||||
- Generic agent (no tuning): ~50-60% success
|
||||
- Basic tuned agent: ~70-80% success
|
||||
- Well-tuned agent: ~85-95% success
|
||||
- Expert-tuned agent: ~95-98% success
|
||||
|
||||
---
|
||||
|
||||
## Regression Testing
|
||||
|
||||
### Track Metrics Over Time
|
||||
|
||||
**Regression Detection**:
|
||||
```
|
||||
if current_metric < (previous_metric - threshold):
|
||||
alert("REGRESSION DETECTED")
|
||||
```
|
||||
|
||||
**Thresholds**:
|
||||
- Success Rate: -5% (e.g., 90% → 85%)
|
||||
- Quality Score: -0.3 (e.g., 4.5 → 4.2)
|
||||
- Efficiency: +20% time (e.g., 2.8 min → 3.4 min)
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Iteration 3: 90% success, 4.3 quality, 2.6 min ✅
|
||||
Iteration 4: 87% success, 4.1 quality, 2.8 min ⚠️ REGRESSION
|
||||
|
||||
Analysis: New constraint too restrictive
|
||||
Action: Revert constraint, re-test
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Reporting Template
|
||||
|
||||
```markdown
|
||||
## Agent Metrics Report
|
||||
|
||||
**Agent**: [Name]
|
||||
**Version**: [X.Y]
|
||||
**Test Date**: [YYYY-MM-DD]
|
||||
**Test Suite**: [Standard 20 | Custom N]
|
||||
|
||||
### Performance Metrics
|
||||
|
||||
**Success Rate**: [X]% ([Y]/[Z] tasks)
|
||||
- Target: ≥85%
|
||||
- Status: ✅/⚠️/❌
|
||||
|
||||
**Quality Score**: [X.X]/5
|
||||
- Target: ≥4.0
|
||||
- Status: ✅/⚠️/❌
|
||||
|
||||
**Efficiency**:
|
||||
- Time: [X.X] min/task (target: [Y] min)
|
||||
- Tokens: [X]k tokens/task (target: [Y]k)
|
||||
- Status: ✅/⚠️/❌
|
||||
|
||||
**Reliability**: [X.XX]
|
||||
- Target: ≥0.85
|
||||
- Status: ✅/⚠️/❌
|
||||
|
||||
### Composite Scores
|
||||
|
||||
**V_instance**: [X.XX]
|
||||
- Target: ≥0.80
|
||||
- Status: ✅/⚠️/❌
|
||||
|
||||
**V_meta**: [X.XX]
|
||||
- Target: ≥0.80
|
||||
- Status: ✅/⚠️/❌
|
||||
|
||||
### Comparison to Baseline
|
||||
|
||||
| Metric | Baseline | Current | Δ |
|
||||
|---------------|----------|---------|--------|
|
||||
| Success Rate | [X]% | [Y]% | [+/-]% |
|
||||
| Quality | [X.X] | [Y.Y] | [+/-] |
|
||||
| Time | [X.X]m | [Y.Y]m | [+/-]% |
|
||||
| V_instance | [X.XX] | [Y.YY] | [+/-] |
|
||||
|
||||
### Recommendations
|
||||
|
||||
1. [Action item based on metrics]
|
||||
2. [Action item based on metrics]
|
||||
|
||||
### Next Steps
|
||||
|
||||
- [ ] [Task for next iteration]
|
||||
- [ ] [Task for next iteration]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Source**: BAIME Agent Prompt Evolution Framework
|
||||
**Status**: Production-ready, validated across 13 agent types
|
||||
**Measurement Overhead**: ~5 min per 20-task test suite
|
||||
339
skills/agent-prompt-evolution/templates/test-suite-template.md
Normal file
339
skills/agent-prompt-evolution/templates/test-suite-template.md
Normal file
@@ -0,0 +1,339 @@
|
||||
# Agent Test Suite Template
|
||||
|
||||
**Purpose**: Standardized test suite for agent prompt validation
|
||||
**Usage**: Copy and customize for your agent type
|
||||
|
||||
---
|
||||
|
||||
## Test Suite: [Agent Name]
|
||||
|
||||
**Agent Type**: [Explore/Code-Gen/Analysis/etc.]
|
||||
**Version**: [X.Y]
|
||||
**Test Date**: [YYYY-MM-DD]
|
||||
**Tester**: [Name]
|
||||
|
||||
---
|
||||
|
||||
## Test Configuration
|
||||
|
||||
**Test Environment**:
|
||||
- Claude Code Version: [version]
|
||||
- Model: [model-id]
|
||||
- Session ID: [session-id]
|
||||
|
||||
**Test Parameters**:
|
||||
- Number of tasks: [20 recommended]
|
||||
- Task diversity: [Low/Medium/High]
|
||||
- Complexity distribution:
|
||||
- Simple: [N] tasks
|
||||
- Medium: [N] tasks
|
||||
- Complex: [N] tasks
|
||||
|
||||
---
|
||||
|
||||
## Test Cases
|
||||
|
||||
### Task 1: [Brief Description]
|
||||
|
||||
**Type**: [Simple/Medium/Complex]
|
||||
**Category**: [Search/Analysis/Generation/etc.]
|
||||
|
||||
**Input**:
|
||||
```
|
||||
[Exact prompt or command given to agent]
|
||||
```
|
||||
|
||||
**Expected Outcome**:
|
||||
```
|
||||
[What a successful completion looks like]
|
||||
```
|
||||
|
||||
**Actual Result**:
|
||||
- Status: ✅ Success / ⚠️ Partial / ❌ Failed
|
||||
- Quality Rating: [1-5]
|
||||
- Time: [X.X] min
|
||||
- Tokens: [X]k
|
||||
|
||||
**Notes**:
|
||||
```
|
||||
[Any observations, issues, or improvements identified]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: [Brief Description]
|
||||
|
||||
**Type**: [Simple/Medium/Complex]
|
||||
**Category**: [Search/Analysis/Generation/etc.]
|
||||
|
||||
**Input**:
|
||||
```
|
||||
[Exact prompt or command given to agent]
|
||||
```
|
||||
|
||||
**Expected Outcome**:
|
||||
```
|
||||
[What a successful completion looks like]
|
||||
```
|
||||
|
||||
**Actual Result**:
|
||||
- Status: ✅ Success / ⚠️ Partial / ❌ Failed
|
||||
- Quality Rating: [1-5]
|
||||
- Time: [X.X] min
|
||||
- Tokens: [X]k
|
||||
|
||||
**Notes**:
|
||||
```
|
||||
[Any observations, issues, or improvements identified]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
[Repeat for all 20 tasks]
|
||||
|
||||
---
|
||||
|
||||
## Summary Statistics
|
||||
|
||||
### Overall Performance
|
||||
|
||||
**Success Rate**:
|
||||
```
|
||||
Total Tasks: [N]
|
||||
Successful: [N] (✅)
|
||||
Partial: [N] (⚠️)
|
||||
Failed: [N] (❌)
|
||||
|
||||
Success Rate: [X]% ([successful] / [total])
|
||||
```
|
||||
|
||||
**Quality Score**:
|
||||
```
|
||||
Task Quality Ratings: [4, 5, 3, 4, 5, ...]
|
||||
Average Quality: [X.X] / 5
|
||||
```
|
||||
|
||||
**Efficiency**:
|
||||
```
|
||||
Total Time: [X.X] min
|
||||
Average Time: [X.X] min/task
|
||||
Total Tokens: [X]k
|
||||
Average Tokens: [X.X]k/task
|
||||
```
|
||||
|
||||
**Reliability**:
|
||||
```
|
||||
Success by Complexity:
|
||||
- Simple: [X]% ([Y]/[Z])
|
||||
- Medium: [X]% ([Y]/[Z])
|
||||
- Complex: [X]% ([Y]/[Z])
|
||||
|
||||
Reliability Score: [X.XX]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Composite Metrics
|
||||
|
||||
### V_instance Calculation
|
||||
|
||||
```
|
||||
Success Rate: [X]% = [0.XX]
|
||||
Quality Score: [X.X]/5 = [0.XX]
|
||||
Efficiency Score: [target - actual] / target = [0.XX]
|
||||
Reliability: [0.XX]
|
||||
|
||||
V_instance = 0.40 × [success_rate] +
|
||||
0.30 × [quality_normalized] +
|
||||
0.20 × [efficiency_score] +
|
||||
0.10 × [reliability]
|
||||
|
||||
= [0.XX] + [0.XX] + [0.XX] + [0.XX]
|
||||
= [0.XX]
|
||||
|
||||
Target: ≥ 0.80
|
||||
Status: ✅ / ⚠️ / ❌
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Failure Analysis
|
||||
|
||||
### Failed Tasks
|
||||
|
||||
| Task ID | Description | Failure Reason | Pattern |
|
||||
|---------|-------------|----------------|---------|
|
||||
| [N] | [Brief] | [Why failed] | [Type] |
|
||||
| [N] | [Brief] | [Why failed] | [Type] |
|
||||
|
||||
### Failure Patterns
|
||||
|
||||
**Pattern 1: [Name]** ([N] occurrences)
|
||||
- Description: [What went wrong]
|
||||
- Root Cause: [Why it happened]
|
||||
- Proposed Fix: [How to address]
|
||||
|
||||
**Pattern 2: [Name]** ([N] occurrences)
|
||||
- Description: [What went wrong]
|
||||
- Root Cause: [Why it happened]
|
||||
- Proposed Fix: [How to address]
|
||||
|
||||
---
|
||||
|
||||
## Quality Issues
|
||||
|
||||
### Tasks with Quality < 4
|
||||
|
||||
| Task ID | Quality | Issues Identified | Improvements Needed |
|
||||
|---------|---------|-------------------|---------------------|
|
||||
| [N] | [1-3] | [Description] | [Actions] |
|
||||
| [N] | [1-3] | [Description] | [Actions] |
|
||||
|
||||
---
|
||||
|
||||
## Efficiency Analysis
|
||||
|
||||
### Tasks Exceeding Time Budget
|
||||
|
||||
| Task ID | Actual Time | Target Time | Δ | Reason |
|
||||
|---------|-------------|-------------|------|--------|
|
||||
| [N] | [X.X] min | [Y] min | [+Z] | [Why] |
|
||||
| [N] | [X.X] min | [Y] min | [+Z] | [Why] |
|
||||
|
||||
### Token Usage Analysis
|
||||
|
||||
```
|
||||
Tokens per task: [min-max] range
|
||||
High-usage tasks: [list]
|
||||
Optimization opportunities: [suggestions]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
### Priority 1 (Critical)
|
||||
|
||||
1. **[Issue]**: [Description]
|
||||
- Impact: [High/Medium/Low]
|
||||
- Frequency: [X] occurrences
|
||||
- Proposed Fix: [Action]
|
||||
- Expected Improvement: [X]% success rate
|
||||
|
||||
2. **[Issue]**: [Description]
|
||||
- Impact: [High/Medium/Low]
|
||||
- Frequency: [X] occurrences
|
||||
- Proposed Fix: [Action]
|
||||
- Expected Improvement: [X]% quality
|
||||
|
||||
### Priority 2 (Important)
|
||||
|
||||
1. **[Issue]**: [Description]
|
||||
- Impact: [High/Medium/Low]
|
||||
- Frequency: [X] occurrences
|
||||
- Proposed Fix: [Action]
|
||||
|
||||
### Priority 3 (Nice to Have)
|
||||
|
||||
1. **[Improvement]**: [Description]
|
||||
- Benefit: [What improves]
|
||||
- Effort: [Low/Medium/High]
|
||||
|
||||
---
|
||||
|
||||
## Next Iteration Plan
|
||||
|
||||
### Focus Areas
|
||||
|
||||
1. **[Area 1]**: [Why focus here]
|
||||
- Baseline: [Current metric]
|
||||
- Target: [Goal metric]
|
||||
- Approach: [How to improve]
|
||||
|
||||
2. **[Area 2]**: [Why focus here]
|
||||
- Baseline: [Current metric]
|
||||
- Target: [Goal metric]
|
||||
- Approach: [How to improve]
|
||||
|
||||
### Prompt Changes
|
||||
|
||||
**Planned Additions**:
|
||||
- [ ] [Guideline/instruction to add]
|
||||
- [ ] [Constraint to add]
|
||||
- [ ] [Example to add]
|
||||
|
||||
**Planned Clarifications**:
|
||||
- [ ] [Instruction to clarify]
|
||||
- [ ] [Constraint to adjust]
|
||||
|
||||
**Planned Removals**:
|
||||
- [ ] [Unnecessary instruction]
|
||||
- [ ] [Redundant constraint]
|
||||
|
||||
---
|
||||
|
||||
## Test Suite Evolution
|
||||
|
||||
### Version History
|
||||
|
||||
| Version | Date | Success | Quality | V_inst | Changes |
|
||||
|---------|------|---------|---------|--------|---------|
|
||||
| 0.1 | [D] | [X]% | [X.X] | [0.XX] | Baseline|
|
||||
| 0.2 | [D] | [X]% | [X.X] | [0.XX] | [Changes]|
|
||||
| [curr] | [D] | [X]% | [X.X] | [0.XX] | [Changes]|
|
||||
|
||||
### Convergence Tracking
|
||||
|
||||
```
|
||||
Iteration 0: V_instance = [0.XX] (baseline)
|
||||
Iteration 1: V_instance = [0.XX] ([+/-]%)
|
||||
Iteration 2: V_instance = [0.XX] ([+/-]%)
|
||||
Current: V_instance = [0.XX] ([+/-]%)
|
||||
|
||||
Converged: ✅ / ❌
|
||||
(Requires V_instance ≥ 0.80 for 2 consecutive iterations)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Task Catalog
|
||||
|
||||
### Task Templates by Category
|
||||
|
||||
**Search Tasks**:
|
||||
- "Find all [pattern] in [scope]"
|
||||
- "Locate [functionality] implementation"
|
||||
- "Show [architecture aspect]"
|
||||
|
||||
**Analysis Tasks**:
|
||||
- "Explain how [feature] works"
|
||||
- "Identify [issue type] in [code]"
|
||||
- "Compare [approach A] vs [approach B]"
|
||||
|
||||
**Generation Tasks**:
|
||||
- "Create [artifact type] for [purpose]"
|
||||
- "Generate [code/docs] following [pattern]"
|
||||
- "Refactor [code] to [goal]"
|
||||
|
||||
### Complexity Guidelines
|
||||
|
||||
**Simple** (1-2 min, 1-3k tokens):
|
||||
- Single-file search
|
||||
- Direct lookup
|
||||
- Straightforward generation
|
||||
|
||||
**Medium** (2-4 min, 3-7k tokens):
|
||||
- Multi-file search
|
||||
- Pattern analysis
|
||||
- Moderate generation
|
||||
|
||||
**Complex** (4-6 min, 7-15k tokens):
|
||||
- Cross-codebase search
|
||||
- Deep analysis
|
||||
- Complex generation
|
||||
|
||||
---
|
||||
|
||||
**Template Version**: 1.0
|
||||
**Source**: BAIME Agent Prompt Evolution
|
||||
**Usage**: Copy to `agent-test-suite-[name]-[version].md`
|
||||
Reference in New Issue
Block a user