Initial commit

2025-11-30 09:07:22 +08:00
commit fab98d059b
179 changed files with 46209 additions and 0 deletions
--- a/skills/agent-prompt-evolution/SKILL.md
+++ b/skills/agent-prompt-evolution/SKILL.md
@@ -0,0 +1,404 @@
+---
+name: Agent Prompt Evolution
+description: Track and optimize agent specialization during methodology development. Use when agent specialization emerges (generic agents show >5x performance gap), multi-experiment comparison needed, or methodology transferability analysis required. Captures agent set evolution (Aₙ tracking), meta-agent evolution (Mₙ tracking), specialization decisions (when/why to create specialized agents), and reusability assessment (universal vs domain-specific vs task-specific). Enables systematic cross-experiment learning and optimized M₀ evolution. 2-3 hours overhead per experiment.
+allowed-tools: Read, Grep, Glob, Edit, Write
+---
+
+# Agent Prompt Evolution
+
+**Systematically track how agents specialize during methodology development.**
+
+> Specialized agents emerge from need, not prediction. Track their evolution to understand when specialization adds value.
+
+---
+
+## When to Use This Skill
+
+Use this skill when:
+- 🔄 **Agent specialization emerges**: Generic agents show >5x performance gap
+- 📊 **Multi-experiment comparison**: Want to learn across experiments
+- 🧩 **Methodology transferability**: Analyzing what's reusable vs domain-specific
+- 📈 **M₀ optimization**: Want to evolve base Meta-Agent capabilities
+- 🎯 **Specialization decisions**: Deciding when to create new agents
+- 📚 **Agent library**: Building reusable agent catalog
+
+**Don't use when**:
+- ❌ Single experiment with no specialization
+- ❌ Generic agents sufficient throughout
+- ❌ No cross-experiment learning goals
+- ❌ Tracking overhead not worth insights
+
+---
+
+## Quick Start (10 minutes per iteration)
+
+### Track Agent Evolution in Each Iteration
+
+**iteration-N.md template**:
+
+```markdown
+## Agent Set Evolution
+
+### Current Agent Set (Aₙ)
+1. **coder** (generic) - Write code, implement features
+2. **doc-writer** (generic) - Documentation
+3. **data-analyst** (generic) - Data analysis
+4. **coverage-analyzer** (specialized, created iteration 3) - Analyze test coverage gaps
+
+### Changes from Previous Iteration
+- Added: coverage-analyzer (10x speedup for coverage analysis)
+- Removed: None
+- Modified: None
+
+### Specialization Decision
+**Why coverage-analyzer?**
+- Generic data-analyst took 45 min for coverage analysis
+- Identified 10x performance gap
+- Coverage analysis is recurring task (every iteration)
+- Domain knowledge: Go coverage tools, gap identification patterns
+- **ROI**: 3 hours creation cost, saves 40 min/iteration × 3 remaining iterations = 2 hours saved
+
+### Agent Reusability Assessment
+- **coder**: Universal (100% transferable)
+- **doc-writer**: Universal (100% transferable)
+- **data-analyst**: Universal (100% transferable)
+- **coverage-analyzer**: Domain-specific (testing methodology, 70% transferable to other languages)
+
+### System State
+- Aₙ ≠ Aₙ₋₁ (new agent added)
+- System UNSTABLE (need iteration N+1 to confirm stability)
+```
+
+---
+
+## Four Tracking Dimensions
+
+### 1. Agent Set Evolution (Aₙ)
+
+**Track changes iteration-to-iteration**:
+
+```
+A₀ = {coder, doc-writer, data-analyst}
+A₁ = {coder, doc-writer, data-analyst} (unchanged)
+A₂ = {coder, doc-writer, data-analyst} (unchanged)
+A₃ = {coder, doc-writer, data-analyst, coverage-analyzer} (new specialist)
+A₄ = {coder, doc-writer, data-analyst, coverage-analyzer, test-generator} (new specialist)
+A₅ = {coder, doc-writer, data-analyst, coverage-analyzer, test-generator} (stable)
+```
+
+**Stability**: Aₙ == Aₙ₋₁ for convergence
+
+### 2. Meta-Agent Evolution (Mₙ)
+
+**Standard M₀ capabilities**:
+1. **observe**: Pattern observation
+2. **plan**: Iteration planning
+3. **execute**: Agent orchestration
+4. **reflect**: Value assessment
+5. **evolve**: System evolution
+
+**Track enhancements**:
+
+```
+M₀ = {observe, plan, execute, reflect, evolve}
+M₁ = {observe, plan, execute, reflect, evolve, gap-identify} (new capability)
+M₂ = {observe, plan, execute, reflect, evolve, gap-identify} (stable)
+```
+
+**Finding** (from 8 experiments): M₀ sufficient in all cases (no evolution needed)
+
+### 3. Specialization Decision Tree
+
+**When to create specialized agent**:
+
+```
+Decision tree:
+1. Is generic agent sufficient? (performance within 2x)
+   YES → No specialization
+   NO → Continue
+
+2. Is task recurring? (happens ≥3 times)
+   NO → One-off, tolerate slowness
+   YES → Continue
+
+3. Is performance gap >5x?
+   NO → Tolerate moderate slowness
+   YES → Continue
+
+4. Is creation cost <ROI?
+   Creation cost < (Time saved per use × Remaining uses)
+   NO → Not worth it
+   YES → Create specialized agent
+```
+
+**Example** (Bootstrap-002):
+
+```
+Task: Test coverage gap analysis
+Generic agent (data-analyst): 45 min
+Potential specialist (coverage-analyzer): 4.5 min (10x faster)
+
+Recurring: YES (every iteration, 3 remaining)
+Performance gap: 10x (>5x threshold)
+Creation cost: 3 hours
+ROI: (45-4.5) min × 3 = 121.5 min = 2 hours saved
+Decision: CREATE (positive ROI)
+```
+
+### 4. Reusability Assessment
+
+**Three categories**:
+
+**Universal** (90-100% transferable):
+- Generic agents (coder, doc-writer, data-analyst)
+- No domain knowledge required
+- Applicable across all domains
+
+**Domain-Specific** (60-80% transferable):
+- Requires domain knowledge (testing, CI/CD, error handling)
+- Patterns apply within domain
+- Needs adaptation for other domains
+
+**Task-Specific** (10-30% transferable):
+- Highly specialized for particular task
+- One-off creation
+- Unlikely to reuse
+
+**Examples**:
+
+```
+Agent: coverage-analyzer
+Domain: Testing methodology
+Transferability: 70%
+- Go coverage tools (language-specific, 30% adaptation)
+- Gap identification patterns (universal, 100%)
+- Overall: 70% transferable to Python/Rust/TypeScript testing
+
+Agent: test-generator
+Domain: Testing methodology
+Transferability: 40%
+- Go test syntax (language-specific, 0% to other languages)
+- Test pattern templates (moderately transferable, 60%)
+- Overall: 40% transferable
+
+Agent: log-analyzer
+Domain: Observability
+Transferability: 85%
+- Log parsing (universal, 95%)
+- Pattern recognition (universal, 100%)
+- Structured logging concepts (universal, 100%)
+- Go slog specifics (language-specific, 20%)
+- Overall: 85% transferable
+```
+
+---
+
+## Evolution Log Template
+
+Create `agents/EVOLUTION-LOG.md`:
+
+```markdown
+# Agent Evolution Log
+
+## Experiment Overview
+- Domain: Testing Strategy
+- Baseline agents: 3 (coder, doc-writer, data-analyst)
+- Final agents: 5 (+coverage-analyzer, +test-generator)
+- Specialization count: 2
+
+---
+
+## Iteration-by-Iteration Evolution
+
+### Iteration 0
+**Agent Set**: {coder, doc-writer, data-analyst}
+**Changes**: None (baseline)
+**Observations**: Generic agents sufficient for baseline establishment
+
+### Iteration 3
+**Agent Set**: {coder, doc-writer, data-analyst, coverage-analyzer}
+**Changes**: +coverage-analyzer
+**Reason**: 10x performance gap (45 min → 4.5 min)
+**Creation Cost**: 3 hours
+**ROI**: Positive (2 hours saved over 3 iterations)
+**Reusability**: 70% (domain-specific, testing)
+
+### Iteration 4
+**Agent Set**: {coder, doc-writer, data-analyst, coverage-analyzer, test-generator}
+**Changes**: +test-generator
+**Reason**: 200x performance gap (manual test writing too slow)
+**Creation Cost**: 4 hours
+**ROI**: Massive (saved 10+ hours)
+**Reusability**: 40% (task-specific, Go testing)
+
+### Iteration 5
+**Agent Set**: {coder, doc-writer, data-analyst, coverage-analyzer, test-generator}
+**Changes**: None
+**System**: STABLE (Aₙ == Aₙ₋₁)
+
+---
+
+## Specialization Analysis
+
+### coverage-analyzer
+**Purpose**: Analyze test coverage, identify gaps
+**Performance**: 10x faster than generic data-analyst
+**Domain**: Testing methodology
+**Transferability**: 70%
+**Lessons**: Coverage gap identification patterns are universal, tool integration is language-specific
+
+### test-generator
+**Purpose**: Generate test boilerplate from coverage gaps
+**Performance**: 200x faster than manual
+**Domain**: Testing methodology (Go-specific)
+**Transferability**: 40%
+**Lessons**: High speedup justified low transferability, patterns reusable but syntax is not
+
+---
+
+## Cross-Experiment Reuse
+
+### From Previous Experiments
+- **validation-builder** (from API design experiment) → Used for smoke test validation
+- Reusability: Excellent (validation patterns are universal)
+- Adaptation: Minimal (10 min to adapt from API to CI/CD context)
+
+### To Future Experiments
+- **coverage-analyzer** → Reusable for Python/Rust/TypeScript testing (70% transferable)
+- **test-generator** → Less reusable (40% transferable, needs rewrite for other languages)
+
+---
+
+## Meta-Agent Evolution
+
+### M₀ Capabilities
+{observe, plan, execute, reflect, evolve}
+
+### Changes
+None (M₀ sufficient throughout)
+
+### Observations
+- M₀'s "evolve" capability successfully identified need for specialization
+- No Meta-Agent evolution required
+- Convergence: Mₙ == M₀ for all iterations
+
+---
+
+## Lessons Learned
+
+### Specialization Decisions
+- **10x performance gap** is good threshold (< 5x not worth it, >10x clear win)
+- **Positive ROI required**: Creation cost must be justified by time savings
+- **Recurring tasks only**: One-off tasks don't justify specialization
+
+### Reusability Patterns
+- **Generic agents always reusable**: coder, doc-writer, data-analyst (100%)
+- **Domain agents moderately reusable**: coverage-analyzer (70%)
+- **Task agents rarely reusable**: test-generator (40%)
+
+### When NOT to Specialize
+- Performance gap <5x (tolerable slowness)
+- Task is one-off (no recurring benefit)
+- Creation cost >ROI (not worth time investment)
+- Generic agent will improve with practice (learning curve)
+```
+
+---
+
+## Cross-Experiment Analysis
+
+After 3+ experiments, create `agents/CROSS-EXPERIMENT-ANALYSIS.md`:
+
+```markdown
+# Cross-Experiment Agent Analysis
+
+## Agent Reuse Matrix
+
+| Agent | Exp1 | Exp2 | Exp3 | Reuse Rate | Transferability |
+|-------|------|------|------|------------|-----------------|
+| coder | ✓ | ✓ | ✓ | 100% | Universal |
+| doc-writer | ✓ | ✓ | ✓ | 100% | Universal |
+| data-analyst | ✓ | ✓ | ✓ | 100% | Universal |
+| coverage-analyzer | ✓ | - | ✓ | 67% | Domain (testing) |
+| test-generator | ✓ | - | - | 33% | Task-specific |
+| validation-builder | - | ✓ | ✓ | 67% | Domain (validation) |
+| log-analyzer | - | - | ✓ | 33% | Domain (observability) |
+
+## Specialization Patterns
+
+### Universal Agents (100% reuse)
+- Generic capabilities (coder, doc-writer, data-analyst)
+- No domain knowledge
+- Always included in A₀
+
+### Domain Agents (50-80% reuse)
+- Require domain knowledge (testing, CI/CD, observability)
+- Reusable within domain
+- Examples: coverage-analyzer, validation-builder, log-analyzer
+
+### Task Agents (10-40% reuse)
+- Highly specialized
+- One-off or rare reuse
+- Examples: test-generator (Go-specific)
+
+## M₀ Sufficiency
+
+**Finding**: M₀ = {observe, plan, execute, reflect, evolve} sufficient in ALL experiments
+
+**Implications**:
+- No Meta-Agent evolution needed
+- Base capabilities handle all domains
+- Specialization occurs at Agent layer, not Meta-Agent layer
+
+## Specialization Threshold
+
+**Data** (from 3 experiments):
+- Average performance gap for specialization: 15x (range: 5x-200x)
+- Average creation cost: 3.5 hours (range: 2-5 hours)
+- Average ROI: Positive in 8/9 cases (89% success rate)
+
+**Recommendation**: Use 5x performance gap as threshold
+
+---
+
+**Updated**: After each new experiment
+```
+
+---
+
+## Success Criteria
+
+Agent evolution tracking succeeded when:
+
+1. **Complete tracking**: All agent changes documented each iteration
+2. **Specialization justified**: Each specialized agent has clear ROI
+3. **Reusability assessed**: Each agent categorized (universal/domain/task)
+4. **Cross-experiment learning**: Patterns identified across 2+ experiments
+5. **M₀ stability documented**: Meta-Agent evolution (or lack thereof) tracked
+
+---
+
+## Related Skills
+
+**Parent framework**:
+- [methodology-bootstrapping](../methodology-bootstrapping/SKILL.md) - Core OCA cycle
+
+**Complementary**:
+- [rapid-convergence](../rapid-convergence/SKILL.md) - Agent stability criterion
+
+---
+
+## References
+
+**Core guide**:
+- [Evolution Tracking](reference/tracking.md) - Detailed tracking process
+- [Specialization Decisions](reference/specialization.md) - Decision tree
+- [Reusability Framework](reference/reusability.md) - Assessment rubric
+
+**Examples**:
+- [Bootstrap-002 Evolution](examples/testing-strategy-agent-evolution.md) - 2 specialists
+- [Bootstrap-007 No Evolution](examples/ci-cd-no-specialization.md) - Generic sufficient
+
+---
+
+**Status**: ✅ Formalized | 2-3 hours overhead | Enables systematic learning
--- a/skills/agent-prompt-evolution/examples/explore-agent-v1-v3.md
+++ b/skills/agent-prompt-evolution/examples/explore-agent-v1-v3.md
@@ -0,0 +1,377 @@
+# Explore Agent Evolution: v1 → v3
+
+**Agent**: Explore (codebase exploration)
+**Iterations**: 3
+**Improvement**: 60% → 90% success rate (+50%)
+**Time**: 4.2 min → 2.6 min (-38%)
+**Status**: Converged (production-ready)
+
+Complete walkthrough of evolving Explore agent prompt through BAIME methodology.
+
+---
+
+## Iteration 0: Baseline (v1)
+
+### Initial Prompt
+
+```markdown
+# Explore Agent
+
+You are a codebase exploration agent. Your task is to help users understand
+code structure, find implementations, and explain how things work.
+
+When given a query:
+1. Use Glob to find relevant files
+2. Use Grep to search for patterns
+3. Read files to understand implementations
+4. Provide a summary
+
+Tools available: Glob, Grep, Read, Bash
+```
+
+**Prompt Length**: 58 lines
+
+---
+
+### Baseline Testing (10 tasks)
+
+| Task | Query | Result | Quality | Time |
+|------|-------|--------|---------|------|
+| 1 | "show architecture" | ❌ Failed | 2/5 | 5.2 min |
+| 2 | "find API endpoints" | ⚠️ Partial | 3/5 | 4.8 min |
+| 3 | "explain auth" | ⚠️ Partial | 3/5 | 6.1 min |
+| 4 | "list CLI commands" | ✅ Success | 4/5 | 2.8 min |
+| 5 | "find database code" | ✅ Success | 5/5 | 3.2 min |
+| 6 | "show test structure" | ❌ Failed | 2/5 | 4.5 min |
+| 7 | "explain config" | ✅ Success | 4/5 | 3.9 min |
+| 8 | "find error handlers" | ✅ Success | 5/5 | 2.9 min |
+| 9 | "show imports" | ✅ Success | 4/5 | 3.1 min |
+| 10 | "find middleware" | ✅ Success | 4/5 | 5.3 min |
+
+**Baseline Metrics**:
+- Success Rate: 60% (6/10)
+- Average Quality: 3.6/5
+- Average Time: 4.18 min
+- V_instance: 0.68 (below target)
+
+---
+
+### Failure Analysis
+
+**Pattern 1: Scope Ambiguity** (Tasks 1, 2, 3)
+- Queries too broad ("architecture", "auth")
+- Agent doesn't know search depth
+- Either stops too early or runs too long
+
+**Pattern 2: Incomplete Coverage** (Tasks 2, 6)
+- Agent finds 1-2 files, stops
+- Misses related implementations
+- No verification of completeness
+
+**Pattern 3: Time Management** (Tasks 1, 3, 10)
+- Long-running queries (>5 min)
+- Diminishing returns after 3 min
+- No time-boxing mechanism
+
+---
+
+## Iteration 1: Add Structure (v2)
+
+### Prompt Changes
+
+**Added: Thoroughness Guidelines**
+```markdown
+## Thoroughness Levels
+
+Assess query complexity and choose thoroughness:
+
+**quick** (1-2 min):
+- Check 3-5 obvious locations
+- Direct pattern matches only
+- Use for simple lookups
+
+**medium** (2-4 min):
+- Check 10-15 related files
+- Follow cross-references
+- Use for typical queries
+
+**thorough** (4-6 min):
+- Comprehensive search across codebase
+- Deep dependency analysis
+- Use for architecture questions
+```
+
+**Added: Time-Boxing**
+```markdown
+## Time Management
+
+Allocate time based on thoroughness:
+- quick: 1-2 min
+- medium: 2-4 min
+- thorough: 4-6 min
+
+Stop if <10% new findings in last 20% of time budget.
+```
+
+**Added: Completeness Checklist**
+```markdown
+## Before Responding
+
+Verify completeness:
+□ All direct matches found (Glob/Grep)
+□ Related implementations checked
+□ Cross-references validated
+□ No obvious gaps remaining
+
+State confidence level: Low / Medium / High
+```
+
+**Prompt Length**: 112 lines (+54)
+
+---
+
+### Testing (8 tasks: 3 re-tests + 5 new)
+
+| Task | Query | Result | Quality | Time |
+|------|-------|--------|---------|------|
+| 1R | "show architecture" | ✅ Success | 4/5 | 3.8 min |
+| 2R | "find API endpoints" | ✅ Success | 5/5 | 2.9 min |
+| 3R | "explain auth" | ✅ Success | 4/5 | 3.2 min |
+| 11 | "list database schemas" | ✅ Success | 5/5 | 2.1 min |
+| 12 | "find error handlers" | ✅ Success | 4/5 | 2.5 min |
+| 13 | "show test structure" | ⚠️ Partial | 3/5 | 3.6 min |
+| 14 | "explain config system" | ✅ Success | 5/5 | 2.4 min |
+| 15 | "find CLI commands" | ✅ Success | 4/5 | 2.2 min |
+
+**Iteration 1 Metrics**:
+- Success Rate: 87.5% (7/8) - **+45.8% improvement**
+- Average Quality: 4.25/5 - **+18.1%**
+- Average Time: 2.84 min - **-32.1%**
+- V_instance: 0.88 ✅ (exceeds target)
+
+---
+
+### Key Improvements
+
+✅ Fixed scope ambiguity (Tasks 1R, 2R, 3R all succeeded)
+✅ Better time management (all <4 min)
+✅ Higher quality outputs (4.25 avg)
+⚠️ Still one partial success (Task 13)
+
+**Remaining Issue**: Test structure query missed integration tests
+
+---
+
+## Iteration 2: Refine Coverage (v3)
+
+### Prompt Changes
+
+**Enhanced: Completeness Verification**
+```markdown
+## Completeness Verification
+
+Before concluding, verify coverage by category:
+
+**For "find" queries**:
+□ Main implementations found
+□ Related utilities checked
+□ Test files reviewed (if applicable)
+□ Configuration/setup files checked
+
+**For "show" queries**:
+□ Primary structure identified
+□ Secondary components listed
+□ Relationships mapped
+□ Examples provided
+
+**For "explain" queries**:
+□ Core mechanism described
+□ Key components identified
+□ Data flow explained
+□ Edge cases noted
+```
+
+**Added: Search Strategy**
+```markdown
+## Search Strategy
+
+**Phase 1 (30% of time)**: Broad search
+- Glob for file patterns
+- Grep for key terms
+- Identify main locations
+
+**Phase 2 (50% of time)**: Deep investigation
+- Read main files
+- Follow references
+- Build understanding
+
+**Phase 3 (20% of time)**: Verification
+- Check for gaps
+- Validate findings
+- Prepare summary
+```
+
+**Refined: Confidence Scoring**
+```markdown
+## Confidence Level
+
+**High**: All major components found, verified complete
+**Medium**: Core components found, minor gaps possible
+**Low**: Partial findings, significant gaps likely
+
+Always state confidence level and identify known gaps.
+```
+
+**Prompt Length**: 138 lines (+26)
+
+---
+
+### Testing (10 tasks: 1 re-test + 9 new)
+
+| Task | Query | Result | Quality | Time |
+|------|-------|--------|---------|------|
+| 13R | "show test structure" | ✅ Success | 5/5 | 2.9 min |
+| 16 | "find auth middleware" | ✅ Success | 5/5 | 2.3 min |
+| 17 | "explain routing" | ✅ Success | 4/5 | 3.1 min |
+| 18 | "list validation rules" | ✅ Success | 5/5 | 2.1 min |
+| 19 | "find logging setup" | ✅ Success | 4/5 | 2.5 min |
+| 20 | "show data models" | ✅ Success | 5/5 | 2.8 min |
+| 21 | "explain caching" | ✅ Success | 4/5 | 2.7 min |
+| 22 | "find background jobs" | ✅ Success | 5/5 | 2.4 min |
+| 23 | "show dependencies" | ✅ Success | 4/5 | 2.2 min |
+| 24 | "explain deployment" | ❌ Failed | 2/5 | 3.8 min |
+
+**Iteration 2 Metrics**:
+- Success Rate: 90% (9/10) - **+2.5% improvement** (stable)
+- Average Quality: 4.3/5 - **+1.2%**
+- Average Time: 2.68 min - **-5.6%**
+- V_instance: 0.90 ✅ ✅ (2 consecutive ≥ 0.80)
+
+**CONVERGED** ✅
+
+---
+
+### Stability Validation
+
+**Iteration 1**: V_instance = 0.88
+**Iteration 2**: V_instance = 0.90
+**Change**: +2.3% (stable, within ±5%)
+
+**Criteria Met**:
+✅ V_instance ≥ 0.80 for 2 consecutive iterations
+✅ Success rate ≥ 85%
+✅ Quality ≥ 4.0
+✅ Time within budget (<3 min avg)
+
+---
+
+## Final Metrics Comparison
+
+| Metric | v1 (Baseline) | v2 (Iteration 1) | v3 (Iteration 2) | Δ Total |
+|--------|---------------|------------------|------------------|---------|
+| Success Rate | 60% | 87.5% | 90% | **+50%** |
+| Quality | 3.6/5 | 4.25/5 | 4.3/5 | **+19.4%** |
+| Time | 4.18 min | 2.84 min | 2.68 min | **-35.9%** |
+| V_instance | 0.68 | 0.88 | 0.90 | **+32.4%** |
+
+---
+
+## Evolution Summary
+
+### Iteration 0 → 1: Major Improvements
+
+**Key Changes**:
+- Added thoroughness levels (quick/medium/thorough)
+- Added time-boxing (1-6 min)
+- Added completeness checklist
+
+**Impact**:
+- Success: 60% → 87.5% (+45.8%)
+- Time: 4.18 → 2.84 min (-32.1%)
+- Quality: 3.6 → 4.25 (+18.1%)
+
+**Root Causes Addressed**:
+✅ Scope ambiguity resolved
+✅ Time management improved
+✅ Completeness awareness added
+
+---
+
+### Iteration 1 → 2: Refinement
+
+**Key Changes**:
+- Enhanced completeness verification (by query type)
+- Added search strategy (3-phase)
+- Refined confidence scoring
+
+**Impact**:
+- Success: 87.5% → 90% (+2.5%, stable)
+- Time: 2.84 → 2.68 min (-5.6%)
+- Quality: 4.25 → 4.3 (+1.2%)
+
+**Root Causes Addressed**:
+✅ Test structure coverage gap fixed
+✅ Verification process strengthened
+
+---
+
+## Key Learnings
+
+### What Worked
+
+1. **Thoroughness Levels**: Clear guidance on search depth
+2. **Time-Boxing**: Prevented runaway queries
+3. **Completeness Checklist**: Improved coverage
+4. **Phased Search**: Structured approach to exploration
+
+### What Didn't Work
+
+1. **Deployment Query Failed**: Outside agent scope (requires infra knowledge)
+   - Solution: Document limitations, suggest alternative agents
+
+### Best Practices Validated
+
+✅ **Start Simple**: v1 was minimal, added structure incrementally
+✅ **Measure Everything**: Quantitative metrics guided refinements
+✅ **Focus on Patterns**: Fixed systematic failures, not one-off issues
+✅ **Validate Stability**: 2-iteration convergence confirmed reliability
+
+---
+
+## Production Deployment
+
+**Status**: ✅ Production-ready (v3)
+**Confidence**: High (90% success, 2 iterations stable)
+
+**Deployment**:
+```bash
+# Update agent prompt
+cp explore-agent-v3.md .claude/agents/explore.md
+
+# Validate
+test-agent-suite explore 20
+# Expected: Success ≥ 85%, Quality ≥ 4.0, Time ≤ 3 min
+```
+
+**Monitoring**:
+- Track success rate (alert if <80%)
+- Monitor time (alert if >3.5 min avg)
+- Review failures weekly
+
+---
+
+## Future Enhancements (v4+)
+
+**Potential Improvements**:
+1. **Context Caching**: Reuse codebase knowledge across queries (Est: -20% time)
+2. **Query Classification**: Auto-detect thoroughness level (Est: +5% success)
+3. **Result Ranking**: Prioritize most relevant findings (Est: +10% quality)
+
+**Decision**: Hold v3, monitor for 2 weeks before v4
+
+---
+
+**Source**: Bootstrap-005 Agent Prompt Evolution
+**Agent**: Explore
+**Final Version**: v3 (90% success, 4.3/5 quality, 2.68 min avg)
+**Status**: Production-ready, converged, deployed
--- a/skills/agent-prompt-evolution/examples/rapid-iteration-pattern.md
+++ b/skills/agent-prompt-evolution/examples/rapid-iteration-pattern.md
@@ -0,0 +1,409 @@
+# Rapid Iteration Pattern for Agent Evolution
+
+**Pattern**: Fast convergence (2-3 iterations) for agent prompt evolution
+**Success Rate**: 85% (11/13 agents converged in ≤3 iterations)
+**Time**: 3-6 hours total vs 8-12 hours standard
+
+How to achieve rapid convergence when evolving agent prompts.
+
+---
+
+## Pattern Overview
+
+**Standard Evolution**: 4-6 iterations, 8-12 hours
+**Rapid Evolution**: 2-3 iterations, 3-6 hours
+
+**Key Difference**: Strong Iteration 0 (comprehensive baseline analysis)
+
+---
+
+## Rapid Iteration Workflow
+
+### Iteration 0: Comprehensive Baseline (90-120 min)
+
+**Standard Baseline** (30 min):
+- Run 5 test cases
+- Note obvious failures
+- Quick metrics
+
+**Comprehensive Baseline** (90-120 min):
+- Run 15-20 diverse test cases
+- Systematic failure pattern analysis
+- Deep root cause investigation
+- Document all edge cases
+- Compare to similar agents
+
+**Investment**: +60-90 min
+**Return**: -2 to -3 iterations (save 3-6 hours)
+
+---
+
+### Example: Explore Agent (Standard vs Rapid)
+
+**Standard Approach**:
+```
+Iteration 0 (30 min): 5 tasks, quick notes
+Iteration 1 (90 min): Add thoroughness levels
+Iteration 2 (90 min): Add time-boxing
+Iteration 3 (75 min): Add completeness checks
+Iteration 4 (60 min): Refine verification
+Iteration 5 (60 min): Final polish
+
+Total: 6.75 hours, 5 iterations
+```
+
+**Rapid Approach**:
+```
+Iteration 0 (120 min): 20 tasks, pattern analysis, root causes
+Iteration 1 (90 min): Add thoroughness + time-boxing + completeness
+Iteration 2 (75 min): Refine + validate stability
+
+Total: 4.75 hours, 2 iterations
+```
+
+**Savings**: 2 hours, 3 fewer iterations
+
+---
+
+## Comprehensive Baseline Checklist
+
+### Task Coverage (15-20 tasks)
+
+**Complexity Distribution**:
+- 5 simple tasks (1-2 min expected)
+- 10 medium tasks (2-4 min expected)
+- 5 complex tasks (4-6 min expected)
+
+**Query Type Diversity**:
+- Search queries (find, locate, list)
+- Analysis queries (explain, describe, analyze)
+- Comparison queries (compare, evaluate, contrast)
+- Edge cases (ambiguous, overly broad, very specific)
+
+---
+
+### Failure Pattern Analysis (30 min)
+
+**Systematic Analysis**:
+
+1. **Categorize Failures**
+   - Scope issues (too broad/narrow)
+   - Coverage issues (incomplete)
+   - Time issues (too slow/fast)
+   - Quality issues (inaccurate)
+
+2. **Identify Root Causes**
+   - Missing instructions
+   - Ambiguous guidelines
+   - Incorrect constraints
+   - Tool usage issues
+
+3. **Prioritize by Impact**
+   - High frequency + high impact → Fix first
+   - Low frequency + high impact → Document
+   - High frequency + low impact → Automate
+   - Low frequency + low impact → Ignore
+
+**Example**:
+```markdown
+## Failure Patterns (Explore Agent)
+
+**Pattern 1: Scope Ambiguity** (6/20 tasks, 30%)
+Root Cause: No guidance on search depth
+Impact: High (3 failures, 3 partial successes)
+Priority: P1 (fix in Iteration 1)
+
+**Pattern 2: Incomplete Coverage** (4/20 tasks, 20%)
+Root Cause: No completeness verification
+Impact: Medium (4 partial successes)
+Priority: P1 (fix in Iteration 1)
+
+**Pattern 3: Time Overruns** (3/20 tasks, 15%)
+Root Cause: No time-boxing mechanism
+Impact: Medium (3 slow but successful)
+Priority: P2 (fix in Iteration 1)
+
+**Pattern 4: Tool Selection** (1/20 tasks, 5%)
+Root Cause: Not using best tool for task
+Impact: Low (1 inefficient but successful)
+Priority: P3 (defer to Iteration 2 if time)
+```
+
+---
+
+### Comparative Analysis (15 min)
+
+**Compare to Similar Agents**:
+- What works well in other agents?
+- What patterns are transferable?
+- What mistakes were made before?
+
+**Example**:
+```markdown
+## Comparative Analysis
+
+**Code-Gen Agent** (similar agent):
+- Uses complexity assessment (simple/medium/complex)
+- Has explicit quality checklist
+- Includes time estimates
+
+**Transferable**:
+✅ Complexity assessment → thoroughness levels
+✅ Quality checklist → completeness verification
+❌ Time estimates (less predictable for exploration)
+
+**Analysis Agent** (similar agent):
+- Uses phased approach (scan → analyze → synthesize)
+- Includes confidence scoring
+
+**Transferable**:
+✅ Phased approach → search strategy
+✅ Confidence scoring → already planned
+```
+
+---
+
+## Iteration 1: Comprehensive Fix (90 min)
+
+**Standard Iteration 1**: Fix 1-2 major issues
+**Rapid Iteration 1**: Fix ALL P1 issues + some P2
+
+**Approach**:
+1. Address all high-priority patterns (P1)
+2. Add preventive measures for P2 issues
+3. Include transferable patterns from similar agents
+
+**Example** (Explore Agent):
+```markdown
+## Iteration 1 Changes
+
+**P1 Fixes**:
+1. Scope Ambiguity → Add thoroughness levels (quick/medium/thorough)
+2. Incomplete Coverage → Add completeness checklist
+3. Time Management → Add time-boxing (1-6 min)
+
+**P2 Improvements**:
+4. Search Strategy → Add 3-phase approach
+5. Confidence → Add confidence scoring
+
+**Borrowed Patterns**:
+6. From Code-Gen: Complexity assessment framework
+7. From Analysis: Verification checkpoints
+
+Total Changes: 7 (vs standard 2-3)
+```
+
+**Result**: Higher chance of convergence in Iteration 2
+
+---
+
+## Iteration 2: Validate & Converge (75 min)
+
+**Objectives**:
+1. Test comprehensive fixes
+2. Measure stability
+3. Validate convergence
+
+**Test Suite** (30 min):
+- Re-run all 20 Iteration 0 tasks
+- Add 5-10 new edge cases
+- Measure metrics
+
+**Analysis** (20 min):
+- Compare to Iteration 0 and Iteration 1
+- Check convergence criteria
+- Identify remaining gaps (if any)
+
+**Refinement** (25 min):
+- Minor adjustments only
+- Polish documentation
+- Validate stability
+
+**Convergence Check**:
+```
+Iteration 1: V_instance = 0.88 ✅
+Iteration 2: V_instance = 0.90 ✅
+Stable: 0.88 → 0.90 (+2.3%, within ±5%)
+
+CONVERGED ✅
+```
+
+---
+
+## Success Factors
+
+### 1. Comprehensive Baseline (60-90 min extra)
+
+**Investment**: 2x standard baseline time
+**Return**: -2 to -3 iterations (6-9 hours saved)
+**ROI**: 4-6x
+
+**Critical Elements**:
+- 15-20 diverse tasks (not 5-10)
+- Systematic failure pattern analysis
+- Root cause investigation (not just symptoms)
+- Comparative analysis with similar agents
+
+---
+
+### 2. Aggressive Iteration 1 (Fix All P1)
+
+**Standard**: Fix 1-2 issues
+**Rapid**: Fix all P1 + some P2 (5-7 fixes)
+
+**Approach**:
+- Batch related fixes together
+- Borrow proven patterns
+- Add preventive measures
+
+**Risk**: Over-complication
+**Mitigation**: Focus on core issues, defer P3
+
+---
+
+### 3. Borrowed Patterns (20-30% reuse)
+
+**Sources**:
+- Similar agents in same project
+- Agents from other projects
+- Industry best practices
+
+**Example**:
+```
+Explore Agent borrowed from:
+- Code-Gen: Complexity assessment (100% reuse)
+- Analysis: Phased approach (90% reuse)
+- Testing: Verification checklist (80% reuse)
+
+Total reuse: ~60% of Iteration 1 changes
+```
+
+**Savings**: 30-40 min per iteration
+
+---
+
+## Anti-Patterns
+
+### ❌ Skipping Comprehensive Baseline
+
+**Symptom**: "Let's just try some fixes and see"
+**Result**: 5-6 iterations, trial and error
+**Cost**: 8-12 hours
+
+**Fix**: Invest 90-120 min in Iteration 0
+
+---
+
+### ❌ Incremental Fixes (One Issue at a Time)
+
+**Symptom**: Fixing one pattern per iteration
+**Result**: 4-6 iterations for convergence
+**Cost**: 8-10 hours
+
+**Fix**: Batch P1 fixes in Iteration 1
+
+---
+
+### ❌ Ignoring Similar Agents
+
+**Symptom**: Reinventing solutions
+**Result**: Slower convergence, lower quality
+**Cost**: 2-3 extra hours
+
+**Fix**: 15 min comparative analysis in Iteration 0
+
+---
+
+## When to Use Rapid Pattern
+
+**Good Fit**:
+- Agent is similar to existing agents (60%+ overlap)
+- Clear failure patterns in baseline
+- Time constraint (need results in 1-2 days)
+
+**Poor Fit**:
+- Novel agent type (no similar agents)
+- Complex domain (many unknowns)
+- Learning objective (want to explore incrementally)
+
+---
+
+## Metrics Comparison
+
+### Standard Evolution
+
+```
+Iteration 0: 30 min (5 tasks)
+Iteration 1: 90 min (fix 1-2 issues)
+Iteration 2: 90 min (fix 2-3 more)
+Iteration 3: 75 min (refine)
+Iteration 4: 60 min (converge)
+
+Total: 5.75 hours, 4 iterations
+V_instance: 0.68 → 0.74 → 0.79 → 0.83 → 0.85 ✅
+```
+
+### Rapid Evolution
+
+```
+Iteration 0: 120 min (20 tasks, analysis)
+Iteration 1: 90 min (fix all P1+P2)
+Iteration 2: 75 min (validate, converge)
+
+Total: 4.75 hours, 2 iterations
+V_instance: 0.68 → 0.88 → 0.90 ✅
+```
+
+**Savings**: 1 hour, 2 fewer iterations
+
+---
+
+## Replication Guide
+
+### Day 1: Comprehensive Baseline
+
+**Morning** (2 hours):
+1. Design 20-task test suite
+2. Run baseline tests
+3. Document all failures
+
+**Afternoon** (1 hour):
+4. Analyze failure patterns
+5. Identify root causes
+6. Compare to similar agents
+7. Prioritize fixes
+
+---
+
+### Day 2: Comprehensive Fix
+
+**Morning** (1.5 hours):
+1. Implement all P1 fixes
+2. Add P2 improvements
+3. Incorporate borrowed patterns
+
+**Afternoon** (1 hour):
+4. Test on 15-20 tasks
+5. Measure metrics
+6. Document changes
+
+---
+
+### Day 3: Validate & Deploy
+
+**Morning** (1 hour):
+1. Test on 25-30 tasks
+2. Check stability
+3. Minor refinements
+
+**Afternoon** (0.5 hours):
+4. Final validation
+5. Deploy to production
+6. Setup monitoring
+
+---
+
+**Source**: BAIME Agent Prompt Evolution - Rapid Pattern
+**Success Rate**: 85% (11/13 agents)
+**Average Time**: 4.2 hours (vs 9.3 hours standard)
+**Average Iterations**: 2.3 (vs 4.8 standard)
--- a/skills/agent-prompt-evolution/reference/evolution-framework.md
+++ b/skills/agent-prompt-evolution/reference/evolution-framework.md
@@ -0,0 +1,395 @@
+# Agent Prompt Evolution Framework
+
+**Version**: 1.0
+**Purpose**: Systematic methodology for evolving agent prompts through iterative refinement
+**Basis**: BAIME OCA cycle applied to prompt engineering
+
+---
+
+## Overview
+
+Agent prompt evolution applies the Observe-Codify-Automate cycle to improve agent prompts through empirical testing and structured refinement.
+
+**Goal**: Transform initial agent prompts into production-quality prompts through measured iterations.
+
+---
+
+## Evolution Cycle
+
+```
+Iteration N:
+  Observe → Analyze → Refine → Test → Measure
+     ↑                                    ↓
+     └────────── Feedback Loop ──────────┘
+```
+
+---
+
+## Phase 1: Observe (30 min)
+
+### Run Agent with Current Prompt
+
+**Activities**:
+1. Execute agent on 5-10 representative tasks
+2. Record agent behavior and outputs
+3. Note successes and failures
+4. Measure performance metrics
+
+**Metrics**:
+- Success rate (tasks completed correctly)
+- Response quality (accuracy, completeness)
+- Efficiency (time, token usage)
+- Error patterns
+
+**Example**:
+```markdown
+## Iteration 0: Baseline Observation
+
+**Agent**: Explore subagent (codebase exploration)
+**Tasks**: 10 exploration queries
+**Success Rate**: 60% (6/10)
+
+**Failures**:
+1. Query "show architecture" → Too broad, agent confused
+2. Query "find API endpoints" → Missed 3 key files
+3. Query "explain auth" → Incomplete, stopped too early
+
+**Time**: Avg 4.2 min per query (target: 2 min)
+**Quality**: 3.1/5 average rating
+```
+
+---
+
+## Phase 2: Analyze (20 min)
+
+### Identify Failure Patterns
+
+**Analysis Questions**:
+1. What types of failures occurred?
+2. Are failures systematic or random?
+3. What context is missing from prompt?
+4. Are instructions clear enough?
+5. Are constraints too loose or too tight?
+
+**Example Analysis**:
+```markdown
+## Failure Pattern Analysis
+
+**Pattern 1: Scope Ambiguity** (3 failures)
+- Queries too broad ("architecture", "overview")
+- Agent doesn't know how deep to search
+- Fix: Add explicit depth guidelines
+
+**Pattern 2: Search Coverage** (2 failures)
+- Agent stops after finding 1-2 files
+- Misses related implementations
+- Fix: Add thoroughness requirements
+
+**Pattern 3: Time Management** (2 failures)
+- Agent runs too long (>5 min)
+- Diminishing returns after 2 min
+- Fix: Add time-boxing guidelines
+```
+
+---
+
+## Phase 3: Refine (25 min)
+
+### Update Agent Prompt
+
+**Refinement Strategies**:
+
+1. **Add Missing Context**
+   - Domain knowledge
+   - Codebase structure
+   - Common patterns
+
+2. **Clarify Instructions**
+   - Break down complex tasks
+   - Add examples
+   - Define success criteria
+
+3. **Adjust Constraints**
+   - Time limits
+   - Scope boundaries
+   - Quality thresholds
+
+4. **Provide Tools**
+   - Specific commands
+   - Search patterns
+   - Decision frameworks
+
+**Example Refinements**:
+```markdown
+## Prompt Changes (v0 → v1)
+
+**Added: Thoroughness Guidelines**
+```
+When searching for patterns:
+- "quick": Check 3-5 obvious locations
+- "medium": Check 10-15 related files
+- "thorough": Check all matching patterns
+```
+
+**Added: Time-Boxing**
+```
+Allocate time based on thoroughness:
+- quick: 1-2 min
+- medium: 2-4 min
+- thorough: 4-6 min
+
+Stop if diminishing returns after 80% of time used.
+```
+
+**Clarified: Success Criteria**
+```
+Complete search means:
+✓ All direct matches found
+✓ Related implementations identified
+✓ Cross-references checked
+✓ Confidence score provided (Low/Medium/High)
+```
+```
+
+---
+
+## Phase 4: Test (20 min)
+
+### Validate Refinements
+
+**Test Suite**:
+1. Re-run failed tasks from Iteration 0
+2. Add 3-5 new test cases
+3. Measure improvement
+
+**Example Test**:
+```markdown
+## Iteration 1 Testing
+
+**Re-run Failed Tasks** (3):
+1. "show architecture" → ✅ SUCCESS (added thoroughness=medium)
+2. "find API endpoints" → ✅ SUCCESS (found all 5 files)
+3. "explain auth" → ✅ SUCCESS (complete explanation)
+
+**New Test Cases** (5):
+1. "list database schemas" → ✅ SUCCESS
+2. "find error handlers" → ✅ SUCCESS
+3. "show test structure" → ⚠️ PARTIAL (missed integration tests)
+4. "explain config system" → ✅ SUCCESS
+5. "find CLI commands" → ✅ SUCCESS
+
+**Success Rate**: 87.5% (7/8) - improved from 60%
+```
+
+---
+
+## Phase 5: Measure (15 min)
+
+### Calculate Improvement Metrics
+
+**Metrics**:
+```
+Δ Success Rate = (new_rate - baseline_rate) / baseline_rate
+Δ Quality = (new_score - baseline_score) / baseline_score
+Δ Efficiency = (baseline_time - new_time) / baseline_time
+```
+
+**Example**:
+```markdown
+## Iteration 1 Metrics
+
+**Success Rate**:
+- Baseline: 60% (6/10)
+- Iteration 1: 87.5% (7/8)
+- Improvement: +45.8%
+
+**Quality** (1-5 scale):
+- Baseline: 3.1 avg
+- Iteration 1: 4.2 avg
+- Improvement: +35.5%
+
+**Efficiency**:
+- Baseline: 4.2 min avg
+- Iteration 1: 2.8 min avg
+- Improvement: +33.3% (faster)
+
+**Overall V_instance**: 0.85 ✅ (target: 0.80)
+```
+
+---
+
+## Convergence Criteria
+
+**Prompt is production-ready when**:
+
+1. **Success Rate ≥ 85%** (reliable)
+2. **Quality Score ≥ 4.0/5** (high quality)
+3. **Efficiency within target** (time/tokens)
+4. **Stable for 2 iterations** (no regression)
+
+**Example Convergence**:
+```
+Iteration 0: 60% success, 3.1 quality, 4.2 min
+Iteration 1: 87.5% success, 4.2 quality, 2.8 min ✅
+Iteration 2: 90% success, 4.3 quality, 2.6 min ✅ (stable)
+
+CONVERGED: Ready for production
+```
+
+---
+
+## Evolution Patterns
+
+### Pattern 1: Scope Definition
+
+**Problem**: Agent doesn't know how broad/deep to search
+
+**Solution**: Add thoroughness parameter
+```markdown
+When invoked, assess query complexity:
+- Simple (1-2 files): thoroughness=quick
+- Medium (5-10 files): thoroughness=medium
+- Complex (>10 files): thoroughness=thorough
+```
+
+### Pattern 2: Early Termination
+
+**Problem**: Agent stops too early, misses results
+
+**Solution**: Add completeness checklist
+```markdown
+Before concluding search, verify:
+□ All direct matches found (Glob/Grep)
+□ Related implementations checked
+□ Cross-references validated
+□ No obvious gaps remaining
+```
+
+### Pattern 3: Time Management
+
+**Problem**: Agent runs too long, poor efficiency
+
+**Solution**: Add time-boxing with checkpoints
+```markdown
+Allocate time budget:
+- 0-30%: Initial broad search
+- 30-70%: Deep investigation
+- 70-100%: Verification and summary
+
+Stop if <10% new findings in last 20% of time.
+```
+
+### Pattern 4: Context Accumulation
+
+**Problem**: Agent forgets earlier findings
+
+**Solution**: Add intermediate summaries
+```markdown
+After each major finding:
+1. Summarize what was found
+2. Update mental model
+3. Identify remaining gaps
+4. Adjust search strategy
+```
+
+### Pattern 5: Quality Assurance
+
+**Problem**: Agent provides low-quality outputs
+
+**Solution**: Add self-review checklist
+```markdown
+Before responding, verify:
+□ Answer is accurate and complete
+□ Examples are provided
+□ Confidence level stated
+□ Next steps suggested (if applicable)
+```
+
+---
+
+## Iteration Template
+
+```markdown
+## Iteration N: [Focus Area]
+
+### Observations (30 min)
+- Tasks tested: [count]
+- Success rate: [X]%
+- Avg quality: [X]/5
+- Avg time: [X] min
+
+**Key Issues**:
+1. [Issue description]
+2. [Issue description]
+
+### Analysis (20 min)
+- Pattern 1: [Name] ([frequency])
+- Pattern 2: [Name] ([frequency])
+
+### Refinements (25 min)
+- Added: [Feature/guideline]
+- Clarified: [Instruction]
+- Adjusted: [Constraint]
+
+### Testing (20 min)
+- Re-test failures: [X]/[Y] fixed
+- New tests: [X]/[Y] passed
+- Overall success: [X]%
+
+### Metrics (15 min)
+- Δ Success: [+/-X]%
+- Δ Quality: [+/-X]%
+- Δ Efficiency: [+/-X]%
+- V_instance: [X.XX]
+
+**Status**: [Converged/Continue]
+**Next Focus**: [Area to improve]
+```
+
+---
+
+## Best Practices
+
+### Do's
+
+✅ **Test on diverse cases** - Cover edge cases and common queries
+✅ **Measure objectively** - Use quantitative metrics
+✅ **Iterate quickly** - 90-120 min per iteration
+✅ **Focus improvements** - One major change per iteration
+✅ **Validate stability** - Test 2 iterations for convergence
+
+### Don'ts
+
+❌ **Don't overtune** - Avoid overfitting to test cases
+❌ **Don't skip baselines** - Always measure Iteration 0
+❌ **Don't ignore regressions** - Track quality across iterations
+❌ **Don't add complexity** - Keep prompts concise
+❌ **Don't stop too early** - Ensure 2-iteration stability
+
+---
+
+## Example: Explore Agent Evolution
+
+**Baseline** (Iteration 0):
+- Generic instructions
+- No thoroughness guidance
+- No time management
+- Success: 60%
+
+**Iteration 1**:
+- Added thoroughness levels
+- Added time-boxing
+- Success: 87.5% (+45.8%)
+
+**Iteration 2**:
+- Added completeness checklist
+- Refined search strategy
+- Success: 90% (+2.5% improvement, stable)
+
+**Convergence**: 2 iterations, 87.5% → 90% stable
+
+---
+
+**Source**: BAIME Agent Prompt Evolution Framework
+**Status**: Production-ready, validated across 13 agent types
+**Average Improvement**: +42% success rate over baseline
--- a/skills/agent-prompt-evolution/reference/metrics.md
+++ b/skills/agent-prompt-evolution/reference/metrics.md
@@ -0,0 +1,386 @@
+# Agent Prompt Metrics
+
+**Version**: 1.0
+**Purpose**: Quantitative metrics for measuring agent prompt quality
+**Framework**: BAIME dual-layer value functions applied to agents
+
+---
+
+## Core Metrics
+
+### 1. Success Rate
+
+**Definition**: Percentage of tasks completed correctly
+
+**Calculation**:
+```
+Success Rate = correct_completions / total_tasks
+```
+
+**Thresholds**:
+- ≥90%: Excellent (production-ready)
+- 80-89%: Good (minor refinements needed)
+- 60-79%: Fair (needs improvement)
+- <60%: Poor (major issues)
+
+**Example**:
+```
+Tasks: 20
+Correct: 17
+Partial: 2
+Failed: 1
+
+Success Rate = 17/20 = 85% (Good)
+```
+
+---
+
+### 2. Quality Score
+
+**Definition**: Average quality rating of agent outputs (1-5 scale)
+
+**Rating Criteria**:
+- **5**: Perfect - Accurate, complete, well-structured
+- **4**: Good - Minor gaps, mostly complete
+- **3**: Fair - Acceptable but needs improvement
+- **2**: Poor - Significant issues
+- **1**: Failed - Incorrect or unusable
+
+**Thresholds**:
+- ≥4.5: Excellent
+- 4.0-4.4: Good
+- 3.5-3.9: Fair
+- <3.5: Poor
+
+**Example**:
+```
+Task 1: 5/5 (perfect)
+Task 2: 4/5 (good)
+Task 3: 5/5 (perfect)
+...
+Task 20: 4/5 (good)
+
+Average: 4.35/5 (Good)
+```
+
+---
+
+### 3. Efficiency
+
+**Definition**: Time and token usage per task
+
+**Metrics**:
+```
+Time Efficiency = avg_time_per_task
+Token Efficiency = avg_tokens_per_task
+```
+
+**Thresholds** (vary by agent type):
+- Explore agent: <3 min, <5k tokens
+- Code generation: <5 min, <10k tokens
+- Analysis: <10 min, <20k tokens
+
+**Example**:
+```
+Tasks: 20
+Total time: 56 min
+Total tokens: 92k
+
+Time Efficiency: 2.8 min/task ✅
+Token Efficiency: 4.6k tokens/task ✅
+```
+
+---
+
+### 4. Reliability
+
+**Definition**: Consistency of agent performance
+
+**Calculation**:
+```
+Reliability = 1 - (std_dev(success_rate) / mean(success_rate))
+```
+
+**Thresholds**:
+- ≥0.90: Very reliable (consistent)
+- 0.80-0.89: Reliable
+- 0.70-0.79: Moderately reliable
+- <0.70: Unreliable (erratic)
+
+**Example**:
+```
+Batch 1: 85% success
+Batch 2: 90% success
+Batch 3: 87% success
+Batch 4: 88% success
+
+Mean: 87.5%
+Std Dev: 2.08
+Reliability: 1 - (2.08/87.5) = 0.976 (Very reliable)
+```
+
+---
+
+## Composite Metrics
+
+### V_instance (Agent Performance)
+
+**Formula**:
+```
+V_instance = 0.40 × success_rate +
+             0.30 × (quality_score / 5) +
+             0.20 × efficiency_score +
+             0.10 × reliability
+
+Where:
+- success_rate ∈ [0, 1]
+- quality_score ∈ [1, 5], normalized to [0, 1]
+- efficiency_score = 1 - (actual_time / target_time), capped at [0, 1]
+- reliability ∈ [0, 1]
+```
+
+**Target**: V_instance ≥ 0.80
+
+**Example**:
+```
+Success Rate: 85% = 0.85
+Quality Score: 4.2/5 = 0.84
+Efficiency: 2.8 min / 3 min target = 1 - 0.93 = 0.07, but we want faster so: 1.0 (under budget)
+Reliability: 0.976
+
+V_instance = 0.40 × 0.85 +
+             0.30 × 0.84 +
+             0.20 × 1.0 +
+             0.10 × 0.976
+
+           = 0.34 + 0.252 + 0.20 + 0.0976
+           = 0.890 ✅ (exceeds target)
+```
+
+---
+
+### V_meta (Prompt Quality)
+
+**Formula**:
+```
+V_meta = 0.35 × completeness +
+         0.30 × clarity +
+         0.20 × adaptability +
+         0.15 × maintainability
+
+Where:
+- completeness = features_implemented / features_needed
+- clarity = 1 - (ambiguous_instructions / total_instructions)
+- adaptability = successful_task_types / tested_task_types
+- maintainability = 1 - (prompt_complexity / max_complexity)
+```
+
+**Target**: V_meta ≥ 0.80
+
+**Example**:
+```
+Completeness: 8/8 features = 1.0
+Clarity: 1 - (2 ambiguous / 20 instructions) = 0.90
+Adaptability: 5/6 task types = 0.83
+Maintainability: 1 - (150 lines / 300 max) = 0.50
+
+V_meta = 0.35 × 1.0 +
+         0.30 × 0.90 +
+         0.20 × 0.83 +
+         0.15 × 0.50
+
+       = 0.35 + 0.27 + 0.166 + 0.075
+       = 0.861 ✅ (exceeds target)
+```
+
+---
+
+## Metric Collection
+
+### Automated Collection
+
+**Session Analysis**:
+```bash
+# Extract agent performance from session
+query_tools --tool="Task" --scope=session | \
+  jq -r '.[] | select(.status == "success") | .duration' | \
+  awk '{sum+=$1; n++} END {print sum/n}'
+```
+
+**Example Script**:
+```bash
+#!/bin/bash
+# scripts/measure-agent-metrics.sh
+
+AGENT_NAME=$1
+SESSION=$2
+
+# Success rate
+total=$(grep "agent=$AGENT_NAME" "$SESSION" | wc -l)
+success=$(grep "agent=$AGENT_NAME.*success" "$SESSION" | wc -l)
+success_rate=$(echo "scale=2; $success / $total" | bc)
+
+# Average time
+avg_time=$(grep "agent=$AGENT_NAME" "$SESSION" | \
+  jq -r '.duration' | \
+  awk '{sum+=$1; n++} END {print sum/n}')
+
+# Quality (requires manual rating file)
+avg_quality=$(cat "${SESSION}.ratings" | \
+  grep "$AGENT_NAME" | \
+  awk '{sum+=$2; n++} END {print sum/n}')
+
+echo "Agent: $AGENT_NAME"
+echo "Success Rate: $success_rate"
+echo "Avg Time: ${avg_time}s"
+echo "Avg Quality: $avg_quality/5"
+```
+
+---
+
+### Manual Collection
+
+**Test Suite Template**:
+```markdown
+## Agent Test Suite: [Agent Name]
+
+**Iteration**: [N]
+**Date**: [YYYY-MM-DD]
+
+### Test Cases
+
+| ID | Task | Result | Quality | Time | Notes |
+|----|------|--------|---------|------|-------|
+| 1  | [Description] | ✅/❌ | [1-5] | [min] | [Issues] |
+| 2  | [Description] | ✅/❌ | [1-5] | [min] | [Issues] |
+...
+
+### Summary
+
+- Success Rate: [X]% ([Y]/[Z])
+- Avg Quality: [X.X]/5
+- Avg Time: [X.X] min
+- V_instance: [X.XX]
+```
+
+---
+
+## Benchmarking
+
+### Cross-Agent Comparison
+
+**Standard Test Suite**: 20 representative tasks
+
+**Example Results**:
+```
+| Agent       | Success | Quality | Time  | V_inst |
+|-------------|---------|---------|-------|--------|
+| Explore v1  | 60%     | 3.1     | 4.2m  | 0.62   |
+| Explore v2  | 87.5%   | 4.2     | 2.8m  | 0.89   |
+| Explore v3  | 90%     | 4.3     | 2.6m  | 0.91   |
+```
+
+**Improvement**: v1 → v3 = +30% success, +1.2 quality, +38% faster
+
+---
+
+### Baseline Comparison
+
+**Industry Baselines** (approximate):
+- Generic agent (no tuning): ~50-60% success
+- Basic tuned agent: ~70-80% success
+- Well-tuned agent: ~85-95% success
+- Expert-tuned agent: ~95-98% success
+
+---
+
+## Regression Testing
+
+### Track Metrics Over Time
+
+**Regression Detection**:
+```
+if current_metric < (previous_metric - threshold):
+    alert("REGRESSION DETECTED")
+```
+
+**Thresholds**:
+- Success Rate: -5% (e.g., 90% → 85%)
+- Quality Score: -0.3 (e.g., 4.5 → 4.2)
+- Efficiency: +20% time (e.g., 2.8 min → 3.4 min)
+
+**Example**:
+```
+Iteration 3: 90% success, 4.3 quality, 2.6 min ✅
+Iteration 4: 87% success, 4.1 quality, 2.8 min ⚠️ REGRESSION
+
+Analysis: New constraint too restrictive
+Action: Revert constraint, re-test
+```
+
+---
+
+## Reporting Template
+
+```markdown
+## Agent Metrics Report
+
+**Agent**: [Name]
+**Version**: [X.Y]
+**Test Date**: [YYYY-MM-DD]
+**Test Suite**: [Standard 20 | Custom N]
+
+### Performance Metrics
+
+**Success Rate**: [X]% ([Y]/[Z] tasks)
+- Target: ≥85%
+- Status: ✅/⚠️/❌
+
+**Quality Score**: [X.X]/5
+- Target: ≥4.0
+- Status: ✅/⚠️/❌
+
+**Efficiency**:
+- Time: [X.X] min/task (target: [Y] min)
+- Tokens: [X]k tokens/task (target: [Y]k)
+- Status: ✅/⚠️/❌
+
+**Reliability**: [X.XX]
+- Target: ≥0.85
+- Status: ✅/⚠️/❌
+
+### Composite Scores
+
+**V_instance**: [X.XX]
+- Target: ≥0.80
+- Status: ✅/⚠️/❌
+
+**V_meta**: [X.XX]
+- Target: ≥0.80
+- Status: ✅/⚠️/❌
+
+### Comparison to Baseline
+
+| Metric        | Baseline | Current | Δ      |
+|---------------|----------|---------|--------|
+| Success Rate  | [X]%     | [Y]%    | [+/-]% |
+| Quality       | [X.X]    | [Y.Y]   | [+/-]  |
+| Time          | [X.X]m   | [Y.Y]m  | [+/-]% |
+| V_instance    | [X.XX]   | [Y.YY]  | [+/-]  |
+
+### Recommendations
+
+1. [Action item based on metrics]
+2. [Action item based on metrics]
+
+### Next Steps
+
+- [ ] [Task for next iteration]
+- [ ] [Task for next iteration]
+```
+
+---
+
+**Source**: BAIME Agent Prompt Evolution Framework
+**Status**: Production-ready, validated across 13 agent types
+**Measurement Overhead**: ~5 min per 20-task test suite
--- a/skills/agent-prompt-evolution/templates/test-suite-template.md
+++ b/skills/agent-prompt-evolution/templates/test-suite-template.md
@@ -0,0 +1,339 @@
+# Agent Test Suite Template
+
+**Purpose**: Standardized test suite for agent prompt validation
+**Usage**: Copy and customize for your agent type
+
+---
+
+## Test Suite: [Agent Name]
+
+**Agent Type**: [Explore/Code-Gen/Analysis/etc.]
+**Version**: [X.Y]
+**Test Date**: [YYYY-MM-DD]
+**Tester**: [Name]
+
+---
+
+## Test Configuration
+
+**Test Environment**:
+- Claude Code Version: [version]
+- Model: [model-id]
+- Session ID: [session-id]
+
+**Test Parameters**:
+- Number of tasks: [20 recommended]
+- Task diversity: [Low/Medium/High]
+- Complexity distribution:
+  - Simple: [N] tasks
+  - Medium: [N] tasks
+  - Complex: [N] tasks
+
+---
+
+## Test Cases
+
+### Task 1: [Brief Description]
+
+**Type**: [Simple/Medium/Complex]
+**Category**: [Search/Analysis/Generation/etc.]
+
+**Input**:
+```
+[Exact prompt or command given to agent]
+```
+
+**Expected Outcome**:
+```
+[What a successful completion looks like]
+```
+
+**Actual Result**:
+- Status: ✅ Success / ⚠️ Partial / ❌ Failed
+- Quality Rating: [1-5]
+- Time: [X.X] min
+- Tokens: [X]k
+
+**Notes**:
+```
+[Any observations, issues, or improvements identified]
+```
+
+---
+
+### Task 2: [Brief Description]
+
+**Type**: [Simple/Medium/Complex]
+**Category**: [Search/Analysis/Generation/etc.]
+
+**Input**:
+```
+[Exact prompt or command given to agent]
+```
+
+**Expected Outcome**:
+```
+[What a successful completion looks like]
+```
+
+**Actual Result**:
+- Status: ✅ Success / ⚠️ Partial / ❌ Failed
+- Quality Rating: [1-5]
+- Time: [X.X] min
+- Tokens: [X]k
+
+**Notes**:
+```
+[Any observations, issues, or improvements identified]
+```
+
+---
+
+[Repeat for all 20 tasks]
+
+---
+
+## Summary Statistics
+
+### Overall Performance
+
+**Success Rate**:
+```
+Total Tasks: [N]
+Successful: [N] (✅)
+Partial: [N] (⚠️)
+Failed: [N] (❌)
+
+Success Rate: [X]% ([successful] / [total])
+```
+
+**Quality Score**:
+```
+Task Quality Ratings: [4, 5, 3, 4, 5, ...]
+Average Quality: [X.X] / 5
+```
+
+**Efficiency**:
+```
+Total Time: [X.X] min
+Average Time: [X.X] min/task
+Total Tokens: [X]k
+Average Tokens: [X.X]k/task
+```
+
+**Reliability**:
+```
+Success by Complexity:
+- Simple: [X]% ([Y]/[Z])
+- Medium: [X]% ([Y]/[Z])
+- Complex: [X]% ([Y]/[Z])
+
+Reliability Score: [X.XX]
+```
+
+---
+
+## Composite Metrics
+
+### V_instance Calculation
+
+```
+Success Rate: [X]% = [0.XX]
+Quality Score: [X.X]/5 = [0.XX]
+Efficiency Score: [target - actual] / target = [0.XX]
+Reliability: [0.XX]
+
+V_instance = 0.40 × [success_rate] +
+             0.30 × [quality_normalized] +
+             0.20 × [efficiency_score] +
+             0.10 × [reliability]
+
+           = [0.XX] + [0.XX] + [0.XX] + [0.XX]
+           = [0.XX]
+
+Target: ≥ 0.80
+Status: ✅ / ⚠️ / ❌
+```
+
+---
+
+## Failure Analysis
+
+### Failed Tasks
+
+| Task ID | Description | Failure Reason | Pattern |
+|---------|-------------|----------------|---------|
+| [N]     | [Brief]     | [Why failed]   | [Type]  |
+| [N]     | [Brief]     | [Why failed]   | [Type]  |
+
+### Failure Patterns
+
+**Pattern 1: [Name]** ([N] occurrences)
+- Description: [What went wrong]
+- Root Cause: [Why it happened]
+- Proposed Fix: [How to address]
+
+**Pattern 2: [Name]** ([N] occurrences)
+- Description: [What went wrong]
+- Root Cause: [Why it happened]
+- Proposed Fix: [How to address]
+
+---
+
+## Quality Issues
+
+### Tasks with Quality < 4
+
+| Task ID | Quality | Issues Identified | Improvements Needed |
+|---------|---------|-------------------|---------------------|
+| [N]     | [1-3]   | [Description]     | [Actions]           |
+| [N]     | [1-3]   | [Description]     | [Actions]           |
+
+---
+
+## Efficiency Analysis
+
+### Tasks Exceeding Time Budget
+
+| Task ID | Actual Time | Target Time | Δ    | Reason |
+|---------|-------------|-------------|------|--------|
+| [N]     | [X.X] min   | [Y] min     | [+Z] | [Why]  |
+| [N]     | [X.X] min   | [Y] min     | [+Z] | [Why]  |
+
+### Token Usage Analysis
+
+```
+Tokens per task: [min-max] range
+High-usage tasks: [list]
+Optimization opportunities: [suggestions]
+```
+
+---
+
+## Recommendations
+
+### Priority 1 (Critical)
+
+1. **[Issue]**: [Description]
+   - Impact: [High/Medium/Low]
+   - Frequency: [X] occurrences
+   - Proposed Fix: [Action]
+   - Expected Improvement: [X]% success rate
+
+2. **[Issue]**: [Description]
+   - Impact: [High/Medium/Low]
+   - Frequency: [X] occurrences
+   - Proposed Fix: [Action]
+   - Expected Improvement: [X]% quality
+
+### Priority 2 (Important)
+
+1. **[Issue]**: [Description]
+   - Impact: [High/Medium/Low]
+   - Frequency: [X] occurrences
+   - Proposed Fix: [Action]
+
+### Priority 3 (Nice to Have)
+
+1. **[Improvement]**: [Description]
+   - Benefit: [What improves]
+   - Effort: [Low/Medium/High]
+
+---
+
+## Next Iteration Plan
+
+### Focus Areas
+
+1. **[Area 1]**: [Why focus here]
+   - Baseline: [Current metric]
+   - Target: [Goal metric]
+   - Approach: [How to improve]
+
+2. **[Area 2]**: [Why focus here]
+   - Baseline: [Current metric]
+   - Target: [Goal metric]
+   - Approach: [How to improve]
+
+### Prompt Changes
+
+**Planned Additions**:
+- [ ] [Guideline/instruction to add]
+- [ ] [Constraint to add]
+- [ ] [Example to add]
+
+**Planned Clarifications**:
+- [ ] [Instruction to clarify]
+- [ ] [Constraint to adjust]
+
+**Planned Removals**:
+- [ ] [Unnecessary instruction]
+- [ ] [Redundant constraint]
+
+---
+
+## Test Suite Evolution
+
+### Version History
+
+| Version | Date | Success | Quality | V_inst | Changes |
+|---------|------|---------|---------|--------|---------|
+| 0.1     | [D]  | [X]%    | [X.X]   | [0.XX] | Baseline|
+| 0.2     | [D]  | [X]%    | [X.X]   | [0.XX] | [Changes]|
+| [curr]  | [D]  | [X]%    | [X.X]   | [0.XX] | [Changes]|
+
+### Convergence Tracking
+
+```
+Iteration 0: V_instance = [0.XX] (baseline)
+Iteration 1: V_instance = [0.XX] ([+/-]%)
+Iteration 2: V_instance = [0.XX] ([+/-]%)
+Current:     V_instance = [0.XX] ([+/-]%)
+
+Converged: ✅ / ❌
+(Requires V_instance ≥ 0.80 for 2 consecutive iterations)
+```
+
+---
+
+## Appendix: Task Catalog
+
+### Task Templates by Category
+
+**Search Tasks**:
+- "Find all [pattern] in [scope]"
+- "Locate [functionality] implementation"
+- "Show [architecture aspect]"
+
+**Analysis Tasks**:
+- "Explain how [feature] works"
+- "Identify [issue type] in [code]"
+- "Compare [approach A] vs [approach B]"
+
+**Generation Tasks**:
+- "Create [artifact type] for [purpose]"
+- "Generate [code/docs] following [pattern]"
+- "Refactor [code] to [goal]"
+
+### Complexity Guidelines
+
+**Simple** (1-2 min, 1-3k tokens):
+- Single-file search
+- Direct lookup
+- Straightforward generation
+
+**Medium** (2-4 min, 3-7k tokens):
+- Multi-file search
+- Pattern analysis
+- Moderate generation
+
+**Complex** (4-6 min, 7-15k tokens):
+- Cross-codebase search
+- Deep analysis
+- Complex generation
+
+---
+
+**Template Version**: 1.0
+**Source**: BAIME Agent Prompt Evolution
+**Usage**: Copy to `agent-test-suite-[name]-[version].md`