Initial commit

2025-11-30 09:07:22 +08:00
commit fab98d059b
179 changed files with 46209 additions and 0 deletions
--- a/skills/agent-prompt-evolution/examples/explore-agent-v1-v3.md
+++ b/skills/agent-prompt-evolution/examples/explore-agent-v1-v3.md
@@ -0,0 +1,377 @@
+# Explore Agent Evolution: v1 → v3
+
+**Agent**: Explore (codebase exploration)
+**Iterations**: 3
+**Improvement**: 60% → 90% success rate (+50%)
+**Time**: 4.2 min → 2.6 min (-38%)
+**Status**: Converged (production-ready)
+
+Complete walkthrough of evolving Explore agent prompt through BAIME methodology.
+
+---
+
+## Iteration 0: Baseline (v1)
+
+### Initial Prompt
+
+```markdown
+# Explore Agent
+
+You are a codebase exploration agent. Your task is to help users understand
+code structure, find implementations, and explain how things work.
+
+When given a query:
+1. Use Glob to find relevant files
+2. Use Grep to search for patterns
+3. Read files to understand implementations
+4. Provide a summary
+
+Tools available: Glob, Grep, Read, Bash
+```
+
+**Prompt Length**: 58 lines
+
+---
+
+### Baseline Testing (10 tasks)
+
+| Task | Query | Result | Quality | Time |
+|------|-------|--------|---------|------|
+| 1 | "show architecture" | ❌ Failed | 2/5 | 5.2 min |
+| 2 | "find API endpoints" | ⚠️ Partial | 3/5 | 4.8 min |
+| 3 | "explain auth" | ⚠️ Partial | 3/5 | 6.1 min |
+| 4 | "list CLI commands" | ✅ Success | 4/5 | 2.8 min |
+| 5 | "find database code" | ✅ Success | 5/5 | 3.2 min |
+| 6 | "show test structure" | ❌ Failed | 2/5 | 4.5 min |
+| 7 | "explain config" | ✅ Success | 4/5 | 3.9 min |
+| 8 | "find error handlers" | ✅ Success | 5/5 | 2.9 min |
+| 9 | "show imports" | ✅ Success | 4/5 | 3.1 min |
+| 10 | "find middleware" | ✅ Success | 4/5 | 5.3 min |
+
+**Baseline Metrics**:
+- Success Rate: 60% (6/10)
+- Average Quality: 3.6/5
+- Average Time: 4.18 min
+- V_instance: 0.68 (below target)
+
+---
+
+### Failure Analysis
+
+**Pattern 1: Scope Ambiguity** (Tasks 1, 2, 3)
+- Queries too broad ("architecture", "auth")
+- Agent doesn't know search depth
+- Either stops too early or runs too long
+
+**Pattern 2: Incomplete Coverage** (Tasks 2, 6)
+- Agent finds 1-2 files, stops
+- Misses related implementations
+- No verification of completeness
+
+**Pattern 3: Time Management** (Tasks 1, 3, 10)
+- Long-running queries (>5 min)
+- Diminishing returns after 3 min
+- No time-boxing mechanism
+
+---
+
+## Iteration 1: Add Structure (v2)
+
+### Prompt Changes
+
+**Added: Thoroughness Guidelines**
+```markdown
+## Thoroughness Levels
+
+Assess query complexity and choose thoroughness:
+
+**quick** (1-2 min):
+- Check 3-5 obvious locations
+- Direct pattern matches only
+- Use for simple lookups
+
+**medium** (2-4 min):
+- Check 10-15 related files
+- Follow cross-references
+- Use for typical queries
+
+**thorough** (4-6 min):
+- Comprehensive search across codebase
+- Deep dependency analysis
+- Use for architecture questions
+```
+
+**Added: Time-Boxing**
+```markdown
+## Time Management
+
+Allocate time based on thoroughness:
+- quick: 1-2 min
+- medium: 2-4 min
+- thorough: 4-6 min
+
+Stop if <10% new findings in last 20% of time budget.
+```
+
+**Added: Completeness Checklist**
+```markdown
+## Before Responding
+
+Verify completeness:
+□ All direct matches found (Glob/Grep)
+□ Related implementations checked
+□ Cross-references validated
+□ No obvious gaps remaining
+
+State confidence level: Low / Medium / High
+```
+
+**Prompt Length**: 112 lines (+54)
+
+---
+
+### Testing (8 tasks: 3 re-tests + 5 new)
+
+| Task | Query | Result | Quality | Time |
+|------|-------|--------|---------|------|
+| 1R | "show architecture" | ✅ Success | 4/5 | 3.8 min |
+| 2R | "find API endpoints" | ✅ Success | 5/5 | 2.9 min |
+| 3R | "explain auth" | ✅ Success | 4/5 | 3.2 min |
+| 11 | "list database schemas" | ✅ Success | 5/5 | 2.1 min |
+| 12 | "find error handlers" | ✅ Success | 4/5 | 2.5 min |
+| 13 | "show test structure" | ⚠️ Partial | 3/5 | 3.6 min |
+| 14 | "explain config system" | ✅ Success | 5/5 | 2.4 min |
+| 15 | "find CLI commands" | ✅ Success | 4/5 | 2.2 min |
+
+**Iteration 1 Metrics**:
+- Success Rate: 87.5% (7/8) - **+45.8% improvement**
+- Average Quality: 4.25/5 - **+18.1%**
+- Average Time: 2.84 min - **-32.1%**
+- V_instance: 0.88 ✅ (exceeds target)
+
+---
+
+### Key Improvements
+
+✅ Fixed scope ambiguity (Tasks 1R, 2R, 3R all succeeded)
+✅ Better time management (all <4 min)
+✅ Higher quality outputs (4.25 avg)
+⚠️ Still one partial success (Task 13)
+
+**Remaining Issue**: Test structure query missed integration tests
+
+---
+
+## Iteration 2: Refine Coverage (v3)
+
+### Prompt Changes
+
+**Enhanced: Completeness Verification**
+```markdown
+## Completeness Verification
+
+Before concluding, verify coverage by category:
+
+**For "find" queries**:
+□ Main implementations found
+□ Related utilities checked
+□ Test files reviewed (if applicable)
+□ Configuration/setup files checked
+
+**For "show" queries**:
+□ Primary structure identified
+□ Secondary components listed
+□ Relationships mapped
+□ Examples provided
+
+**For "explain" queries**:
+□ Core mechanism described
+□ Key components identified
+□ Data flow explained
+□ Edge cases noted
+```
+
+**Added: Search Strategy**
+```markdown
+## Search Strategy
+
+**Phase 1 (30% of time)**: Broad search
+- Glob for file patterns
+- Grep for key terms
+- Identify main locations
+
+**Phase 2 (50% of time)**: Deep investigation
+- Read main files
+- Follow references
+- Build understanding
+
+**Phase 3 (20% of time)**: Verification
+- Check for gaps
+- Validate findings
+- Prepare summary
+```
+
+**Refined: Confidence Scoring**
+```markdown
+## Confidence Level
+
+**High**: All major components found, verified complete
+**Medium**: Core components found, minor gaps possible
+**Low**: Partial findings, significant gaps likely
+
+Always state confidence level and identify known gaps.
+```
+
+**Prompt Length**: 138 lines (+26)
+
+---
+
+### Testing (10 tasks: 1 re-test + 9 new)
+
+| Task | Query | Result | Quality | Time |
+|------|-------|--------|---------|------|
+| 13R | "show test structure" | ✅ Success | 5/5 | 2.9 min |
+| 16 | "find auth middleware" | ✅ Success | 5/5 | 2.3 min |
+| 17 | "explain routing" | ✅ Success | 4/5 | 3.1 min |
+| 18 | "list validation rules" | ✅ Success | 5/5 | 2.1 min |
+| 19 | "find logging setup" | ✅ Success | 4/5 | 2.5 min |
+| 20 | "show data models" | ✅ Success | 5/5 | 2.8 min |
+| 21 | "explain caching" | ✅ Success | 4/5 | 2.7 min |
+| 22 | "find background jobs" | ✅ Success | 5/5 | 2.4 min |
+| 23 | "show dependencies" | ✅ Success | 4/5 | 2.2 min |
+| 24 | "explain deployment" | ❌ Failed | 2/5 | 3.8 min |
+
+**Iteration 2 Metrics**:
+- Success Rate: 90% (9/10) - **+2.5% improvement** (stable)
+- Average Quality: 4.3/5 - **+1.2%**
+- Average Time: 2.68 min - **-5.6%**
+- V_instance: 0.90 ✅ ✅ (2 consecutive ≥ 0.80)
+
+**CONVERGED** ✅
+
+---
+
+### Stability Validation
+
+**Iteration 1**: V_instance = 0.88
+**Iteration 2**: V_instance = 0.90
+**Change**: +2.3% (stable, within ±5%)
+
+**Criteria Met**:
+✅ V_instance ≥ 0.80 for 2 consecutive iterations
+✅ Success rate ≥ 85%
+✅ Quality ≥ 4.0
+✅ Time within budget (<3 min avg)
+
+---
+
+## Final Metrics Comparison
+
+| Metric | v1 (Baseline) | v2 (Iteration 1) | v3 (Iteration 2) | Δ Total |
+|--------|---------------|------------------|------------------|---------|
+| Success Rate | 60% | 87.5% | 90% | **+50%** |
+| Quality | 3.6/5 | 4.25/5 | 4.3/5 | **+19.4%** |
+| Time | 4.18 min | 2.84 min | 2.68 min | **-35.9%** |
+| V_instance | 0.68 | 0.88 | 0.90 | **+32.4%** |
+
+---
+
+## Evolution Summary
+
+### Iteration 0 → 1: Major Improvements
+
+**Key Changes**:
+- Added thoroughness levels (quick/medium/thorough)
+- Added time-boxing (1-6 min)
+- Added completeness checklist
+
+**Impact**:
+- Success: 60% → 87.5% (+45.8%)
+- Time: 4.18 → 2.84 min (-32.1%)
+- Quality: 3.6 → 4.25 (+18.1%)
+
+**Root Causes Addressed**:
+✅ Scope ambiguity resolved
+✅ Time management improved
+✅ Completeness awareness added
+
+---
+
+### Iteration 1 → 2: Refinement
+
+**Key Changes**:
+- Enhanced completeness verification (by query type)
+- Added search strategy (3-phase)
+- Refined confidence scoring
+
+**Impact**:
+- Success: 87.5% → 90% (+2.5%, stable)
+- Time: 2.84 → 2.68 min (-5.6%)
+- Quality: 4.25 → 4.3 (+1.2%)
+
+**Root Causes Addressed**:
+✅ Test structure coverage gap fixed
+✅ Verification process strengthened
+
+---
+
+## Key Learnings
+
+### What Worked
+
+1. **Thoroughness Levels**: Clear guidance on search depth
+2. **Time-Boxing**: Prevented runaway queries
+3. **Completeness Checklist**: Improved coverage
+4. **Phased Search**: Structured approach to exploration
+
+### What Didn't Work
+
+1. **Deployment Query Failed**: Outside agent scope (requires infra knowledge)
+   - Solution: Document limitations, suggest alternative agents
+
+### Best Practices Validated
+
+✅ **Start Simple**: v1 was minimal, added structure incrementally
+✅ **Measure Everything**: Quantitative metrics guided refinements
+✅ **Focus on Patterns**: Fixed systematic failures, not one-off issues
+✅ **Validate Stability**: 2-iteration convergence confirmed reliability
+
+---
+
+## Production Deployment
+
+**Status**: ✅ Production-ready (v3)
+**Confidence**: High (90% success, 2 iterations stable)
+
+**Deployment**:
+```bash
+# Update agent prompt
+cp explore-agent-v3.md .claude/agents/explore.md
+
+# Validate
+test-agent-suite explore 20
+# Expected: Success ≥ 85%, Quality ≥ 4.0, Time ≤ 3 min
+```
+
+**Monitoring**:
+- Track success rate (alert if <80%)
+- Monitor time (alert if >3.5 min avg)
+- Review failures weekly
+
+---
+
+## Future Enhancements (v4+)
+
+**Potential Improvements**:
+1. **Context Caching**: Reuse codebase knowledge across queries (Est: -20% time)
+2. **Query Classification**: Auto-detect thoroughness level (Est: +5% success)
+3. **Result Ranking**: Prioritize most relevant findings (Est: +10% quality)
+
+**Decision**: Hold v3, monitor for 2 weeks before v4
+
+---
+
+**Source**: Bootstrap-005 Agent Prompt Evolution
+**Agent**: Explore
+**Final Version**: v3 (90% success, 4.3/5 quality, 2.68 min avg)
+**Status**: Production-ready, converged, deployed
--- a/skills/agent-prompt-evolution/examples/rapid-iteration-pattern.md
+++ b/skills/agent-prompt-evolution/examples/rapid-iteration-pattern.md
@@ -0,0 +1,409 @@
+# Rapid Iteration Pattern for Agent Evolution
+
+**Pattern**: Fast convergence (2-3 iterations) for agent prompt evolution
+**Success Rate**: 85% (11/13 agents converged in ≤3 iterations)
+**Time**: 3-6 hours total vs 8-12 hours standard
+
+How to achieve rapid convergence when evolving agent prompts.
+
+---
+
+## Pattern Overview
+
+**Standard Evolution**: 4-6 iterations, 8-12 hours
+**Rapid Evolution**: 2-3 iterations, 3-6 hours
+
+**Key Difference**: Strong Iteration 0 (comprehensive baseline analysis)
+
+---
+
+## Rapid Iteration Workflow
+
+### Iteration 0: Comprehensive Baseline (90-120 min)
+
+**Standard Baseline** (30 min):
+- Run 5 test cases
+- Note obvious failures
+- Quick metrics
+
+**Comprehensive Baseline** (90-120 min):
+- Run 15-20 diverse test cases
+- Systematic failure pattern analysis
+- Deep root cause investigation
+- Document all edge cases
+- Compare to similar agents
+
+**Investment**: +60-90 min
+**Return**: -2 to -3 iterations (save 3-6 hours)
+
+---
+
+### Example: Explore Agent (Standard vs Rapid)
+
+**Standard Approach**:
+```
+Iteration 0 (30 min): 5 tasks, quick notes
+Iteration 1 (90 min): Add thoroughness levels
+Iteration 2 (90 min): Add time-boxing
+Iteration 3 (75 min): Add completeness checks
+Iteration 4 (60 min): Refine verification
+Iteration 5 (60 min): Final polish
+
+Total: 6.75 hours, 5 iterations
+```
+
+**Rapid Approach**:
+```
+Iteration 0 (120 min): 20 tasks, pattern analysis, root causes
+Iteration 1 (90 min): Add thoroughness + time-boxing + completeness
+Iteration 2 (75 min): Refine + validate stability
+
+Total: 4.75 hours, 2 iterations
+```
+
+**Savings**: 2 hours, 3 fewer iterations
+
+---
+
+## Comprehensive Baseline Checklist
+
+### Task Coverage (15-20 tasks)
+
+**Complexity Distribution**:
+- 5 simple tasks (1-2 min expected)
+- 10 medium tasks (2-4 min expected)
+- 5 complex tasks (4-6 min expected)
+
+**Query Type Diversity**:
+- Search queries (find, locate, list)
+- Analysis queries (explain, describe, analyze)
+- Comparison queries (compare, evaluate, contrast)
+- Edge cases (ambiguous, overly broad, very specific)
+
+---
+
+### Failure Pattern Analysis (30 min)
+
+**Systematic Analysis**:
+
+1. **Categorize Failures**
+   - Scope issues (too broad/narrow)
+   - Coverage issues (incomplete)
+   - Time issues (too slow/fast)
+   - Quality issues (inaccurate)
+
+2. **Identify Root Causes**
+   - Missing instructions
+   - Ambiguous guidelines
+   - Incorrect constraints
+   - Tool usage issues
+
+3. **Prioritize by Impact**
+   - High frequency + high impact → Fix first
+   - Low frequency + high impact → Document
+   - High frequency + low impact → Automate
+   - Low frequency + low impact → Ignore
+
+**Example**:
+```markdown
+## Failure Patterns (Explore Agent)
+
+**Pattern 1: Scope Ambiguity** (6/20 tasks, 30%)
+Root Cause: No guidance on search depth
+Impact: High (3 failures, 3 partial successes)
+Priority: P1 (fix in Iteration 1)
+
+**Pattern 2: Incomplete Coverage** (4/20 tasks, 20%)
+Root Cause: No completeness verification
+Impact: Medium (4 partial successes)
+Priority: P1 (fix in Iteration 1)
+
+**Pattern 3: Time Overruns** (3/20 tasks, 15%)
+Root Cause: No time-boxing mechanism
+Impact: Medium (3 slow but successful)
+Priority: P2 (fix in Iteration 1)
+
+**Pattern 4: Tool Selection** (1/20 tasks, 5%)
+Root Cause: Not using best tool for task
+Impact: Low (1 inefficient but successful)
+Priority: P3 (defer to Iteration 2 if time)
+```
+
+---
+
+### Comparative Analysis (15 min)
+
+**Compare to Similar Agents**:
+- What works well in other agents?
+- What patterns are transferable?
+- What mistakes were made before?
+
+**Example**:
+```markdown
+## Comparative Analysis
+
+**Code-Gen Agent** (similar agent):
+- Uses complexity assessment (simple/medium/complex)
+- Has explicit quality checklist
+- Includes time estimates
+
+**Transferable**:
+✅ Complexity assessment → thoroughness levels
+✅ Quality checklist → completeness verification
+❌ Time estimates (less predictable for exploration)
+
+**Analysis Agent** (similar agent):
+- Uses phased approach (scan → analyze → synthesize)
+- Includes confidence scoring
+
+**Transferable**:
+✅ Phased approach → search strategy
+✅ Confidence scoring → already planned
+```
+
+---
+
+## Iteration 1: Comprehensive Fix (90 min)
+
+**Standard Iteration 1**: Fix 1-2 major issues
+**Rapid Iteration 1**: Fix ALL P1 issues + some P2
+
+**Approach**:
+1. Address all high-priority patterns (P1)
+2. Add preventive measures for P2 issues
+3. Include transferable patterns from similar agents
+
+**Example** (Explore Agent):
+```markdown
+## Iteration 1 Changes
+
+**P1 Fixes**:
+1. Scope Ambiguity → Add thoroughness levels (quick/medium/thorough)
+2. Incomplete Coverage → Add completeness checklist
+3. Time Management → Add time-boxing (1-6 min)
+
+**P2 Improvements**:
+4. Search Strategy → Add 3-phase approach
+5. Confidence → Add confidence scoring
+
+**Borrowed Patterns**:
+6. From Code-Gen: Complexity assessment framework
+7. From Analysis: Verification checkpoints
+
+Total Changes: 7 (vs standard 2-3)
+```
+
+**Result**: Higher chance of convergence in Iteration 2
+
+---
+
+## Iteration 2: Validate & Converge (75 min)
+
+**Objectives**:
+1. Test comprehensive fixes
+2. Measure stability
+3. Validate convergence
+
+**Test Suite** (30 min):
+- Re-run all 20 Iteration 0 tasks
+- Add 5-10 new edge cases
+- Measure metrics
+
+**Analysis** (20 min):
+- Compare to Iteration 0 and Iteration 1
+- Check convergence criteria
+- Identify remaining gaps (if any)
+
+**Refinement** (25 min):
+- Minor adjustments only
+- Polish documentation
+- Validate stability
+
+**Convergence Check**:
+```
+Iteration 1: V_instance = 0.88 ✅
+Iteration 2: V_instance = 0.90 ✅
+Stable: 0.88 → 0.90 (+2.3%, within ±5%)
+
+CONVERGED ✅
+```
+
+---
+
+## Success Factors
+
+### 1. Comprehensive Baseline (60-90 min extra)
+
+**Investment**: 2x standard baseline time
+**Return**: -2 to -3 iterations (6-9 hours saved)
+**ROI**: 4-6x
+
+**Critical Elements**:
+- 15-20 diverse tasks (not 5-10)
+- Systematic failure pattern analysis
+- Root cause investigation (not just symptoms)
+- Comparative analysis with similar agents
+
+---
+
+### 2. Aggressive Iteration 1 (Fix All P1)
+
+**Standard**: Fix 1-2 issues
+**Rapid**: Fix all P1 + some P2 (5-7 fixes)
+
+**Approach**:
+- Batch related fixes together
+- Borrow proven patterns
+- Add preventive measures
+
+**Risk**: Over-complication
+**Mitigation**: Focus on core issues, defer P3
+
+---
+
+### 3. Borrowed Patterns (20-30% reuse)
+
+**Sources**:
+- Similar agents in same project
+- Agents from other projects
+- Industry best practices
+
+**Example**:
+```
+Explore Agent borrowed from:
+- Code-Gen: Complexity assessment (100% reuse)
+- Analysis: Phased approach (90% reuse)
+- Testing: Verification checklist (80% reuse)
+
+Total reuse: ~60% of Iteration 1 changes
+```
+
+**Savings**: 30-40 min per iteration
+
+---
+
+## Anti-Patterns
+
+### ❌ Skipping Comprehensive Baseline
+
+**Symptom**: "Let's just try some fixes and see"
+**Result**: 5-6 iterations, trial and error
+**Cost**: 8-12 hours
+
+**Fix**: Invest 90-120 min in Iteration 0
+
+---
+
+### ❌ Incremental Fixes (One Issue at a Time)
+
+**Symptom**: Fixing one pattern per iteration
+**Result**: 4-6 iterations for convergence
+**Cost**: 8-10 hours
+
+**Fix**: Batch P1 fixes in Iteration 1
+
+---
+
+### ❌ Ignoring Similar Agents
+
+**Symptom**: Reinventing solutions
+**Result**: Slower convergence, lower quality
+**Cost**: 2-3 extra hours
+
+**Fix**: 15 min comparative analysis in Iteration 0
+
+---
+
+## When to Use Rapid Pattern
+
+**Good Fit**:
+- Agent is similar to existing agents (60%+ overlap)
+- Clear failure patterns in baseline
+- Time constraint (need results in 1-2 days)
+
+**Poor Fit**:
+- Novel agent type (no similar agents)
+- Complex domain (many unknowns)
+- Learning objective (want to explore incrementally)
+
+---
+
+## Metrics Comparison
+
+### Standard Evolution
+
+```
+Iteration 0: 30 min (5 tasks)
+Iteration 1: 90 min (fix 1-2 issues)
+Iteration 2: 90 min (fix 2-3 more)
+Iteration 3: 75 min (refine)
+Iteration 4: 60 min (converge)
+
+Total: 5.75 hours, 4 iterations
+V_instance: 0.68 → 0.74 → 0.79 → 0.83 → 0.85 ✅
+```
+
+### Rapid Evolution
+
+```
+Iteration 0: 120 min (20 tasks, analysis)
+Iteration 1: 90 min (fix all P1+P2)
+Iteration 2: 75 min (validate, converge)
+
+Total: 4.75 hours, 2 iterations
+V_instance: 0.68 → 0.88 → 0.90 ✅
+```
+
+**Savings**: 1 hour, 2 fewer iterations
+
+---
+
+## Replication Guide
+
+### Day 1: Comprehensive Baseline
+
+**Morning** (2 hours):
+1. Design 20-task test suite
+2. Run baseline tests
+3. Document all failures
+
+**Afternoon** (1 hour):
+4. Analyze failure patterns
+5. Identify root causes
+6. Compare to similar agents
+7. Prioritize fixes
+
+---
+
+### Day 2: Comprehensive Fix
+
+**Morning** (1.5 hours):
+1. Implement all P1 fixes
+2. Add P2 improvements
+3. Incorporate borrowed patterns
+
+**Afternoon** (1 hour):
+4. Test on 15-20 tasks
+5. Measure metrics
+6. Document changes
+
+---
+
+### Day 3: Validate & Deploy
+
+**Morning** (1 hour):
+1. Test on 25-30 tasks
+2. Check stability
+3. Minor refinements
+
+**Afternoon** (0.5 hours):
+4. Final validation
+5. Deploy to production
+6. Setup monitoring
+
+---
+
+**Source**: BAIME Agent Prompt Evolution - Rapid Pattern
+**Success Rate**: 85% (11/13 agents)
+**Average Time**: 4.2 hours (vs 9.3 hours standard)
+**Average Iterations**: 2.3 (vs 4.8 standard)