Initial commit
This commit is contained in:
377
skills/agent-prompt-evolution/examples/explore-agent-v1-v3.md
Normal file
377
skills/agent-prompt-evolution/examples/explore-agent-v1-v3.md
Normal file
@@ -0,0 +1,377 @@
|
||||
# Explore Agent Evolution: v1 → v3
|
||||
|
||||
**Agent**: Explore (codebase exploration)
|
||||
**Iterations**: 3
|
||||
**Improvement**: 60% → 90% success rate (+50%)
|
||||
**Time**: 4.2 min → 2.6 min (-38%)
|
||||
**Status**: Converged (production-ready)
|
||||
|
||||
Complete walkthrough of evolving Explore agent prompt through BAIME methodology.
|
||||
|
||||
---
|
||||
|
||||
## Iteration 0: Baseline (v1)
|
||||
|
||||
### Initial Prompt
|
||||
|
||||
```markdown
|
||||
# Explore Agent
|
||||
|
||||
You are a codebase exploration agent. Your task is to help users understand
|
||||
code structure, find implementations, and explain how things work.
|
||||
|
||||
When given a query:
|
||||
1. Use Glob to find relevant files
|
||||
2. Use Grep to search for patterns
|
||||
3. Read files to understand implementations
|
||||
4. Provide a summary
|
||||
|
||||
Tools available: Glob, Grep, Read, Bash
|
||||
```
|
||||
|
||||
**Prompt Length**: 58 lines
|
||||
|
||||
---
|
||||
|
||||
### Baseline Testing (10 tasks)
|
||||
|
||||
| Task | Query | Result | Quality | Time |
|
||||
|------|-------|--------|---------|------|
|
||||
| 1 | "show architecture" | ❌ Failed | 2/5 | 5.2 min |
|
||||
| 2 | "find API endpoints" | ⚠️ Partial | 3/5 | 4.8 min |
|
||||
| 3 | "explain auth" | ⚠️ Partial | 3/5 | 6.1 min |
|
||||
| 4 | "list CLI commands" | ✅ Success | 4/5 | 2.8 min |
|
||||
| 5 | "find database code" | ✅ Success | 5/5 | 3.2 min |
|
||||
| 6 | "show test structure" | ❌ Failed | 2/5 | 4.5 min |
|
||||
| 7 | "explain config" | ✅ Success | 4/5 | 3.9 min |
|
||||
| 8 | "find error handlers" | ✅ Success | 5/5 | 2.9 min |
|
||||
| 9 | "show imports" | ✅ Success | 4/5 | 3.1 min |
|
||||
| 10 | "find middleware" | ✅ Success | 4/5 | 5.3 min |
|
||||
|
||||
**Baseline Metrics**:
|
||||
- Success Rate: 60% (6/10)
|
||||
- Average Quality: 3.6/5
|
||||
- Average Time: 4.18 min
|
||||
- V_instance: 0.68 (below target)
|
||||
|
||||
---
|
||||
|
||||
### Failure Analysis
|
||||
|
||||
**Pattern 1: Scope Ambiguity** (Tasks 1, 2, 3)
|
||||
- Queries too broad ("architecture", "auth")
|
||||
- Agent doesn't know search depth
|
||||
- Either stops too early or runs too long
|
||||
|
||||
**Pattern 2: Incomplete Coverage** (Tasks 2, 6)
|
||||
- Agent finds 1-2 files, stops
|
||||
- Misses related implementations
|
||||
- No verification of completeness
|
||||
|
||||
**Pattern 3: Time Management** (Tasks 1, 3, 10)
|
||||
- Long-running queries (>5 min)
|
||||
- Diminishing returns after 3 min
|
||||
- No time-boxing mechanism
|
||||
|
||||
---
|
||||
|
||||
## Iteration 1: Add Structure (v2)
|
||||
|
||||
### Prompt Changes
|
||||
|
||||
**Added: Thoroughness Guidelines**
|
||||
```markdown
|
||||
## Thoroughness Levels
|
||||
|
||||
Assess query complexity and choose thoroughness:
|
||||
|
||||
**quick** (1-2 min):
|
||||
- Check 3-5 obvious locations
|
||||
- Direct pattern matches only
|
||||
- Use for simple lookups
|
||||
|
||||
**medium** (2-4 min):
|
||||
- Check 10-15 related files
|
||||
- Follow cross-references
|
||||
- Use for typical queries
|
||||
|
||||
**thorough** (4-6 min):
|
||||
- Comprehensive search across codebase
|
||||
- Deep dependency analysis
|
||||
- Use for architecture questions
|
||||
```
|
||||
|
||||
**Added: Time-Boxing**
|
||||
```markdown
|
||||
## Time Management
|
||||
|
||||
Allocate time based on thoroughness:
|
||||
- quick: 1-2 min
|
||||
- medium: 2-4 min
|
||||
- thorough: 4-6 min
|
||||
|
||||
Stop if <10% new findings in last 20% of time budget.
|
||||
```
|
||||
|
||||
**Added: Completeness Checklist**
|
||||
```markdown
|
||||
## Before Responding
|
||||
|
||||
Verify completeness:
|
||||
□ All direct matches found (Glob/Grep)
|
||||
□ Related implementations checked
|
||||
□ Cross-references validated
|
||||
□ No obvious gaps remaining
|
||||
|
||||
State confidence level: Low / Medium / High
|
||||
```
|
||||
|
||||
**Prompt Length**: 112 lines (+54)
|
||||
|
||||
---
|
||||
|
||||
### Testing (8 tasks: 3 re-tests + 5 new)
|
||||
|
||||
| Task | Query | Result | Quality | Time |
|
||||
|------|-------|--------|---------|------|
|
||||
| 1R | "show architecture" | ✅ Success | 4/5 | 3.8 min |
|
||||
| 2R | "find API endpoints" | ✅ Success | 5/5 | 2.9 min |
|
||||
| 3R | "explain auth" | ✅ Success | 4/5 | 3.2 min |
|
||||
| 11 | "list database schemas" | ✅ Success | 5/5 | 2.1 min |
|
||||
| 12 | "find error handlers" | ✅ Success | 4/5 | 2.5 min |
|
||||
| 13 | "show test structure" | ⚠️ Partial | 3/5 | 3.6 min |
|
||||
| 14 | "explain config system" | ✅ Success | 5/5 | 2.4 min |
|
||||
| 15 | "find CLI commands" | ✅ Success | 4/5 | 2.2 min |
|
||||
|
||||
**Iteration 1 Metrics**:
|
||||
- Success Rate: 87.5% (7/8) - **+45.8% improvement**
|
||||
- Average Quality: 4.25/5 - **+18.1%**
|
||||
- Average Time: 2.84 min - **-32.1%**
|
||||
- V_instance: 0.88 ✅ (exceeds target)
|
||||
|
||||
---
|
||||
|
||||
### Key Improvements
|
||||
|
||||
✅ Fixed scope ambiguity (Tasks 1R, 2R, 3R all succeeded)
|
||||
✅ Better time management (all <4 min)
|
||||
✅ Higher quality outputs (4.25 avg)
|
||||
⚠️ Still one partial success (Task 13)
|
||||
|
||||
**Remaining Issue**: Test structure query missed integration tests
|
||||
|
||||
---
|
||||
|
||||
## Iteration 2: Refine Coverage (v3)
|
||||
|
||||
### Prompt Changes
|
||||
|
||||
**Enhanced: Completeness Verification**
|
||||
```markdown
|
||||
## Completeness Verification
|
||||
|
||||
Before concluding, verify coverage by category:
|
||||
|
||||
**For "find" queries**:
|
||||
□ Main implementations found
|
||||
□ Related utilities checked
|
||||
□ Test files reviewed (if applicable)
|
||||
□ Configuration/setup files checked
|
||||
|
||||
**For "show" queries**:
|
||||
□ Primary structure identified
|
||||
□ Secondary components listed
|
||||
□ Relationships mapped
|
||||
□ Examples provided
|
||||
|
||||
**For "explain" queries**:
|
||||
□ Core mechanism described
|
||||
□ Key components identified
|
||||
□ Data flow explained
|
||||
□ Edge cases noted
|
||||
```
|
||||
|
||||
**Added: Search Strategy**
|
||||
```markdown
|
||||
## Search Strategy
|
||||
|
||||
**Phase 1 (30% of time)**: Broad search
|
||||
- Glob for file patterns
|
||||
- Grep for key terms
|
||||
- Identify main locations
|
||||
|
||||
**Phase 2 (50% of time)**: Deep investigation
|
||||
- Read main files
|
||||
- Follow references
|
||||
- Build understanding
|
||||
|
||||
**Phase 3 (20% of time)**: Verification
|
||||
- Check for gaps
|
||||
- Validate findings
|
||||
- Prepare summary
|
||||
```
|
||||
|
||||
**Refined: Confidence Scoring**
|
||||
```markdown
|
||||
## Confidence Level
|
||||
|
||||
**High**: All major components found, verified complete
|
||||
**Medium**: Core components found, minor gaps possible
|
||||
**Low**: Partial findings, significant gaps likely
|
||||
|
||||
Always state confidence level and identify known gaps.
|
||||
```
|
||||
|
||||
**Prompt Length**: 138 lines (+26)
|
||||
|
||||
---
|
||||
|
||||
### Testing (10 tasks: 1 re-test + 9 new)
|
||||
|
||||
| Task | Query | Result | Quality | Time |
|
||||
|------|-------|--------|---------|------|
|
||||
| 13R | "show test structure" | ✅ Success | 5/5 | 2.9 min |
|
||||
| 16 | "find auth middleware" | ✅ Success | 5/5 | 2.3 min |
|
||||
| 17 | "explain routing" | ✅ Success | 4/5 | 3.1 min |
|
||||
| 18 | "list validation rules" | ✅ Success | 5/5 | 2.1 min |
|
||||
| 19 | "find logging setup" | ✅ Success | 4/5 | 2.5 min |
|
||||
| 20 | "show data models" | ✅ Success | 5/5 | 2.8 min |
|
||||
| 21 | "explain caching" | ✅ Success | 4/5 | 2.7 min |
|
||||
| 22 | "find background jobs" | ✅ Success | 5/5 | 2.4 min |
|
||||
| 23 | "show dependencies" | ✅ Success | 4/5 | 2.2 min |
|
||||
| 24 | "explain deployment" | ❌ Failed | 2/5 | 3.8 min |
|
||||
|
||||
**Iteration 2 Metrics**:
|
||||
- Success Rate: 90% (9/10) - **+2.5% improvement** (stable)
|
||||
- Average Quality: 4.3/5 - **+1.2%**
|
||||
- Average Time: 2.68 min - **-5.6%**
|
||||
- V_instance: 0.90 ✅ ✅ (2 consecutive ≥ 0.80)
|
||||
|
||||
**CONVERGED** ✅
|
||||
|
||||
---
|
||||
|
||||
### Stability Validation
|
||||
|
||||
**Iteration 1**: V_instance = 0.88
|
||||
**Iteration 2**: V_instance = 0.90
|
||||
**Change**: +2.3% (stable, within ±5%)
|
||||
|
||||
**Criteria Met**:
|
||||
✅ V_instance ≥ 0.80 for 2 consecutive iterations
|
||||
✅ Success rate ≥ 85%
|
||||
✅ Quality ≥ 4.0
|
||||
✅ Time within budget (<3 min avg)
|
||||
|
||||
---
|
||||
|
||||
## Final Metrics Comparison
|
||||
|
||||
| Metric | v1 (Baseline) | v2 (Iteration 1) | v3 (Iteration 2) | Δ Total |
|
||||
|--------|---------------|------------------|------------------|---------|
|
||||
| Success Rate | 60% | 87.5% | 90% | **+50%** |
|
||||
| Quality | 3.6/5 | 4.25/5 | 4.3/5 | **+19.4%** |
|
||||
| Time | 4.18 min | 2.84 min | 2.68 min | **-35.9%** |
|
||||
| V_instance | 0.68 | 0.88 | 0.90 | **+32.4%** |
|
||||
|
||||
---
|
||||
|
||||
## Evolution Summary
|
||||
|
||||
### Iteration 0 → 1: Major Improvements
|
||||
|
||||
**Key Changes**:
|
||||
- Added thoroughness levels (quick/medium/thorough)
|
||||
- Added time-boxing (1-6 min)
|
||||
- Added completeness checklist
|
||||
|
||||
**Impact**:
|
||||
- Success: 60% → 87.5% (+45.8%)
|
||||
- Time: 4.18 → 2.84 min (-32.1%)
|
||||
- Quality: 3.6 → 4.25 (+18.1%)
|
||||
|
||||
**Root Causes Addressed**:
|
||||
✅ Scope ambiguity resolved
|
||||
✅ Time management improved
|
||||
✅ Completeness awareness added
|
||||
|
||||
---
|
||||
|
||||
### Iteration 1 → 2: Refinement
|
||||
|
||||
**Key Changes**:
|
||||
- Enhanced completeness verification (by query type)
|
||||
- Added search strategy (3-phase)
|
||||
- Refined confidence scoring
|
||||
|
||||
**Impact**:
|
||||
- Success: 87.5% → 90% (+2.5%, stable)
|
||||
- Time: 2.84 → 2.68 min (-5.6%)
|
||||
- Quality: 4.25 → 4.3 (+1.2%)
|
||||
|
||||
**Root Causes Addressed**:
|
||||
✅ Test structure coverage gap fixed
|
||||
✅ Verification process strengthened
|
||||
|
||||
---
|
||||
|
||||
## Key Learnings
|
||||
|
||||
### What Worked
|
||||
|
||||
1. **Thoroughness Levels**: Clear guidance on search depth
|
||||
2. **Time-Boxing**: Prevented runaway queries
|
||||
3. **Completeness Checklist**: Improved coverage
|
||||
4. **Phased Search**: Structured approach to exploration
|
||||
|
||||
### What Didn't Work
|
||||
|
||||
1. **Deployment Query Failed**: Outside agent scope (requires infra knowledge)
|
||||
- Solution: Document limitations, suggest alternative agents
|
||||
|
||||
### Best Practices Validated
|
||||
|
||||
✅ **Start Simple**: v1 was minimal, added structure incrementally
|
||||
✅ **Measure Everything**: Quantitative metrics guided refinements
|
||||
✅ **Focus on Patterns**: Fixed systematic failures, not one-off issues
|
||||
✅ **Validate Stability**: 2-iteration convergence confirmed reliability
|
||||
|
||||
---
|
||||
|
||||
## Production Deployment
|
||||
|
||||
**Status**: ✅ Production-ready (v3)
|
||||
**Confidence**: High (90% success, 2 iterations stable)
|
||||
|
||||
**Deployment**:
|
||||
```bash
|
||||
# Update agent prompt
|
||||
cp explore-agent-v3.md .claude/agents/explore.md
|
||||
|
||||
# Validate
|
||||
test-agent-suite explore 20
|
||||
# Expected: Success ≥ 85%, Quality ≥ 4.0, Time ≤ 3 min
|
||||
```
|
||||
|
||||
**Monitoring**:
|
||||
- Track success rate (alert if <80%)
|
||||
- Monitor time (alert if >3.5 min avg)
|
||||
- Review failures weekly
|
||||
|
||||
---
|
||||
|
||||
## Future Enhancements (v4+)
|
||||
|
||||
**Potential Improvements**:
|
||||
1. **Context Caching**: Reuse codebase knowledge across queries (Est: -20% time)
|
||||
2. **Query Classification**: Auto-detect thoroughness level (Est: +5% success)
|
||||
3. **Result Ranking**: Prioritize most relevant findings (Est: +10% quality)
|
||||
|
||||
**Decision**: Hold v3, monitor for 2 weeks before v4
|
||||
|
||||
---
|
||||
|
||||
**Source**: Bootstrap-005 Agent Prompt Evolution
|
||||
**Agent**: Explore
|
||||
**Final Version**: v3 (90% success, 4.3/5 quality, 2.68 min avg)
|
||||
**Status**: Production-ready, converged, deployed
|
||||
@@ -0,0 +1,409 @@
|
||||
# Rapid Iteration Pattern for Agent Evolution
|
||||
|
||||
**Pattern**: Fast convergence (2-3 iterations) for agent prompt evolution
|
||||
**Success Rate**: 85% (11/13 agents converged in ≤3 iterations)
|
||||
**Time**: 3-6 hours total vs 8-12 hours standard
|
||||
|
||||
How to achieve rapid convergence when evolving agent prompts.
|
||||
|
||||
---
|
||||
|
||||
## Pattern Overview
|
||||
|
||||
**Standard Evolution**: 4-6 iterations, 8-12 hours
|
||||
**Rapid Evolution**: 2-3 iterations, 3-6 hours
|
||||
|
||||
**Key Difference**: Strong Iteration 0 (comprehensive baseline analysis)
|
||||
|
||||
---
|
||||
|
||||
## Rapid Iteration Workflow
|
||||
|
||||
### Iteration 0: Comprehensive Baseline (90-120 min)
|
||||
|
||||
**Standard Baseline** (30 min):
|
||||
- Run 5 test cases
|
||||
- Note obvious failures
|
||||
- Quick metrics
|
||||
|
||||
**Comprehensive Baseline** (90-120 min):
|
||||
- Run 15-20 diverse test cases
|
||||
- Systematic failure pattern analysis
|
||||
- Deep root cause investigation
|
||||
- Document all edge cases
|
||||
- Compare to similar agents
|
||||
|
||||
**Investment**: +60-90 min
|
||||
**Return**: -2 to -3 iterations (save 3-6 hours)
|
||||
|
||||
---
|
||||
|
||||
### Example: Explore Agent (Standard vs Rapid)
|
||||
|
||||
**Standard Approach**:
|
||||
```
|
||||
Iteration 0 (30 min): 5 tasks, quick notes
|
||||
Iteration 1 (90 min): Add thoroughness levels
|
||||
Iteration 2 (90 min): Add time-boxing
|
||||
Iteration 3 (75 min): Add completeness checks
|
||||
Iteration 4 (60 min): Refine verification
|
||||
Iteration 5 (60 min): Final polish
|
||||
|
||||
Total: 6.75 hours, 5 iterations
|
||||
```
|
||||
|
||||
**Rapid Approach**:
|
||||
```
|
||||
Iteration 0 (120 min): 20 tasks, pattern analysis, root causes
|
||||
Iteration 1 (90 min): Add thoroughness + time-boxing + completeness
|
||||
Iteration 2 (75 min): Refine + validate stability
|
||||
|
||||
Total: 4.75 hours, 2 iterations
|
||||
```
|
||||
|
||||
**Savings**: 2 hours, 3 fewer iterations
|
||||
|
||||
---
|
||||
|
||||
## Comprehensive Baseline Checklist
|
||||
|
||||
### Task Coverage (15-20 tasks)
|
||||
|
||||
**Complexity Distribution**:
|
||||
- 5 simple tasks (1-2 min expected)
|
||||
- 10 medium tasks (2-4 min expected)
|
||||
- 5 complex tasks (4-6 min expected)
|
||||
|
||||
**Query Type Diversity**:
|
||||
- Search queries (find, locate, list)
|
||||
- Analysis queries (explain, describe, analyze)
|
||||
- Comparison queries (compare, evaluate, contrast)
|
||||
- Edge cases (ambiguous, overly broad, very specific)
|
||||
|
||||
---
|
||||
|
||||
### Failure Pattern Analysis (30 min)
|
||||
|
||||
**Systematic Analysis**:
|
||||
|
||||
1. **Categorize Failures**
|
||||
- Scope issues (too broad/narrow)
|
||||
- Coverage issues (incomplete)
|
||||
- Time issues (too slow/fast)
|
||||
- Quality issues (inaccurate)
|
||||
|
||||
2. **Identify Root Causes**
|
||||
- Missing instructions
|
||||
- Ambiguous guidelines
|
||||
- Incorrect constraints
|
||||
- Tool usage issues
|
||||
|
||||
3. **Prioritize by Impact**
|
||||
- High frequency + high impact → Fix first
|
||||
- Low frequency + high impact → Document
|
||||
- High frequency + low impact → Automate
|
||||
- Low frequency + low impact → Ignore
|
||||
|
||||
**Example**:
|
||||
```markdown
|
||||
## Failure Patterns (Explore Agent)
|
||||
|
||||
**Pattern 1: Scope Ambiguity** (6/20 tasks, 30%)
|
||||
Root Cause: No guidance on search depth
|
||||
Impact: High (3 failures, 3 partial successes)
|
||||
Priority: P1 (fix in Iteration 1)
|
||||
|
||||
**Pattern 2: Incomplete Coverage** (4/20 tasks, 20%)
|
||||
Root Cause: No completeness verification
|
||||
Impact: Medium (4 partial successes)
|
||||
Priority: P1 (fix in Iteration 1)
|
||||
|
||||
**Pattern 3: Time Overruns** (3/20 tasks, 15%)
|
||||
Root Cause: No time-boxing mechanism
|
||||
Impact: Medium (3 slow but successful)
|
||||
Priority: P2 (fix in Iteration 1)
|
||||
|
||||
**Pattern 4: Tool Selection** (1/20 tasks, 5%)
|
||||
Root Cause: Not using best tool for task
|
||||
Impact: Low (1 inefficient but successful)
|
||||
Priority: P3 (defer to Iteration 2 if time)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Comparative Analysis (15 min)
|
||||
|
||||
**Compare to Similar Agents**:
|
||||
- What works well in other agents?
|
||||
- What patterns are transferable?
|
||||
- What mistakes were made before?
|
||||
|
||||
**Example**:
|
||||
```markdown
|
||||
## Comparative Analysis
|
||||
|
||||
**Code-Gen Agent** (similar agent):
|
||||
- Uses complexity assessment (simple/medium/complex)
|
||||
- Has explicit quality checklist
|
||||
- Includes time estimates
|
||||
|
||||
**Transferable**:
|
||||
✅ Complexity assessment → thoroughness levels
|
||||
✅ Quality checklist → completeness verification
|
||||
❌ Time estimates (less predictable for exploration)
|
||||
|
||||
**Analysis Agent** (similar agent):
|
||||
- Uses phased approach (scan → analyze → synthesize)
|
||||
- Includes confidence scoring
|
||||
|
||||
**Transferable**:
|
||||
✅ Phased approach → search strategy
|
||||
✅ Confidence scoring → already planned
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Iteration 1: Comprehensive Fix (90 min)
|
||||
|
||||
**Standard Iteration 1**: Fix 1-2 major issues
|
||||
**Rapid Iteration 1**: Fix ALL P1 issues + some P2
|
||||
|
||||
**Approach**:
|
||||
1. Address all high-priority patterns (P1)
|
||||
2. Add preventive measures for P2 issues
|
||||
3. Include transferable patterns from similar agents
|
||||
|
||||
**Example** (Explore Agent):
|
||||
```markdown
|
||||
## Iteration 1 Changes
|
||||
|
||||
**P1 Fixes**:
|
||||
1. Scope Ambiguity → Add thoroughness levels (quick/medium/thorough)
|
||||
2. Incomplete Coverage → Add completeness checklist
|
||||
3. Time Management → Add time-boxing (1-6 min)
|
||||
|
||||
**P2 Improvements**:
|
||||
4. Search Strategy → Add 3-phase approach
|
||||
5. Confidence → Add confidence scoring
|
||||
|
||||
**Borrowed Patterns**:
|
||||
6. From Code-Gen: Complexity assessment framework
|
||||
7. From Analysis: Verification checkpoints
|
||||
|
||||
Total Changes: 7 (vs standard 2-3)
|
||||
```
|
||||
|
||||
**Result**: Higher chance of convergence in Iteration 2
|
||||
|
||||
---
|
||||
|
||||
## Iteration 2: Validate & Converge (75 min)
|
||||
|
||||
**Objectives**:
|
||||
1. Test comprehensive fixes
|
||||
2. Measure stability
|
||||
3. Validate convergence
|
||||
|
||||
**Test Suite** (30 min):
|
||||
- Re-run all 20 Iteration 0 tasks
|
||||
- Add 5-10 new edge cases
|
||||
- Measure metrics
|
||||
|
||||
**Analysis** (20 min):
|
||||
- Compare to Iteration 0 and Iteration 1
|
||||
- Check convergence criteria
|
||||
- Identify remaining gaps (if any)
|
||||
|
||||
**Refinement** (25 min):
|
||||
- Minor adjustments only
|
||||
- Polish documentation
|
||||
- Validate stability
|
||||
|
||||
**Convergence Check**:
|
||||
```
|
||||
Iteration 1: V_instance = 0.88 ✅
|
||||
Iteration 2: V_instance = 0.90 ✅
|
||||
Stable: 0.88 → 0.90 (+2.3%, within ±5%)
|
||||
|
||||
CONVERGED ✅
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Success Factors
|
||||
|
||||
### 1. Comprehensive Baseline (60-90 min extra)
|
||||
|
||||
**Investment**: 2x standard baseline time
|
||||
**Return**: -2 to -3 iterations (6-9 hours saved)
|
||||
**ROI**: 4-6x
|
||||
|
||||
**Critical Elements**:
|
||||
- 15-20 diverse tasks (not 5-10)
|
||||
- Systematic failure pattern analysis
|
||||
- Root cause investigation (not just symptoms)
|
||||
- Comparative analysis with similar agents
|
||||
|
||||
---
|
||||
|
||||
### 2. Aggressive Iteration 1 (Fix All P1)
|
||||
|
||||
**Standard**: Fix 1-2 issues
|
||||
**Rapid**: Fix all P1 + some P2 (5-7 fixes)
|
||||
|
||||
**Approach**:
|
||||
- Batch related fixes together
|
||||
- Borrow proven patterns
|
||||
- Add preventive measures
|
||||
|
||||
**Risk**: Over-complication
|
||||
**Mitigation**: Focus on core issues, defer P3
|
||||
|
||||
---
|
||||
|
||||
### 3. Borrowed Patterns (20-30% reuse)
|
||||
|
||||
**Sources**:
|
||||
- Similar agents in same project
|
||||
- Agents from other projects
|
||||
- Industry best practices
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Explore Agent borrowed from:
|
||||
- Code-Gen: Complexity assessment (100% reuse)
|
||||
- Analysis: Phased approach (90% reuse)
|
||||
- Testing: Verification checklist (80% reuse)
|
||||
|
||||
Total reuse: ~60% of Iteration 1 changes
|
||||
```
|
||||
|
||||
**Savings**: 30-40 min per iteration
|
||||
|
||||
---
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
### ❌ Skipping Comprehensive Baseline
|
||||
|
||||
**Symptom**: "Let's just try some fixes and see"
|
||||
**Result**: 5-6 iterations, trial and error
|
||||
**Cost**: 8-12 hours
|
||||
|
||||
**Fix**: Invest 90-120 min in Iteration 0
|
||||
|
||||
---
|
||||
|
||||
### ❌ Incremental Fixes (One Issue at a Time)
|
||||
|
||||
**Symptom**: Fixing one pattern per iteration
|
||||
**Result**: 4-6 iterations for convergence
|
||||
**Cost**: 8-10 hours
|
||||
|
||||
**Fix**: Batch P1 fixes in Iteration 1
|
||||
|
||||
---
|
||||
|
||||
### ❌ Ignoring Similar Agents
|
||||
|
||||
**Symptom**: Reinventing solutions
|
||||
**Result**: Slower convergence, lower quality
|
||||
**Cost**: 2-3 extra hours
|
||||
|
||||
**Fix**: 15 min comparative analysis in Iteration 0
|
||||
|
||||
---
|
||||
|
||||
## When to Use Rapid Pattern
|
||||
|
||||
**Good Fit**:
|
||||
- Agent is similar to existing agents (60%+ overlap)
|
||||
- Clear failure patterns in baseline
|
||||
- Time constraint (need results in 1-2 days)
|
||||
|
||||
**Poor Fit**:
|
||||
- Novel agent type (no similar agents)
|
||||
- Complex domain (many unknowns)
|
||||
- Learning objective (want to explore incrementally)
|
||||
|
||||
---
|
||||
|
||||
## Metrics Comparison
|
||||
|
||||
### Standard Evolution
|
||||
|
||||
```
|
||||
Iteration 0: 30 min (5 tasks)
|
||||
Iteration 1: 90 min (fix 1-2 issues)
|
||||
Iteration 2: 90 min (fix 2-3 more)
|
||||
Iteration 3: 75 min (refine)
|
||||
Iteration 4: 60 min (converge)
|
||||
|
||||
Total: 5.75 hours, 4 iterations
|
||||
V_instance: 0.68 → 0.74 → 0.79 → 0.83 → 0.85 ✅
|
||||
```
|
||||
|
||||
### Rapid Evolution
|
||||
|
||||
```
|
||||
Iteration 0: 120 min (20 tasks, analysis)
|
||||
Iteration 1: 90 min (fix all P1+P2)
|
||||
Iteration 2: 75 min (validate, converge)
|
||||
|
||||
Total: 4.75 hours, 2 iterations
|
||||
V_instance: 0.68 → 0.88 → 0.90 ✅
|
||||
```
|
||||
|
||||
**Savings**: 1 hour, 2 fewer iterations
|
||||
|
||||
---
|
||||
|
||||
## Replication Guide
|
||||
|
||||
### Day 1: Comprehensive Baseline
|
||||
|
||||
**Morning** (2 hours):
|
||||
1. Design 20-task test suite
|
||||
2. Run baseline tests
|
||||
3. Document all failures
|
||||
|
||||
**Afternoon** (1 hour):
|
||||
4. Analyze failure patterns
|
||||
5. Identify root causes
|
||||
6. Compare to similar agents
|
||||
7. Prioritize fixes
|
||||
|
||||
---
|
||||
|
||||
### Day 2: Comprehensive Fix
|
||||
|
||||
**Morning** (1.5 hours):
|
||||
1. Implement all P1 fixes
|
||||
2. Add P2 improvements
|
||||
3. Incorporate borrowed patterns
|
||||
|
||||
**Afternoon** (1 hour):
|
||||
4. Test on 15-20 tasks
|
||||
5. Measure metrics
|
||||
6. Document changes
|
||||
|
||||
---
|
||||
|
||||
### Day 3: Validate & Deploy
|
||||
|
||||
**Morning** (1 hour):
|
||||
1. Test on 25-30 tasks
|
||||
2. Check stability
|
||||
3. Minor refinements
|
||||
|
||||
**Afternoon** (0.5 hours):
|
||||
4. Final validation
|
||||
5. Deploy to production
|
||||
6. Setup monitoring
|
||||
|
||||
---
|
||||
|
||||
**Source**: BAIME Agent Prompt Evolution - Rapid Pattern
|
||||
**Success Rate**: 85% (11/13 agents)
|
||||
**Average Time**: 4.2 hours (vs 9.3 hours standard)
|
||||
**Average Iterations**: 2.3 (vs 4.8 standard)
|
||||
Reference in New Issue
Block a user