378 lines
9.5 KiB
Markdown
378 lines
9.5 KiB
Markdown
# Explore Agent Evolution: v1 → v3
|
|
|
|
**Agent**: Explore (codebase exploration)
|
|
**Iterations**: 3
|
|
**Improvement**: 60% → 90% success rate (+50%)
|
|
**Time**: 4.2 min → 2.6 min (-38%)
|
|
**Status**: Converged (production-ready)
|
|
|
|
Complete walkthrough of evolving Explore agent prompt through BAIME methodology.
|
|
|
|
---
|
|
|
|
## Iteration 0: Baseline (v1)
|
|
|
|
### Initial Prompt
|
|
|
|
```markdown
|
|
# Explore Agent
|
|
|
|
You are a codebase exploration agent. Your task is to help users understand
|
|
code structure, find implementations, and explain how things work.
|
|
|
|
When given a query:
|
|
1. Use Glob to find relevant files
|
|
2. Use Grep to search for patterns
|
|
3. Read files to understand implementations
|
|
4. Provide a summary
|
|
|
|
Tools available: Glob, Grep, Read, Bash
|
|
```
|
|
|
|
**Prompt Length**: 58 lines
|
|
|
|
---
|
|
|
|
### Baseline Testing (10 tasks)
|
|
|
|
| Task | Query | Result | Quality | Time |
|
|
|------|-------|--------|---------|------|
|
|
| 1 | "show architecture" | ❌ Failed | 2/5 | 5.2 min |
|
|
| 2 | "find API endpoints" | ⚠️ Partial | 3/5 | 4.8 min |
|
|
| 3 | "explain auth" | ⚠️ Partial | 3/5 | 6.1 min |
|
|
| 4 | "list CLI commands" | ✅ Success | 4/5 | 2.8 min |
|
|
| 5 | "find database code" | ✅ Success | 5/5 | 3.2 min |
|
|
| 6 | "show test structure" | ❌ Failed | 2/5 | 4.5 min |
|
|
| 7 | "explain config" | ✅ Success | 4/5 | 3.9 min |
|
|
| 8 | "find error handlers" | ✅ Success | 5/5 | 2.9 min |
|
|
| 9 | "show imports" | ✅ Success | 4/5 | 3.1 min |
|
|
| 10 | "find middleware" | ✅ Success | 4/5 | 5.3 min |
|
|
|
|
**Baseline Metrics**:
|
|
- Success Rate: 60% (6/10)
|
|
- Average Quality: 3.6/5
|
|
- Average Time: 4.18 min
|
|
- V_instance: 0.68 (below target)
|
|
|
|
---
|
|
|
|
### Failure Analysis
|
|
|
|
**Pattern 1: Scope Ambiguity** (Tasks 1, 2, 3)
|
|
- Queries too broad ("architecture", "auth")
|
|
- Agent doesn't know search depth
|
|
- Either stops too early or runs too long
|
|
|
|
**Pattern 2: Incomplete Coverage** (Tasks 2, 6)
|
|
- Agent finds 1-2 files, stops
|
|
- Misses related implementations
|
|
- No verification of completeness
|
|
|
|
**Pattern 3: Time Management** (Tasks 1, 3, 10)
|
|
- Long-running queries (>5 min)
|
|
- Diminishing returns after 3 min
|
|
- No time-boxing mechanism
|
|
|
|
---
|
|
|
|
## Iteration 1: Add Structure (v2)
|
|
|
|
### Prompt Changes
|
|
|
|
**Added: Thoroughness Guidelines**
|
|
```markdown
|
|
## Thoroughness Levels
|
|
|
|
Assess query complexity and choose thoroughness:
|
|
|
|
**quick** (1-2 min):
|
|
- Check 3-5 obvious locations
|
|
- Direct pattern matches only
|
|
- Use for simple lookups
|
|
|
|
**medium** (2-4 min):
|
|
- Check 10-15 related files
|
|
- Follow cross-references
|
|
- Use for typical queries
|
|
|
|
**thorough** (4-6 min):
|
|
- Comprehensive search across codebase
|
|
- Deep dependency analysis
|
|
- Use for architecture questions
|
|
```
|
|
|
|
**Added: Time-Boxing**
|
|
```markdown
|
|
## Time Management
|
|
|
|
Allocate time based on thoroughness:
|
|
- quick: 1-2 min
|
|
- medium: 2-4 min
|
|
- thorough: 4-6 min
|
|
|
|
Stop if <10% new findings in last 20% of time budget.
|
|
```
|
|
|
|
**Added: Completeness Checklist**
|
|
```markdown
|
|
## Before Responding
|
|
|
|
Verify completeness:
|
|
□ All direct matches found (Glob/Grep)
|
|
□ Related implementations checked
|
|
□ Cross-references validated
|
|
□ No obvious gaps remaining
|
|
|
|
State confidence level: Low / Medium / High
|
|
```
|
|
|
|
**Prompt Length**: 112 lines (+54)
|
|
|
|
---
|
|
|
|
### Testing (8 tasks: 3 re-tests + 5 new)
|
|
|
|
| Task | Query | Result | Quality | Time |
|
|
|------|-------|--------|---------|------|
|
|
| 1R | "show architecture" | ✅ Success | 4/5 | 3.8 min |
|
|
| 2R | "find API endpoints" | ✅ Success | 5/5 | 2.9 min |
|
|
| 3R | "explain auth" | ✅ Success | 4/5 | 3.2 min |
|
|
| 11 | "list database schemas" | ✅ Success | 5/5 | 2.1 min |
|
|
| 12 | "find error handlers" | ✅ Success | 4/5 | 2.5 min |
|
|
| 13 | "show test structure" | ⚠️ Partial | 3/5 | 3.6 min |
|
|
| 14 | "explain config system" | ✅ Success | 5/5 | 2.4 min |
|
|
| 15 | "find CLI commands" | ✅ Success | 4/5 | 2.2 min |
|
|
|
|
**Iteration 1 Metrics**:
|
|
- Success Rate: 87.5% (7/8) - **+45.8% improvement**
|
|
- Average Quality: 4.25/5 - **+18.1%**
|
|
- Average Time: 2.84 min - **-32.1%**
|
|
- V_instance: 0.88 ✅ (exceeds target)
|
|
|
|
---
|
|
|
|
### Key Improvements
|
|
|
|
✅ Fixed scope ambiguity (Tasks 1R, 2R, 3R all succeeded)
|
|
✅ Better time management (all <4 min)
|
|
✅ Higher quality outputs (4.25 avg)
|
|
⚠️ Still one partial success (Task 13)
|
|
|
|
**Remaining Issue**: Test structure query missed integration tests
|
|
|
|
---
|
|
|
|
## Iteration 2: Refine Coverage (v3)
|
|
|
|
### Prompt Changes
|
|
|
|
**Enhanced: Completeness Verification**
|
|
```markdown
|
|
## Completeness Verification
|
|
|
|
Before concluding, verify coverage by category:
|
|
|
|
**For "find" queries**:
|
|
□ Main implementations found
|
|
□ Related utilities checked
|
|
□ Test files reviewed (if applicable)
|
|
□ Configuration/setup files checked
|
|
|
|
**For "show" queries**:
|
|
□ Primary structure identified
|
|
□ Secondary components listed
|
|
□ Relationships mapped
|
|
□ Examples provided
|
|
|
|
**For "explain" queries**:
|
|
□ Core mechanism described
|
|
□ Key components identified
|
|
□ Data flow explained
|
|
□ Edge cases noted
|
|
```
|
|
|
|
**Added: Search Strategy**
|
|
```markdown
|
|
## Search Strategy
|
|
|
|
**Phase 1 (30% of time)**: Broad search
|
|
- Glob for file patterns
|
|
- Grep for key terms
|
|
- Identify main locations
|
|
|
|
**Phase 2 (50% of time)**: Deep investigation
|
|
- Read main files
|
|
- Follow references
|
|
- Build understanding
|
|
|
|
**Phase 3 (20% of time)**: Verification
|
|
- Check for gaps
|
|
- Validate findings
|
|
- Prepare summary
|
|
```
|
|
|
|
**Refined: Confidence Scoring**
|
|
```markdown
|
|
## Confidence Level
|
|
|
|
**High**: All major components found, verified complete
|
|
**Medium**: Core components found, minor gaps possible
|
|
**Low**: Partial findings, significant gaps likely
|
|
|
|
Always state confidence level and identify known gaps.
|
|
```
|
|
|
|
**Prompt Length**: 138 lines (+26)
|
|
|
|
---
|
|
|
|
### Testing (10 tasks: 1 re-test + 9 new)
|
|
|
|
| Task | Query | Result | Quality | Time |
|
|
|------|-------|--------|---------|------|
|
|
| 13R | "show test structure" | ✅ Success | 5/5 | 2.9 min |
|
|
| 16 | "find auth middleware" | ✅ Success | 5/5 | 2.3 min |
|
|
| 17 | "explain routing" | ✅ Success | 4/5 | 3.1 min |
|
|
| 18 | "list validation rules" | ✅ Success | 5/5 | 2.1 min |
|
|
| 19 | "find logging setup" | ✅ Success | 4/5 | 2.5 min |
|
|
| 20 | "show data models" | ✅ Success | 5/5 | 2.8 min |
|
|
| 21 | "explain caching" | ✅ Success | 4/5 | 2.7 min |
|
|
| 22 | "find background jobs" | ✅ Success | 5/5 | 2.4 min |
|
|
| 23 | "show dependencies" | ✅ Success | 4/5 | 2.2 min |
|
|
| 24 | "explain deployment" | ❌ Failed | 2/5 | 3.8 min |
|
|
|
|
**Iteration 2 Metrics**:
|
|
- Success Rate: 90% (9/10) - **+2.5% improvement** (stable)
|
|
- Average Quality: 4.3/5 - **+1.2%**
|
|
- Average Time: 2.68 min - **-5.6%**
|
|
- V_instance: 0.90 ✅ ✅ (2 consecutive ≥ 0.80)
|
|
|
|
**CONVERGED** ✅
|
|
|
|
---
|
|
|
|
### Stability Validation
|
|
|
|
**Iteration 1**: V_instance = 0.88
|
|
**Iteration 2**: V_instance = 0.90
|
|
**Change**: +2.3% (stable, within ±5%)
|
|
|
|
**Criteria Met**:
|
|
✅ V_instance ≥ 0.80 for 2 consecutive iterations
|
|
✅ Success rate ≥ 85%
|
|
✅ Quality ≥ 4.0
|
|
✅ Time within budget (<3 min avg)
|
|
|
|
---
|
|
|
|
## Final Metrics Comparison
|
|
|
|
| Metric | v1 (Baseline) | v2 (Iteration 1) | v3 (Iteration 2) | Δ Total |
|
|
|--------|---------------|------------------|------------------|---------|
|
|
| Success Rate | 60% | 87.5% | 90% | **+50%** |
|
|
| Quality | 3.6/5 | 4.25/5 | 4.3/5 | **+19.4%** |
|
|
| Time | 4.18 min | 2.84 min | 2.68 min | **-35.9%** |
|
|
| V_instance | 0.68 | 0.88 | 0.90 | **+32.4%** |
|
|
|
|
---
|
|
|
|
## Evolution Summary
|
|
|
|
### Iteration 0 → 1: Major Improvements
|
|
|
|
**Key Changes**:
|
|
- Added thoroughness levels (quick/medium/thorough)
|
|
- Added time-boxing (1-6 min)
|
|
- Added completeness checklist
|
|
|
|
**Impact**:
|
|
- Success: 60% → 87.5% (+45.8%)
|
|
- Time: 4.18 → 2.84 min (-32.1%)
|
|
- Quality: 3.6 → 4.25 (+18.1%)
|
|
|
|
**Root Causes Addressed**:
|
|
✅ Scope ambiguity resolved
|
|
✅ Time management improved
|
|
✅ Completeness awareness added
|
|
|
|
---
|
|
|
|
### Iteration 1 → 2: Refinement
|
|
|
|
**Key Changes**:
|
|
- Enhanced completeness verification (by query type)
|
|
- Added search strategy (3-phase)
|
|
- Refined confidence scoring
|
|
|
|
**Impact**:
|
|
- Success: 87.5% → 90% (+2.5%, stable)
|
|
- Time: 2.84 → 2.68 min (-5.6%)
|
|
- Quality: 4.25 → 4.3 (+1.2%)
|
|
|
|
**Root Causes Addressed**:
|
|
✅ Test structure coverage gap fixed
|
|
✅ Verification process strengthened
|
|
|
|
---
|
|
|
|
## Key Learnings
|
|
|
|
### What Worked
|
|
|
|
1. **Thoroughness Levels**: Clear guidance on search depth
|
|
2. **Time-Boxing**: Prevented runaway queries
|
|
3. **Completeness Checklist**: Improved coverage
|
|
4. **Phased Search**: Structured approach to exploration
|
|
|
|
### What Didn't Work
|
|
|
|
1. **Deployment Query Failed**: Outside agent scope (requires infra knowledge)
|
|
- Solution: Document limitations, suggest alternative agents
|
|
|
|
### Best Practices Validated
|
|
|
|
✅ **Start Simple**: v1 was minimal, added structure incrementally
|
|
✅ **Measure Everything**: Quantitative metrics guided refinements
|
|
✅ **Focus on Patterns**: Fixed systematic failures, not one-off issues
|
|
✅ **Validate Stability**: 2-iteration convergence confirmed reliability
|
|
|
|
---
|
|
|
|
## Production Deployment
|
|
|
|
**Status**: ✅ Production-ready (v3)
|
|
**Confidence**: High (90% success, 2 iterations stable)
|
|
|
|
**Deployment**:
|
|
```bash
|
|
# Update agent prompt
|
|
cp explore-agent-v3.md .claude/agents/explore.md
|
|
|
|
# Validate
|
|
test-agent-suite explore 20
|
|
# Expected: Success ≥ 85%, Quality ≥ 4.0, Time ≤ 3 min
|
|
```
|
|
|
|
**Monitoring**:
|
|
- Track success rate (alert if <80%)
|
|
- Monitor time (alert if >3.5 min avg)
|
|
- Review failures weekly
|
|
|
|
---
|
|
|
|
## Future Enhancements (v4+)
|
|
|
|
**Potential Improvements**:
|
|
1. **Context Caching**: Reuse codebase knowledge across queries (Est: -20% time)
|
|
2. **Query Classification**: Auto-detect thoroughness level (Est: +5% success)
|
|
3. **Result Ranking**: Prioritize most relevant findings (Est: +10% quality)
|
|
|
|
**Decision**: Hold v3, monitor for 2 weeks before v4
|
|
|
|
---
|
|
|
|
**Source**: Bootstrap-005 Agent Prompt Evolution
|
|
**Agent**: Explore
|
|
**Final Version**: v3 (90% success, 4.3/5 quality, 2.68 min avg)
|
|
**Status**: Production-ready, converged, deployed
|