gh-yaleh-meta-cc-claude/skills/agent-prompt-evolution/examples/explore-agent-v1-v3.md

# Explore Agent Evolution: v1 → v3

**Agent**: Explore (codebase exploration)
**Iterations**: 3
**Improvement**: 60% → 90% success rate (+50%)
**Time**: 4.2 min → 2.6 min (-38%)
**Status**: Converged (production-ready)

Complete walkthrough of evolving Explore agent prompt through BAIME methodology.

---

## Iteration 0: Baseline (v1)

### Initial Prompt

```markdown
# Explore Agent

You are a codebase exploration agent. Your task is to help users understand
code structure, find implementations, and explain how things work.

When given a query:
1. Use Glob to find relevant files
2. Use Grep to search for patterns
3. Read files to understand implementations
4. Provide a summary

Tools available: Glob, Grep, Read, Bash
```

**Prompt Length**: 58 lines

---

### Baseline Testing (10 tasks)

| Task | Query | Result | Quality | Time |
|------|-------|--------|---------|------|
| 1 | "show architecture" | ❌ Failed | 2/5 | 5.2 min |
| 2 | "find API endpoints" | ⚠️ Partial | 3/5 | 4.8 min |
| 3 | "explain auth" | ⚠️ Partial | 3/5 | 6.1 min |
| 4 | "list CLI commands" | ✅ Success | 4/5 | 2.8 min |
| 5 | "find database code" | ✅ Success | 5/5 | 3.2 min |
| 6 | "show test structure" | ❌ Failed | 2/5 | 4.5 min |
| 7 | "explain config" | ✅ Success | 4/5 | 3.9 min |
| 8 | "find error handlers" | ✅ Success | 5/5 | 2.9 min |
| 9 | "show imports" | ✅ Success | 4/5 | 3.1 min |
| 10 | "find middleware" | ✅ Success | 4/5 | 5.3 min |

**Baseline Metrics**:
- Success Rate: 60% (6/10)
- Average Quality: 3.6/5
- Average Time: 4.18 min
- V_instance: 0.68 (below target)

---

### Failure Analysis

**Pattern 1: Scope Ambiguity** (Tasks 1, 2, 3)
- Queries too broad ("architecture", "auth")
- Agent doesn't know search depth
- Either stops too early or runs too long

**Pattern 2: Incomplete Coverage** (Tasks 2, 6)
- Agent finds 1-2 files, stops
- Misses related implementations
- No verification of completeness

**Pattern 3: Time Management** (Tasks 1, 3, 10)
- Long-running queries (>5 min)
- Diminishing returns after 3 min
- No time-boxing mechanism

---

## Iteration 1: Add Structure (v2)

### Prompt Changes

**Added: Thoroughness Guidelines**
```markdown
## Thoroughness Levels

Assess query complexity and choose thoroughness:

**quick** (1-2 min):
- Check 3-5 obvious locations
- Direct pattern matches only
- Use for simple lookups

**medium** (2-4 min):
- Check 10-15 related files
- Follow cross-references
- Use for typical queries

**thorough** (4-6 min):
- Comprehensive search across codebase
- Deep dependency analysis
- Use for architecture questions
```

**Added: Time-Boxing**
```markdown
## Time Management

Allocate time based on thoroughness:
- quick: 1-2 min
- medium: 2-4 min
- thorough: 4-6 min

Stop if <10% new findings in last 20% of time budget.
```

**Added: Completeness Checklist**
```markdown
## Before Responding

Verify completeness:
□ All direct matches found (Glob/Grep)
□ Related implementations checked
□ Cross-references validated
□ No obvious gaps remaining

State confidence level: Low / Medium / High
```

**Prompt Length**: 112 lines (+54)

---

### Testing (8 tasks: 3 re-tests + 5 new)

| Task | Query | Result | Quality | Time |
|------|-------|--------|---------|------|
| 1R | "show architecture" | ✅ Success | 4/5 | 3.8 min |
| 2R | "find API endpoints" | ✅ Success | 5/5 | 2.9 min |
| 3R | "explain auth" | ✅ Success | 4/5 | 3.2 min |
| 11 | "list database schemas" | ✅ Success | 5/5 | 2.1 min |
| 12 | "find error handlers" | ✅ Success | 4/5 | 2.5 min |
| 13 | "show test structure" | ⚠️ Partial | 3/5 | 3.6 min |
| 14 | "explain config system" | ✅ Success | 5/5 | 2.4 min |
| 15 | "find CLI commands" | ✅ Success | 4/5 | 2.2 min |

**Iteration 1 Metrics**:
- Success Rate: 87.5% (7/8) - **+45.8% improvement**
- Average Quality: 4.25/5 - **+18.1%**
- Average Time: 2.84 min - **-32.1%**
- V_instance: 0.88 ✅ (exceeds target)

---

### Key Improvements

✅ Fixed scope ambiguity (Tasks 1R, 2R, 3R all succeeded)
✅ Better time management (all <4 min)
✅ Higher quality outputs (4.25 avg)
⚠️ Still one partial success (Task 13)

**Remaining Issue**: Test structure query missed integration tests

---

## Iteration 2: Refine Coverage (v3)

### Prompt Changes

**Enhanced: Completeness Verification**
```markdown
## Completeness Verification

Before concluding, verify coverage by category:

**For "find" queries**:
□ Main implementations found
□ Related utilities checked
□ Test files reviewed (if applicable)
□ Configuration/setup files checked

**For "show" queries**:
□ Primary structure identified
□ Secondary components listed
□ Relationships mapped
□ Examples provided

**For "explain" queries**:
□ Core mechanism described
□ Key components identified
□ Data flow explained
□ Edge cases noted
```

**Added: Search Strategy**
```markdown
## Search Strategy

**Phase 1 (30% of time)**: Broad search
- Glob for file patterns
- Grep for key terms
- Identify main locations

**Phase 2 (50% of time)**: Deep investigation
- Read main files
- Follow references
- Build understanding

**Phase 3 (20% of time)**: Verification
- Check for gaps
- Validate findings
- Prepare summary
```

**Refined: Confidence Scoring**
```markdown
## Confidence Level

**High**: All major components found, verified complete
**Medium**: Core components found, minor gaps possible
**Low**: Partial findings, significant gaps likely

Always state confidence level and identify known gaps.
```

**Prompt Length**: 138 lines (+26)

---

### Testing (10 tasks: 1 re-test + 9 new)

| Task | Query | Result | Quality | Time |
|------|-------|--------|---------|------|
| 13R | "show test structure" | ✅ Success | 5/5 | 2.9 min |
| 16 | "find auth middleware" | ✅ Success | 5/5 | 2.3 min |
| 17 | "explain routing" | ✅ Success | 4/5 | 3.1 min |
| 18 | "list validation rules" | ✅ Success | 5/5 | 2.1 min |
| 19 | "find logging setup" | ✅ Success | 4/5 | 2.5 min |
| 20 | "show data models" | ✅ Success | 5/5 | 2.8 min |
| 21 | "explain caching" | ✅ Success | 4/5 | 2.7 min |
| 22 | "find background jobs" | ✅ Success | 5/5 | 2.4 min |
| 23 | "show dependencies" | ✅ Success | 4/5 | 2.2 min |
| 24 | "explain deployment" | ❌ Failed | 2/5 | 3.8 min |

**Iteration 2 Metrics**:
- Success Rate: 90% (9/10) - **+2.5% improvement** (stable)
- Average Quality: 4.3/5 - **+1.2%**
- Average Time: 2.68 min - **-5.6%**
- V_instance: 0.90 ✅ ✅ (2 consecutive ≥ 0.80)

**CONVERGED** ✅

---

### Stability Validation

**Iteration 1**: V_instance = 0.88
**Iteration 2**: V_instance = 0.90
**Change**: +2.3% (stable, within ±5%)

**Criteria Met**:
✅ V_instance ≥ 0.80 for 2 consecutive iterations
✅ Success rate ≥ 85%
✅ Quality ≥ 4.0
✅ Time within budget (<3 min avg)

---

## Final Metrics Comparison

| Metric | v1 (Baseline) | v2 (Iteration 1) | v3 (Iteration 2) | Δ Total |
|--------|---------------|------------------|------------------|---------|
| Success Rate | 60% | 87.5% | 90% | **+50%** |
| Quality | 3.6/5 | 4.25/5 | 4.3/5 | **+19.4%** |
| Time | 4.18 min | 2.84 min | 2.68 min | **-35.9%** |
| V_instance | 0.68 | 0.88 | 0.90 | **+32.4%** |

---

## Evolution Summary

### Iteration 0 → 1: Major Improvements

**Key Changes**:
- Added thoroughness levels (quick/medium/thorough)
- Added time-boxing (1-6 min)
- Added completeness checklist

**Impact**:
- Success: 60% → 87.5% (+45.8%)
- Time: 4.18 → 2.84 min (-32.1%)
- Quality: 3.6 → 4.25 (+18.1%)

**Root Causes Addressed**:
✅ Scope ambiguity resolved
✅ Time management improved
✅ Completeness awareness added

---

### Iteration 1 → 2: Refinement

**Key Changes**:
- Enhanced completeness verification (by query type)
- Added search strategy (3-phase)
- Refined confidence scoring

**Impact**:
- Success: 87.5% → 90% (+2.5%, stable)
- Time: 2.84 → 2.68 min (-5.6%)
- Quality: 4.25 → 4.3 (+1.2%)

**Root Causes Addressed**:
✅ Test structure coverage gap fixed
✅ Verification process strengthened

---

## Key Learnings

### What Worked

1. **Thoroughness Levels**: Clear guidance on search depth
2. **Time-Boxing**: Prevented runaway queries
3. **Completeness Checklist**: Improved coverage
4. **Phased Search**: Structured approach to exploration

### What Didn't Work

1. **Deployment Query Failed**: Outside agent scope (requires infra knowledge)
   - Solution: Document limitations, suggest alternative agents

### Best Practices Validated

✅ **Start Simple**: v1 was minimal, added structure incrementally
✅ **Measure Everything**: Quantitative metrics guided refinements
✅ **Focus on Patterns**: Fixed systematic failures, not one-off issues
✅ **Validate Stability**: 2-iteration convergence confirmed reliability

---

## Production Deployment

**Status**: ✅ Production-ready (v3)
**Confidence**: High (90% success, 2 iterations stable)

**Deployment**:
```bash
# Update agent prompt
cp explore-agent-v3.md .claude/agents/explore.md

# Validate
test-agent-suite explore 20
# Expected: Success ≥ 85%, Quality ≥ 4.0, Time ≤ 3 min
```

**Monitoring**:
- Track success rate (alert if <80%)
- Monitor time (alert if >3.5 min avg)
- Review failures weekly

---

## Future Enhancements (v4+)

**Potential Improvements**:
1. **Context Caching**: Reuse codebase knowledge across queries (Est: -20% time)
2. **Query Classification**: Auto-detect thoroughness level (Est: +5% success)
3. **Result Ranking**: Prioritize most relevant findings (Est: +10% quality)

**Decision**: Hold v3, monitor for 2 weeks before v4

---

**Source**: Bootstrap-005 Agent Prompt Evolution
**Agent**: Explore
**Final Version**: v3 (90% success, 4.3/5 quality, 2.68 min avg)
**Status**: Production-ready, converged, deployed