Files
2025-11-30 09:07:22 +08:00

9.5 KiB

Explore Agent Evolution: v1 → v3

Agent: Explore (codebase exploration) Iterations: 3 Improvement: 60% → 90% success rate (+50%) Time: 4.2 min → 2.6 min (-38%) Status: Converged (production-ready)

Complete walkthrough of evolving Explore agent prompt through BAIME methodology.


Iteration 0: Baseline (v1)

Initial Prompt

# Explore Agent

You are a codebase exploration agent. Your task is to help users understand
code structure, find implementations, and explain how things work.

When given a query:
1. Use Glob to find relevant files
2. Use Grep to search for patterns
3. Read files to understand implementations
4. Provide a summary

Tools available: Glob, Grep, Read, Bash

Prompt Length: 58 lines


Baseline Testing (10 tasks)

Task Query Result Quality Time
1 "show architecture" Failed 2/5 5.2 min
2 "find API endpoints" ⚠️ Partial 3/5 4.8 min
3 "explain auth" ⚠️ Partial 3/5 6.1 min
4 "list CLI commands" Success 4/5 2.8 min
5 "find database code" Success 5/5 3.2 min
6 "show test structure" Failed 2/5 4.5 min
7 "explain config" Success 4/5 3.9 min
8 "find error handlers" Success 5/5 2.9 min
9 "show imports" Success 4/5 3.1 min
10 "find middleware" Success 4/5 5.3 min

Baseline Metrics:

  • Success Rate: 60% (6/10)
  • Average Quality: 3.6/5
  • Average Time: 4.18 min
  • V_instance: 0.68 (below target)

Failure Analysis

Pattern 1: Scope Ambiguity (Tasks 1, 2, 3)

  • Queries too broad ("architecture", "auth")
  • Agent doesn't know search depth
  • Either stops too early or runs too long

Pattern 2: Incomplete Coverage (Tasks 2, 6)

  • Agent finds 1-2 files, stops
  • Misses related implementations
  • No verification of completeness

Pattern 3: Time Management (Tasks 1, 3, 10)

  • Long-running queries (>5 min)
  • Diminishing returns after 3 min
  • No time-boxing mechanism

Iteration 1: Add Structure (v2)

Prompt Changes

Added: Thoroughness Guidelines

## Thoroughness Levels

Assess query complexity and choose thoroughness:

**quick** (1-2 min):
- Check 3-5 obvious locations
- Direct pattern matches only
- Use for simple lookups

**medium** (2-4 min):
- Check 10-15 related files
- Follow cross-references
- Use for typical queries

**thorough** (4-6 min):
- Comprehensive search across codebase
- Deep dependency analysis
- Use for architecture questions

Added: Time-Boxing

## Time Management

Allocate time based on thoroughness:
- quick: 1-2 min
- medium: 2-4 min
- thorough: 4-6 min

Stop if <10% new findings in last 20% of time budget.

Added: Completeness Checklist

## Before Responding

Verify completeness:
□ All direct matches found (Glob/Grep)
□ Related implementations checked
□ Cross-references validated
□ No obvious gaps remaining

State confidence level: Low / Medium / High

Prompt Length: 112 lines (+54)


Testing (8 tasks: 3 re-tests + 5 new)

Task Query Result Quality Time
1R "show architecture" Success 4/5 3.8 min
2R "find API endpoints" Success 5/5 2.9 min
3R "explain auth" Success 4/5 3.2 min
11 "list database schemas" Success 5/5 2.1 min
12 "find error handlers" Success 4/5 2.5 min
13 "show test structure" ⚠️ Partial 3/5 3.6 min
14 "explain config system" Success 5/5 2.4 min
15 "find CLI commands" Success 4/5 2.2 min

Iteration 1 Metrics:

  • Success Rate: 87.5% (7/8) - +45.8% improvement
  • Average Quality: 4.25/5 - +18.1%
  • Average Time: 2.84 min - -32.1%
  • V_instance: 0.88 (exceeds target)

Key Improvements

Fixed scope ambiguity (Tasks 1R, 2R, 3R all succeeded) Better time management (all <4 min) Higher quality outputs (4.25 avg) ⚠️ Still one partial success (Task 13)

Remaining Issue: Test structure query missed integration tests


Iteration 2: Refine Coverage (v3)

Prompt Changes

Enhanced: Completeness Verification

## Completeness Verification

Before concluding, verify coverage by category:

**For "find" queries**:
□ Main implementations found
□ Related utilities checked
□ Test files reviewed (if applicable)
□ Configuration/setup files checked

**For "show" queries**:
□ Primary structure identified
□ Secondary components listed
□ Relationships mapped
□ Examples provided

**For "explain" queries**:
□ Core mechanism described
□ Key components identified
□ Data flow explained
□ Edge cases noted

Added: Search Strategy

## Search Strategy

**Phase 1 (30% of time)**: Broad search
- Glob for file patterns
- Grep for key terms
- Identify main locations

**Phase 2 (50% of time)**: Deep investigation
- Read main files
- Follow references
- Build understanding

**Phase 3 (20% of time)**: Verification
- Check for gaps
- Validate findings
- Prepare summary

Refined: Confidence Scoring

## Confidence Level

**High**: All major components found, verified complete
**Medium**: Core components found, minor gaps possible
**Low**: Partial findings, significant gaps likely

Always state confidence level and identify known gaps.

Prompt Length: 138 lines (+26)


Testing (10 tasks: 1 re-test + 9 new)

Task Query Result Quality Time
13R "show test structure" Success 5/5 2.9 min
16 "find auth middleware" Success 5/5 2.3 min
17 "explain routing" Success 4/5 3.1 min
18 "list validation rules" Success 5/5 2.1 min
19 "find logging setup" Success 4/5 2.5 min
20 "show data models" Success 5/5 2.8 min
21 "explain caching" Success 4/5 2.7 min
22 "find background jobs" Success 5/5 2.4 min
23 "show dependencies" Success 4/5 2.2 min
24 "explain deployment" Failed 2/5 3.8 min

Iteration 2 Metrics:

  • Success Rate: 90% (9/10) - +2.5% improvement (stable)
  • Average Quality: 4.3/5 - +1.2%
  • Average Time: 2.68 min - -5.6%
  • V_instance: 0.90 (2 consecutive ≥ 0.80)

CONVERGED


Stability Validation

Iteration 1: V_instance = 0.88 Iteration 2: V_instance = 0.90 Change: +2.3% (stable, within ±5%)

Criteria Met: V_instance ≥ 0.80 for 2 consecutive iterations Success rate ≥ 85% Quality ≥ 4.0 Time within budget (<3 min avg)


Final Metrics Comparison

Metric v1 (Baseline) v2 (Iteration 1) v3 (Iteration 2) Δ Total
Success Rate 60% 87.5% 90% +50%
Quality 3.6/5 4.25/5 4.3/5 +19.4%
Time 4.18 min 2.84 min 2.68 min -35.9%
V_instance 0.68 0.88 0.90 +32.4%

Evolution Summary

Iteration 0 → 1: Major Improvements

Key Changes:

  • Added thoroughness levels (quick/medium/thorough)
  • Added time-boxing (1-6 min)
  • Added completeness checklist

Impact:

  • Success: 60% → 87.5% (+45.8%)
  • Time: 4.18 → 2.84 min (-32.1%)
  • Quality: 3.6 → 4.25 (+18.1%)

Root Causes Addressed: Scope ambiguity resolved Time management improved Completeness awareness added


Iteration 1 → 2: Refinement

Key Changes:

  • Enhanced completeness verification (by query type)
  • Added search strategy (3-phase)
  • Refined confidence scoring

Impact:

  • Success: 87.5% → 90% (+2.5%, stable)
  • Time: 2.84 → 2.68 min (-5.6%)
  • Quality: 4.25 → 4.3 (+1.2%)

Root Causes Addressed: Test structure coverage gap fixed Verification process strengthened


Key Learnings

What Worked

  1. Thoroughness Levels: Clear guidance on search depth
  2. Time-Boxing: Prevented runaway queries
  3. Completeness Checklist: Improved coverage
  4. Phased Search: Structured approach to exploration

What Didn't Work

  1. Deployment Query Failed: Outside agent scope (requires infra knowledge)
    • Solution: Document limitations, suggest alternative agents

Best Practices Validated

Start Simple: v1 was minimal, added structure incrementally Measure Everything: Quantitative metrics guided refinements Focus on Patterns: Fixed systematic failures, not one-off issues Validate Stability: 2-iteration convergence confirmed reliability


Production Deployment

Status: Production-ready (v3) Confidence: High (90% success, 2 iterations stable)

Deployment:

# Update agent prompt
cp explore-agent-v3.md .claude/agents/explore.md

# Validate
test-agent-suite explore 20
# Expected: Success ≥ 85%, Quality ≥ 4.0, Time ≤ 3 min

Monitoring:

  • Track success rate (alert if <80%)
  • Monitor time (alert if >3.5 min avg)
  • Review failures weekly

Future Enhancements (v4+)

Potential Improvements:

  1. Context Caching: Reuse codebase knowledge across queries (Est: -20% time)
  2. Query Classification: Auto-detect thoroughness level (Est: +5% success)
  3. Result Ranking: Prioritize most relevant findings (Est: +10% quality)

Decision: Hold v3, monitor for 2 weeks before v4


Source: Bootstrap-005 Agent Prompt Evolution Agent: Explore Final Version: v3 (90% success, 4.3/5 quality, 2.68 min avg) Status: Production-ready, converged, deployed