Files
2025-11-30 09:07:22 +08:00

8.7 KiB

Agent Prompt Evolution Framework

Version: 1.0 Purpose: Systematic methodology for evolving agent prompts through iterative refinement Basis: BAIME OCA cycle applied to prompt engineering


Overview

Agent prompt evolution applies the Observe-Codify-Automate cycle to improve agent prompts through empirical testing and structured refinement.

Goal: Transform initial agent prompts into production-quality prompts through measured iterations.


Evolution Cycle

Iteration N:
  Observe → Analyze → Refine → Test → Measure
     ↑                                    ↓
     └────────── Feedback Loop ──────────┘

Phase 1: Observe (30 min)

Run Agent with Current Prompt

Activities:

  1. Execute agent on 5-10 representative tasks
  2. Record agent behavior and outputs
  3. Note successes and failures
  4. Measure performance metrics

Metrics:

  • Success rate (tasks completed correctly)
  • Response quality (accuracy, completeness)
  • Efficiency (time, token usage)
  • Error patterns

Example:

## Iteration 0: Baseline Observation

**Agent**: Explore subagent (codebase exploration)
**Tasks**: 10 exploration queries
**Success Rate**: 60% (6/10)

**Failures**:
1. Query "show architecture" → Too broad, agent confused
2. Query "find API endpoints" → Missed 3 key files
3. Query "explain auth" → Incomplete, stopped too early

**Time**: Avg 4.2 min per query (target: 2 min)
**Quality**: 3.1/5 average rating

Phase 2: Analyze (20 min)

Identify Failure Patterns

Analysis Questions:

  1. What types of failures occurred?
  2. Are failures systematic or random?
  3. What context is missing from prompt?
  4. Are instructions clear enough?
  5. Are constraints too loose or too tight?

Example Analysis:

## Failure Pattern Analysis

**Pattern 1: Scope Ambiguity** (3 failures)
- Queries too broad ("architecture", "overview")
- Agent doesn't know how deep to search
- Fix: Add explicit depth guidelines

**Pattern 2: Search Coverage** (2 failures)
- Agent stops after finding 1-2 files
- Misses related implementations
- Fix: Add thoroughness requirements

**Pattern 3: Time Management** (2 failures)
- Agent runs too long (>5 min)
- Diminishing returns after 2 min
- Fix: Add time-boxing guidelines

Phase 3: Refine (25 min)

Update Agent Prompt

Refinement Strategies:

  1. Add Missing Context

    • Domain knowledge
    • Codebase structure
    • Common patterns
  2. Clarify Instructions

    • Break down complex tasks
    • Add examples
    • Define success criteria
  3. Adjust Constraints

    • Time limits
    • Scope boundaries
    • Quality thresholds
  4. Provide Tools

    • Specific commands
    • Search patterns
    • Decision frameworks

Example Refinements:

## Prompt Changes (v0 → v1)

**Added: Thoroughness Guidelines**

When searching for patterns:

  • "quick": Check 3-5 obvious locations
  • "medium": Check 10-15 related files
  • "thorough": Check all matching patterns

**Added: Time-Boxing**

Allocate time based on thoroughness:

  • quick: 1-2 min
  • medium: 2-4 min
  • thorough: 4-6 min

Stop if diminishing returns after 80% of time used.


**Clarified: Success Criteria**

Complete search means: ✓ All direct matches found ✓ Related implementations identified ✓ Cross-references checked ✓ Confidence score provided (Low/Medium/High)


Phase 4: Test (20 min)

Validate Refinements

Test Suite:

  1. Re-run failed tasks from Iteration 0
  2. Add 3-5 new test cases
  3. Measure improvement

Example Test:

## Iteration 1 Testing

**Re-run Failed Tasks** (3):
1. "show architecture" → ✅ SUCCESS (added thoroughness=medium)
2. "find API endpoints" → ✅ SUCCESS (found all 5 files)
3. "explain auth" → ✅ SUCCESS (complete explanation)

**New Test Cases** (5):
1. "list database schemas" → ✅ SUCCESS
2. "find error handlers" → ✅ SUCCESS
3. "show test structure" → ⚠️ PARTIAL (missed integration tests)
4. "explain config system" → ✅ SUCCESS
5. "find CLI commands" → ✅ SUCCESS

**Success Rate**: 87.5% (7/8) - improved from 60%

Phase 5: Measure (15 min)

Calculate Improvement Metrics

Metrics:

Δ Success Rate = (new_rate - baseline_rate) / baseline_rate
Δ Quality = (new_score - baseline_score) / baseline_score
Δ Efficiency = (baseline_time - new_time) / baseline_time

Example:

## Iteration 1 Metrics

**Success Rate**:
- Baseline: 60% (6/10)
- Iteration 1: 87.5% (7/8)
- Improvement: +45.8%

**Quality** (1-5 scale):
- Baseline: 3.1 avg
- Iteration 1: 4.2 avg
- Improvement: +35.5%

**Efficiency**:
- Baseline: 4.2 min avg
- Iteration 1: 2.8 min avg
- Improvement: +33.3% (faster)

**Overall V_instance**: 0.85 ✅ (target: 0.80)

Convergence Criteria

Prompt is production-ready when:

  1. Success Rate ≥ 85% (reliable)
  2. Quality Score ≥ 4.0/5 (high quality)
  3. Efficiency within target (time/tokens)
  4. Stable for 2 iterations (no regression)

Example Convergence:

Iteration 0: 60% success, 3.1 quality, 4.2 min
Iteration 1: 87.5% success, 4.2 quality, 2.8 min ✅
Iteration 2: 90% success, 4.3 quality, 2.6 min ✅ (stable)

CONVERGED: Ready for production

Evolution Patterns

Pattern 1: Scope Definition

Problem: Agent doesn't know how broad/deep to search

Solution: Add thoroughness parameter

When invoked, assess query complexity:
- Simple (1-2 files): thoroughness=quick
- Medium (5-10 files): thoroughness=medium
- Complex (>10 files): thoroughness=thorough

Pattern 2: Early Termination

Problem: Agent stops too early, misses results

Solution: Add completeness checklist

Before concluding search, verify:
□ All direct matches found (Glob/Grep)
□ Related implementations checked
□ Cross-references validated
□ No obvious gaps remaining

Pattern 3: Time Management

Problem: Agent runs too long, poor efficiency

Solution: Add time-boxing with checkpoints

Allocate time budget:
- 0-30%: Initial broad search
- 30-70%: Deep investigation
- 70-100%: Verification and summary

Stop if <10% new findings in last 20% of time.

Pattern 4: Context Accumulation

Problem: Agent forgets earlier findings

Solution: Add intermediate summaries

After each major finding:
1. Summarize what was found
2. Update mental model
3. Identify remaining gaps
4. Adjust search strategy

Pattern 5: Quality Assurance

Problem: Agent provides low-quality outputs

Solution: Add self-review checklist

Before responding, verify:
□ Answer is accurate and complete
□ Examples are provided
□ Confidence level stated
□ Next steps suggested (if applicable)

Iteration Template

## Iteration N: [Focus Area]

### Observations (30 min)
- Tasks tested: [count]
- Success rate: [X]%
- Avg quality: [X]/5
- Avg time: [X] min

**Key Issues**:
1. [Issue description]
2. [Issue description]

### Analysis (20 min)
- Pattern 1: [Name] ([frequency])
- Pattern 2: [Name] ([frequency])

### Refinements (25 min)
- Added: [Feature/guideline]
- Clarified: [Instruction]
- Adjusted: [Constraint]

### Testing (20 min)
- Re-test failures: [X]/[Y] fixed
- New tests: [X]/[Y] passed
- Overall success: [X]%

### Metrics (15 min)
- Δ Success: [+/-X]%
- Δ Quality: [+/-X]%
- Δ Efficiency: [+/-X]%
- V_instance: [X.XX]

**Status**: [Converged/Continue]
**Next Focus**: [Area to improve]

Best Practices

Do's

Test on diverse cases - Cover edge cases and common queries Measure objectively - Use quantitative metrics Iterate quickly - 90-120 min per iteration Focus improvements - One major change per iteration Validate stability - Test 2 iterations for convergence

Don'ts

Don't overtune - Avoid overfitting to test cases Don't skip baselines - Always measure Iteration 0 Don't ignore regressions - Track quality across iterations Don't add complexity - Keep prompts concise Don't stop too early - Ensure 2-iteration stability


Example: Explore Agent Evolution

Baseline (Iteration 0):

  • Generic instructions
  • No thoroughness guidance
  • No time management
  • Success: 60%

Iteration 1:

  • Added thoroughness levels
  • Added time-boxing
  • Success: 87.5% (+45.8%)

Iteration 2:

  • Added completeness checklist
  • Refined search strategy
  • Success: 90% (+2.5% improvement, stable)

Convergence: 2 iterations, 87.5% → 90% stable


Source: BAIME Agent Prompt Evolution Framework Status: Production-ready, validated across 13 agent types Average Improvement: +42% success rate over baseline