8.7 KiB
Agent Prompt Evolution Framework
Version: 1.0 Purpose: Systematic methodology for evolving agent prompts through iterative refinement Basis: BAIME OCA cycle applied to prompt engineering
Overview
Agent prompt evolution applies the Observe-Codify-Automate cycle to improve agent prompts through empirical testing and structured refinement.
Goal: Transform initial agent prompts into production-quality prompts through measured iterations.
Evolution Cycle
Iteration N:
Observe → Analyze → Refine → Test → Measure
↑ ↓
└────────── Feedback Loop ──────────┘
Phase 1: Observe (30 min)
Run Agent with Current Prompt
Activities:
- Execute agent on 5-10 representative tasks
- Record agent behavior and outputs
- Note successes and failures
- Measure performance metrics
Metrics:
- Success rate (tasks completed correctly)
- Response quality (accuracy, completeness)
- Efficiency (time, token usage)
- Error patterns
Example:
## Iteration 0: Baseline Observation
**Agent**: Explore subagent (codebase exploration)
**Tasks**: 10 exploration queries
**Success Rate**: 60% (6/10)
**Failures**:
1. Query "show architecture" → Too broad, agent confused
2. Query "find API endpoints" → Missed 3 key files
3. Query "explain auth" → Incomplete, stopped too early
**Time**: Avg 4.2 min per query (target: 2 min)
**Quality**: 3.1/5 average rating
Phase 2: Analyze (20 min)
Identify Failure Patterns
Analysis Questions:
- What types of failures occurred?
- Are failures systematic or random?
- What context is missing from prompt?
- Are instructions clear enough?
- Are constraints too loose or too tight?
Example Analysis:
## Failure Pattern Analysis
**Pattern 1: Scope Ambiguity** (3 failures)
- Queries too broad ("architecture", "overview")
- Agent doesn't know how deep to search
- Fix: Add explicit depth guidelines
**Pattern 2: Search Coverage** (2 failures)
- Agent stops after finding 1-2 files
- Misses related implementations
- Fix: Add thoroughness requirements
**Pattern 3: Time Management** (2 failures)
- Agent runs too long (>5 min)
- Diminishing returns after 2 min
- Fix: Add time-boxing guidelines
Phase 3: Refine (25 min)
Update Agent Prompt
Refinement Strategies:
-
Add Missing Context
- Domain knowledge
- Codebase structure
- Common patterns
-
Clarify Instructions
- Break down complex tasks
- Add examples
- Define success criteria
-
Adjust Constraints
- Time limits
- Scope boundaries
- Quality thresholds
-
Provide Tools
- Specific commands
- Search patterns
- Decision frameworks
Example Refinements:
## Prompt Changes (v0 → v1)
**Added: Thoroughness Guidelines**
When searching for patterns:
- "quick": Check 3-5 obvious locations
- "medium": Check 10-15 related files
- "thorough": Check all matching patterns
**Added: Time-Boxing**
Allocate time based on thoroughness:
- quick: 1-2 min
- medium: 2-4 min
- thorough: 4-6 min
Stop if diminishing returns after 80% of time used.
**Clarified: Success Criteria**
Complete search means: ✓ All direct matches found ✓ Related implementations identified ✓ Cross-references checked ✓ Confidence score provided (Low/Medium/High)
Phase 4: Test (20 min)
Validate Refinements
Test Suite:
- Re-run failed tasks from Iteration 0
- Add 3-5 new test cases
- Measure improvement
Example Test:
## Iteration 1 Testing
**Re-run Failed Tasks** (3):
1. "show architecture" → ✅ SUCCESS (added thoroughness=medium)
2. "find API endpoints" → ✅ SUCCESS (found all 5 files)
3. "explain auth" → ✅ SUCCESS (complete explanation)
**New Test Cases** (5):
1. "list database schemas" → ✅ SUCCESS
2. "find error handlers" → ✅ SUCCESS
3. "show test structure" → ⚠️ PARTIAL (missed integration tests)
4. "explain config system" → ✅ SUCCESS
5. "find CLI commands" → ✅ SUCCESS
**Success Rate**: 87.5% (7/8) - improved from 60%
Phase 5: Measure (15 min)
Calculate Improvement Metrics
Metrics:
Δ Success Rate = (new_rate - baseline_rate) / baseline_rate
Δ Quality = (new_score - baseline_score) / baseline_score
Δ Efficiency = (baseline_time - new_time) / baseline_time
Example:
## Iteration 1 Metrics
**Success Rate**:
- Baseline: 60% (6/10)
- Iteration 1: 87.5% (7/8)
- Improvement: +45.8%
**Quality** (1-5 scale):
- Baseline: 3.1 avg
- Iteration 1: 4.2 avg
- Improvement: +35.5%
**Efficiency**:
- Baseline: 4.2 min avg
- Iteration 1: 2.8 min avg
- Improvement: +33.3% (faster)
**Overall V_instance**: 0.85 ✅ (target: 0.80)
Convergence Criteria
Prompt is production-ready when:
- Success Rate ≥ 85% (reliable)
- Quality Score ≥ 4.0/5 (high quality)
- Efficiency within target (time/tokens)
- Stable for 2 iterations (no regression)
Example Convergence:
Iteration 0: 60% success, 3.1 quality, 4.2 min
Iteration 1: 87.5% success, 4.2 quality, 2.8 min ✅
Iteration 2: 90% success, 4.3 quality, 2.6 min ✅ (stable)
CONVERGED: Ready for production
Evolution Patterns
Pattern 1: Scope Definition
Problem: Agent doesn't know how broad/deep to search
Solution: Add thoroughness parameter
When invoked, assess query complexity:
- Simple (1-2 files): thoroughness=quick
- Medium (5-10 files): thoroughness=medium
- Complex (>10 files): thoroughness=thorough
Pattern 2: Early Termination
Problem: Agent stops too early, misses results
Solution: Add completeness checklist
Before concluding search, verify:
□ All direct matches found (Glob/Grep)
□ Related implementations checked
□ Cross-references validated
□ No obvious gaps remaining
Pattern 3: Time Management
Problem: Agent runs too long, poor efficiency
Solution: Add time-boxing with checkpoints
Allocate time budget:
- 0-30%: Initial broad search
- 30-70%: Deep investigation
- 70-100%: Verification and summary
Stop if <10% new findings in last 20% of time.
Pattern 4: Context Accumulation
Problem: Agent forgets earlier findings
Solution: Add intermediate summaries
After each major finding:
1. Summarize what was found
2. Update mental model
3. Identify remaining gaps
4. Adjust search strategy
Pattern 5: Quality Assurance
Problem: Agent provides low-quality outputs
Solution: Add self-review checklist
Before responding, verify:
□ Answer is accurate and complete
□ Examples are provided
□ Confidence level stated
□ Next steps suggested (if applicable)
Iteration Template
## Iteration N: [Focus Area]
### Observations (30 min)
- Tasks tested: [count]
- Success rate: [X]%
- Avg quality: [X]/5
- Avg time: [X] min
**Key Issues**:
1. [Issue description]
2. [Issue description]
### Analysis (20 min)
- Pattern 1: [Name] ([frequency])
- Pattern 2: [Name] ([frequency])
### Refinements (25 min)
- Added: [Feature/guideline]
- Clarified: [Instruction]
- Adjusted: [Constraint]
### Testing (20 min)
- Re-test failures: [X]/[Y] fixed
- New tests: [X]/[Y] passed
- Overall success: [X]%
### Metrics (15 min)
- Δ Success: [+/-X]%
- Δ Quality: [+/-X]%
- Δ Efficiency: [+/-X]%
- V_instance: [X.XX]
**Status**: [Converged/Continue]
**Next Focus**: [Area to improve]
Best Practices
Do's
✅ Test on diverse cases - Cover edge cases and common queries ✅ Measure objectively - Use quantitative metrics ✅ Iterate quickly - 90-120 min per iteration ✅ Focus improvements - One major change per iteration ✅ Validate stability - Test 2 iterations for convergence
Don'ts
❌ Don't overtune - Avoid overfitting to test cases ❌ Don't skip baselines - Always measure Iteration 0 ❌ Don't ignore regressions - Track quality across iterations ❌ Don't add complexity - Keep prompts concise ❌ Don't stop too early - Ensure 2-iteration stability
Example: Explore Agent Evolution
Baseline (Iteration 0):
- Generic instructions
- No thoroughness guidance
- No time management
- Success: 60%
Iteration 1:
- Added thoroughness levels
- Added time-boxing
- Success: 87.5% (+45.8%)
Iteration 2:
- Added completeness checklist
- Refined search strategy
- Success: 90% (+2.5% improvement, stable)
Convergence: 2 iterations, 87.5% → 90% stable
Source: BAIME Agent Prompt Evolution Framework Status: Production-ready, validated across 13 agent types Average Improvement: +42% success rate over baseline