Files

Zhongwei Li fab98d059b Initial commit

2025-11-30 09:07:22 +08:00

9.0 KiB

Raw Blame History

Rapid Iteration Pattern for Agent Evolution

Pattern: Fast convergence (2-3 iterations) for agent prompt evolution Success Rate: 85% (11/13 agents converged in ≤3 iterations) Time: 3-6 hours total vs 8-12 hours standard

How to achieve rapid convergence when evolving agent prompts.

Pattern Overview

Standard Evolution: 4-6 iterations, 8-12 hours Rapid Evolution: 2-3 iterations, 3-6 hours

Key Difference: Strong Iteration 0 (comprehensive baseline analysis)

Rapid Iteration Workflow

Iteration 0: Comprehensive Baseline (90-120 min)

Standard Baseline (30 min):

Run 5 test cases
Note obvious failures
Quick metrics

Comprehensive Baseline (90-120 min):

Run 15-20 diverse test cases
Systematic failure pattern analysis
Deep root cause investigation
Document all edge cases
Compare to similar agents

Investment: +60-90 min Return: -2 to -3 iterations (save 3-6 hours)

Example: Explore Agent (Standard vs Rapid)

Standard Approach:

Iteration 0 (30 min): 5 tasks, quick notes
Iteration 1 (90 min): Add thoroughness levels
Iteration 2 (90 min): Add time-boxing
Iteration 3 (75 min): Add completeness checks
Iteration 4 (60 min): Refine verification
Iteration 5 (60 min): Final polish

Total: 6.75 hours, 5 iterations

Rapid Approach:

Iteration 0 (120 min): 20 tasks, pattern analysis, root causes
Iteration 1 (90 min): Add thoroughness + time-boxing + completeness
Iteration 2 (75 min): Refine + validate stability

Total: 4.75 hours, 2 iterations

Savings: 2 hours, 3 fewer iterations

Comprehensive Baseline Checklist

Task Coverage (15-20 tasks)

Complexity Distribution:

5 simple tasks (1-2 min expected)
10 medium tasks (2-4 min expected)
5 complex tasks (4-6 min expected)

Query Type Diversity:

Search queries (find, locate, list)
Analysis queries (explain, describe, analyze)
Comparison queries (compare, evaluate, contrast)
Edge cases (ambiguous, overly broad, very specific)

Failure Pattern Analysis (30 min)

Systematic Analysis:

Categorize Failures
- Scope issues (too broad/narrow)
- Coverage issues (incomplete)
- Time issues (too slow/fast)
- Quality issues (inaccurate)
Identify Root Causes
- Missing instructions
- Ambiguous guidelines
- Incorrect constraints
- Tool usage issues
Prioritize by Impact
- High frequency + high impact → Fix first
- Low frequency + high impact → Document
- High frequency + low impact → Automate
- Low frequency + low impact → Ignore

Example:

## Failure Patterns (Explore Agent)

**Pattern 1: Scope Ambiguity** (6/20 tasks, 30%)
Root Cause: No guidance on search depth
Impact: High (3 failures, 3 partial successes)
Priority: P1 (fix in Iteration 1)

**Pattern 2: Incomplete Coverage** (4/20 tasks, 20%)
Root Cause: No completeness verification
Impact: Medium (4 partial successes)
Priority: P1 (fix in Iteration 1)

**Pattern 3: Time Overruns** (3/20 tasks, 15%)
Root Cause: No time-boxing mechanism
Impact: Medium (3 slow but successful)
Priority: P2 (fix in Iteration 1)

**Pattern 4: Tool Selection** (1/20 tasks, 5%)
Root Cause: Not using best tool for task
Impact: Low (1 inefficient but successful)
Priority: P3 (defer to Iteration 2 if time)

Comparative Analysis (15 min)

Compare to Similar Agents:

What works well in other agents?
What patterns are transferable?
What mistakes were made before?

Example:

## Comparative Analysis

**Code-Gen Agent** (similar agent):
- Uses complexity assessment (simple/medium/complex)
- Has explicit quality checklist
- Includes time estimates

**Transferable**:
✅ Complexity assessment → thoroughness levels
✅ Quality checklist → completeness verification
❌ Time estimates (less predictable for exploration)

**Analysis Agent** (similar agent):
- Uses phased approach (scan → analyze → synthesize)
- Includes confidence scoring

**Transferable**:
✅ Phased approach → search strategy
✅ Confidence scoring → already planned

Iteration 1: Comprehensive Fix (90 min)

Standard Iteration 1: Fix 1-2 major issues Rapid Iteration 1: Fix ALL P1 issues + some P2

Approach:

Address all high-priority patterns (P1)
Add preventive measures for P2 issues
Include transferable patterns from similar agents

Example (Explore Agent):

## Iteration 1 Changes

**P1 Fixes**:
1. Scope Ambiguity → Add thoroughness levels (quick/medium/thorough)
2. Incomplete Coverage → Add completeness checklist
3. Time Management → Add time-boxing (1-6 min)

**P2 Improvements**:
4. Search Strategy → Add 3-phase approach
5. Confidence → Add confidence scoring

**Borrowed Patterns**:
6. From Code-Gen: Complexity assessment framework
7. From Analysis: Verification checkpoints

Total Changes: 7 (vs standard 2-3)

Result: Higher chance of convergence in Iteration 2

Iteration 2: Validate & Converge (75 min)

Objectives:

Test comprehensive fixes
Measure stability
Validate convergence

Test Suite (30 min):

Re-run all 20 Iteration 0 tasks
Add 5-10 new edge cases
Measure metrics

Analysis (20 min):

Compare to Iteration 0 and Iteration 1
Check convergence criteria
Identify remaining gaps (if any)

Refinement (25 min):

Minor adjustments only
Polish documentation
Validate stability

Convergence Check:

Iteration 1: V_instance = 0.88 ✅
Iteration 2: V_instance = 0.90 ✅
Stable: 0.88 → 0.90 (+2.3%, within ±5%)

CONVERGED ✅

Success Factors

1. Comprehensive Baseline (60-90 min extra)

Investment: 2x standard baseline time Return: -2 to -3 iterations (6-9 hours saved) ROI: 4-6x

Critical Elements:

15-20 diverse tasks (not 5-10)
Systematic failure pattern analysis
Root cause investigation (not just symptoms)
Comparative analysis with similar agents

2. Aggressive Iteration 1 (Fix All P1)

Standard: Fix 1-2 issues Rapid: Fix all P1 + some P2 (5-7 fixes)

Approach:

Batch related fixes together
Borrow proven patterns
Add preventive measures

Risk: Over-complication Mitigation: Focus on core issues, defer P3

3. Borrowed Patterns (20-30% reuse)

Sources:

Similar agents in same project
Agents from other projects
Industry best practices

Example:

Explore Agent borrowed from:
- Code-Gen: Complexity assessment (100% reuse)
- Analysis: Phased approach (90% reuse)
- Testing: Verification checklist (80% reuse)

Total reuse: ~60% of Iteration 1 changes

Savings: 30-40 min per iteration

Anti-Patterns

❌ Skipping Comprehensive Baseline

Symptom: "Let's just try some fixes and see" Result: 5-6 iterations, trial and error Cost: 8-12 hours

Fix: Invest 90-120 min in Iteration 0

❌ Incremental Fixes (One Issue at a Time)

Symptom: Fixing one pattern per iteration Result: 4-6 iterations for convergence Cost: 8-10 hours

Fix: Batch P1 fixes in Iteration 1

❌ Ignoring Similar Agents

Symptom: Reinventing solutions Result: Slower convergence, lower quality Cost: 2-3 extra hours

Fix: 15 min comparative analysis in Iteration 0

When to Use Rapid Pattern

Good Fit:

Agent is similar to existing agents (60%+ overlap)
Clear failure patterns in baseline
Time constraint (need results in 1-2 days)

Poor Fit:

Novel agent type (no similar agents)
Complex domain (many unknowns)
Learning objective (want to explore incrementally)

Metrics Comparison

Standard Evolution

Iteration 0: 30 min (5 tasks)
Iteration 1: 90 min (fix 1-2 issues)
Iteration 2: 90 min (fix 2-3 more)
Iteration 3: 75 min (refine)
Iteration 4: 60 min (converge)

Total: 5.75 hours, 4 iterations
V_instance: 0.68 → 0.74 → 0.79 → 0.83 → 0.85 ✅

Rapid Evolution

Iteration 0: 120 min (20 tasks, analysis)
Iteration 1: 90 min (fix all P1+P2)
Iteration 2: 75 min (validate, converge)

Total: 4.75 hours, 2 iterations
V_instance: 0.68 → 0.88 → 0.90 ✅

Savings: 1 hour, 2 fewer iterations

Replication Guide

Day 1: Comprehensive Baseline

Morning (2 hours):

Design 20-task test suite
Run baseline tests
Document all failures

Afternoon (1 hour): 4. Analyze failure patterns 5. Identify root causes 6. Compare to similar agents 7. Prioritize fixes

Day 2: Comprehensive Fix

Morning (1.5 hours):

Implement all P1 fixes
Add P2 improvements
Incorporate borrowed patterns

Afternoon (1 hour): 4. Test on 15-20 tasks 5. Measure metrics 6. Document changes

Day 3: Validate & Deploy

Morning (1 hour):

Test on 25-30 tasks
Check stability
Minor refinements

Afternoon (0.5 hours): 4. Final validation 5. Deploy to production 6. Setup monitoring

Source: BAIME Agent Prompt Evolution - Rapid Pattern Success Rate: 85% (11/13 agents) Average Time: 4.2 hours (vs 9.3 hours standard) Average Iterations: 2.3 (vs 4.8 standard)

9.0 KiB Raw Blame History

Rapid Iteration Pattern for Agent Evolution

Pattern Overview

Rapid Iteration Workflow

Iteration 0: Comprehensive Baseline (90-120 min)

Example: Explore Agent (Standard vs Rapid)

Comprehensive Baseline Checklist

Task Coverage (15-20 tasks)

Failure Pattern Analysis (30 min)

Comparative Analysis (15 min)

Iteration 1: Comprehensive Fix (90 min)

Iteration 2: Validate & Converge (75 min)

Success Factors

1. Comprehensive Baseline (60-90 min extra)

2. Aggressive Iteration 1 (Fix All P1)

3. Borrowed Patterns (20-30% reuse)

Anti-Patterns

❌ Skipping Comprehensive Baseline

❌ Incremental Fixes (One Issue at a Time)

❌ Ignoring Similar Agents

When to Use Rapid Pattern

Metrics Comparison

Standard Evolution

Rapid Evolution

Replication Guide

Day 1: Comprehensive Baseline

Day 2: Comprehensive Fix

Day 3: Validate & Deploy

9.0 KiB

Raw Blame History