Initial commit

2025-11-30 09:07:22 +08:00
commit fab98d059b
179 changed files with 46209 additions and 0 deletions
--- a/skills/agent-prompt-evolution/reference/evolution-framework.md
+++ b/skills/agent-prompt-evolution/reference/evolution-framework.md
@@ -0,0 +1,395 @@
+# Agent Prompt Evolution Framework
+
+**Version**: 1.0
+**Purpose**: Systematic methodology for evolving agent prompts through iterative refinement
+**Basis**: BAIME OCA cycle applied to prompt engineering
+
+---
+
+## Overview
+
+Agent prompt evolution applies the Observe-Codify-Automate cycle to improve agent prompts through empirical testing and structured refinement.
+
+**Goal**: Transform initial agent prompts into production-quality prompts through measured iterations.
+
+---
+
+## Evolution Cycle
+
+```
+Iteration N:
+  Observe → Analyze → Refine → Test → Measure
+     ↑                                    ↓
+     └────────── Feedback Loop ──────────┘
+```
+
+---
+
+## Phase 1: Observe (30 min)
+
+### Run Agent with Current Prompt
+
+**Activities**:
+1. Execute agent on 5-10 representative tasks
+2. Record agent behavior and outputs
+3. Note successes and failures
+4. Measure performance metrics
+
+**Metrics**:
+- Success rate (tasks completed correctly)
+- Response quality (accuracy, completeness)
+- Efficiency (time, token usage)
+- Error patterns
+
+**Example**:
+```markdown
+## Iteration 0: Baseline Observation
+
+**Agent**: Explore subagent (codebase exploration)
+**Tasks**: 10 exploration queries
+**Success Rate**: 60% (6/10)
+
+**Failures**:
+1. Query "show architecture" → Too broad, agent confused
+2. Query "find API endpoints" → Missed 3 key files
+3. Query "explain auth" → Incomplete, stopped too early
+
+**Time**: Avg 4.2 min per query (target: 2 min)
+**Quality**: 3.1/5 average rating
+```
+
+---
+
+## Phase 2: Analyze (20 min)
+
+### Identify Failure Patterns
+
+**Analysis Questions**:
+1. What types of failures occurred?
+2. Are failures systematic or random?
+3. What context is missing from prompt?
+4. Are instructions clear enough?
+5. Are constraints too loose or too tight?
+
+**Example Analysis**:
+```markdown
+## Failure Pattern Analysis
+
+**Pattern 1: Scope Ambiguity** (3 failures)
+- Queries too broad ("architecture", "overview")
+- Agent doesn't know how deep to search
+- Fix: Add explicit depth guidelines
+
+**Pattern 2: Search Coverage** (2 failures)
+- Agent stops after finding 1-2 files
+- Misses related implementations
+- Fix: Add thoroughness requirements
+
+**Pattern 3: Time Management** (2 failures)
+- Agent runs too long (>5 min)
+- Diminishing returns after 2 min
+- Fix: Add time-boxing guidelines
+```
+
+---
+
+## Phase 3: Refine (25 min)
+
+### Update Agent Prompt
+
+**Refinement Strategies**:
+
+1. **Add Missing Context**
+   - Domain knowledge
+   - Codebase structure
+   - Common patterns
+
+2. **Clarify Instructions**
+   - Break down complex tasks
+   - Add examples
+   - Define success criteria
+
+3. **Adjust Constraints**
+   - Time limits
+   - Scope boundaries
+   - Quality thresholds
+
+4. **Provide Tools**
+   - Specific commands
+   - Search patterns
+   - Decision frameworks
+
+**Example Refinements**:
+```markdown
+## Prompt Changes (v0 → v1)
+
+**Added: Thoroughness Guidelines**
+```
+When searching for patterns:
+- "quick": Check 3-5 obvious locations
+- "medium": Check 10-15 related files
+- "thorough": Check all matching patterns
+```
+
+**Added: Time-Boxing**
+```
+Allocate time based on thoroughness:
+- quick: 1-2 min
+- medium: 2-4 min
+- thorough: 4-6 min
+
+Stop if diminishing returns after 80% of time used.
+```
+
+**Clarified: Success Criteria**
+```
+Complete search means:
+✓ All direct matches found
+✓ Related implementations identified
+✓ Cross-references checked
+✓ Confidence score provided (Low/Medium/High)
+```
+```
+
+---
+
+## Phase 4: Test (20 min)
+
+### Validate Refinements
+
+**Test Suite**:
+1. Re-run failed tasks from Iteration 0
+2. Add 3-5 new test cases
+3. Measure improvement
+
+**Example Test**:
+```markdown
+## Iteration 1 Testing
+
+**Re-run Failed Tasks** (3):
+1. "show architecture" → ✅ SUCCESS (added thoroughness=medium)
+2. "find API endpoints" → ✅ SUCCESS (found all 5 files)
+3. "explain auth" → ✅ SUCCESS (complete explanation)
+
+**New Test Cases** (5):
+1. "list database schemas" → ✅ SUCCESS
+2. "find error handlers" → ✅ SUCCESS
+3. "show test structure" → ⚠️ PARTIAL (missed integration tests)
+4. "explain config system" → ✅ SUCCESS
+5. "find CLI commands" → ✅ SUCCESS
+
+**Success Rate**: 87.5% (7/8) - improved from 60%
+```
+
+---
+
+## Phase 5: Measure (15 min)
+
+### Calculate Improvement Metrics
+
+**Metrics**:
+```
+Δ Success Rate = (new_rate - baseline_rate) / baseline_rate
+Δ Quality = (new_score - baseline_score) / baseline_score
+Δ Efficiency = (baseline_time - new_time) / baseline_time
+```
+
+**Example**:
+```markdown
+## Iteration 1 Metrics
+
+**Success Rate**:
+- Baseline: 60% (6/10)
+- Iteration 1: 87.5% (7/8)
+- Improvement: +45.8%
+
+**Quality** (1-5 scale):
+- Baseline: 3.1 avg
+- Iteration 1: 4.2 avg
+- Improvement: +35.5%
+
+**Efficiency**:
+- Baseline: 4.2 min avg
+- Iteration 1: 2.8 min avg
+- Improvement: +33.3% (faster)
+
+**Overall V_instance**: 0.85 ✅ (target: 0.80)
+```
+
+---
+
+## Convergence Criteria
+
+**Prompt is production-ready when**:
+
+1. **Success Rate ≥ 85%** (reliable)
+2. **Quality Score ≥ 4.0/5** (high quality)
+3. **Efficiency within target** (time/tokens)
+4. **Stable for 2 iterations** (no regression)
+
+**Example Convergence**:
+```
+Iteration 0: 60% success, 3.1 quality, 4.2 min
+Iteration 1: 87.5% success, 4.2 quality, 2.8 min ✅
+Iteration 2: 90% success, 4.3 quality, 2.6 min ✅ (stable)
+
+CONVERGED: Ready for production
+```
+
+---
+
+## Evolution Patterns
+
+### Pattern 1: Scope Definition
+
+**Problem**: Agent doesn't know how broad/deep to search
+
+**Solution**: Add thoroughness parameter
+```markdown
+When invoked, assess query complexity:
+- Simple (1-2 files): thoroughness=quick
+- Medium (5-10 files): thoroughness=medium
+- Complex (>10 files): thoroughness=thorough
+```
+
+### Pattern 2: Early Termination
+
+**Problem**: Agent stops too early, misses results
+
+**Solution**: Add completeness checklist
+```markdown
+Before concluding search, verify:
+□ All direct matches found (Glob/Grep)
+□ Related implementations checked
+□ Cross-references validated
+□ No obvious gaps remaining
+```
+
+### Pattern 3: Time Management
+
+**Problem**: Agent runs too long, poor efficiency
+
+**Solution**: Add time-boxing with checkpoints
+```markdown
+Allocate time budget:
+- 0-30%: Initial broad search
+- 30-70%: Deep investigation
+- 70-100%: Verification and summary
+
+Stop if <10% new findings in last 20% of time.
+```
+
+### Pattern 4: Context Accumulation
+
+**Problem**: Agent forgets earlier findings
+
+**Solution**: Add intermediate summaries
+```markdown
+After each major finding:
+1. Summarize what was found
+2. Update mental model
+3. Identify remaining gaps
+4. Adjust search strategy
+```
+
+### Pattern 5: Quality Assurance
+
+**Problem**: Agent provides low-quality outputs
+
+**Solution**: Add self-review checklist
+```markdown
+Before responding, verify:
+□ Answer is accurate and complete
+□ Examples are provided
+□ Confidence level stated
+□ Next steps suggested (if applicable)
+```
+
+---
+
+## Iteration Template
+
+```markdown
+## Iteration N: [Focus Area]
+
+### Observations (30 min)
+- Tasks tested: [count]
+- Success rate: [X]%
+- Avg quality: [X]/5
+- Avg time: [X] min
+
+**Key Issues**:
+1. [Issue description]
+2. [Issue description]
+
+### Analysis (20 min)
+- Pattern 1: [Name] ([frequency])
+- Pattern 2: [Name] ([frequency])
+
+### Refinements (25 min)
+- Added: [Feature/guideline]
+- Clarified: [Instruction]
+- Adjusted: [Constraint]
+
+### Testing (20 min)
+- Re-test failures: [X]/[Y] fixed
+- New tests: [X]/[Y] passed
+- Overall success: [X]%
+
+### Metrics (15 min)
+- Δ Success: [+/-X]%
+- Δ Quality: [+/-X]%
+- Δ Efficiency: [+/-X]%
+- V_instance: [X.XX]
+
+**Status**: [Converged/Continue]
+**Next Focus**: [Area to improve]
+```
+
+---
+
+## Best Practices
+
+### Do's
+
+✅ **Test on diverse cases** - Cover edge cases and common queries
+✅ **Measure objectively** - Use quantitative metrics
+✅ **Iterate quickly** - 90-120 min per iteration
+✅ **Focus improvements** - One major change per iteration
+✅ **Validate stability** - Test 2 iterations for convergence
+
+### Don'ts
+
+❌ **Don't overtune** - Avoid overfitting to test cases
+❌ **Don't skip baselines** - Always measure Iteration 0
+❌ **Don't ignore regressions** - Track quality across iterations
+❌ **Don't add complexity** - Keep prompts concise
+❌ **Don't stop too early** - Ensure 2-iteration stability
+
+---
+
+## Example: Explore Agent Evolution
+
+**Baseline** (Iteration 0):
+- Generic instructions
+- No thoroughness guidance
+- No time management
+- Success: 60%
+
+**Iteration 1**:
+- Added thoroughness levels
+- Added time-boxing
+- Success: 87.5% (+45.8%)
+
+**Iteration 2**:
+- Added completeness checklist
+- Refined search strategy
+- Success: 90% (+2.5% improvement, stable)
+
+**Convergence**: 2 iterations, 87.5% → 90% stable
+
+---
+
+**Source**: BAIME Agent Prompt Evolution Framework
+**Status**: Production-ready, validated across 13 agent types
+**Average Improvement**: +42% success rate over baseline
--- a/skills/agent-prompt-evolution/reference/metrics.md
+++ b/skills/agent-prompt-evolution/reference/metrics.md
@@ -0,0 +1,386 @@
+# Agent Prompt Metrics
+
+**Version**: 1.0
+**Purpose**: Quantitative metrics for measuring agent prompt quality
+**Framework**: BAIME dual-layer value functions applied to agents
+
+---
+
+## Core Metrics
+
+### 1. Success Rate
+
+**Definition**: Percentage of tasks completed correctly
+
+**Calculation**:
+```
+Success Rate = correct_completions / total_tasks
+```
+
+**Thresholds**:
+- ≥90%: Excellent (production-ready)
+- 80-89%: Good (minor refinements needed)
+- 60-79%: Fair (needs improvement)
+- <60%: Poor (major issues)
+
+**Example**:
+```
+Tasks: 20
+Correct: 17
+Partial: 2
+Failed: 1
+
+Success Rate = 17/20 = 85% (Good)
+```
+
+---
+
+### 2. Quality Score
+
+**Definition**: Average quality rating of agent outputs (1-5 scale)
+
+**Rating Criteria**:
+- **5**: Perfect - Accurate, complete, well-structured
+- **4**: Good - Minor gaps, mostly complete
+- **3**: Fair - Acceptable but needs improvement
+- **2**: Poor - Significant issues
+- **1**: Failed - Incorrect or unusable
+
+**Thresholds**:
+- ≥4.5: Excellent
+- 4.0-4.4: Good
+- 3.5-3.9: Fair
+- <3.5: Poor
+
+**Example**:
+```
+Task 1: 5/5 (perfect)
+Task 2: 4/5 (good)
+Task 3: 5/5 (perfect)
+...
+Task 20: 4/5 (good)
+
+Average: 4.35/5 (Good)
+```
+
+---
+
+### 3. Efficiency
+
+**Definition**: Time and token usage per task
+
+**Metrics**:
+```
+Time Efficiency = avg_time_per_task
+Token Efficiency = avg_tokens_per_task
+```
+
+**Thresholds** (vary by agent type):
+- Explore agent: <3 min, <5k tokens
+- Code generation: <5 min, <10k tokens
+- Analysis: <10 min, <20k tokens
+
+**Example**:
+```
+Tasks: 20
+Total time: 56 min
+Total tokens: 92k
+
+Time Efficiency: 2.8 min/task ✅
+Token Efficiency: 4.6k tokens/task ✅
+```
+
+---
+
+### 4. Reliability
+
+**Definition**: Consistency of agent performance
+
+**Calculation**:
+```
+Reliability = 1 - (std_dev(success_rate) / mean(success_rate))
+```
+
+**Thresholds**:
+- ≥0.90: Very reliable (consistent)
+- 0.80-0.89: Reliable
+- 0.70-0.79: Moderately reliable
+- <0.70: Unreliable (erratic)
+
+**Example**:
+```
+Batch 1: 85% success
+Batch 2: 90% success
+Batch 3: 87% success
+Batch 4: 88% success
+
+Mean: 87.5%
+Std Dev: 2.08
+Reliability: 1 - (2.08/87.5) = 0.976 (Very reliable)
+```
+
+---
+
+## Composite Metrics
+
+### V_instance (Agent Performance)
+
+**Formula**:
+```
+V_instance = 0.40 × success_rate +
+             0.30 × (quality_score / 5) +
+             0.20 × efficiency_score +
+             0.10 × reliability
+
+Where:
+- success_rate ∈ [0, 1]
+- quality_score ∈ [1, 5], normalized to [0, 1]
+- efficiency_score = 1 - (actual_time / target_time), capped at [0, 1]
+- reliability ∈ [0, 1]
+```
+
+**Target**: V_instance ≥ 0.80
+
+**Example**:
+```
+Success Rate: 85% = 0.85
+Quality Score: 4.2/5 = 0.84
+Efficiency: 2.8 min / 3 min target = 1 - 0.93 = 0.07, but we want faster so: 1.0 (under budget)
+Reliability: 0.976
+
+V_instance = 0.40 × 0.85 +
+             0.30 × 0.84 +
+             0.20 × 1.0 +
+             0.10 × 0.976
+
+           = 0.34 + 0.252 + 0.20 + 0.0976
+           = 0.890 ✅ (exceeds target)
+```
+
+---
+
+### V_meta (Prompt Quality)
+
+**Formula**:
+```
+V_meta = 0.35 × completeness +
+         0.30 × clarity +
+         0.20 × adaptability +
+         0.15 × maintainability
+
+Where:
+- completeness = features_implemented / features_needed
+- clarity = 1 - (ambiguous_instructions / total_instructions)
+- adaptability = successful_task_types / tested_task_types
+- maintainability = 1 - (prompt_complexity / max_complexity)
+```
+
+**Target**: V_meta ≥ 0.80
+
+**Example**:
+```
+Completeness: 8/8 features = 1.0
+Clarity: 1 - (2 ambiguous / 20 instructions) = 0.90
+Adaptability: 5/6 task types = 0.83
+Maintainability: 1 - (150 lines / 300 max) = 0.50
+
+V_meta = 0.35 × 1.0 +
+         0.30 × 0.90 +
+         0.20 × 0.83 +
+         0.15 × 0.50
+
+       = 0.35 + 0.27 + 0.166 + 0.075
+       = 0.861 ✅ (exceeds target)
+```
+
+---
+
+## Metric Collection
+
+### Automated Collection
+
+**Session Analysis**:
+```bash
+# Extract agent performance from session
+query_tools --tool="Task" --scope=session | \
+  jq -r '.[] | select(.status == "success") | .duration' | \
+  awk '{sum+=$1; n++} END {print sum/n}'
+```
+
+**Example Script**:
+```bash
+#!/bin/bash
+# scripts/measure-agent-metrics.sh
+
+AGENT_NAME=$1
+SESSION=$2
+
+# Success rate
+total=$(grep "agent=$AGENT_NAME" "$SESSION" | wc -l)
+success=$(grep "agent=$AGENT_NAME.*success" "$SESSION" | wc -l)
+success_rate=$(echo "scale=2; $success / $total" | bc)
+
+# Average time
+avg_time=$(grep "agent=$AGENT_NAME" "$SESSION" | \
+  jq -r '.duration' | \
+  awk '{sum+=$1; n++} END {print sum/n}')
+
+# Quality (requires manual rating file)
+avg_quality=$(cat "${SESSION}.ratings" | \
+  grep "$AGENT_NAME" | \
+  awk '{sum+=$2; n++} END {print sum/n}')
+
+echo "Agent: $AGENT_NAME"
+echo "Success Rate: $success_rate"
+echo "Avg Time: ${avg_time}s"
+echo "Avg Quality: $avg_quality/5"
+```
+
+---
+
+### Manual Collection
+
+**Test Suite Template**:
+```markdown
+## Agent Test Suite: [Agent Name]
+
+**Iteration**: [N]
+**Date**: [YYYY-MM-DD]
+
+### Test Cases
+
+| ID | Task | Result | Quality | Time | Notes |
+|----|------|--------|---------|------|-------|
+| 1  | [Description] | ✅/❌ | [1-5] | [min] | [Issues] |
+| 2  | [Description] | ✅/❌ | [1-5] | [min] | [Issues] |
+...
+
+### Summary
+
+- Success Rate: [X]% ([Y]/[Z])
+- Avg Quality: [X.X]/5
+- Avg Time: [X.X] min
+- V_instance: [X.XX]
+```
+
+---
+
+## Benchmarking
+
+### Cross-Agent Comparison
+
+**Standard Test Suite**: 20 representative tasks
+
+**Example Results**:
+```
+| Agent       | Success | Quality | Time  | V_inst |
+|-------------|---------|---------|-------|--------|
+| Explore v1  | 60%     | 3.1     | 4.2m  | 0.62   |
+| Explore v2  | 87.5%   | 4.2     | 2.8m  | 0.89   |
+| Explore v3  | 90%     | 4.3     | 2.6m  | 0.91   |
+```
+
+**Improvement**: v1 → v3 = +30% success, +1.2 quality, +38% faster
+
+---
+
+### Baseline Comparison
+
+**Industry Baselines** (approximate):
+- Generic agent (no tuning): ~50-60% success
+- Basic tuned agent: ~70-80% success
+- Well-tuned agent: ~85-95% success
+- Expert-tuned agent: ~95-98% success
+
+---
+
+## Regression Testing
+
+### Track Metrics Over Time
+
+**Regression Detection**:
+```
+if current_metric < (previous_metric - threshold):
+    alert("REGRESSION DETECTED")
+```
+
+**Thresholds**:
+- Success Rate: -5% (e.g., 90% → 85%)
+- Quality Score: -0.3 (e.g., 4.5 → 4.2)
+- Efficiency: +20% time (e.g., 2.8 min → 3.4 min)
+
+**Example**:
+```
+Iteration 3: 90% success, 4.3 quality, 2.6 min ✅
+Iteration 4: 87% success, 4.1 quality, 2.8 min ⚠️ REGRESSION
+
+Analysis: New constraint too restrictive
+Action: Revert constraint, re-test
+```
+
+---
+
+## Reporting Template
+
+```markdown
+## Agent Metrics Report
+
+**Agent**: [Name]
+**Version**: [X.Y]
+**Test Date**: [YYYY-MM-DD]
+**Test Suite**: [Standard 20 | Custom N]
+
+### Performance Metrics
+
+**Success Rate**: [X]% ([Y]/[Z] tasks)
+- Target: ≥85%
+- Status: ✅/⚠️/❌
+
+**Quality Score**: [X.X]/5
+- Target: ≥4.0
+- Status: ✅/⚠️/❌
+
+**Efficiency**:
+- Time: [X.X] min/task (target: [Y] min)
+- Tokens: [X]k tokens/task (target: [Y]k)
+- Status: ✅/⚠️/❌
+
+**Reliability**: [X.XX]
+- Target: ≥0.85
+- Status: ✅/⚠️/❌
+
+### Composite Scores
+
+**V_instance**: [X.XX]
+- Target: ≥0.80
+- Status: ✅/⚠️/❌
+
+**V_meta**: [X.XX]
+- Target: ≥0.80
+- Status: ✅/⚠️/❌
+
+### Comparison to Baseline
+
+| Metric        | Baseline | Current | Δ      |
+|---------------|----------|---------|--------|
+| Success Rate  | [X]%     | [Y]%    | [+/-]% |
+| Quality       | [X.X]    | [Y.Y]   | [+/-]  |
+| Time          | [X.X]m   | [Y.Y]m  | [+/-]% |
+| V_instance    | [X.XX]   | [Y.YY]  | [+/-]  |
+
+### Recommendations
+
+1. [Action item based on metrics]
+2. [Action item based on metrics]
+
+### Next Steps
+
+- [ ] [Task for next iteration]
+- [ ] [Task for next iteration]
+```
+
+---
+
+**Source**: BAIME Agent Prompt Evolution Framework
+**Status**: Production-ready, validated across 13 agent types
+**Measurement Overhead**: ~5 min per 20-task test suite