Initial commit
This commit is contained in:
339
skills/agent-prompt-evolution/templates/test-suite-template.md
Normal file
339
skills/agent-prompt-evolution/templates/test-suite-template.md
Normal file
@@ -0,0 +1,339 @@
|
||||
# Agent Test Suite Template
|
||||
|
||||
**Purpose**: Standardized test suite for agent prompt validation
|
||||
**Usage**: Copy and customize for your agent type
|
||||
|
||||
---
|
||||
|
||||
## Test Suite: [Agent Name]
|
||||
|
||||
**Agent Type**: [Explore/Code-Gen/Analysis/etc.]
|
||||
**Version**: [X.Y]
|
||||
**Test Date**: [YYYY-MM-DD]
|
||||
**Tester**: [Name]
|
||||
|
||||
---
|
||||
|
||||
## Test Configuration
|
||||
|
||||
**Test Environment**:
|
||||
- Claude Code Version: [version]
|
||||
- Model: [model-id]
|
||||
- Session ID: [session-id]
|
||||
|
||||
**Test Parameters**:
|
||||
- Number of tasks: [20 recommended]
|
||||
- Task diversity: [Low/Medium/High]
|
||||
- Complexity distribution:
|
||||
- Simple: [N] tasks
|
||||
- Medium: [N] tasks
|
||||
- Complex: [N] tasks
|
||||
|
||||
---
|
||||
|
||||
## Test Cases
|
||||
|
||||
### Task 1: [Brief Description]
|
||||
|
||||
**Type**: [Simple/Medium/Complex]
|
||||
**Category**: [Search/Analysis/Generation/etc.]
|
||||
|
||||
**Input**:
|
||||
```
|
||||
[Exact prompt or command given to agent]
|
||||
```
|
||||
|
||||
**Expected Outcome**:
|
||||
```
|
||||
[What a successful completion looks like]
|
||||
```
|
||||
|
||||
**Actual Result**:
|
||||
- Status: ✅ Success / ⚠️ Partial / ❌ Failed
|
||||
- Quality Rating: [1-5]
|
||||
- Time: [X.X] min
|
||||
- Tokens: [X]k
|
||||
|
||||
**Notes**:
|
||||
```
|
||||
[Any observations, issues, or improvements identified]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: [Brief Description]
|
||||
|
||||
**Type**: [Simple/Medium/Complex]
|
||||
**Category**: [Search/Analysis/Generation/etc.]
|
||||
|
||||
**Input**:
|
||||
```
|
||||
[Exact prompt or command given to agent]
|
||||
```
|
||||
|
||||
**Expected Outcome**:
|
||||
```
|
||||
[What a successful completion looks like]
|
||||
```
|
||||
|
||||
**Actual Result**:
|
||||
- Status: ✅ Success / ⚠️ Partial / ❌ Failed
|
||||
- Quality Rating: [1-5]
|
||||
- Time: [X.X] min
|
||||
- Tokens: [X]k
|
||||
|
||||
**Notes**:
|
||||
```
|
||||
[Any observations, issues, or improvements identified]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
[Repeat for all 20 tasks]
|
||||
|
||||
---
|
||||
|
||||
## Summary Statistics
|
||||
|
||||
### Overall Performance
|
||||
|
||||
**Success Rate**:
|
||||
```
|
||||
Total Tasks: [N]
|
||||
Successful: [N] (✅)
|
||||
Partial: [N] (⚠️)
|
||||
Failed: [N] (❌)
|
||||
|
||||
Success Rate: [X]% ([successful] / [total])
|
||||
```
|
||||
|
||||
**Quality Score**:
|
||||
```
|
||||
Task Quality Ratings: [4, 5, 3, 4, 5, ...]
|
||||
Average Quality: [X.X] / 5
|
||||
```
|
||||
|
||||
**Efficiency**:
|
||||
```
|
||||
Total Time: [X.X] min
|
||||
Average Time: [X.X] min/task
|
||||
Total Tokens: [X]k
|
||||
Average Tokens: [X.X]k/task
|
||||
```
|
||||
|
||||
**Reliability**:
|
||||
```
|
||||
Success by Complexity:
|
||||
- Simple: [X]% ([Y]/[Z])
|
||||
- Medium: [X]% ([Y]/[Z])
|
||||
- Complex: [X]% ([Y]/[Z])
|
||||
|
||||
Reliability Score: [X.XX]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Composite Metrics
|
||||
|
||||
### V_instance Calculation
|
||||
|
||||
```
|
||||
Success Rate: [X]% = [0.XX]
|
||||
Quality Score: [X.X]/5 = [0.XX]
|
||||
Efficiency Score: [target - actual] / target = [0.XX]
|
||||
Reliability: [0.XX]
|
||||
|
||||
V_instance = 0.40 × [success_rate] +
|
||||
0.30 × [quality_normalized] +
|
||||
0.20 × [efficiency_score] +
|
||||
0.10 × [reliability]
|
||||
|
||||
= [0.XX] + [0.XX] + [0.XX] + [0.XX]
|
||||
= [0.XX]
|
||||
|
||||
Target: ≥ 0.80
|
||||
Status: ✅ / ⚠️ / ❌
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Failure Analysis
|
||||
|
||||
### Failed Tasks
|
||||
|
||||
| Task ID | Description | Failure Reason | Pattern |
|
||||
|---------|-------------|----------------|---------|
|
||||
| [N] | [Brief] | [Why failed] | [Type] |
|
||||
| [N] | [Brief] | [Why failed] | [Type] |
|
||||
|
||||
### Failure Patterns
|
||||
|
||||
**Pattern 1: [Name]** ([N] occurrences)
|
||||
- Description: [What went wrong]
|
||||
- Root Cause: [Why it happened]
|
||||
- Proposed Fix: [How to address]
|
||||
|
||||
**Pattern 2: [Name]** ([N] occurrences)
|
||||
- Description: [What went wrong]
|
||||
- Root Cause: [Why it happened]
|
||||
- Proposed Fix: [How to address]
|
||||
|
||||
---
|
||||
|
||||
## Quality Issues
|
||||
|
||||
### Tasks with Quality < 4
|
||||
|
||||
| Task ID | Quality | Issues Identified | Improvements Needed |
|
||||
|---------|---------|-------------------|---------------------|
|
||||
| [N] | [1-3] | [Description] | [Actions] |
|
||||
| [N] | [1-3] | [Description] | [Actions] |
|
||||
|
||||
---
|
||||
|
||||
## Efficiency Analysis
|
||||
|
||||
### Tasks Exceeding Time Budget
|
||||
|
||||
| Task ID | Actual Time | Target Time | Δ | Reason |
|
||||
|---------|-------------|-------------|------|--------|
|
||||
| [N] | [X.X] min | [Y] min | [+Z] | [Why] |
|
||||
| [N] | [X.X] min | [Y] min | [+Z] | [Why] |
|
||||
|
||||
### Token Usage Analysis
|
||||
|
||||
```
|
||||
Tokens per task: [min-max] range
|
||||
High-usage tasks: [list]
|
||||
Optimization opportunities: [suggestions]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
### Priority 1 (Critical)
|
||||
|
||||
1. **[Issue]**: [Description]
|
||||
- Impact: [High/Medium/Low]
|
||||
- Frequency: [X] occurrences
|
||||
- Proposed Fix: [Action]
|
||||
- Expected Improvement: [X]% success rate
|
||||
|
||||
2. **[Issue]**: [Description]
|
||||
- Impact: [High/Medium/Low]
|
||||
- Frequency: [X] occurrences
|
||||
- Proposed Fix: [Action]
|
||||
- Expected Improvement: [X]% quality
|
||||
|
||||
### Priority 2 (Important)
|
||||
|
||||
1. **[Issue]**: [Description]
|
||||
- Impact: [High/Medium/Low]
|
||||
- Frequency: [X] occurrences
|
||||
- Proposed Fix: [Action]
|
||||
|
||||
### Priority 3 (Nice to Have)
|
||||
|
||||
1. **[Improvement]**: [Description]
|
||||
- Benefit: [What improves]
|
||||
- Effort: [Low/Medium/High]
|
||||
|
||||
---
|
||||
|
||||
## Next Iteration Plan
|
||||
|
||||
### Focus Areas
|
||||
|
||||
1. **[Area 1]**: [Why focus here]
|
||||
- Baseline: [Current metric]
|
||||
- Target: [Goal metric]
|
||||
- Approach: [How to improve]
|
||||
|
||||
2. **[Area 2]**: [Why focus here]
|
||||
- Baseline: [Current metric]
|
||||
- Target: [Goal metric]
|
||||
- Approach: [How to improve]
|
||||
|
||||
### Prompt Changes
|
||||
|
||||
**Planned Additions**:
|
||||
- [ ] [Guideline/instruction to add]
|
||||
- [ ] [Constraint to add]
|
||||
- [ ] [Example to add]
|
||||
|
||||
**Planned Clarifications**:
|
||||
- [ ] [Instruction to clarify]
|
||||
- [ ] [Constraint to adjust]
|
||||
|
||||
**Planned Removals**:
|
||||
- [ ] [Unnecessary instruction]
|
||||
- [ ] [Redundant constraint]
|
||||
|
||||
---
|
||||
|
||||
## Test Suite Evolution
|
||||
|
||||
### Version History
|
||||
|
||||
| Version | Date | Success | Quality | V_inst | Changes |
|
||||
|---------|------|---------|---------|--------|---------|
|
||||
| 0.1 | [D] | [X]% | [X.X] | [0.XX] | Baseline|
|
||||
| 0.2 | [D] | [X]% | [X.X] | [0.XX] | [Changes]|
|
||||
| [curr] | [D] | [X]% | [X.X] | [0.XX] | [Changes]|
|
||||
|
||||
### Convergence Tracking
|
||||
|
||||
```
|
||||
Iteration 0: V_instance = [0.XX] (baseline)
|
||||
Iteration 1: V_instance = [0.XX] ([+/-]%)
|
||||
Iteration 2: V_instance = [0.XX] ([+/-]%)
|
||||
Current: V_instance = [0.XX] ([+/-]%)
|
||||
|
||||
Converged: ✅ / ❌
|
||||
(Requires V_instance ≥ 0.80 for 2 consecutive iterations)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Task Catalog
|
||||
|
||||
### Task Templates by Category
|
||||
|
||||
**Search Tasks**:
|
||||
- "Find all [pattern] in [scope]"
|
||||
- "Locate [functionality] implementation"
|
||||
- "Show [architecture aspect]"
|
||||
|
||||
**Analysis Tasks**:
|
||||
- "Explain how [feature] works"
|
||||
- "Identify [issue type] in [code]"
|
||||
- "Compare [approach A] vs [approach B]"
|
||||
|
||||
**Generation Tasks**:
|
||||
- "Create [artifact type] for [purpose]"
|
||||
- "Generate [code/docs] following [pattern]"
|
||||
- "Refactor [code] to [goal]"
|
||||
|
||||
### Complexity Guidelines
|
||||
|
||||
**Simple** (1-2 min, 1-3k tokens):
|
||||
- Single-file search
|
||||
- Direct lookup
|
||||
- Straightforward generation
|
||||
|
||||
**Medium** (2-4 min, 3-7k tokens):
|
||||
- Multi-file search
|
||||
- Pattern analysis
|
||||
- Moderate generation
|
||||
|
||||
**Complex** (4-6 min, 7-15k tokens):
|
||||
- Cross-codebase search
|
||||
- Deep analysis
|
||||
- Complex generation
|
||||
|
||||
---
|
||||
|
||||
**Template Version**: 1.0
|
||||
**Source**: BAIME Agent Prompt Evolution
|
||||
**Usage**: Copy to `agent-test-suite-[name]-[version].md`
|
||||
Reference in New Issue
Block a user