Files
2025-11-30 09:07:22 +08:00

340 lines
6.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Agent Test Suite Template
**Purpose**: Standardized test suite for agent prompt validation
**Usage**: Copy and customize for your agent type
---
## Test Suite: [Agent Name]
**Agent Type**: [Explore/Code-Gen/Analysis/etc.]
**Version**: [X.Y]
**Test Date**: [YYYY-MM-DD]
**Tester**: [Name]
---
## Test Configuration
**Test Environment**:
- Claude Code Version: [version]
- Model: [model-id]
- Session ID: [session-id]
**Test Parameters**:
- Number of tasks: [20 recommended]
- Task diversity: [Low/Medium/High]
- Complexity distribution:
- Simple: [N] tasks
- Medium: [N] tasks
- Complex: [N] tasks
---
## Test Cases
### Task 1: [Brief Description]
**Type**: [Simple/Medium/Complex]
**Category**: [Search/Analysis/Generation/etc.]
**Input**:
```
[Exact prompt or command given to agent]
```
**Expected Outcome**:
```
[What a successful completion looks like]
```
**Actual Result**:
- Status: ✅ Success / ⚠️ Partial / ❌ Failed
- Quality Rating: [1-5]
- Time: [X.X] min
- Tokens: [X]k
**Notes**:
```
[Any observations, issues, or improvements identified]
```
---
### Task 2: [Brief Description]
**Type**: [Simple/Medium/Complex]
**Category**: [Search/Analysis/Generation/etc.]
**Input**:
```
[Exact prompt or command given to agent]
```
**Expected Outcome**:
```
[What a successful completion looks like]
```
**Actual Result**:
- Status: ✅ Success / ⚠️ Partial / ❌ Failed
- Quality Rating: [1-5]
- Time: [X.X] min
- Tokens: [X]k
**Notes**:
```
[Any observations, issues, or improvements identified]
```
---
[Repeat for all 20 tasks]
---
## Summary Statistics
### Overall Performance
**Success Rate**:
```
Total Tasks: [N]
Successful: [N] (✅)
Partial: [N] (⚠️)
Failed: [N] (❌)
Success Rate: [X]% ([successful] / [total])
```
**Quality Score**:
```
Task Quality Ratings: [4, 5, 3, 4, 5, ...]
Average Quality: [X.X] / 5
```
**Efficiency**:
```
Total Time: [X.X] min
Average Time: [X.X] min/task
Total Tokens: [X]k
Average Tokens: [X.X]k/task
```
**Reliability**:
```
Success by Complexity:
- Simple: [X]% ([Y]/[Z])
- Medium: [X]% ([Y]/[Z])
- Complex: [X]% ([Y]/[Z])
Reliability Score: [X.XX]
```
---
## Composite Metrics
### V_instance Calculation
```
Success Rate: [X]% = [0.XX]
Quality Score: [X.X]/5 = [0.XX]
Efficiency Score: [target - actual] / target = [0.XX]
Reliability: [0.XX]
V_instance = 0.40 × [success_rate] +
0.30 × [quality_normalized] +
0.20 × [efficiency_score] +
0.10 × [reliability]
= [0.XX] + [0.XX] + [0.XX] + [0.XX]
= [0.XX]
Target: ≥ 0.80
Status: ✅ / ⚠️ / ❌
```
---
## Failure Analysis
### Failed Tasks
| Task ID | Description | Failure Reason | Pattern |
|---------|-------------|----------------|---------|
| [N] | [Brief] | [Why failed] | [Type] |
| [N] | [Brief] | [Why failed] | [Type] |
### Failure Patterns
**Pattern 1: [Name]** ([N] occurrences)
- Description: [What went wrong]
- Root Cause: [Why it happened]
- Proposed Fix: [How to address]
**Pattern 2: [Name]** ([N] occurrences)
- Description: [What went wrong]
- Root Cause: [Why it happened]
- Proposed Fix: [How to address]
---
## Quality Issues
### Tasks with Quality < 4
| Task ID | Quality | Issues Identified | Improvements Needed |
|---------|---------|-------------------|---------------------|
| [N] | [1-3] | [Description] | [Actions] |
| [N] | [1-3] | [Description] | [Actions] |
---
## Efficiency Analysis
### Tasks Exceeding Time Budget
| Task ID | Actual Time | Target Time | Δ | Reason |
|---------|-------------|-------------|------|--------|
| [N] | [X.X] min | [Y] min | [+Z] | [Why] |
| [N] | [X.X] min | [Y] min | [+Z] | [Why] |
### Token Usage Analysis
```
Tokens per task: [min-max] range
High-usage tasks: [list]
Optimization opportunities: [suggestions]
```
---
## Recommendations
### Priority 1 (Critical)
1. **[Issue]**: [Description]
- Impact: [High/Medium/Low]
- Frequency: [X] occurrences
- Proposed Fix: [Action]
- Expected Improvement: [X]% success rate
2. **[Issue]**: [Description]
- Impact: [High/Medium/Low]
- Frequency: [X] occurrences
- Proposed Fix: [Action]
- Expected Improvement: [X]% quality
### Priority 2 (Important)
1. **[Issue]**: [Description]
- Impact: [High/Medium/Low]
- Frequency: [X] occurrences
- Proposed Fix: [Action]
### Priority 3 (Nice to Have)
1. **[Improvement]**: [Description]
- Benefit: [What improves]
- Effort: [Low/Medium/High]
---
## Next Iteration Plan
### Focus Areas
1. **[Area 1]**: [Why focus here]
- Baseline: [Current metric]
- Target: [Goal metric]
- Approach: [How to improve]
2. **[Area 2]**: [Why focus here]
- Baseline: [Current metric]
- Target: [Goal metric]
- Approach: [How to improve]
### Prompt Changes
**Planned Additions**:
- [ ] [Guideline/instruction to add]
- [ ] [Constraint to add]
- [ ] [Example to add]
**Planned Clarifications**:
- [ ] [Instruction to clarify]
- [ ] [Constraint to adjust]
**Planned Removals**:
- [ ] [Unnecessary instruction]
- [ ] [Redundant constraint]
---
## Test Suite Evolution
### Version History
| Version | Date | Success | Quality | V_inst | Changes |
|---------|------|---------|---------|--------|---------|
| 0.1 | [D] | [X]% | [X.X] | [0.XX] | Baseline|
| 0.2 | [D] | [X]% | [X.X] | [0.XX] | [Changes]|
| [curr] | [D] | [X]% | [X.X] | [0.XX] | [Changes]|
### Convergence Tracking
```
Iteration 0: V_instance = [0.XX] (baseline)
Iteration 1: V_instance = [0.XX] ([+/-]%)
Iteration 2: V_instance = [0.XX] ([+/-]%)
Current: V_instance = [0.XX] ([+/-]%)
Converged: ✅ / ❌
(Requires V_instance ≥ 0.80 for 2 consecutive iterations)
```
---
## Appendix: Task Catalog
### Task Templates by Category
**Search Tasks**:
- "Find all [pattern] in [scope]"
- "Locate [functionality] implementation"
- "Show [architecture aspect]"
**Analysis Tasks**:
- "Explain how [feature] works"
- "Identify [issue type] in [code]"
- "Compare [approach A] vs [approach B]"
**Generation Tasks**:
- "Create [artifact type] for [purpose]"
- "Generate [code/docs] following [pattern]"
- "Refactor [code] to [goal]"
### Complexity Guidelines
**Simple** (1-2 min, 1-3k tokens):
- Single-file search
- Direct lookup
- Straightforward generation
**Medium** (2-4 min, 3-7k tokens):
- Multi-file search
- Pattern analysis
- Moderate generation
**Complex** (4-6 min, 7-15k tokens):
- Cross-codebase search
- Deep analysis
- Complex generation
---
**Template Version**: 1.0
**Source**: BAIME Agent Prompt Evolution
**Usage**: Copy to `agent-test-suite-[name]-[version].md`