Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 09:07:22 +08:00
commit fab98d059b
179 changed files with 46209 additions and 0 deletions

View File

@@ -0,0 +1,339 @@
# Agent Test Suite Template
**Purpose**: Standardized test suite for agent prompt validation
**Usage**: Copy and customize for your agent type
---
## Test Suite: [Agent Name]
**Agent Type**: [Explore/Code-Gen/Analysis/etc.]
**Version**: [X.Y]
**Test Date**: [YYYY-MM-DD]
**Tester**: [Name]
---
## Test Configuration
**Test Environment**:
- Claude Code Version: [version]
- Model: [model-id]
- Session ID: [session-id]
**Test Parameters**:
- Number of tasks: [20 recommended]
- Task diversity: [Low/Medium/High]
- Complexity distribution:
- Simple: [N] tasks
- Medium: [N] tasks
- Complex: [N] tasks
---
## Test Cases
### Task 1: [Brief Description]
**Type**: [Simple/Medium/Complex]
**Category**: [Search/Analysis/Generation/etc.]
**Input**:
```
[Exact prompt or command given to agent]
```
**Expected Outcome**:
```
[What a successful completion looks like]
```
**Actual Result**:
- Status: ✅ Success / ⚠️ Partial / ❌ Failed
- Quality Rating: [1-5]
- Time: [X.X] min
- Tokens: [X]k
**Notes**:
```
[Any observations, issues, or improvements identified]
```
---
### Task 2: [Brief Description]
**Type**: [Simple/Medium/Complex]
**Category**: [Search/Analysis/Generation/etc.]
**Input**:
```
[Exact prompt or command given to agent]
```
**Expected Outcome**:
```
[What a successful completion looks like]
```
**Actual Result**:
- Status: ✅ Success / ⚠️ Partial / ❌ Failed
- Quality Rating: [1-5]
- Time: [X.X] min
- Tokens: [X]k
**Notes**:
```
[Any observations, issues, or improvements identified]
```
---
[Repeat for all 20 tasks]
---
## Summary Statistics
### Overall Performance
**Success Rate**:
```
Total Tasks: [N]
Successful: [N] (✅)
Partial: [N] (⚠️)
Failed: [N] (❌)
Success Rate: [X]% ([successful] / [total])
```
**Quality Score**:
```
Task Quality Ratings: [4, 5, 3, 4, 5, ...]
Average Quality: [X.X] / 5
```
**Efficiency**:
```
Total Time: [X.X] min
Average Time: [X.X] min/task
Total Tokens: [X]k
Average Tokens: [X.X]k/task
```
**Reliability**:
```
Success by Complexity:
- Simple: [X]% ([Y]/[Z])
- Medium: [X]% ([Y]/[Z])
- Complex: [X]% ([Y]/[Z])
Reliability Score: [X.XX]
```
---
## Composite Metrics
### V_instance Calculation
```
Success Rate: [X]% = [0.XX]
Quality Score: [X.X]/5 = [0.XX]
Efficiency Score: [target - actual] / target = [0.XX]
Reliability: [0.XX]
V_instance = 0.40 × [success_rate] +
0.30 × [quality_normalized] +
0.20 × [efficiency_score] +
0.10 × [reliability]
= [0.XX] + [0.XX] + [0.XX] + [0.XX]
= [0.XX]
Target: ≥ 0.80
Status: ✅ / ⚠️ / ❌
```
---
## Failure Analysis
### Failed Tasks
| Task ID | Description | Failure Reason | Pattern |
|---------|-------------|----------------|---------|
| [N] | [Brief] | [Why failed] | [Type] |
| [N] | [Brief] | [Why failed] | [Type] |
### Failure Patterns
**Pattern 1: [Name]** ([N] occurrences)
- Description: [What went wrong]
- Root Cause: [Why it happened]
- Proposed Fix: [How to address]
**Pattern 2: [Name]** ([N] occurrences)
- Description: [What went wrong]
- Root Cause: [Why it happened]
- Proposed Fix: [How to address]
---
## Quality Issues
### Tasks with Quality < 4
| Task ID | Quality | Issues Identified | Improvements Needed |
|---------|---------|-------------------|---------------------|
| [N] | [1-3] | [Description] | [Actions] |
| [N] | [1-3] | [Description] | [Actions] |
---
## Efficiency Analysis
### Tasks Exceeding Time Budget
| Task ID | Actual Time | Target Time | Δ | Reason |
|---------|-------------|-------------|------|--------|
| [N] | [X.X] min | [Y] min | [+Z] | [Why] |
| [N] | [X.X] min | [Y] min | [+Z] | [Why] |
### Token Usage Analysis
```
Tokens per task: [min-max] range
High-usage tasks: [list]
Optimization opportunities: [suggestions]
```
---
## Recommendations
### Priority 1 (Critical)
1. **[Issue]**: [Description]
- Impact: [High/Medium/Low]
- Frequency: [X] occurrences
- Proposed Fix: [Action]
- Expected Improvement: [X]% success rate
2. **[Issue]**: [Description]
- Impact: [High/Medium/Low]
- Frequency: [X] occurrences
- Proposed Fix: [Action]
- Expected Improvement: [X]% quality
### Priority 2 (Important)
1. **[Issue]**: [Description]
- Impact: [High/Medium/Low]
- Frequency: [X] occurrences
- Proposed Fix: [Action]
### Priority 3 (Nice to Have)
1. **[Improvement]**: [Description]
- Benefit: [What improves]
- Effort: [Low/Medium/High]
---
## Next Iteration Plan
### Focus Areas
1. **[Area 1]**: [Why focus here]
- Baseline: [Current metric]
- Target: [Goal metric]
- Approach: [How to improve]
2. **[Area 2]**: [Why focus here]
- Baseline: [Current metric]
- Target: [Goal metric]
- Approach: [How to improve]
### Prompt Changes
**Planned Additions**:
- [ ] [Guideline/instruction to add]
- [ ] [Constraint to add]
- [ ] [Example to add]
**Planned Clarifications**:
- [ ] [Instruction to clarify]
- [ ] [Constraint to adjust]
**Planned Removals**:
- [ ] [Unnecessary instruction]
- [ ] [Redundant constraint]
---
## Test Suite Evolution
### Version History
| Version | Date | Success | Quality | V_inst | Changes |
|---------|------|---------|---------|--------|---------|
| 0.1 | [D] | [X]% | [X.X] | [0.XX] | Baseline|
| 0.2 | [D] | [X]% | [X.X] | [0.XX] | [Changes]|
| [curr] | [D] | [X]% | [X.X] | [0.XX] | [Changes]|
### Convergence Tracking
```
Iteration 0: V_instance = [0.XX] (baseline)
Iteration 1: V_instance = [0.XX] ([+/-]%)
Iteration 2: V_instance = [0.XX] ([+/-]%)
Current: V_instance = [0.XX] ([+/-]%)
Converged: ✅ / ❌
(Requires V_instance ≥ 0.80 for 2 consecutive iterations)
```
---
## Appendix: Task Catalog
### Task Templates by Category
**Search Tasks**:
- "Find all [pattern] in [scope]"
- "Locate [functionality] implementation"
- "Show [architecture aspect]"
**Analysis Tasks**:
- "Explain how [feature] works"
- "Identify [issue type] in [code]"
- "Compare [approach A] vs [approach B]"
**Generation Tasks**:
- "Create [artifact type] for [purpose]"
- "Generate [code/docs] following [pattern]"
- "Refactor [code] to [goal]"
### Complexity Guidelines
**Simple** (1-2 min, 1-3k tokens):
- Single-file search
- Direct lookup
- Straightforward generation
**Medium** (2-4 min, 3-7k tokens):
- Multi-file search
- Pattern analysis
- Moderate generation
**Complex** (4-6 min, 7-15k tokens):
- Cross-codebase search
- Deep analysis
- Complex generation
---
**Template Version**: 1.0
**Source**: BAIME Agent Prompt Evolution
**Usage**: Copy to `agent-test-suite-[name]-[version].md`