614 lines
14 KiB
Markdown
614 lines
14 KiB
Markdown
# Activation Testing Guide
|
|
|
|
**Version:** 1.0
|
|
**Purpose:** Comprehensive guide for testing skill activation reliability
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
This guide provides procedures, templates, and checklists for testing the 3-Layer Activation System to ensure skills activate correctly and reliably.
|
|
|
|
### Testing Philosophy
|
|
|
|
**Goal:** 95%+ activation reliability
|
|
|
|
**Approach:** Test each layer independently, then integration
|
|
|
|
**Metrics:**
|
|
- **True Positives:** Valid queries that correctly activate
|
|
- **True Negatives:** Invalid queries that correctly don't activate
|
|
- **False Positives:** Invalid queries that incorrectly activate
|
|
- **False Negatives:** Valid queries that fail to activate
|
|
|
|
**Target:** Zero false positives, <5% false negatives
|
|
|
|
---
|
|
|
|
## 🧪 Testing Methodology
|
|
|
|
### Phase 1: Layer 1 Testing (Keywords)
|
|
|
|
#### Objective
|
|
Verify that exact keyword phrases activate the skill.
|
|
|
|
#### Procedure
|
|
|
|
**Step 1:** List all keywords from marketplace.json
|
|
|
|
**Step 2:** Create test query for each keyword
|
|
|
|
**Step 3:** Test each query manually
|
|
|
|
**Step 4:** Document results
|
|
|
|
#### Template
|
|
|
|
```markdown
|
|
## Layer 1: Keywords Testing
|
|
|
|
**Keyword 1:** "create an agent for"
|
|
|
|
Test Queries:
|
|
1. "create an agent for processing invoices"
|
|
- ✅ Activated
|
|
- Via: Keyword match
|
|
|
|
2. "I want to create an agent for data analysis"
|
|
- ✅ Activated
|
|
- Via: Keyword match
|
|
|
|
3. "Create An Agent For automation" // Case variation
|
|
- ✅ Activated
|
|
- Via: Keyword match (case-insensitive)
|
|
|
|
**Keyword 2:** "automate workflow"
|
|
...
|
|
```
|
|
|
|
#### Pass Criteria
|
|
- [ ] 100% of keyword test queries activate
|
|
- [ ] Case-insensitive matching works
|
|
- [ ] Embedded keywords activate (keyword within longer query)
|
|
|
|
---
|
|
|
|
### Phase 2: Layer 2 Testing (Patterns)
|
|
|
|
#### Objective
|
|
Verify that regex patterns capture expected variations.
|
|
|
|
#### Procedure
|
|
|
|
**Step 1:** List all patterns from marketplace.json
|
|
|
|
**Step 2:** Create 5+ test queries per pattern
|
|
|
|
**Step 3:** Test pattern matching (can use regex tester)
|
|
|
|
**Step 4:** Test in Claude Code
|
|
|
|
**Step 5:** Document results
|
|
|
|
#### Template
|
|
|
|
```markdown
|
|
## Layer 2: Patterns Testing
|
|
|
|
**Pattern 1:** `(?i)(create|build)\s+(an?\s+)?agent\s+for`
|
|
|
|
Designed to Match:
|
|
- Verbs: create, build
|
|
- Optional article: a, an
|
|
- Entity: agent
|
|
- Connector: for
|
|
|
|
Test Queries:
|
|
1. "create an agent for automation"
|
|
- ✅ Matches pattern
|
|
- ✅ Activated in Claude Code
|
|
|
|
2. "build a agent for processing"
|
|
- ✅ Matches pattern
|
|
- ✅ Activated
|
|
|
|
3. "create agent for data" // No article
|
|
- ✅ Matches pattern
|
|
- ✅ Activated
|
|
|
|
4. "Build Agent For Tasks" // Different case
|
|
- ✅ Matches pattern
|
|
- ✅ Activated
|
|
|
|
5. "I want to create an agent for reporting" // Embedded
|
|
- ✅ Matches pattern
|
|
- ✅ Activated
|
|
|
|
Should NOT Match:
|
|
6. "agent creation guide"
|
|
- ❌ No action verb
|
|
- ❌ Correctly did not activate
|
|
|
|
7. "create something for automation"
|
|
- ❌ No "agent" keyword
|
|
- ❌ Correctly did not activate
|
|
```
|
|
|
|
#### Pass Criteria
|
|
- [ ] 100% of positive test queries match pattern
|
|
- [ ] 100% of positive queries activate in Claude Code
|
|
- [ ] 0% of negative queries match pattern
|
|
- [ ] Pattern is flexible (captures variations)
|
|
- [ ] Pattern is specific (no false positives)
|
|
|
|
---
|
|
|
|
### Phase 3: Layer 3 Testing (Description + NLU)
|
|
|
|
#### Objective
|
|
Verify that description helps Claude understand intent for edge cases.
|
|
|
|
#### Procedure
|
|
|
|
**Step 1:** Create queries that DON'T match keywords/patterns
|
|
|
|
**Step 2:** Verify these still activate via description understanding
|
|
|
|
**Step 3:** Document which queries activate
|
|
|
|
#### Template
|
|
|
|
```markdown
|
|
## Layer 3: Description + NLU Testing
|
|
|
|
**Queries that don't match Keywords or Patterns:**
|
|
|
|
1. "I keep doing this task manually, can you help automate it?"
|
|
- ❌ No keyword match
|
|
- ❌ No pattern match
|
|
- ✅ Should activate via description understanding
|
|
- Result: {activated/did not activate}
|
|
|
|
2. "This process is repetitive and takes hours daily"
|
|
- ❌ No keyword match
|
|
- ❌ No pattern match
|
|
- ✅ Should activate (describes repetitive workflow)
|
|
- Result: {activated/did not activate}
|
|
|
|
3. "Help me build something to handle this workflow"
|
|
- ❌ No exact keyword
|
|
- ⚠️ Might match pattern
|
|
- ✅ Should activate
|
|
- Result: {activated/did not activate}
|
|
```
|
|
|
|
#### Pass Criteria
|
|
- [ ] Edge case queries activate when appropriate
|
|
- [ ] Natural language variations work
|
|
- [ ] Description provides fallback coverage
|
|
|
|
---
|
|
|
|
### Phase 4: Integration Testing
|
|
|
|
#### Objective
|
|
Test complete system with real-world query variations.
|
|
|
|
#### Procedure
|
|
|
|
**Step 1:** Create 10+ realistic query variations per capability
|
|
|
|
**Step 2:** Test all queries in actual Claude Code environment
|
|
|
|
**Step 3:** Track activation success rate
|
|
|
|
**Step 4:** Identify gaps
|
|
|
|
#### Template
|
|
|
|
```markdown
|
|
## Integration Testing
|
|
|
|
**Capability:** Agent Creation
|
|
|
|
**Test Queries:**
|
|
|
|
| # | Query | Expected | Actual | Layer | Status |
|
|
|---|-------|----------|--------|-------|--------|
|
|
| 1 | "create an agent for PDFs" | Activate | Activated | Keyword | ✅ |
|
|
| 2 | "build automation for emails" | Activate | Activated | Pattern | ✅ |
|
|
| 3 | "daily I process invoices manually" | Activate | Activated | Desc | ✅ |
|
|
| 4 | "make agent for data entry" | Activate | Activated | Pattern | ✅ |
|
|
| 5 | "automate my workflow for reports" | Activate | Activated | Keyword | ✅ |
|
|
| 6 | "I need help with automation" | Activate | NOT activated | - | ❌ |
|
|
| 7 | "turn this into automated process" | Activate | Activated | Pattern | ✅ |
|
|
| 8 | "create skill for stock analysis" | Activate | Activated | Keyword | ✅ |
|
|
| 9 | "repeatedly doing this task" | Activate | Activated | Desc | ✅ |
|
|
| 10 | "can you help automate this?" | Activate | Activated | Desc | ✅ |
|
|
|
|
**Results:**
|
|
- Total queries: 10
|
|
- Activated correctly: 9
|
|
- Failed to activate: 1 (Query #6)
|
|
- Success rate: 90%
|
|
|
|
**Issues:**
|
|
- Query #6 too generic, needs more specific keywords
|
|
```
|
|
|
|
#### Pass Criteria
|
|
- [ ] 95%+ success rate
|
|
- [ ] All capability variations covered
|
|
- [ ] Realistic query phrasings tested
|
|
- [ ] Edge cases documented
|
|
|
|
---
|
|
|
|
### Phase 5: Negative Testing (False Positives)
|
|
|
|
#### Objective
|
|
Ensure skill does NOT activate for out-of-scope queries.
|
|
|
|
#### Procedure
|
|
|
|
**Step 1:** List out-of-scope use cases (when_not_to_use)
|
|
|
|
**Step 2:** Create queries for each
|
|
|
|
**Step 3:** Verify skill does NOT activate
|
|
|
|
**Step 4:** Document any false positives
|
|
|
|
#### Template
|
|
|
|
```markdown
|
|
## Negative Testing
|
|
|
|
**Out of Scope:** General programming questions
|
|
|
|
Test Queries (Should NOT Activate):
|
|
1. "How do I write a for loop in Python?"
|
|
- Result: Did not activate ✅
|
|
|
|
2. "What's the difference between list and tuple?"
|
|
- Result: Did not activate ✅
|
|
|
|
3. "Help me debug this code"
|
|
- Result: Did not activate ✅
|
|
|
|
**Out of Scope:** Using existing skills
|
|
|
|
Test Queries (Should NOT Activate):
|
|
4. "Run the invoice processor skill"
|
|
- Result: Did not activate ✅
|
|
|
|
5. "Show me existing agents"
|
|
- Result: Did not activate ✅
|
|
|
|
**Results:**
|
|
- Total negative queries: 5
|
|
- Correctly did not activate: 5
|
|
- False positives: 0
|
|
- Success rate: 100%
|
|
```
|
|
|
|
#### Pass Criteria
|
|
- [ ] 100% of out-of-scope queries do NOT activate
|
|
- [ ] Zero false positives
|
|
- [ ] when_not_to_use cases covered
|
|
|
|
---
|
|
|
|
## 📋 Complete Testing Checklist
|
|
|
|
### Pre-Testing Setup
|
|
- [ ] marketplace.json has activation section
|
|
- [ ] Keywords defined (10-15)
|
|
- [ ] Patterns defined (5-7)
|
|
- [ ] Description includes keywords
|
|
- [ ] when_to_use / when_not_to_use defined
|
|
- [ ] test_queries array populated
|
|
|
|
### Layer 1: Keywords
|
|
- [ ] All keywords tested individually
|
|
- [ ] Case-insensitive matching verified
|
|
- [ ] Embedded keywords work
|
|
- [ ] 100% activation rate
|
|
|
|
### Layer 2: Patterns
|
|
- [ ] Each pattern tested with 5+ queries
|
|
- [ ] Pattern matches verified (regex tester)
|
|
- [ ] Claude Code activation verified
|
|
- [ ] No false positives
|
|
- [ ] Flexible enough for variations
|
|
|
|
### Layer 3: Description
|
|
- [ ] Edge cases tested
|
|
- [ ] Natural language variations work
|
|
- [ ] Fallback coverage confirmed
|
|
|
|
### Integration
|
|
- [ ] 10+ realistic queries per capability tested
|
|
- [ ] 95%+ success rate achieved
|
|
- [ ] All capabilities covered
|
|
- [ ] Results documented
|
|
|
|
### Negative Testing
|
|
- [ ] Out-of-scope queries tested
|
|
- [ ] Zero false positives
|
|
- [ ] when_not_to_use cases verified
|
|
|
|
### Documentation
|
|
- [ ] Test results documented
|
|
- [ ] Issues logged
|
|
- [ ] Recommendations made
|
|
- [ ] marketplace.json updated if needed
|
|
|
|
---
|
|
|
|
## 📊 Test Report Template
|
|
|
|
```markdown
|
|
# Activation Test Report
|
|
|
|
**Skill Name:** {skill-name}
|
|
**Version:** {version}
|
|
**Test Date:** {date}
|
|
**Tested By:** {name}
|
|
**Environment:** Claude Code {version}
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
- **Overall Success Rate:** {X}%
|
|
- **Total Queries Tested:** {N}
|
|
- **True Positives:** {N}
|
|
- **True Negatives:** {N}
|
|
- **False Positives:** {N}
|
|
- **False Negatives:** {N}
|
|
|
|
---
|
|
|
|
## Layer 1: Keywords Testing
|
|
|
|
**Keywords Tested:** {count}
|
|
**Success Rate:** {X}%
|
|
|
|
### Results
|
|
| Keyword | Test Queries | Passed | Failed |
|
|
|---------|--------------|--------|--------|
|
|
| {keyword-1} | {N} | {N} | {N} |
|
|
| {keyword-2} | {N} | {N} | {N} |
|
|
|
|
**Issues:**
|
|
- {issue-1}
|
|
- {issue-2}
|
|
|
|
---
|
|
|
|
## Layer 2: Patterns Testing
|
|
|
|
**Patterns Tested:** {count}
|
|
**Success Rate:** {X}%
|
|
|
|
### Results
|
|
| Pattern | Test Queries | Passed | Failed |
|
|
|---------|--------------|--------|--------|
|
|
| {pattern-1} | {N} | {N} | {N} |
|
|
| {pattern-2} | {N} | {N} | {N} |
|
|
|
|
**Issues:**
|
|
- {issue-1}
|
|
- {issue-2}
|
|
|
|
---
|
|
|
|
## Layer 3: Description Testing
|
|
|
|
**Edge Cases Tested:** {count}
|
|
**Success Rate:** {X}%
|
|
|
|
**Results:**
|
|
- Activated via description: {N}
|
|
- Failed to activate: {N}
|
|
|
|
---
|
|
|
|
## Integration Testing
|
|
|
|
**Total Test Queries:** {count}
|
|
**Success Rate:** {X}%
|
|
|
|
**Breakdown by Capability:**
|
|
| Capability | Queries | Success | Rate |
|
|
|------------|---------|---------|------|
|
|
| {cap-1} | {N} | {N} | {X}% |
|
|
| {cap-2} | {N} | {N} | {X}% |
|
|
|
|
---
|
|
|
|
## Negative Testing
|
|
|
|
**Out-of-Scope Queries:** {count}
|
|
**False Positives:** {N}
|
|
**Success Rate:** {X}%
|
|
|
|
---
|
|
|
|
## Issues & Recommendations
|
|
|
|
### Critical Issues
|
|
1. {issue-description}
|
|
- Impact: {high/medium/low}
|
|
- Recommendation: {action}
|
|
|
|
### Minor Issues
|
|
1. {issue-description}
|
|
- Impact: {low}
|
|
- Recommendation: {action}
|
|
|
|
### Recommendations
|
|
1. {recommendation-1}
|
|
2. {recommendation-2}
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
{Summary of test results and next steps}
|
|
|
|
**Status:** {PASS / NEEDS WORK / FAIL}
|
|
|
|
---
|
|
|
|
**Appendix A:** Full Test Query List
|
|
**Appendix B:** Failed Query Analysis
|
|
**Appendix C:** Updated marketplace.json (if changes needed)
|
|
```
|
|
|
|
---
|
|
|
|
## 🔄 Iterative Testing Process
|
|
|
|
### Step 1: Initial Test
|
|
- Run complete test suite
|
|
- Document results
|
|
- Identify failures
|
|
|
|
### Step 2: Analysis
|
|
- Analyze failed queries
|
|
- Determine root cause
|
|
- Plan fixes
|
|
|
|
### Step 3: Fix
|
|
- Update keywords/patterns/description
|
|
- Document changes
|
|
|
|
### Step 4: Retest
|
|
- Test only failed queries
|
|
- Verify fixes work
|
|
- Ensure no regressions
|
|
|
|
### Step 5: Full Regression Test
|
|
- Run complete test suite again
|
|
- Verify 95%+ success rate
|
|
- Document final results
|
|
|
|
---
|
|
|
|
## 🎯 Sample Test Suite
|
|
|
|
### Example: Agent Creation Skill
|
|
|
|
```markdown
|
|
## Test Suite: Agent Creation Skill
|
|
|
|
### Layer 1 Tests (Keywords)
|
|
|
|
**Keyword:** "create an agent for"
|
|
- ✅ "create an agent for processing PDFs"
|
|
- ✅ "I want to create an agent for automation"
|
|
- ✅ "Create An Agent For daily tasks"
|
|
|
|
**Keyword:** "automate workflow"
|
|
- ✅ "automate workflow for invoices"
|
|
- ✅ "need to automate workflow"
|
|
- ✅ "Automate Workflow handling"
|
|
|
|
[... more keywords]
|
|
|
|
### Layer 2 Tests (Patterns)
|
|
|
|
**Pattern:** `(?i)(create|build)\s+(an?\s+)?agent`
|
|
- ✅ "create an agent for X"
|
|
- ✅ "build a agent for Y"
|
|
- ✅ "create agent for Z"
|
|
- ✅ "Build Agent for tasks"
|
|
- ❌ "agent creation guide" (should not match)
|
|
|
|
[... more patterns]
|
|
|
|
### Integration Tests
|
|
|
|
**Capability:** Agent Creation
|
|
1. ✅ "create an agent for processing CSVs"
|
|
2. ✅ "build automation for email handling"
|
|
3. ✅ "automate this workflow: download, process, upload"
|
|
4. ✅ "every day I have to categorize files manually"
|
|
5. ✅ "turn this process into an automated agent"
|
|
6. ✅ "I need a skill for data extraction"
|
|
7. ✅ "daily workflow automation needed"
|
|
8. ✅ "repeatedly doing manual data entry"
|
|
9. ✅ "develop an agent to monitor APIs"
|
|
10. ✅ "make something to handle invoices automatically"
|
|
|
|
**Success Rate:** 10/10 = 100%
|
|
|
|
### Negative Tests
|
|
|
|
**Should NOT Activate:**
|
|
1. ✅ "How do I use an existing agent?" (did not activate)
|
|
2. ✅ "Explain what agents are" (did not activate)
|
|
3. ✅ "Debug this code" (did not activate)
|
|
4. ✅ "Write a Python function" (did not activate)
|
|
5. ✅ "Run the invoice agent" (did not activate)
|
|
|
|
**Success Rate:** 5/5 = 100%
|
|
```
|
|
|
|
---
|
|
|
|
## 📚 Additional Resources
|
|
|
|
- `phase4-detection.md` - Detection methodology
|
|
- `activation-patterns-guide.md` - Pattern library
|
|
- `activation-quality-checklist.md` - Quality standards
|
|
- `ACTIVATION_BEST_PRACTICES.md` - Best practices
|
|
|
|
---
|
|
|
|
## 🔧 Troubleshooting
|
|
|
|
### Issue: Low Success Rate (<90%)
|
|
|
|
**Diagnosis:**
|
|
1. Review failed queries
|
|
2. Check if keywords/patterns too narrow
|
|
3. Verify description includes key concepts
|
|
|
|
**Solution:**
|
|
1. Add more keyword variations
|
|
2. Broaden patterns slightly
|
|
3. Enhance description with synonyms
|
|
|
|
### Issue: False Positives
|
|
|
|
**Diagnosis:**
|
|
1. Review activated queries
|
|
2. Check if patterns too broad
|
|
3. Verify keywords not too generic
|
|
|
|
**Solution:**
|
|
1. Narrow patterns (add context requirements)
|
|
2. Use complete phrases for keywords
|
|
3. Add negative scope to description
|
|
|
|
### Issue: Inconsistent Activation
|
|
|
|
**Diagnosis:**
|
|
1. Test same query multiple times
|
|
2. Check for Claude Code updates
|
|
3. Verify marketplace.json structure
|
|
|
|
**Solution:**
|
|
1. Use all 3 layers (keywords + patterns + description)
|
|
2. Increase keyword/pattern coverage
|
|
3. Validate JSON syntax
|
|
|
|
---
|
|
|
|
**Version:** 1.0
|
|
**Last Updated:** 2025-10-23
|
|
**Maintained By:** Agent-Skill-Creator Team
|