14 KiB
Activation Testing Guide
Version: 1.0 Purpose: Comprehensive guide for testing skill activation reliability
Overview
This guide provides procedures, templates, and checklists for testing the 3-Layer Activation System to ensure skills activate correctly and reliably.
Testing Philosophy
Goal: 95%+ activation reliability
Approach: Test each layer independently, then integration
Metrics:
- True Positives: Valid queries that correctly activate
- True Negatives: Invalid queries that correctly don't activate
- False Positives: Invalid queries that incorrectly activate
- False Negatives: Valid queries that fail to activate
Target: Zero false positives, <5% false negatives
🧪 Testing Methodology
Phase 1: Layer 1 Testing (Keywords)
Objective
Verify that exact keyword phrases activate the skill.
Procedure
Step 1: List all keywords from marketplace.json
Step 2: Create test query for each keyword
Step 3: Test each query manually
Step 4: Document results
Template
## Layer 1: Keywords Testing
**Keyword 1:** "create an agent for"
Test Queries:
1. "create an agent for processing invoices"
- ✅ Activated
- Via: Keyword match
2. "I want to create an agent for data analysis"
- ✅ Activated
- Via: Keyword match
3. "Create An Agent For automation" // Case variation
- ✅ Activated
- Via: Keyword match (case-insensitive)
**Keyword 2:** "automate workflow"
...
Pass Criteria
- 100% of keyword test queries activate
- Case-insensitive matching works
- Embedded keywords activate (keyword within longer query)
Phase 2: Layer 2 Testing (Patterns)
Objective
Verify that regex patterns capture expected variations.
Procedure
Step 1: List all patterns from marketplace.json
Step 2: Create 5+ test queries per pattern
Step 3: Test pattern matching (can use regex tester)
Step 4: Test in Claude Code
Step 5: Document results
Template
## Layer 2: Patterns Testing
**Pattern 1:** `(?i)(create|build)\s+(an?\s+)?agent\s+for`
Designed to Match:
- Verbs: create, build
- Optional article: a, an
- Entity: agent
- Connector: for
Test Queries:
1. "create an agent for automation"
- ✅ Matches pattern
- ✅ Activated in Claude Code
2. "build a agent for processing"
- ✅ Matches pattern
- ✅ Activated
3. "create agent for data" // No article
- ✅ Matches pattern
- ✅ Activated
4. "Build Agent For Tasks" // Different case
- ✅ Matches pattern
- ✅ Activated
5. "I want to create an agent for reporting" // Embedded
- ✅ Matches pattern
- ✅ Activated
Should NOT Match:
6. "agent creation guide"
- ❌ No action verb
- ❌ Correctly did not activate
7. "create something for automation"
- ❌ No "agent" keyword
- ❌ Correctly did not activate
Pass Criteria
- 100% of positive test queries match pattern
- 100% of positive queries activate in Claude Code
- 0% of negative queries match pattern
- Pattern is flexible (captures variations)
- Pattern is specific (no false positives)
Phase 3: Layer 3 Testing (Description + NLU)
Objective
Verify that description helps Claude understand intent for edge cases.
Procedure
Step 1: Create queries that DON'T match keywords/patterns
Step 2: Verify these still activate via description understanding
Step 3: Document which queries activate
Template
## Layer 3: Description + NLU Testing
**Queries that don't match Keywords or Patterns:**
1. "I keep doing this task manually, can you help automate it?"
- ❌ No keyword match
- ❌ No pattern match
- ✅ Should activate via description understanding
- Result: {activated/did not activate}
2. "This process is repetitive and takes hours daily"
- ❌ No keyword match
- ❌ No pattern match
- ✅ Should activate (describes repetitive workflow)
- Result: {activated/did not activate}
3. "Help me build something to handle this workflow"
- ❌ No exact keyword
- ⚠️ Might match pattern
- ✅ Should activate
- Result: {activated/did not activate}
Pass Criteria
- Edge case queries activate when appropriate
- Natural language variations work
- Description provides fallback coverage
Phase 4: Integration Testing
Objective
Test complete system with real-world query variations.
Procedure
Step 1: Create 10+ realistic query variations per capability
Step 2: Test all queries in actual Claude Code environment
Step 3: Track activation success rate
Step 4: Identify gaps
Template
## Integration Testing
**Capability:** Agent Creation
**Test Queries:**
| # | Query | Expected | Actual | Layer | Status |
|---|-------|----------|--------|-------|--------|
| 1 | "create an agent for PDFs" | Activate | Activated | Keyword | ✅ |
| 2 | "build automation for emails" | Activate | Activated | Pattern | ✅ |
| 3 | "daily I process invoices manually" | Activate | Activated | Desc | ✅ |
| 4 | "make agent for data entry" | Activate | Activated | Pattern | ✅ |
| 5 | "automate my workflow for reports" | Activate | Activated | Keyword | ✅ |
| 6 | "I need help with automation" | Activate | NOT activated | - | ❌ |
| 7 | "turn this into automated process" | Activate | Activated | Pattern | ✅ |
| 8 | "create skill for stock analysis" | Activate | Activated | Keyword | ✅ |
| 9 | "repeatedly doing this task" | Activate | Activated | Desc | ✅ |
| 10 | "can you help automate this?" | Activate | Activated | Desc | ✅ |
**Results:**
- Total queries: 10
- Activated correctly: 9
- Failed to activate: 1 (Query #6)
- Success rate: 90%
**Issues:**
- Query #6 too generic, needs more specific keywords
Pass Criteria
- 95%+ success rate
- All capability variations covered
- Realistic query phrasings tested
- Edge cases documented
Phase 5: Negative Testing (False Positives)
Objective
Ensure skill does NOT activate for out-of-scope queries.
Procedure
Step 1: List out-of-scope use cases (when_not_to_use)
Step 2: Create queries for each
Step 3: Verify skill does NOT activate
Step 4: Document any false positives
Template
## Negative Testing
**Out of Scope:** General programming questions
Test Queries (Should NOT Activate):
1. "How do I write a for loop in Python?"
- Result: Did not activate ✅
2. "What's the difference between list and tuple?"
- Result: Did not activate ✅
3. "Help me debug this code"
- Result: Did not activate ✅
**Out of Scope:** Using existing skills
Test Queries (Should NOT Activate):
4. "Run the invoice processor skill"
- Result: Did not activate ✅
5. "Show me existing agents"
- Result: Did not activate ✅
**Results:**
- Total negative queries: 5
- Correctly did not activate: 5
- False positives: 0
- Success rate: 100%
Pass Criteria
- 100% of out-of-scope queries do NOT activate
- Zero false positives
- when_not_to_use cases covered
📋 Complete Testing Checklist
Pre-Testing Setup
- marketplace.json has activation section
- Keywords defined (10-15)
- Patterns defined (5-7)
- Description includes keywords
- when_to_use / when_not_to_use defined
- test_queries array populated
Layer 1: Keywords
- All keywords tested individually
- Case-insensitive matching verified
- Embedded keywords work
- 100% activation rate
Layer 2: Patterns
- Each pattern tested with 5+ queries
- Pattern matches verified (regex tester)
- Claude Code activation verified
- No false positives
- Flexible enough for variations
Layer 3: Description
- Edge cases tested
- Natural language variations work
- Fallback coverage confirmed
Integration
- 10+ realistic queries per capability tested
- 95%+ success rate achieved
- All capabilities covered
- Results documented
Negative Testing
- Out-of-scope queries tested
- Zero false positives
- when_not_to_use cases verified
Documentation
- Test results documented
- Issues logged
- Recommendations made
- marketplace.json updated if needed
📊 Test Report Template
# Activation Test Report
**Skill Name:** {skill-name}
**Version:** {version}
**Test Date:** {date}
**Tested By:** {name}
**Environment:** Claude Code {version}
---
## Executive Summary
- **Overall Success Rate:** {X}%
- **Total Queries Tested:** {N}
- **True Positives:** {N}
- **True Negatives:** {N}
- **False Positives:** {N}
- **False Negatives:** {N}
---
## Layer 1: Keywords Testing
**Keywords Tested:** {count}
**Success Rate:** {X}%
### Results
| Keyword | Test Queries | Passed | Failed |
|---------|--------------|--------|--------|
| {keyword-1} | {N} | {N} | {N} |
| {keyword-2} | {N} | {N} | {N} |
**Issues:**
- {issue-1}
- {issue-2}
---
## Layer 2: Patterns Testing
**Patterns Tested:** {count}
**Success Rate:** {X}%
### Results
| Pattern | Test Queries | Passed | Failed |
|---------|--------------|--------|--------|
| {pattern-1} | {N} | {N} | {N} |
| {pattern-2} | {N} | {N} | {N} |
**Issues:**
- {issue-1}
- {issue-2}
---
## Layer 3: Description Testing
**Edge Cases Tested:** {count}
**Success Rate:** {X}%
**Results:**
- Activated via description: {N}
- Failed to activate: {N}
---
## Integration Testing
**Total Test Queries:** {count}
**Success Rate:** {X}%
**Breakdown by Capability:**
| Capability | Queries | Success | Rate |
|------------|---------|---------|------|
| {cap-1} | {N} | {N} | {X}% |
| {cap-2} | {N} | {N} | {X}% |
---
## Negative Testing
**Out-of-Scope Queries:** {count}
**False Positives:** {N}
**Success Rate:** {X}%
---
## Issues & Recommendations
### Critical Issues
1. {issue-description}
- Impact: {high/medium/low}
- Recommendation: {action}
### Minor Issues
1. {issue-description}
- Impact: {low}
- Recommendation: {action}
### Recommendations
1. {recommendation-1}
2. {recommendation-2}
---
## Conclusion
{Summary of test results and next steps}
**Status:** {PASS / NEEDS WORK / FAIL}
---
**Appendix A:** Full Test Query List
**Appendix B:** Failed Query Analysis
**Appendix C:** Updated marketplace.json (if changes needed)
🔄 Iterative Testing Process
Step 1: Initial Test
- Run complete test suite
- Document results
- Identify failures
Step 2: Analysis
- Analyze failed queries
- Determine root cause
- Plan fixes
Step 3: Fix
- Update keywords/patterns/description
- Document changes
Step 4: Retest
- Test only failed queries
- Verify fixes work
- Ensure no regressions
Step 5: Full Regression Test
- Run complete test suite again
- Verify 95%+ success rate
- Document final results
🎯 Sample Test Suite
Example: Agent Creation Skill
## Test Suite: Agent Creation Skill
### Layer 1 Tests (Keywords)
**Keyword:** "create an agent for"
- ✅ "create an agent for processing PDFs"
- ✅ "I want to create an agent for automation"
- ✅ "Create An Agent For daily tasks"
**Keyword:** "automate workflow"
- ✅ "automate workflow for invoices"
- ✅ "need to automate workflow"
- ✅ "Automate Workflow handling"
[... more keywords]
### Layer 2 Tests (Patterns)
**Pattern:** `(?i)(create|build)\s+(an?\s+)?agent`
- ✅ "create an agent for X"
- ✅ "build a agent for Y"
- ✅ "create agent for Z"
- ✅ "Build Agent for tasks"
- ❌ "agent creation guide" (should not match)
[... more patterns]
### Integration Tests
**Capability:** Agent Creation
1. ✅ "create an agent for processing CSVs"
2. ✅ "build automation for email handling"
3. ✅ "automate this workflow: download, process, upload"
4. ✅ "every day I have to categorize files manually"
5. ✅ "turn this process into an automated agent"
6. ✅ "I need a skill for data extraction"
7. ✅ "daily workflow automation needed"
8. ✅ "repeatedly doing manual data entry"
9. ✅ "develop an agent to monitor APIs"
10. ✅ "make something to handle invoices automatically"
**Success Rate:** 10/10 = 100%
### Negative Tests
**Should NOT Activate:**
1. ✅ "How do I use an existing agent?" (did not activate)
2. ✅ "Explain what agents are" (did not activate)
3. ✅ "Debug this code" (did not activate)
4. ✅ "Write a Python function" (did not activate)
5. ✅ "Run the invoice agent" (did not activate)
**Success Rate:** 5/5 = 100%
📚 Additional Resources
phase4-detection.md- Detection methodologyactivation-patterns-guide.md- Pattern libraryactivation-quality-checklist.md- Quality standardsACTIVATION_BEST_PRACTICES.md- Best practices
🔧 Troubleshooting
Issue: Low Success Rate (<90%)
Diagnosis:
- Review failed queries
- Check if keywords/patterns too narrow
- Verify description includes key concepts
Solution:
- Add more keyword variations
- Broaden patterns slightly
- Enhance description with synonyms
Issue: False Positives
Diagnosis:
- Review activated queries
- Check if patterns too broad
- Verify keywords not too generic
Solution:
- Narrow patterns (add context requirements)
- Use complete phrases for keywords
- Add negative scope to description
Issue: Inconsistent Activation
Diagnosis:
- Test same query multiple times
- Check for Claude Code updates
- Verify marketplace.json structure
Solution:
- Use all 3 layers (keywords + patterns + description)
- Increase keyword/pattern coverage
- Validate JSON syntax
Version: 1.0 Last Updated: 2025-10-23 Maintained By: Agent-Skill-Creator Team