Files
gh-francyjglisboa-agent-ski…/skills/FrancyJGLisboa__agent-skill-creator/references/activation-testing-guide.md
2025-11-29 18:27:28 +08:00

14 KiB

Activation Testing Guide

Version: 1.0 Purpose: Comprehensive guide for testing skill activation reliability


Overview

This guide provides procedures, templates, and checklists for testing the 3-Layer Activation System to ensure skills activate correctly and reliably.

Testing Philosophy

Goal: 95%+ activation reliability

Approach: Test each layer independently, then integration

Metrics:

  • True Positives: Valid queries that correctly activate
  • True Negatives: Invalid queries that correctly don't activate
  • False Positives: Invalid queries that incorrectly activate
  • False Negatives: Valid queries that fail to activate

Target: Zero false positives, <5% false negatives


🧪 Testing Methodology

Phase 1: Layer 1 Testing (Keywords)

Objective

Verify that exact keyword phrases activate the skill.

Procedure

Step 1: List all keywords from marketplace.json

Step 2: Create test query for each keyword

Step 3: Test each query manually

Step 4: Document results

Template

## Layer 1: Keywords Testing

**Keyword 1:** "create an agent for"

Test Queries:
1. "create an agent for processing invoices"
   - ✅ Activated
   - Via: Keyword match

2. "I want to create an agent for data analysis"
   - ✅ Activated
   - Via: Keyword match

3. "Create An Agent For automation"  // Case variation
   - ✅ Activated
   - Via: Keyword match (case-insensitive)

**Keyword 2:** "automate workflow"
...

Pass Criteria

  • 100% of keyword test queries activate
  • Case-insensitive matching works
  • Embedded keywords activate (keyword within longer query)

Phase 2: Layer 2 Testing (Patterns)

Objective

Verify that regex patterns capture expected variations.

Procedure

Step 1: List all patterns from marketplace.json

Step 2: Create 5+ test queries per pattern

Step 3: Test pattern matching (can use regex tester)

Step 4: Test in Claude Code

Step 5: Document results

Template

## Layer 2: Patterns Testing

**Pattern 1:** `(?i)(create|build)\s+(an?\s+)?agent\s+for`

Designed to Match:
- Verbs: create, build
- Optional article: a, an
- Entity: agent
- Connector: for

Test Queries:
1. "create an agent for automation"
   - ✅ Matches pattern
   - ✅ Activated in Claude Code

2. "build a agent for processing"
   - ✅ Matches pattern
   - ✅ Activated

3. "create agent for data"  // No article
   - ✅ Matches pattern
   - ✅ Activated

4. "Build Agent For Tasks"  // Different case
   - ✅ Matches pattern
   - ✅ Activated

5. "I want to create an agent for reporting"  // Embedded
   - ✅ Matches pattern
   - ✅ Activated

Should NOT Match:
6. "agent creation guide"
   - ❌ No action verb
   - ❌ Correctly did not activate

7. "create something for automation"
   - ❌ No "agent" keyword
   - ❌ Correctly did not activate

Pass Criteria

  • 100% of positive test queries match pattern
  • 100% of positive queries activate in Claude Code
  • 0% of negative queries match pattern
  • Pattern is flexible (captures variations)
  • Pattern is specific (no false positives)

Phase 3: Layer 3 Testing (Description + NLU)

Objective

Verify that description helps Claude understand intent for edge cases.

Procedure

Step 1: Create queries that DON'T match keywords/patterns

Step 2: Verify these still activate via description understanding

Step 3: Document which queries activate

Template

## Layer 3: Description + NLU Testing

**Queries that don't match Keywords or Patterns:**

1. "I keep doing this task manually, can you help automate it?"
   - ❌ No keyword match
   - ❌ No pattern match
   - ✅ Should activate via description understanding
   - Result: {activated/did not activate}

2. "This process is repetitive and takes hours daily"
   - ❌ No keyword match
   - ❌ No pattern match
   - ✅ Should activate (describes repetitive workflow)
   - Result: {activated/did not activate}

3. "Help me build something to handle this workflow"
   - ❌ No exact keyword
   - ⚠️ Might match pattern
   - ✅ Should activate
   - Result: {activated/did not activate}

Pass Criteria

  • Edge case queries activate when appropriate
  • Natural language variations work
  • Description provides fallback coverage

Phase 4: Integration Testing

Objective

Test complete system with real-world query variations.

Procedure

Step 1: Create 10+ realistic query variations per capability

Step 2: Test all queries in actual Claude Code environment

Step 3: Track activation success rate

Step 4: Identify gaps

Template

## Integration Testing

**Capability:** Agent Creation

**Test Queries:**

| # | Query | Expected | Actual | Layer | Status |
|---|-------|----------|--------|-------|--------|
| 1 | "create an agent for PDFs" | Activate | Activated | Keyword | ✅ |
| 2 | "build automation for emails" | Activate | Activated | Pattern | ✅ |
| 3 | "daily I process invoices manually" | Activate | Activated | Desc | ✅ |
| 4 | "make agent for data entry" | Activate | Activated | Pattern | ✅ |
| 5 | "automate my workflow for reports" | Activate | Activated | Keyword | ✅ |
| 6 | "I need help with automation" | Activate | NOT activated | - | ❌ |
| 7 | "turn this into automated process" | Activate | Activated | Pattern | ✅ |
| 8 | "create skill for stock analysis" | Activate | Activated | Keyword | ✅ |
| 9 | "repeatedly doing this task" | Activate | Activated | Desc | ✅ |
| 10 | "can you help automate this?" | Activate | Activated | Desc | ✅ |

**Results:**
- Total queries: 10
- Activated correctly: 9
- Failed to activate: 1 (Query #6)
- Success rate: 90%

**Issues:**
- Query #6 too generic, needs more specific keywords

Pass Criteria

  • 95%+ success rate
  • All capability variations covered
  • Realistic query phrasings tested
  • Edge cases documented

Phase 5: Negative Testing (False Positives)

Objective

Ensure skill does NOT activate for out-of-scope queries.

Procedure

Step 1: List out-of-scope use cases (when_not_to_use)

Step 2: Create queries for each

Step 3: Verify skill does NOT activate

Step 4: Document any false positives

Template

## Negative Testing

**Out of Scope:** General programming questions

Test Queries (Should NOT Activate):
1. "How do I write a for loop in Python?"
   - Result: Did not activate ✅

2. "What's the difference between list and tuple?"
   - Result: Did not activate ✅

3. "Help me debug this code"
   - Result: Did not activate ✅

**Out of Scope:** Using existing skills

Test Queries (Should NOT Activate):
4. "Run the invoice processor skill"
   - Result: Did not activate ✅

5. "Show me existing agents"
   - Result: Did not activate ✅

**Results:**
- Total negative queries: 5
- Correctly did not activate: 5
- False positives: 0
- Success rate: 100%

Pass Criteria

  • 100% of out-of-scope queries do NOT activate
  • Zero false positives
  • when_not_to_use cases covered

📋 Complete Testing Checklist

Pre-Testing Setup

  • marketplace.json has activation section
  • Keywords defined (10-15)
  • Patterns defined (5-7)
  • Description includes keywords
  • when_to_use / when_not_to_use defined
  • test_queries array populated

Layer 1: Keywords

  • All keywords tested individually
  • Case-insensitive matching verified
  • Embedded keywords work
  • 100% activation rate

Layer 2: Patterns

  • Each pattern tested with 5+ queries
  • Pattern matches verified (regex tester)
  • Claude Code activation verified
  • No false positives
  • Flexible enough for variations

Layer 3: Description

  • Edge cases tested
  • Natural language variations work
  • Fallback coverage confirmed

Integration

  • 10+ realistic queries per capability tested
  • 95%+ success rate achieved
  • All capabilities covered
  • Results documented

Negative Testing

  • Out-of-scope queries tested
  • Zero false positives
  • when_not_to_use cases verified

Documentation

  • Test results documented
  • Issues logged
  • Recommendations made
  • marketplace.json updated if needed

📊 Test Report Template

# Activation Test Report

**Skill Name:** {skill-name}
**Version:** {version}
**Test Date:** {date}
**Tested By:** {name}
**Environment:** Claude Code {version}

---

## Executive Summary

- **Overall Success Rate:** {X}%
- **Total Queries Tested:** {N}
- **True Positives:** {N}
- **True Negatives:** {N}
- **False Positives:** {N}
- **False Negatives:** {N}

---

## Layer 1: Keywords Testing

**Keywords Tested:** {count}
**Success Rate:** {X}%

### Results
| Keyword | Test Queries | Passed | Failed |
|---------|--------------|--------|--------|
| {keyword-1} | {N} | {N} | {N} |
| {keyword-2} | {N} | {N} | {N} |

**Issues:**
- {issue-1}
- {issue-2}

---

## Layer 2: Patterns Testing

**Patterns Tested:** {count}
**Success Rate:** {X}%

### Results
| Pattern | Test Queries | Passed | Failed |
|---------|--------------|--------|--------|
| {pattern-1} | {N} | {N} | {N} |
| {pattern-2} | {N} | {N} | {N} |

**Issues:**
- {issue-1}
- {issue-2}

---

## Layer 3: Description Testing

**Edge Cases Tested:** {count}
**Success Rate:** {X}%

**Results:**
- Activated via description: {N}
- Failed to activate: {N}

---

## Integration Testing

**Total Test Queries:** {count}
**Success Rate:** {X}%

**Breakdown by Capability:**
| Capability | Queries | Success | Rate |
|------------|---------|---------|------|
| {cap-1} | {N} | {N} | {X}% |
| {cap-2} | {N} | {N} | {X}% |

---

## Negative Testing

**Out-of-Scope Queries:** {count}
**False Positives:** {N}
**Success Rate:** {X}%

---

## Issues & Recommendations

### Critical Issues
1. {issue-description}
   - Impact: {high/medium/low}
   - Recommendation: {action}

### Minor Issues
1. {issue-description}
   - Impact: {low}
   - Recommendation: {action}

### Recommendations
1. {recommendation-1}
2. {recommendation-2}

---

## Conclusion

{Summary of test results and next steps}

**Status:** {PASS / NEEDS WORK / FAIL}

---

**Appendix A:** Full Test Query List
**Appendix B:** Failed Query Analysis
**Appendix C:** Updated marketplace.json (if changes needed)

🔄 Iterative Testing Process

Step 1: Initial Test

  • Run complete test suite
  • Document results
  • Identify failures

Step 2: Analysis

  • Analyze failed queries
  • Determine root cause
  • Plan fixes

Step 3: Fix

  • Update keywords/patterns/description
  • Document changes

Step 4: Retest

  • Test only failed queries
  • Verify fixes work
  • Ensure no regressions

Step 5: Full Regression Test

  • Run complete test suite again
  • Verify 95%+ success rate
  • Document final results

🎯 Sample Test Suite

Example: Agent Creation Skill

## Test Suite: Agent Creation Skill

### Layer 1 Tests (Keywords)

**Keyword:** "create an agent for"
- ✅ "create an agent for processing PDFs"
- ✅ "I want to create an agent for automation"
- ✅ "Create An Agent For daily tasks"

**Keyword:** "automate workflow"
- ✅ "automate workflow for invoices"
- ✅ "need to automate workflow"
- ✅ "Automate Workflow handling"

[... more keywords]

### Layer 2 Tests (Patterns)

**Pattern:** `(?i)(create|build)\s+(an?\s+)?agent`
- ✅ "create an agent for X"
- ✅ "build a agent for Y"
- ✅ "create agent for Z"
- ✅ "Build Agent for tasks"
- ❌ "agent creation guide" (should not match)

[... more patterns]

### Integration Tests

**Capability:** Agent Creation
1. ✅ "create an agent for processing CSVs"
2. ✅ "build automation for email handling"
3. ✅ "automate this workflow: download, process, upload"
4. ✅ "every day I have to categorize files manually"
5. ✅ "turn this process into an automated agent"
6. ✅ "I need a skill for data extraction"
7. ✅ "daily workflow automation needed"
8. ✅ "repeatedly doing manual data entry"
9. ✅ "develop an agent to monitor APIs"
10. ✅ "make something to handle invoices automatically"

**Success Rate:** 10/10 = 100%

### Negative Tests

**Should NOT Activate:**
1. ✅ "How do I use an existing agent?" (did not activate)
2. ✅ "Explain what agents are" (did not activate)
3. ✅ "Debug this code" (did not activate)
4. ✅ "Write a Python function" (did not activate)
5. ✅ "Run the invoice agent" (did not activate)

**Success Rate:** 5/5 = 100%

📚 Additional Resources

  • phase4-detection.md - Detection methodology
  • activation-patterns-guide.md - Pattern library
  • activation-quality-checklist.md - Quality standards
  • ACTIVATION_BEST_PRACTICES.md - Best practices

🔧 Troubleshooting

Issue: Low Success Rate (<90%)

Diagnosis:

  1. Review failed queries
  2. Check if keywords/patterns too narrow
  3. Verify description includes key concepts

Solution:

  1. Add more keyword variations
  2. Broaden patterns slightly
  3. Enhance description with synonyms

Issue: False Positives

Diagnosis:

  1. Review activated queries
  2. Check if patterns too broad
  3. Verify keywords not too generic

Solution:

  1. Narrow patterns (add context requirements)
  2. Use complete phrases for keywords
  3. Add negative scope to description

Issue: Inconsistent Activation

Diagnosis:

  1. Test same query multiple times
  2. Check for Claude Code updates
  3. Verify marketplace.json structure

Solution:

  1. Use all 3 layers (keywords + patterns + description)
  2. Increase keyword/pattern coverage
  3. Validate JSON syntax

Version: 1.0 Last Updated: 2025-10-23 Maintained By: Agent-Skill-Creator Team