Files
gh-francyjglisboa-agent-ski…/references/tools/activation-tester.md
2025-11-29 18:27:25 +08:00

571 lines
15 KiB
Markdown

# Activation Test Automation Framework v1.0
**Version:** 1.0
**Purpose:** Automated testing system for skill activation reliability
**Target:** 99.5% activation reliability with <1% false positives
---
## 🎯 **Overview**
This framework provides automated tools to test, validate, and monitor skill activation reliability across the 3-Layer Activation System (Keywords, Patterns, Description + NLU).
### **Problem Solved**
**Before:** Manual testing was time-consuming, inconsistent, and missed edge cases
**After:** Automated testing provides consistent validation, comprehensive coverage, and continuous monitoring
---
## 🛠️ **Core Components**
### **1. Activation Test Suite Generator**
Automatically generates comprehensive test cases for any skill based on its marketplace.json configuration.
### **2. Regex Pattern Validator**
Validates regex patterns against test cases and identifies potential issues.
### **3. Coverage Analyzer**
Calculates activation coverage and identifies gaps in keyword/pattern combinations.
### **4. Continuous Monitor**
Monitors skill activation in real-time and tracks performance metrics.
---
## 📁 **Framework Structure**
```
references/tools/activation-tester/
├── core/
│ ├── test-generator.md # Test case generation logic
│ ├── pattern-validator.md # Regex validation tools
│ ├── coverage-analyzer.md # Coverage calculation
│ └── performance-monitor.md # Continuous monitoring
├── scripts/
│ ├── run-full-test-suite.sh # Complete automation script
│ ├── quick-validation.sh # Fast validation checks
│ ├── regression-test.sh # Regression testing
│ └── performance-benchmark.sh # Performance testing
├── templates/
│ ├── test-report-template.md # Standardized reporting
│ ├── coverage-report-template.md # Coverage analysis
│ └── performance-dashboard.md # Metrics visualization
└── examples/
├── stock-analyzer-test-suite.md # Example test suite
└── agent-creator-test-suite.md # Example reference test
```
---
## 🧪 **Test Generation System**
### **Keyword Test Generation**
For each keyword in marketplace.json, the system generates:
```bash
generate_keyword_tests() {
local keyword="$1"
local skill_context="$2"
# 1. Exact match test
echo "Test: \"${keyword}\""
# 2. Embedded in sentence
echo "Test: \"I need to ${keyword} for my project\""
# 3. Case variations
echo "Test: \"$(echo ${keyword} | tr '[:lower:]' '[:upper:]')\""
# 4. Natural language variations
echo "Test: \"Can you help me ${keyword}?\""
# 5. Context-specific variations
echo "Test: \"${keyword} in ${skill_context}\""
}
```
### **Pattern Test Generation**
For each regex pattern, generate comprehensive test cases:
```bash
generate_pattern_tests() {
local pattern="$1"
local description="$2"
# Extract pattern components
local verbs=$(extract_verbs "$pattern")
local entities=$(extract_entities "$pattern")
local contexts=$(extract_contexts "$pattern")
# Generate positive test cases
for verb in $verbs; do
for entity in $entities; do
echo "Test: \"${verb} ${entity}\""
echo "Test: \"I want to ${verb} ${entity} now\""
echo "Test: \"Can you ${verb} ${entity} for me?\""
done
done
# Generate negative test cases
generate_negative_cases "$pattern"
}
```
### **Integration Test Generation**
Creates realistic user queries combining multiple elements:
```bash
generate_integration_tests() {
local capabilities=("$@")
for capability in "${capabilities[@]}"; do
# Natural language variations
echo "Test: \"How can I ${capability}?\""
echo "Test: \"I need help with ${capability}\""
echo "Test: \"Can you ${capability} for me?\""
# Workflow context
echo "Test: \"Every day I have to ${capability}\""
echo "Test: \"I want to automate ${capability}\""
# Complex queries
echo "Test: \"${capability} and show me results\""
echo "Test: \"Help me understand ${capability} better\""
done
}
```
---
## 🔍 **Pattern Validation System**
### **Regex Pattern Analyzer**
Validates regex patterns for common issues:
```python
def analyze_pattern(pattern):
"""Analyze regex pattern for potential issues"""
issues = []
suggestions = []
# Check for common regex problems
if pattern.count('*') > 2:
issues.append("Too many wildcards - may cause false positives")
if not re.search(r'\(\?\:i\)', pattern):
suggestions.append("Add case-insensitive flag: (?i)")
if pattern.startswith('.*') and pattern.endswith('.*'):
issues.append("Pattern too broad - may match anything")
# Calculate pattern specificity
specificity = calculate_specificity(pattern)
return {
'issues': issues,
'suggestions': suggestions,
'specificity': specificity,
'risk_level': assess_risk(pattern)
}
```
### **Pattern Coverage Test**
Tests pattern against comprehensive query variations:
```bash
test_pattern_coverage() {
local pattern="$1"
local test_queries=("$@")
local matches=0
local total=${#test_queries[@]}
for query in "${test_queries[@]}"; do
if [[ $query =~ $pattern ]]; then
((matches++))
echo "✅ Match: '$query'"
else
echo "❌ No match: '$query'"
fi
done
local coverage=$((matches * 100 / total))
echo "Pattern coverage: ${coverage}%"
if [[ $coverage -lt 80 ]]; then
echo "⚠️ Low coverage - consider expanding pattern"
fi
}
```
---
## 📊 **Coverage Analysis System**
### **Multi-Layer Coverage Calculator**
Calculates coverage across all three activation layers:
```python
def calculate_activation_coverage(skill_config):
"""Calculate comprehensive activation coverage"""
keywords = skill_config['activation']['keywords']
patterns = skill_config['activation']['patterns']
description = skill_config['metadata']['description']
# Layer 1: Keyword coverage
keyword_coverage = {
'total_keywords': len(keywords),
'categories': categorize_keywords(keywords),
'synonym_coverage': calculate_synonym_coverage(keywords),
'natural_language_coverage': calculate_nl_coverage(keywords)
}
# Layer 2: Pattern coverage
pattern_coverage = {
'total_patterns': len(patterns),
'pattern_types': categorize_patterns(patterns),
'regex_complexity': calculate_pattern_complexity(patterns),
'overlap_analysis': analyze_pattern_overlap(patterns)
}
# Layer 3: Description coverage
description_coverage = {
'keyword_density': calculate_keyword_density(description, keywords),
'semantic_richness': analyze_semantic_content(description),
'concept_coverage': extract_concepts(description)
}
# Overall coverage score
overall_score = calculate_overall_coverage(
keyword_coverage, pattern_coverage, description_coverage
)
return {
'overall_score': overall_score,
'keyword_coverage': keyword_coverage,
'pattern_coverage': pattern_coverage,
'description_coverage': description_coverage,
'recommendations': generate_recommendations(overall_score)
}
```
### **Gap Identification**
Identifies gaps in activation coverage:
```python
def identify_activation_gaps(skill_config, test_results):
"""Identify gaps in activation coverage"""
gaps = []
# Analyze failed test queries
failed_queries = [q for q in test_results if not q['activated']]
# Categorize failures
failure_categories = categorize_failures(failed_queries)
# Identify missing keyword categories
missing_categories = find_missing_keyword_categories(
skill_config['activation']['keywords'],
failure_categories
)
# Identify pattern weaknesses
pattern_gaps = find_pattern_gaps(
skill_config['activation']['patterns'],
failed_queries
)
# Generate specific recommendations
for category in missing_categories:
gaps.append({
'type': 'missing_keyword_category',
'category': category,
'suggestion': f"Add 5-10 keywords from {category} category"
})
for gap in pattern_gaps:
gaps.append({
'type': 'pattern_gap',
'gap_type': gap['type'],
'suggestion': gap['suggestion']
})
return gaps
```
---
## 🚀 **Automation Scripts**
### **Full Test Suite Runner**
```bash
#!/bin/bash
# run-full-test-suite.sh
run_full_test_suite() {
local skill_path="$1"
local output_dir="$2"
echo "🧪 Running Full Activation Test Suite"
echo "Skill: $skill_path"
echo "Output: $output_dir"
# 1. Parse skill configuration
echo "📋 Parsing skill configuration..."
parse_skill_config "$skill_path"
# 2. Generate test cases
echo "🎲 Generating test cases..."
generate_all_test_cases "$skill_path"
# 3. Run keyword tests
echo "🔑 Testing keyword activation..."
run_keyword_tests "$skill_path"
# 4. Run pattern tests
echo "🔍 Testing pattern matching..."
run_pattern_tests "$skill_path"
# 5. Run integration tests
echo "🔗 Testing integration scenarios..."
run_integration_tests "$skill_path"
# 6. Run negative tests
echo "🚫 Testing false positives..."
run_negative_tests "$skill_path"
# 7. Calculate coverage
echo "📊 Calculating coverage..."
calculate_coverage "$skill_path"
# 8. Generate report
echo "📄 Generating test report..."
generate_test_report "$skill_path" "$output_dir"
echo "✅ Test suite completed!"
echo "📁 Report available at: $output_dir/activation-test-report.html"
}
```
### **Quick Validation Script**
```bash
#!/bin/bash
# quick-validation.sh
quick_validation() {
local skill_path="$1"
echo "⚡ Quick Activation Validation"
# Fast JSON validation
if ! python3 -m json.tool "$skill_path/marketplace.json" > /dev/null 2>&1; then
echo "❌ Invalid JSON in marketplace.json"
return 1
fi
# Check required fields
check_required_fields "$skill_path"
# Validate regex patterns
validate_patterns "$skill_path"
# Quick keyword count check
keyword_count=$(jq '.activation.keywords | length' "$skill_path/marketplace.json")
if [[ $keyword_count -lt 20 ]]; then
echo "⚠️ Low keyword count: $keyword_count (recommend 50+)"
fi
# Pattern count check
pattern_count=$(jq '.activation.patterns | length' "$skill_path/marketplace.json")
if [[ $pattern_count -lt 8 ]]; then
echo "⚠️ Low pattern count: $pattern_count (recommend 10+)"
fi
echo "✅ Quick validation completed"
}
```
---
## 📈 **Performance Monitoring**
### **Real-time Activation Monitor**
```python
class ActivationMonitor:
"""Monitor skill activation performance in real-time"""
def __init__(self, skill_name):
self.skill_name = skill_name
self.activation_log = []
self.performance_metrics = {
'total_activations': 0,
'successful_activations': 0,
'failed_activations': 0,
'average_response_time': 0,
'activation_by_layer': {
'keywords': 0,
'patterns': 0,
'description': 0
}
}
def log_activation(self, query, activated, layer, response_time):
"""Log activation attempt"""
self.activation_log.append({
'timestamp': datetime.now(),
'query': query,
'activated': activated,
'layer': layer,
'response_time': response_time
})
self.update_metrics(activated, layer, response_time)
def calculate_reliability_score(self):
"""Calculate current reliability score"""
if self.performance_metrics['total_activations'] == 0:
return 0.0
success_rate = (
self.performance_metrics['successful_activations'] /
self.performance_metrics['total_activations']
)
return success_rate
def generate_alerts(self):
"""Generate performance alerts"""
alerts = []
reliability = self.calculate_reliability_score()
if reliability < 0.95:
alerts.append({
'type': 'low_reliability',
'message': f'Reliability dropped to {reliability:.2%}',
'severity': 'high'
})
avg_response_time = self.performance_metrics['average_response_time']
if avg_response_time > 5.0:
alerts.append({
'type': 'slow_response',
'message': f'Average response time: {avg_response_time:.2f}s',
'severity': 'medium'
})
return alerts
```
---
## 📋 **Usage Examples**
### **Example 1: Testing Stock Analyzer Skill**
```bash
# Run full test suite
./run-full-test-suite.sh \
/path/to/stock-analyzer-cskill \
/output/test-results
# Quick validation
./quick-validation.sh /path/to/stock-analyzer-cskill
# Monitor performance
./performance-benchmark.sh stock-analyzer-cskill
```
### **Example 2: Integration with Development Workflow**
```yaml
# .github/workflows/activation-testing.yml
name: Activation Testing
on: [push, pull_request]
jobs:
test-activation:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run Activation Tests
run: |
./references/tools/activation-tester/scripts/run-full-test-suite.sh \
./references/examples/stock-analyzer-cskill \
./test-results
- name: Upload Test Results
uses: actions/upload-artifact@v2
with:
name: activation-test-results
path: ./test-results/
```
---
## ✅ **Quality Standards**
### **Test Coverage Requirements**
- [ ] 100% keyword coverage testing
- [ ] 95%+ pattern coverage validation
- [ ] All capability variations tested
- [ ] Edge cases documented and tested
- [ ] Negative testing for false positives
### **Performance Benchmarks**
- [ ] Activation reliability: 99.5%+
- [ ] False positive rate: <1%
- [ ] Test execution time: <30 seconds
- [ ] Memory usage: <100MB
- [ ] Response time: <2 seconds average
### **Reporting Standards**
- [ ] Automated test report generation
- [ ] Performance metrics dashboard
- [ ] Historical trend analysis
- [ ] Actionable recommendations
- [ ] Integration with CI/CD pipeline
---
## 🔄 **Continuous Improvement**
### **Feedback Loop Integration**
1. **Collect** activation data from real usage
2. **Analyze** performance metrics and failure patterns
3. **Identify** optimization opportunities
4. **Implement** improvements to keywords/patterns
5. **Validate** improvements with automated testing
6. **Deploy** updated configurations
### **A/B Testing Framework**
- Test different keyword combinations
- Compare pattern performance
- Validate description effectiveness
- Measure user satisfaction impact
---
## 📚 **Additional Resources**
- `../activation-testing-guide.md` - Manual testing procedures
- `../activation-patterns-guide.md` - Pattern library
- `../phase4-detection.md` - Detection methodology
- `../synonym-expansion-system.md` - Keyword expansion
---
**Version:** 1.0
**Last Updated:** 2025-10-24
**Maintained By:** Agent-Skill-Creator Team