Initial commit
This commit is contained in:
1020
commands/apply-anthropic-skill-best-practices.md
Normal file
1020
commands/apply-anthropic-skill-best-practices.md
Normal file
File diff suppressed because it is too large
Load Diff
461
commands/create-command.md
Normal file
461
commands/create-command.md
Normal file
@@ -0,0 +1,461 @@
|
||||
---
|
||||
description: Interactive assistant for creating new Claude commands with proper structure, patterns, and MCP tool integration
|
||||
argument-hint: Optional command name or description of command purpose
|
||||
---
|
||||
|
||||
# Command Creator Assistant
|
||||
|
||||
<task>
|
||||
You are a command creation specialist. Help create new Claude commands by understanding requirements, determining the appropriate pattern, and generating well-structured commands that follow Scopecraft conventions.
|
||||
</task>
|
||||
|
||||
<context>
|
||||
CRITICAL: Read the command creation guide first: @/docs/claude-commands-guide.md
|
||||
|
||||
This meta-command helps create other commands by:
|
||||
|
||||
1. Understanding the command's purpose
|
||||
2. Determining its category and pattern
|
||||
3. Choosing command location (project vs user)
|
||||
4. Generating the command file
|
||||
5. Creating supporting resources
|
||||
6. Updating documentation
|
||||
</context>
|
||||
|
||||
<command_categories>
|
||||
|
||||
1. **Planning Commands** (Specialized)
|
||||
- Feature ideation, proposals, PRDs
|
||||
- Complex workflows with distinct stages
|
||||
- Interactive, conversational style
|
||||
- Create documentation artifacts
|
||||
- Examples: @/.claude/commands/01_brainstorm-feature.md
|
||||
@/.claude/commands/02_feature-proposal.md
|
||||
|
||||
2. **Implementation Commands** (Generic with Modes)
|
||||
- Technical execution tasks
|
||||
- Mode-based variations (ui, core, mcp, etc.)
|
||||
- Follow established patterns
|
||||
- Update task states
|
||||
- Example: @/.claude/commands/implement.md
|
||||
|
||||
3. **Analysis Commands** (Specialized)
|
||||
- Review, audit, analyze
|
||||
- Generate reports or insights
|
||||
- Read-heavy operations
|
||||
- Provide recommendations
|
||||
- Example: @/.claude/commands/review.md
|
||||
|
||||
4. **Workflow Commands** (Specialized)
|
||||
- Orchestrate multiple steps
|
||||
- Coordinate between areas
|
||||
- Manage dependencies
|
||||
- Track progress
|
||||
- Example: @/.claude/commands/04_feature-planning.md
|
||||
|
||||
5. **Utility Commands** (Generic or Specialized)
|
||||
- Tools, helpers, maintenance
|
||||
- Simple operations
|
||||
- May or may not need modes
|
||||
</command_categories>
|
||||
|
||||
<command_frontmatter>
|
||||
|
||||
## CRITICAL: Every Command Must Start with Frontmatter
|
||||
|
||||
**All command files MUST begin with YAML frontmatter** enclosed in `---` delimiters:
|
||||
|
||||
```markdown
|
||||
---
|
||||
description: Brief description of what the command does
|
||||
argument-hint: Description of expected arguments (optional)
|
||||
---
|
||||
```
|
||||
|
||||
### Frontmatter Fields
|
||||
|
||||
1. **`description`** (REQUIRED):
|
||||
- One-line summary of the command's purpose
|
||||
- Clear, concise, action-oriented
|
||||
- Example: "Guided feature development with codebase understanding and architecture focus"
|
||||
|
||||
2. **`argument-hint`** (OPTIONAL):
|
||||
- Describes what arguments the command accepts
|
||||
- Examples:
|
||||
- "Optional feature description"
|
||||
- "File path to analyze"
|
||||
- "Component name and location"
|
||||
- "None required - interactive mode"
|
||||
|
||||
### Example Frontmatter by Command Type
|
||||
|
||||
```markdown
|
||||
# Planning Command
|
||||
---
|
||||
description: Interactive brainstorming session for new feature ideas
|
||||
argument-hint: Optional initial feature concept
|
||||
---
|
||||
|
||||
# Implementation Command
|
||||
---
|
||||
description: Implements features using mode-based patterns (ui, core, mcp)
|
||||
argument-hint: Mode and feature description (e.g., 'ui: add dark mode toggle')
|
||||
---
|
||||
|
||||
# Analysis Command
|
||||
---
|
||||
description: Comprehensive code review with quality assessment
|
||||
argument-hint: Optional file or directory path to review
|
||||
---
|
||||
|
||||
# Utility Command
|
||||
---
|
||||
description: Validates API documentation against OpenAPI standards
|
||||
argument-hint: Path to OpenAPI spec file
|
||||
---
|
||||
```
|
||||
|
||||
### Placement
|
||||
|
||||
- Frontmatter MUST be the **very first content** in the file
|
||||
- No blank lines before the opening `---`
|
||||
- One blank line after the closing `---` before content begins
|
||||
</command_frontmatter>
|
||||
|
||||
<pattern_research>
|
||||
|
||||
## Before Creating: Study Similar Commands
|
||||
|
||||
1. **List existing commands in target directory**:
|
||||
|
||||
```bash
|
||||
# For project commands
|
||||
ls -la /.claude/commands/
|
||||
|
||||
# For user commands
|
||||
ls -la ~/.claude/commands/
|
||||
```
|
||||
|
||||
2. **Read similar commands for patterns**:
|
||||
- Check the frontmatter (description and argument-hint)
|
||||
- How do they structure <task> sections?
|
||||
- What MCP tools do they use?
|
||||
- How do they handle arguments?
|
||||
- What documentation do they reference?
|
||||
|
||||
3. **Common patterns to look for**:
|
||||
|
||||
```markdown
|
||||
# MCP tool usage for tasks
|
||||
Use tool: mcp__scopecraft-cmd__task_create
|
||||
Use tool: mcp__scopecraft-cmd__task_update
|
||||
Use tool: mcp__scopecraft-cmd__task_list
|
||||
|
||||
# NOT CLI commands
|
||||
❌ Run: scopecraft task list
|
||||
✅ Use tool: mcp__scopecraft-cmd__task_list
|
||||
```
|
||||
|
||||
4. **Standard references to include**:
|
||||
- @/docs/organizational-structure-guide.md
|
||||
- @/docs/command-resources/{relevant-templates}
|
||||
- @/docs/claude-commands-guide.md
|
||||
</pattern_research>
|
||||
|
||||
<interview_process>
|
||||
|
||||
## Phase 1: Understanding Purpose
|
||||
|
||||
"Let's create a new command. First, let me check what similar commands exist..."
|
||||
|
||||
*Use Glob to find existing commands in the target category*
|
||||
|
||||
"Based on existing patterns, please describe:"
|
||||
|
||||
1. What problem does this command solve?
|
||||
2. Who will use it and when?
|
||||
3. What's the expected output?
|
||||
4. Is it interactive or batch?
|
||||
|
||||
## Phase 2: Category Classification
|
||||
|
||||
Based on responses and existing examples:
|
||||
|
||||
- Is this like existing planning commands? (Check: brainstorm-feature, feature-proposal)
|
||||
- Is this like implementation commands? (Check: implement.md)
|
||||
- Does it need mode variations?
|
||||
- Should it follow analysis patterns? (Check: review.md)
|
||||
|
||||
## Phase 3: Pattern Selection
|
||||
|
||||
**Study similar commands first**:
|
||||
|
||||
```markdown
|
||||
# Read a similar command
|
||||
@{similar-command-path}
|
||||
|
||||
# Note patterns:
|
||||
- Task description style
|
||||
- Argument handling
|
||||
- MCP tool usage
|
||||
- Documentation references
|
||||
- Human review sections
|
||||
```
|
||||
|
||||
## Phase 4: Command Location
|
||||
|
||||
🎯 **Critical Decision: Where should this command live?**
|
||||
|
||||
**Project Command** (`/.claude/commands/`)
|
||||
|
||||
- Specific to this project's workflow
|
||||
- Uses project conventions
|
||||
- References project documentation
|
||||
- Integrates with project MCP tools
|
||||
|
||||
**User Command** (`~/.claude/commands/`)
|
||||
|
||||
- General-purpose utility
|
||||
- Reusable across projects
|
||||
- Personal productivity tool
|
||||
- Not project-specific
|
||||
|
||||
Ask: "Should this be:
|
||||
|
||||
1. A project command (specific to this codebase)
|
||||
2. A user command (available in all projects)?"
|
||||
|
||||
## Phase 5: Resource Planning
|
||||
|
||||
Check existing resources:
|
||||
|
||||
```bash
|
||||
# Check templates
|
||||
ls -la /docs/command-resources/planning-templates/
|
||||
ls -la /docs/command-resources/implement-modes/
|
||||
|
||||
# Check which guides exist
|
||||
ls -la /docs/
|
||||
```
|
||||
|
||||
</interview_process>
|
||||
|
||||
<generation_patterns>
|
||||
|
||||
## Critical: Copy Patterns from Similar Commands
|
||||
|
||||
Before generating, read similar commands and note:
|
||||
|
||||
1. **Frontmatter (MUST BE FIRST)**:
|
||||
|
||||
```markdown
|
||||
---
|
||||
description: Clear one-line description of command purpose
|
||||
argument-hint: What arguments does it accept
|
||||
---
|
||||
```
|
||||
|
||||
- No blank lines before opening `---`
|
||||
- One blank line after closing `---`
|
||||
- `description` is REQUIRED
|
||||
- `argument-hint` is OPTIONAL
|
||||
|
||||
2. **MCP Tool Usage**:
|
||||
|
||||
```markdown
|
||||
# From existing commands
|
||||
Use mcp__scopecraft-cmd__task_create
|
||||
Use mcp__scopecraft-cmd__feature_get
|
||||
Use mcp__scopecraft-cmd__phase_list
|
||||
```
|
||||
|
||||
3. **Standard References**:
|
||||
|
||||
```markdown
|
||||
<context>
|
||||
Key Reference: @/docs/organizational-structure-guide.md
|
||||
Template: @/docs/command-resources/planning-templates/{template}.md
|
||||
Guide: @/docs/claude-commands-guide.md
|
||||
</context>
|
||||
```
|
||||
|
||||
4. **Task Update Patterns**:
|
||||
|
||||
```markdown
|
||||
<task_updates>
|
||||
After implementation:
|
||||
1. Update task status to appropriate state
|
||||
2. Add implementation log entries
|
||||
3. Mark checklist items as complete
|
||||
4. Document any decisions made
|
||||
</task_updates>
|
||||
```
|
||||
|
||||
5. **Human Review Sections**:
|
||||
|
||||
```markdown
|
||||
<human_review_needed>
|
||||
Flag decisions needing verification:
|
||||
- [ ] Assumptions about workflows
|
||||
- [ ] Technical approach choices
|
||||
- [ ] Pattern-based suggestions
|
||||
</human_review_needed>
|
||||
```
|
||||
|
||||
</generation_patterns>
|
||||
|
||||
<implementation_steps>
|
||||
|
||||
1. **Create Command File**
|
||||
- Determine location based on project/user choice
|
||||
- Generate content following established patterns
|
||||
- Include all required sections
|
||||
|
||||
2. **Create Supporting Files** (if project command)
|
||||
- Templates in `/docs/command-resources/`
|
||||
- Mode guides if generic command
|
||||
- Example documentation
|
||||
|
||||
3. **Update Documentation** (if project command)
|
||||
- Add to claude-commands-guide.md
|
||||
- Update feature-development-workflow.md if workflow command
|
||||
- Add to README if user-facing
|
||||
|
||||
4. **Test the Command**
|
||||
- Create example usage scenarios
|
||||
- Verify argument handling
|
||||
- Check MCP tool integration
|
||||
</implementation_steps>
|
||||
|
||||
<creation_checklist>
|
||||
Before finalizing:
|
||||
|
||||
- [ ] **Includes YAML frontmatter** with description (required) and argument-hint (optional)
|
||||
- [ ] Frontmatter is the very first content (no blank lines before opening `---`)
|
||||
- [ ] Studied similar commands in the category
|
||||
- [ ] Command follows naming conventions (use numeric prefix for ordered workflows)
|
||||
- [ ] Includes proper task/context structure
|
||||
- [ ] References @organizational-structure-guide.md
|
||||
- [ ] Uses MCP tools (not CLI) - check existing patterns
|
||||
- [ ] Includes human review sections
|
||||
- [ ] Has clear examples like other commands
|
||||
- [ ] Updates task states appropriately
|
||||
- [ ] Creates proper documentation
|
||||
- [ ] Follows established patterns from similar commands
|
||||
- [ ] Correct command prefix (project: or user:)
|
||||
</creation_checklist>
|
||||
|
||||
<example_session>
|
||||
User: "I need a command to help validate our API documentation"
|
||||
|
||||
🔍 **Research**: Let me check existing analysis commands...
|
||||
|
||||
*Use Read tool to examine: /.claude/commands/review.md*
|
||||
|
||||
I notice the review command:
|
||||
|
||||
- Uses MCP tools for task operations
|
||||
- Includes human review sections
|
||||
- References organizational structure
|
||||
- Has clear output formatting
|
||||
|
||||
🤔 **Question**: Can you tell me more about this API documentation validation?
|
||||
|
||||
- What format is the documentation in?
|
||||
- What aspects need validation?
|
||||
- Should it create tasks for issues found?
|
||||
|
||||
User: "It's OpenAPI specs, need to check for completeness and consistency"
|
||||
|
||||
💡 **Category**: This is an Analysis command similar to 'review'.
|
||||
|
||||
🔍 **Pattern Check**: Looking at review.md, I see it:
|
||||
|
||||
```markdown
|
||||
<task>
|
||||
You are a code reviewer conducting a comprehensive review...
|
||||
</task>
|
||||
|
||||
<mcp_usage>
|
||||
Always use MCP tools:
|
||||
- mcp__scopecraft-cmd__task_list
|
||||
- mcp__scopecraft-cmd__task_update
|
||||
</mcp_usage>
|
||||
```
|
||||
|
||||
🎯 **Location Question**: Should this be:
|
||||
|
||||
1. A project command (specific to this API project)
|
||||
2. A user command (useful for all your API projects)
|
||||
|
||||
User: "Project command - it needs to reference our specific API standards"
|
||||
|
||||
✅ Creating project command: `/.claude/commands/validate-api.md`
|
||||
|
||||
Generated command (following review.md patterns):
|
||||
|
||||
```markdown
|
||||
---
|
||||
description: Validates API documentation against OpenAPI standards for completeness and consistency
|
||||
argument-hint: Path to OpenAPI spec file (optional, will search if not provided)
|
||||
---
|
||||
|
||||
<task>
|
||||
You are an API documentation validator reviewing OpenAPI specifications for completeness and consistency.
|
||||
</task>
|
||||
|
||||
<context>
|
||||
References:
|
||||
- API Standards: @/docs/api-standards.md
|
||||
- Organizational Structure: @/docs/organizational-structure-guide.md
|
||||
Similar to: @/.claude/commands/review.md
|
||||
</context>
|
||||
|
||||
<validation_process>
|
||||
1. Load OpenAPI spec files
|
||||
2. Check required endpoints documented
|
||||
3. Validate response schemas
|
||||
4. Verify authentication documented
|
||||
5. Check for missing examples
|
||||
</validation_process>
|
||||
|
||||
<mcp_usage>
|
||||
If issues found, create tasks:
|
||||
- Use tool: mcp__scopecraft-cmd__task_create
|
||||
- Type: "bug" or "documentation"
|
||||
- Phase: Current active phase
|
||||
- Area: "docs" or "api"
|
||||
</mcp_usage>
|
||||
|
||||
<human_review_needed>
|
||||
Flag for manual review:
|
||||
- [ ] Breaking changes detected
|
||||
- [ ] Security implications unclear
|
||||
- [ ] Business logic assumptions
|
||||
</human_review_needed>
|
||||
```
|
||||
|
||||
</example_session>
|
||||
|
||||
<final_output>
|
||||
After gathering all information:
|
||||
|
||||
1. **Command Created**:
|
||||
- Location: {chosen location}
|
||||
- Name: {command-name}
|
||||
- Category: {category}
|
||||
- Pattern: {specialized/generic}
|
||||
|
||||
2. **Resources Created**:
|
||||
- Supporting templates: {list}
|
||||
- Documentation updates: {list}
|
||||
|
||||
3. **Usage Instructions**:
|
||||
- Command: `/{prefix}:{name}`
|
||||
- Example: {example usage}
|
||||
|
||||
4. **Next Steps**:
|
||||
- Test the command
|
||||
- Refine based on usage
|
||||
- Add to command documentation
|
||||
</final_output>
|
||||
219
commands/create-hook.md
Normal file
219
commands/create-hook.md
Normal file
@@ -0,0 +1,219 @@
|
||||
---
|
||||
description: Create and configure git hooks with intelligent project analysis, suggestions, and automated testing
|
||||
argument-hint: Optional hook type or description of desired behavior
|
||||
---
|
||||
|
||||
# Create Hook Command
|
||||
|
||||
Analyze the project, suggest practical hooks, and create them with proper testing.
|
||||
|
||||
## Your Task (/create-hook)
|
||||
|
||||
1. **Analyze environment** - Detect tooling and existing hooks
|
||||
2. **Suggest hooks** - Based on your project configuration
|
||||
3. **Configure hook** - Ask targeted questions and create the script
|
||||
4. **Test & validate** - Ensure the hook works correctly
|
||||
|
||||
## Your Workflow
|
||||
|
||||
### 1. Environment Analysis & Suggestions
|
||||
|
||||
Automatically detect the project tooling and suggest relevant hooks:
|
||||
|
||||
**When TypeScript is detected (`tsconfig.json`):**
|
||||
|
||||
- PostToolUse hook: "Type-check files after editing"
|
||||
- PreToolUse hook: "Block edits with type errors"
|
||||
|
||||
**When Prettier is detected (`.prettierrc`, `prettier.config.js`):**
|
||||
|
||||
- PostToolUse hook: "Auto-format files after editing"
|
||||
- PreToolUse hook: "Require formatted code"
|
||||
|
||||
**When ESLint is detected (`.eslintrc.*`):**
|
||||
|
||||
- PostToolUse hook: "Lint and auto-fix after editing"
|
||||
- PreToolUse hook: "Block commits with linting errors"
|
||||
|
||||
**When package.json has scripts:**
|
||||
|
||||
- `test` script → "Run tests before commits"
|
||||
- `build` script → "Validate build before commits"
|
||||
|
||||
**When a git repository is detected:**
|
||||
|
||||
- PreToolUse/Bash hook: "Prevent commits with secrets"
|
||||
- PostToolUse hook: "Security scan on file changes"
|
||||
|
||||
**Decision Tree:**
|
||||
|
||||
```
|
||||
Project has TypeScript? → Suggest type checking hooks
|
||||
Project has formatter? → Suggest formatting hooks
|
||||
Project has tests? → Suggest test validation hooks
|
||||
Security sensitive? → Suggest security hooks
|
||||
+ Scan for additional patterns and suggest custom hooks based on:
|
||||
- Custom scripts in package.json
|
||||
- Unique file patterns or extensions
|
||||
- Development workflow indicators
|
||||
- Project-specific tooling configurations
|
||||
```
|
||||
|
||||
### 2. Hook Configuration
|
||||
|
||||
Start by asking: **"What should this hook do?"** and offer relevant suggestions from your analysis.
|
||||
|
||||
Then understand the context from the user's description and **only ask about details you're unsure about**:
|
||||
|
||||
1. **Trigger timing**: When should it run?
|
||||
- `PreToolUse`: Before file operations (can block)
|
||||
- `PostToolUse`: After file operations (feedback/fixes)
|
||||
- `UserPromptSubmit`: Before processing requests
|
||||
- Other event types as needed
|
||||
|
||||
2. **Tool matcher**: Which tools should trigger it? (`Write`, `Edit`, `Bash`, `*` etc)
|
||||
|
||||
3. **Scope**: `global`, `project`, or `project-local`
|
||||
|
||||
4. **Response approach**:
|
||||
- **Exit codes only**: Simple (exit 0 = success, exit 2 = block in PreToolUse)
|
||||
- **JSON response**: Advanced control (blocking, context, decisions)
|
||||
- Guide based on complexity: simple pass/fail → exit codes, rich feedback → JSON
|
||||
|
||||
5. **Blocking behavior** (if relevant): "Should this stop operations when issues are found?"
|
||||
- PreToolUse: Can block operations (security, validation)
|
||||
- PostToolUse: Usually provide feedback only
|
||||
|
||||
6. **Claude integration** (CRITICAL): "Should Claude Code automatically see and fix issues this hook detects?"
|
||||
- If YES: Use `additionalContext` for error communication
|
||||
- If NO: Use `suppressOutput: true` for silent operation
|
||||
|
||||
7. **Context pollution**: "Should successful operations be silent to avoid noise?"
|
||||
- Recommend YES for formatting, routine checks
|
||||
- Recommend NO for security alerts, critical errors
|
||||
|
||||
8. **File filtering**: "What file types should this hook process?"
|
||||
|
||||
### 3. Hook Creation
|
||||
|
||||
You should:
|
||||
|
||||
- **Create hooks directory**: `~/.claude/hooks/` or `.claude/hooks/` based on scope
|
||||
- **Generate script**: Create hook script with:
|
||||
- Proper shebang and executable permissions
|
||||
- Project-specific commands (use detected config paths)
|
||||
- Comments explaining the hook's purpose
|
||||
- **Update settings**: Add hook configuration to appropriate settings.json
|
||||
- **Use absolute paths**: Avoid relative paths to scripts and executables. Use `$CLAUDE_PROJECT_DIR` to reference project root
|
||||
- **Offer validation**: Ask if the user wants you to test the hook
|
||||
|
||||
**Key Implementation Standards:**
|
||||
|
||||
- Read JSON from stdin (never use argv)
|
||||
- Use top-level `additionalContext`/`systemMessage` for Claude communication
|
||||
- Include `suppressOutput: true` for successful operations
|
||||
- Provide specific error counts and actionable feedback
|
||||
- Focus on changed files rather than entire codebase
|
||||
- Support common development workflows
|
||||
|
||||
**⚠️ CRITICAL: Input/Output Format**
|
||||
|
||||
This is where most hook implementations fail. Pay extra attention to:
|
||||
|
||||
- **Input**: Reading JSON from stdin correctly (not argv)
|
||||
- **Output**: Using correct top-level JSON structure for Claude communication
|
||||
- **Documentation**: Consulting official docs for exact schemas when in doubt
|
||||
|
||||
### 4. Testing & Validation
|
||||
|
||||
**CRITICAL: Test both happy and sad paths:**
|
||||
|
||||
**Happy Path Testing:**
|
||||
|
||||
1. **Test expected success scenario** - Create conditions where hook should pass
|
||||
- _Examples_: TypeScript (valid code), Linting (formatted code), Security (safe commands)
|
||||
|
||||
**Sad Path Testing:** 2. **Test expected failure scenario** - Create conditions where hook should fail/warn
|
||||
|
||||
- _Examples_: TypeScript (type errors), Linting (unformatted code), Security (dangerous operations)
|
||||
|
||||
**Verification Steps:** 3. **Verify expected behavior**: Check if it blocks/warns/provides context as intended
|
||||
|
||||
**Example Testing Process:**
|
||||
|
||||
- For a hook preventing file deletion: Create a test file, attempt the protected action, and verify the hook prevents it
|
||||
|
||||
**If Issues Occur, you should:**
|
||||
|
||||
- Check hook registration in settings
|
||||
- Verify script permissions (`chmod +x`)
|
||||
- Test with simplified version first
|
||||
- Debug with detailed hook execution analysis
|
||||
|
||||
## Hook Templates
|
||||
|
||||
### Type Checking (PostToolUse)
|
||||
|
||||
```
|
||||
#!/usr/bin/env node
|
||||
// Read stdin JSON, check .ts/.tsx files only
|
||||
// Run: npx tsc --noEmit --pretty
|
||||
// Output: JSON with additionalContext for errors
|
||||
```
|
||||
|
||||
### Auto-formatting (PostToolUse)
|
||||
|
||||
```
|
||||
#!/usr/bin/env node
|
||||
// Read stdin JSON, check supported file types
|
||||
// Run: npx prettier --write [file]
|
||||
// Output: JSON with suppressOutput: true
|
||||
```
|
||||
|
||||
### Security Scanning (PreToolUse)
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Read stdin JSON, check for secrets/keys
|
||||
# Block if dangerous patterns found
|
||||
# Exit 2 to block, 0 to continue
|
||||
```
|
||||
|
||||
_Complete templates available at: https://docs.claude.com/en/docs/claude-code/hooks#examples_
|
||||
|
||||
## Quick Reference
|
||||
|
||||
**📖 Official Docs**: https://docs.claude.com/en/docs/claude-code/hooks.md
|
||||
|
||||
**Common Patterns:**
|
||||
|
||||
- **stdin input**: `JSON.parse(process.stdin.read())`
|
||||
- **File filtering**: Check extensions before processing
|
||||
- **Success response**: `{continue: true, suppressOutput: true}`
|
||||
- **Error response**: `{continue: true, additionalContext: "error details"}`
|
||||
- **Block operation**: `exit(2)` in PreToolUse hooks
|
||||
|
||||
**Hook Types by Use Case:**
|
||||
|
||||
- **Code Quality**: PostToolUse for feedback and fixes
|
||||
- **Security**: PreToolUse to block dangerous operations
|
||||
- **CI/CD**: PreToolUse to validate before commits
|
||||
- **Development**: PostToolUse for automated improvements
|
||||
|
||||
**Hook Execution Best Practices:**
|
||||
|
||||
- **Hooks run in parallel** according to official documentation
|
||||
- **Design for independence** since execution order isn't guaranteed
|
||||
- **Plan hook interactions carefully** when multiple hooks affect the same files
|
||||
|
||||
## Success Criteria
|
||||
|
||||
✅ **Hook created successfully when:**
|
||||
|
||||
- Script has executable permissions
|
||||
- Registered in correct settings.json
|
||||
- Responds correctly to test scenarios
|
||||
- Integrates properly with Claude for automated fixes
|
||||
- Follows project conventions and detected tooling
|
||||
|
||||
**Result**: The user gets a working hook that enhances their development workflow with intelligent automation and quality checks.
|
||||
875
commands/create-skill.md
Normal file
875
commands/create-skill.md
Normal file
@@ -0,0 +1,875 @@
|
||||
---
|
||||
name: create-skill
|
||||
description: Guide for creating effective skills. This command should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations. Use when creating new skills, editing existing skills, or verifying skills work before deployment - applies TDD to process documentation by testing with subagents before writing, iterating until bulletproof against rationalization
|
||||
|
||||
---
|
||||
|
||||
# Create Skill Command
|
||||
|
||||
This command provides guidance for creating effective skills.
|
||||
|
||||
## Overview
|
||||
|
||||
**Writing skills IS Test-Driven Development applied to process documentation.**
|
||||
|
||||
**Personal skills live in agent-specific directories (`~/.claude/skills` for Claude Code, `~/.codex/skills` for Codex)**
|
||||
|
||||
You write test cases (pressure scenarios with subagents), watch them fail (baseline behavior), write the skill (documentation), watch tests pass (agents comply), and refactor (close loopholes).
|
||||
|
||||
**Core principle:** If you didn't watch an agent fail without the skill, you don't know if the skill teaches the right thing.
|
||||
|
||||
**REQUIRED BACKGROUND:** You MUST understand Test-Driven Development before using this skill. That skill defines the fundamental RED-GREEN-REFACTOR cycle. This skill adapts TDD to documentation.
|
||||
|
||||
**Official guidance:** The Anthropic's official skill authoring best practices provided at the `/customaize-agent:apply-anthropic-skill-best-practices` command, they enhance `customize-agent:prompt-engineering` skill. Use skill and the document, as they not copy but add to each other. These document provides additional patterns and guidelines that complement the TDD-focused approach in this skill.
|
||||
|
||||
## About Skills
|
||||
|
||||
Skills are modular, self-contained packages that extend Claude's capabilities by providing
|
||||
specialized knowledge, workflows, and tools. Think of them as "onboarding guides" for specific
|
||||
domains or tasks—they transform Claude from a general-purpose agent into a specialized agent
|
||||
equipped with procedural knowledge that no model can fully possess.
|
||||
|
||||
## What is a Skill?
|
||||
|
||||
A **skill** is a reference guide for proven techniques, patterns, or tools. Skills help future Claude instances find and apply effective approaches.
|
||||
|
||||
**Skills are:** Reusable techniques, patterns, tools, reference guides
|
||||
|
||||
**Skills are NOT:** Narratives about how you solved a problem once
|
||||
|
||||
### What Skills Provide
|
||||
|
||||
1. Specialized workflows - Multi-step procedures for specific domains
|
||||
2. Tool integrations - Instructions for working with specific file formats or APIs
|
||||
3. Domain expertise - Company-specific knowledge, schemas, business logic
|
||||
4. Bundled resources - Scripts, references, and assets for complex and repetitive tasks
|
||||
|
||||
## TDD Mapping for Skills
|
||||
|
||||
| TDD Concept | Skill Creation |
|
||||
|-------------|----------------|
|
||||
| **Test case** | Pressure scenario with subagent |
|
||||
| **Production code** | Skill document (SKILL.md) |
|
||||
| **Test fails (RED)** | Agent violates rule without skill (baseline) |
|
||||
| **Test passes (GREEN)** | Agent complies with skill present |
|
||||
| **Refactor** | Close loopholes while maintaining compliance |
|
||||
| **Write test first** | Run baseline scenario BEFORE writing skill |
|
||||
| **Watch it fail** | Document exact rationalizations agent uses |
|
||||
| **Minimal code** | Write skill addressing those specific violations |
|
||||
| **Watch it pass** | Verify agent now complies |
|
||||
| **Refactor cycle** | Find new rationalizations → plug → re-verify |
|
||||
|
||||
The entire skill creation process follows RED-GREEN-REFACTOR.
|
||||
|
||||
## When to Create a Skill
|
||||
|
||||
**Create when:**
|
||||
|
||||
- Technique wasn't intuitively obvious to you
|
||||
- You'd reference this again across projects
|
||||
- Pattern applies broadly (not project-specific)
|
||||
- Others would benefit
|
||||
|
||||
**Don't create for:**
|
||||
|
||||
- One-off solutions
|
||||
- Standard practices well-documented elsewhere
|
||||
- Project-specific conventions (put in CLAUDE.md)
|
||||
|
||||
## Skill Types
|
||||
|
||||
### Technique
|
||||
|
||||
Concrete method with steps to follow (condition-based-waiting, root-cause-tracing)
|
||||
|
||||
### Pattern
|
||||
|
||||
Way of thinking about problems (flatten-with-flags, test-invariants)
|
||||
|
||||
### Reference
|
||||
|
||||
API docs, syntax guides, tool documentation (office docs)
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
skills/
|
||||
skill-name/
|
||||
SKILL.md # Main reference (required)
|
||||
supporting-file.* # Only if needed
|
||||
```
|
||||
|
||||
**Flat namespace** - all skills in one searchable namespace
|
||||
|
||||
**Separate files for:**
|
||||
|
||||
1. **Heavy reference** (100+ lines) - API docs, comprehensive syntax
|
||||
2. **Reusable tools** - Scripts, utilities, templates
|
||||
|
||||
**Keep inline:**
|
||||
|
||||
- Principles and concepts
|
||||
- Code patterns (< 50 lines)
|
||||
- Everything else
|
||||
|
||||
## Anatomy of a Skill
|
||||
|
||||
Every skill consists of a required SKILL.md file and optional bundled resources:
|
||||
|
||||
```
|
||||
skill-name/
|
||||
├── SKILL.md (required)
|
||||
│ ├── YAML frontmatter metadata (required)
|
||||
│ │ ├── name: (required)
|
||||
│ │ └── description: (required)
|
||||
│ └── Markdown instructions (required)
|
||||
└── Bundled Resources (optional)
|
||||
├── scripts/ - Executable code (Python/Bash/etc.)
|
||||
├── references/ - Documentation intended to be loaded into context as needed
|
||||
└── assets/ - Files used in output (templates, icons, fonts, etc.)
|
||||
```
|
||||
|
||||
## SKILL.md (required)
|
||||
|
||||
**Metadata Quality:** The `name` and `description` in YAML frontmatter determine when Claude will use the skill. Be specific about what the skill does and when to use it. Use the third-person (e.g. "This skill should be used when..." instead of "Use this skill when...").
|
||||
|
||||
### SKILL.md Structure
|
||||
|
||||
**Frontmatter (YAML):**
|
||||
|
||||
- Only two fields supported: `name` and `description`
|
||||
- Max 1024 characters total
|
||||
- `name`: Use letters, numbers, and hyphens only (no parentheses, special chars)
|
||||
- `description`: Third-person, includes BOTH what it does AND when to use it
|
||||
- Start with "Use when..." to focus on triggering conditions
|
||||
- Include specific symptoms, situations, and contexts
|
||||
- Keep under 500 characters if possible
|
||||
|
||||
```markdown
|
||||
---
|
||||
name: Skill-Name-With-Hyphens
|
||||
description: Use when [specific triggering conditions and symptoms] - [what the skill does and how it helps, written in third person]
|
||||
---
|
||||
|
||||
# Skill Name
|
||||
|
||||
## Overview
|
||||
What is this? Core principle in 1-2 sentences.
|
||||
|
||||
## When to Use
|
||||
[Small inline flowchart IF decision non-obvious]
|
||||
|
||||
Bullet list with SYMPTOMS and use cases
|
||||
When NOT to use
|
||||
|
||||
## Core Pattern (for techniques/patterns)
|
||||
Before/after code comparison
|
||||
|
||||
## Quick Reference
|
||||
Table or bullets for scanning common operations
|
||||
|
||||
## Implementation
|
||||
Inline code for simple patterns
|
||||
Link to file for heavy reference or reusable tools
|
||||
|
||||
## Common Mistakes
|
||||
What goes wrong + fixes
|
||||
|
||||
## Real-World Impact (optional)
|
||||
Concrete results
|
||||
```
|
||||
|
||||
#### Bundled Resources (optional)
|
||||
|
||||
##### Scripts (`scripts/`)
|
||||
|
||||
Executable code (Python/Bash/etc.) for tasks that require deterministic reliability or are repeatedly rewritten.
|
||||
|
||||
- **When to include**: When the same code is being rewritten repeatedly or deterministic reliability is needed
|
||||
- **Example**: `scripts/rotate_pdf.py` for PDF rotation tasks
|
||||
- **Benefits**: Token efficient, deterministic, may be executed without loading into context
|
||||
- **Note**: Scripts may still need to be read by Claude for patching or environment-specific adjustments
|
||||
|
||||
##### References (`references/`)
|
||||
|
||||
Documentation and reference material intended to be loaded as needed into context to inform Claude's process and thinking.
|
||||
|
||||
- **When to include**: For documentation that Claude should reference while working
|
||||
- **Examples**: `references/finance.md` for financial schemas, `references/mnda.md` for company NDA template, `references/policies.md` for company policies, `references/api_docs.md` for API specifications
|
||||
- **Use cases**: Database schemas, API documentation, domain knowledge, company policies, detailed workflow guides
|
||||
- **Benefits**: Keeps SKILL.md lean, loaded only when Claude determines it's needed
|
||||
- **Best practice**: If files are large (>10k words), include grep search patterns in SKILL.md
|
||||
- **Avoid duplication**: Information should live in either SKILL.md or references files, not both. Prefer references files for detailed information unless it's truly core to the skill—this keeps SKILL.md lean while making information discoverable without hogging the context window. Keep only essential procedural instructions and workflow guidance in SKILL.md; move detailed reference material, schemas, and examples to references files.
|
||||
|
||||
##### Assets (`assets/`)
|
||||
|
||||
Files not intended to be loaded into context, but rather used within the output Claude produces.
|
||||
|
||||
- **When to include**: When the skill needs files that will be used in the final output
|
||||
- **Examples**: `assets/logo.png` for brand assets, `assets/slides.pptx` for PowerPoint templates, `assets/frontend-template/` for HTML/React boilerplate, `assets/font.ttf` for typography
|
||||
- **Use cases**: Templates, images, icons, boilerplate code, fonts, sample documents that get copied or modified
|
||||
- **Benefits**: Separates output resources from documentation, enables Claude to use files without loading them into context
|
||||
|
||||
### Progressive Disclosure Design Principle
|
||||
|
||||
Skills use a three-level loading system to manage context efficiently:
|
||||
|
||||
1. **Metadata (name + description)** - Always in context (~100 words)
|
||||
2. **SKILL.md body** - When skill triggers (<5k words)
|
||||
3. **Bundled resources** - As needed by Claude (Unlimited*)
|
||||
|
||||
*Unlimited because scripts can be executed without reading into context window.
|
||||
|
||||
## Claude Search Optimization (CSO)
|
||||
|
||||
**Critical for discovery:** Future Claude needs to FIND your skill
|
||||
|
||||
### 1. Rich Description Field
|
||||
|
||||
**Purpose:** Claude reads description to decide which skills to load for a given task. Make it answer: "Should I read this skill right now?"
|
||||
|
||||
**Format:** Start with "Use when..." to focus on triggering conditions, then explain what it does
|
||||
|
||||
**Content:**
|
||||
|
||||
- Use concrete triggers, symptoms, and situations that signal this skill applies
|
||||
- Describe the *problem* (race conditions, inconsistent behavior) not *language-specific symptoms* (setTimeout, sleep)
|
||||
- Keep triggers technology-agnostic unless the skill itself is technology-specific
|
||||
- If skill is technology-specific, make that explicit in the trigger
|
||||
- Write in third person (injected into system prompt)
|
||||
|
||||
```yaml
|
||||
# ❌ BAD: Too abstract, vague, doesn't include when to use
|
||||
description: For async testing
|
||||
|
||||
# ❌ BAD: First person
|
||||
description: I can help you with async tests when they're flaky
|
||||
|
||||
# ❌ BAD: Mentions technology but skill isn't specific to it
|
||||
description: Use when tests use setTimeout/sleep and are flaky
|
||||
|
||||
# ✅ GOOD: Starts with "Use when", describes problem, then what it does
|
||||
description: Use when tests have race conditions, timing dependencies, or pass/fail inconsistently - replaces arbitrary timeouts with condition polling for reliable async tests
|
||||
|
||||
# ✅ GOOD: Technology-specific skill with explicit trigger
|
||||
description: Use when using React Router and handling authentication redirects - provides patterns for protected routes and auth state management
|
||||
```
|
||||
|
||||
### 2. Keyword Coverage
|
||||
|
||||
Use words Claude would search for:
|
||||
|
||||
- Error messages: "Hook timed out", "ENOTEMPTY", "race condition"
|
||||
- Symptoms: "flaky", "hanging", "zombie", "pollution"
|
||||
- Synonyms: "timeout/hang/freeze", "cleanup/teardown/afterEach"
|
||||
- Tools: Actual commands, library names, file types
|
||||
|
||||
### 3. Descriptive Naming
|
||||
|
||||
**Use active voice, verb-first:**
|
||||
|
||||
- ✅ `creating-skills` not `skill-creation`
|
||||
- ✅ `testing-skills-with-subagents` not `subagent-skill-testing`
|
||||
|
||||
### 4. Token Efficiency (Critical)
|
||||
|
||||
**Problem:** getting-started and frequently-referenced skills load into EVERY conversation. Every token counts.
|
||||
|
||||
**Target word counts:**
|
||||
|
||||
- getting-started workflows: <150 words each
|
||||
- Frequently-loaded skills: <200 words total
|
||||
- Other skills: <500 words (still be concise)
|
||||
|
||||
**Techniques:**
|
||||
|
||||
**Move details to tool help:**
|
||||
|
||||
```bash
|
||||
# ❌ BAD: Document all flags in SKILL.md
|
||||
search-conversations supports --text, --both, --after DATE, --before DATE, --limit N
|
||||
|
||||
# ✅ GOOD: Reference --help
|
||||
search-conversations supports multiple modes and filters. Run --help for details.
|
||||
```
|
||||
|
||||
**Use cross-references:**
|
||||
|
||||
```markdown
|
||||
# ❌ BAD: Repeat workflow details
|
||||
When searching, dispatch subagent with template...
|
||||
[20 lines of repeated instructions]
|
||||
|
||||
# ✅ GOOD: Reference other skill
|
||||
Always use subagents (50-100x context savings). REQUIRED: Use [other-skill-name] for workflow.
|
||||
```
|
||||
|
||||
**Compress examples:**
|
||||
|
||||
```markdown
|
||||
# ❌ BAD: Verbose example (42 words)
|
||||
your human partner: "How did we handle authentication errors in React Router before?"
|
||||
You: I'll search past conversations for React Router authentication patterns.
|
||||
[Dispatch subagent with search query: "React Router authentication error handling 401"]
|
||||
|
||||
# ✅ GOOD: Minimal example (20 words)
|
||||
Partner: "How did we handle auth errors in React Router?"
|
||||
You: Searching...
|
||||
[Dispatch subagent → synthesis]
|
||||
```
|
||||
|
||||
**Eliminate redundancy:**
|
||||
|
||||
- Don't repeat what's in cross-referenced skills
|
||||
- Don't explain what's obvious from command
|
||||
- Don't include multiple examples of same pattern
|
||||
|
||||
**Verification:**
|
||||
|
||||
```bash
|
||||
wc -w skills/path/SKILL.md
|
||||
# getting-started workflows: aim for <150 each
|
||||
# Other frequently-loaded: aim for <200 total
|
||||
```
|
||||
|
||||
**Name by what you DO or core insight:**
|
||||
|
||||
- ✅ `condition-based-waiting` > `async-test-helpers`
|
||||
- ✅ `using-skills` not `skill-usage`
|
||||
- ✅ `flatten-with-flags` > `data-structure-refactoring`
|
||||
- ✅ `root-cause-tracing` > `debugging-techniques`
|
||||
|
||||
**Gerunds (-ing) work well for processes:**
|
||||
|
||||
- `creating-skills`, `testing-skills`, `debugging-with-logs`
|
||||
- Active, describes the action you're taking
|
||||
|
||||
### 4. Cross-Referencing Other Skills
|
||||
|
||||
**When writing documentation that references other skills:**
|
||||
|
||||
Use skill name only, with explicit requirement markers:
|
||||
|
||||
- ✅ Good: `**REQUIRED SUB-SKILL:** Use superpowers:test-driven-development`
|
||||
- ✅ Good: `**REQUIRED BACKGROUND:** You MUST understand superpowers:systematic-debugging`
|
||||
- ❌ Bad: `See skills/testing/test-driven-development` (unclear if required)
|
||||
- ❌ Bad: `@skills/testing/test-driven-development/SKILL.md` (force-loads, burns context)
|
||||
|
||||
**Why no @ links:** `@` syntax force-loads files immediately, consuming 200k+ context before you need them.
|
||||
|
||||
## Flowchart Usage
|
||||
|
||||
```dot
|
||||
digraph when_flowchart {
|
||||
"Need to show information?" [shape=diamond];
|
||||
"Decision where I might go wrong?" [shape=diamond];
|
||||
"Use markdown" [shape=box];
|
||||
"Small inline flowchart" [shape=box];
|
||||
|
||||
"Need to show information?" -> "Decision where I might go wrong?" [label="yes"];
|
||||
"Decision where I might go wrong?" -> "Small inline flowchart" [label="yes"];
|
||||
"Decision where I might go wrong?" -> "Use markdown" [label="no"];
|
||||
}
|
||||
```
|
||||
|
||||
**Use flowcharts ONLY for:**
|
||||
|
||||
- Non-obvious decision points
|
||||
- Process loops where you might stop too early
|
||||
- "When to use A vs B" decisions
|
||||
|
||||
**Never use flowcharts for:**
|
||||
|
||||
- Reference material → Tables, lists
|
||||
- Code examples → Markdown blocks
|
||||
- Linear instructions → Numbered lists
|
||||
- Labels without semantic meaning (step1, helper2)
|
||||
|
||||
See [graphviz-conventions.dot](https://github.com/obra/superpowers/blob/main/skills/writing-skills/graphviz-conventions.dot) for graphviz style rules.
|
||||
|
||||
## Code Examples
|
||||
|
||||
**One excellent example beats many mediocre ones**
|
||||
|
||||
Choose most relevant language:
|
||||
|
||||
- Testing techniques → TypeScript/JavaScript
|
||||
- System debugging → Shell/Python
|
||||
- Data processing → Python
|
||||
|
||||
**Good example:**
|
||||
|
||||
- Complete and runnable
|
||||
- Well-commented explaining WHY
|
||||
- From real scenario
|
||||
- Shows pattern clearly
|
||||
- Ready to adapt (not generic template)
|
||||
|
||||
**Don't:**
|
||||
|
||||
- Implement in 5+ languages
|
||||
- Create fill-in-the-blank templates
|
||||
- Write contrived examples
|
||||
|
||||
You're good at porting - one great example is enough.
|
||||
|
||||
## File Organization
|
||||
|
||||
### Self-Contained Skill
|
||||
|
||||
```
|
||||
defense-in-depth/
|
||||
SKILL.md # Everything inline
|
||||
```
|
||||
|
||||
When: All content fits, no heavy reference needed
|
||||
|
||||
### Skill with Reusable Tool
|
||||
|
||||
```
|
||||
condition-based-waiting/
|
||||
SKILL.md # Overview + patterns
|
||||
example.ts # Working helpers to adapt
|
||||
```
|
||||
|
||||
When: Tool is reusable code, not just narrative
|
||||
|
||||
### Skill with Heavy Reference
|
||||
|
||||
```
|
||||
pptx/
|
||||
SKILL.md # Overview + workflows
|
||||
pptxgenjs.md # 600 lines API reference
|
||||
ooxml.md # 500 lines XML structure
|
||||
scripts/ # Executable tools
|
||||
```
|
||||
|
||||
When: Reference material too large for inline
|
||||
|
||||
## The Iron Law (Same as TDD)
|
||||
|
||||
```
|
||||
NO SKILL WITHOUT A FAILING TEST FIRST
|
||||
```
|
||||
|
||||
This applies to NEW skills AND EDITS to existing skills.
|
||||
|
||||
Write skill before testing? Delete it. Start over.
|
||||
Edit skill without testing? Same violation.
|
||||
|
||||
**No exceptions:**
|
||||
|
||||
- Not for "simple additions"
|
||||
- Not for "just adding a section"
|
||||
- Not for "documentation updates"
|
||||
- Don't keep untested changes as "reference"
|
||||
- Don't "adapt" while running tests
|
||||
- Delete means delete
|
||||
|
||||
**REQUIRED BACKGROUND:** The superpowers:test-driven-development skill explains why this matters. Same principles apply to documentation.
|
||||
|
||||
## Testing All Skill Types
|
||||
|
||||
Different skill types need different test approaches:
|
||||
|
||||
### Discipline-Enforcing Skills (rules/requirements)
|
||||
|
||||
**Examples:** TDD, verification-before-completion, designing-before-coding
|
||||
|
||||
**Test with:**
|
||||
|
||||
- Academic questions: Do they understand the rules?
|
||||
- Pressure scenarios: Do they comply under stress?
|
||||
- Multiple pressures combined: time + sunk cost + exhaustion
|
||||
- Identify rationalizations and add explicit counters
|
||||
|
||||
**Success criteria:** Agent follows rule under maximum pressure
|
||||
|
||||
### Technique Skills (how-to guides)
|
||||
|
||||
**Examples:** condition-based-waiting, root-cause-tracing, defensive-programming
|
||||
|
||||
**Test with:**
|
||||
|
||||
- Application scenarios: Can they apply the technique correctly?
|
||||
- Variation scenarios: Do they handle edge cases?
|
||||
- Missing information tests: Do instructions have gaps?
|
||||
|
||||
**Success criteria:** Agent successfully applies technique to new scenario
|
||||
|
||||
### Pattern Skills (mental models)
|
||||
|
||||
**Examples:** reducing-complexity, information-hiding concepts
|
||||
|
||||
**Test with:**
|
||||
|
||||
- Recognition scenarios: Do they recognize when pattern applies?
|
||||
- Application scenarios: Can they use the mental model?
|
||||
- Counter-examples: Do they know when NOT to apply?
|
||||
|
||||
**Success criteria:** Agent correctly identifies when/how to apply pattern
|
||||
|
||||
### Reference Skills (documentation/APIs)
|
||||
|
||||
**Examples:** API documentation, command references, library guides
|
||||
|
||||
**Test with:**
|
||||
|
||||
- Retrieval scenarios: Can they find the right information?
|
||||
- Application scenarios: Can they use what they found correctly?
|
||||
- Gap testing: Are common use cases covered?
|
||||
|
||||
**Success criteria:** Agent finds and correctly applies reference information
|
||||
|
||||
## Common Rationalizations for Skipping Testing
|
||||
|
||||
| Excuse | Reality |
|
||||
|--------|---------|
|
||||
| "Skill is obviously clear" | Clear to you ≠ clear to other agents. Test it. |
|
||||
| "It's just a reference" | References can have gaps, unclear sections. Test retrieval. |
|
||||
| "Testing is overkill" | Untested skills have issues. Always. 15 min testing saves hours. |
|
||||
| "I'll test if problems emerge" | Problems = agents can't use skill. Test BEFORE deploying. |
|
||||
| "Too tedious to test" | Testing is less tedious than debugging bad skill in production. |
|
||||
| "I'm confident it's good" | Overconfidence guarantees issues. Test anyway. |
|
||||
| "Academic review is enough" | Reading ≠ using. Test application scenarios. |
|
||||
| "No time to test" | Deploying untested skill wastes more time fixing it later. |
|
||||
|
||||
**All of these mean: Test before deploying. No exceptions.**
|
||||
|
||||
## Bulletproofing Skills Against Rationalization
|
||||
|
||||
Skills that enforce discipline (like TDD) need to resist rationalization. Agents are smart and will find loopholes when under pressure.
|
||||
|
||||
**Psychology note:** Understanding WHY persuasion techniques work helps you apply them systematically. See persuasion-principles.md for research foundation (Cialdini, 2021; Meincke et al., 2025) on authority, commitment, scarcity, social proof, and unity principles.
|
||||
|
||||
### Close Every Loophole Explicitly
|
||||
|
||||
Don't just state the rule - forbid specific workarounds:
|
||||
|
||||
<Bad>
|
||||
```markdown
|
||||
Write code before test? Delete it.
|
||||
```
|
||||
</Bad>
|
||||
|
||||
<Good>
|
||||
```markdown
|
||||
Write code before test? Delete it. Start over.
|
||||
|
||||
**No exceptions:**
|
||||
|
||||
- Don't keep it as "reference"
|
||||
- Don't "adapt" it while writing tests
|
||||
- Don't look at it
|
||||
- Delete means delete
|
||||
|
||||
```
|
||||
</Good>
|
||||
|
||||
### Address "Spirit vs Letter" Arguments
|
||||
|
||||
Add foundational principle early:
|
||||
|
||||
```markdown
|
||||
**Violating the letter of the rules is violating the spirit of the rules.**
|
||||
```
|
||||
|
||||
This cuts off entire class of "I'm following the spirit" rationalizations.
|
||||
|
||||
### Build Rationalization Table
|
||||
|
||||
Capture rationalizations from baseline testing (see Testing section below). Every excuse agents make goes in the table:
|
||||
|
||||
```markdown
|
||||
| Excuse | Reality |
|
||||
|--------|---------|
|
||||
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
|
||||
| "I'll test after" | Tests passing immediately prove nothing. |
|
||||
| "Tests after achieve same goals" | Tests-after = "what does this do?" Tests-first = "what should this do?" |
|
||||
```
|
||||
|
||||
### Create Red Flags List
|
||||
|
||||
Make it easy for agents to self-check when rationalizing:
|
||||
|
||||
```markdown
|
||||
## Red Flags - STOP and Start Over
|
||||
|
||||
- Code before test
|
||||
- "I already manually tested it"
|
||||
- "Tests after achieve the same purpose"
|
||||
- "It's about spirit not ritual"
|
||||
- "This is different because..."
|
||||
|
||||
**All of these mean: Delete code. Start over with TDD.**
|
||||
```
|
||||
|
||||
### Update CSO for Violation Symptoms
|
||||
|
||||
Add to description: symptoms of when you're ABOUT to violate the rule:
|
||||
|
||||
```yaml
|
||||
description: use when implementing any feature or bugfix, before writing implementation code
|
||||
```
|
||||
|
||||
## RED-GREEN-REFACTOR for Skills
|
||||
|
||||
Follow the TDD cycle:
|
||||
|
||||
### RED: Write Failing Test (Baseline)
|
||||
|
||||
Run pressure scenario with subagent WITHOUT the skill. Document exact behavior:
|
||||
|
||||
- What choices did they make?
|
||||
- What rationalizations did they use (verbatim)?
|
||||
- Which pressures triggered violations?
|
||||
|
||||
This is "watch the test fail" - you must see what agents naturally do before writing the skill.
|
||||
|
||||
### GREEN: Write Minimal Skill
|
||||
|
||||
Write skill that addresses those specific rationalizations. Don't add extra content for hypothetical cases.
|
||||
|
||||
Run same scenarios WITH skill. Agent should now comply.
|
||||
|
||||
### REFACTOR: Close Loopholes
|
||||
|
||||
Agent found new rationalization? Add explicit counter. Re-test until bulletproof.
|
||||
|
||||
**REQUIRED SUB-SKILL:** Use superpowers:testing-skills-with-subagents for the complete testing methodology:
|
||||
|
||||
- How to write pressure scenarios
|
||||
- Pressure types (time, sunk cost, authority, exhaustion)
|
||||
- Plugging holes systematically
|
||||
- Meta-testing techniques
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
### ❌ Narrative Example
|
||||
|
||||
"In session 2025-10-03, we found empty projectDir caused..."
|
||||
**Why bad:** Too specific, not reusable
|
||||
|
||||
### ❌ Multi-Language Dilution
|
||||
|
||||
example-js.js, example-py.py, example-go.go
|
||||
**Why bad:** Mediocre quality, maintenance burden
|
||||
|
||||
### ❌ Code in Flowcharts
|
||||
|
||||
```dot
|
||||
step1 [label="import fs"];
|
||||
step2 [label="read file"];
|
||||
```
|
||||
|
||||
**Why bad:** Can't copy-paste, hard to read
|
||||
|
||||
### ❌ Generic Labels
|
||||
|
||||
helper1, helper2, step3, pattern4
|
||||
**Why bad:** Labels should have semantic meaning
|
||||
|
||||
## STOP: Before Moving to Next Skill
|
||||
|
||||
**After writing ANY skill, you MUST STOP and complete the deployment process.**
|
||||
|
||||
**Do NOT:**
|
||||
|
||||
- Create multiple skills in batch without testing each
|
||||
- Move to next skill before current one is verified
|
||||
- Skip testing because "batching is more efficient"
|
||||
|
||||
**The deployment checklist below is MANDATORY for EACH skill.**
|
||||
|
||||
Deploying untested skills = deploying untested code. It's a violation of quality standards.
|
||||
|
||||
## Skill Creation Checklist (TDD Adapted)
|
||||
|
||||
**IMPORTANT: Use TodoWrite to create todos for EACH checklist item below.**
|
||||
|
||||
**RED Phase - Write Failing Test:**
|
||||
|
||||
- [ ] Create pressure scenarios (3+ combined pressures for discipline skills)
|
||||
- [ ] Run scenarios WITHOUT skill - document baseline behavior verbatim
|
||||
- [ ] Identify patterns in rationalizations/failures
|
||||
|
||||
**GREEN Phase - Write Minimal Skill:**
|
||||
|
||||
- [ ] Name uses only letters, numbers, hyphens (no parentheses/special chars)
|
||||
- [ ] YAML frontmatter with only name and description (max 1024 chars)
|
||||
- [ ] Description starts with "Use when..." and includes specific triggers/symptoms
|
||||
- [ ] Description written in third person
|
||||
- [ ] Keywords throughout for search (errors, symptoms, tools)
|
||||
- [ ] Clear overview with core principle
|
||||
- [ ] Address specific baseline failures identified in RED
|
||||
- [ ] Code inline OR link to separate file
|
||||
- [ ] One excellent example (not multi-language)
|
||||
- [ ] Run scenarios WITH skill - verify agents now comply
|
||||
|
||||
**REFACTOR Phase - Close Loopholes:**
|
||||
|
||||
- [ ] Identify NEW rationalizations from testing
|
||||
- [ ] Add explicit counters (if discipline skill)
|
||||
- [ ] Build rationalization table from all test iterations
|
||||
- [ ] Create red flags list
|
||||
- [ ] Re-test until bulletproof
|
||||
|
||||
**Quality Checks:**
|
||||
|
||||
- [ ] Small flowchart only if decision non-obvious
|
||||
- [ ] Quick reference table
|
||||
- [ ] Common mistakes section
|
||||
- [ ] No narrative storytelling
|
||||
- [ ] Supporting files only for tools or heavy reference
|
||||
|
||||
**Deployment:**
|
||||
|
||||
- [ ] Commit skill to git and push to your fork (if configured)
|
||||
- [ ] Consider contributing back via PR (if broadly useful)
|
||||
|
||||
## Discovery Workflow
|
||||
|
||||
How future Claude finds your skill:
|
||||
|
||||
1. **Encounters problem** ("tests are flaky")
|
||||
3. **Finds SKILL** (description matches)
|
||||
4. **Scans overview** (is this relevant?)
|
||||
5. **Reads patterns** (quick reference table)
|
||||
6. **Loads example** (only when implementing)
|
||||
|
||||
**Optimize for this flow** - put searchable terms early and often.
|
||||
|
||||
## The Bottom Line
|
||||
|
||||
**Creating skills IS TDD for process documentation.**
|
||||
|
||||
Same Iron Law: No skill without failing test first.
|
||||
Same cycle: RED (baseline) → GREEN (write skill) → REFACTOR (close loopholes).
|
||||
Same benefits: Better quality, fewer surprises, bulletproof results.
|
||||
|
||||
If you follow TDD for code, follow it for skills. It's the same discipline applied to documentation.
|
||||
|
||||
## Skill Creation Process
|
||||
|
||||
To create a skill, follow the "Skill Creation Process" in order, skipping steps only if there is a clear reason why they are not applicable.
|
||||
|
||||
### Step 1: Understanding the Skill with Concrete Examples
|
||||
|
||||
Skip this step only when the skill's usage patterns are already clearly understood. It remains valuable even when working with an existing skill.
|
||||
|
||||
To create an effective skill, clearly understand concrete examples of how the skill will be used. This understanding can come from either direct user examples or generated examples that are validated with user feedback.
|
||||
|
||||
For example, when building an image-editor skill, relevant questions include:
|
||||
|
||||
- "What functionality should the image-editor skill support? Editing, rotating, anything else?"
|
||||
- "Can you give some examples of how this skill would be used?"
|
||||
- "I can imagine users asking for things like 'Remove the red-eye from this image' or 'Rotate this image'. Are there other ways you imagine this skill being used?"
|
||||
- "What would a user say that should trigger this skill?"
|
||||
|
||||
To avoid overwhelming users, avoid asking too many questions in a single message. Start with the most important questions and follow up as needed for better effectiveness.
|
||||
|
||||
Conclude this step when there is a clear sense of the functionality the skill should support.
|
||||
|
||||
### Step 2: Planning the Reusable Skill Contents
|
||||
|
||||
To turn concrete examples into an effective skill, analyze each example by:
|
||||
|
||||
1. Considering how to execute on the example from scratch
|
||||
2. Identifying what scripts, references, and assets would be helpful when executing these workflows repeatedly
|
||||
|
||||
Example: When building a `pdf-editor` skill to handle queries like "Help me rotate this PDF," the analysis shows:
|
||||
|
||||
1. Rotating a PDF requires re-writing the same code each time
|
||||
2. A `scripts/rotate_pdf.py` script would be helpful to store in the skill
|
||||
|
||||
Example: When designing a `frontend-webapp-builder` skill for queries like "Build me a todo app" or "Build me a dashboard to track my steps," the analysis shows:
|
||||
|
||||
1. Writing a frontend webapp requires the same boilerplate HTML/React each time
|
||||
2. An `assets/hello-world/` template containing the boilerplate HTML/React project files would be helpful to store in the skill
|
||||
|
||||
Example: When building a `big-query` skill to handle queries like "How many users have logged in today?" the analysis shows:
|
||||
|
||||
1. Querying BigQuery requires re-discovering the table schemas and relationships each time
|
||||
2. A `references/schema.md` file documenting the table schemas would be helpful to store in the skill
|
||||
|
||||
To establish the skill's contents, analyze each concrete example to create a list of the reusable resources to include: scripts, references, and assets.
|
||||
|
||||
### Step 3: Initializing the Skill
|
||||
|
||||
At this point, it is time to actually create the skill.
|
||||
|
||||
Skip this step only if the skill being developed already exists, and iteration or packaging is needed. In this case, continue to the next step.
|
||||
|
||||
When creating a new skill from scratch, always run the `init_skill.py` script. The script conveniently generates a new template skill directory that automatically includes everything a skill requires, making the skill creation process much more efficient and reliable.
|
||||
|
||||
Usage:
|
||||
|
||||
```bash
|
||||
scripts/init_skill.py <skill-name> --path <output-directory>
|
||||
```
|
||||
|
||||
The script:
|
||||
|
||||
- Creates the skill directory at the specified path
|
||||
- Generates a SKILL.md template with proper frontmatter and TODO placeholders
|
||||
- Creates example resource directories: `scripts/`, `references/`, and `assets/`
|
||||
- Adds example files in each directory that can be customized or deleted
|
||||
|
||||
After initialization, customize or remove the generated SKILL.md and example files as needed.
|
||||
|
||||
### Step 4: Edit the Skill
|
||||
|
||||
When editing the (newly-generated or existing) skill, remember that the skill is being created for another instance of Claude to use. Focus on including information that would be beneficial and non-obvious to Claude. Consider what procedural knowledge, domain-specific details, or reusable assets would help another Claude instance execute these tasks more effectively.
|
||||
|
||||
#### Start with Reusable Skill Contents
|
||||
|
||||
To begin implementation, start with the reusable resources identified above: `scripts/`, `references/`, and `assets/` files. Note that this step may require user input. For example, when implementing a `brand-guidelines` skill, the user may need to provide brand assets or templates to store in `assets/`, or documentation to store in `references/`.
|
||||
|
||||
Also, delete any example files and directories not needed for the skill. The initialization script creates example files in `scripts/`, `references/`, and `assets/` to demonstrate structure, but most skills won't need all of them.
|
||||
|
||||
#### Update SKILL.md
|
||||
|
||||
**Writing Style:** Write the entire skill using **imperative/infinitive form** (verb-first instructions), not second person. Use objective, instructional language (e.g., "To accomplish X, do Y" rather than "You should do X" or "If you need to do X"). This maintains consistency and clarity for AI consumption.
|
||||
|
||||
To complete SKILL.md, answer the following questions:
|
||||
|
||||
1. What is the purpose of the skill, in a few sentences?
|
||||
2. When should the skill be used?
|
||||
3. In practice, how should Claude use the skill? All reusable skill contents developed above should be referenced so that Claude knows how to use them.
|
||||
|
||||
### Step 5: Packaging a Skill
|
||||
|
||||
Once the skill is ready, it should be packaged into a distributable zip file that gets shared with the user. The packaging process automatically validates the skill first to ensure it meets all requirements:
|
||||
|
||||
```bash
|
||||
scripts/package_skill.py <path/to/skill-folder>
|
||||
```
|
||||
|
||||
Optional output directory specification:
|
||||
|
||||
```bash
|
||||
scripts/package_skill.py <path/to/skill-folder> ./dist
|
||||
```
|
||||
|
||||
The packaging script will:
|
||||
|
||||
1. **Validate** the skill automatically, checking:
|
||||
- YAML frontmatter format and required fields
|
||||
- Skill naming conventions and directory structure
|
||||
- Description completeness and quality
|
||||
- File organization and resource references
|
||||
|
||||
2. **Package** the skill if validation passes, creating a zip file named after the skill (e.g., `my-skill.zip`) that includes all files and maintains the proper directory structure for distribution.
|
||||
|
||||
If validation fails, the script will report the errors and exit without creating a package. Fix any validation errors and run the packaging command again.
|
||||
|
||||
### Step 6: Iterate
|
||||
|
||||
After testing the skill, users may request improvements. Often this happens right after using the skill, with fresh context of how the skill performed.
|
||||
|
||||
**Iteration workflow:**
|
||||
|
||||
1. Use the skill on real tasks
|
||||
2. Notice struggles or inefficiencies
|
||||
3. Identify how SKILL.md or bundled resources should be updated
|
||||
4. Implement changes and test again
|
||||
714
commands/test-prompt.md
Normal file
714
commands/test-prompt.md
Normal file
@@ -0,0 +1,714 @@
|
||||
---
|
||||
name: test-prompt
|
||||
description: Use when creating or editing any prompt (commands, hooks, skills, subagent instructions) to verify it produces desired behavior - applies RED-GREEN-REFACTOR cycle to prompt engineering using subagents for isolated testing
|
||||
---
|
||||
|
||||
# Testing Prompts With Subagents
|
||||
|
||||
Test any prompt before deployment: commands, hooks, skills, subagent instructions, or production LLM prompts.
|
||||
|
||||
## Overview
|
||||
|
||||
**Testing prompts is TDD applied to LLM instructions.**
|
||||
|
||||
Run scenarios without the prompt (RED - watch agent behavior), write prompt addressing failures (GREEN - watch agent comply), then close loopholes (REFACTOR - verify robustness).
|
||||
|
||||
**Core principle:** If you didn't watch an agent fail without the prompt, you don't know what the prompt needs to fix.
|
||||
|
||||
**REQUIRED BACKGROUND:**
|
||||
- You MUST understand `tdd:test-driven-development` - defines RED-GREEN-REFACTOR cycle
|
||||
- You SHOULD understand `prompt-engineering` skill - provides prompt optimization techniques
|
||||
|
||||
**Related skill:** See `test-skill` for testing discipline-enforcing skills specifically. This command covers ALL prompts.
|
||||
|
||||
## When to Use
|
||||
|
||||
Test prompts that:
|
||||
|
||||
- Guide agent behavior (commands, instructions)
|
||||
- Enforce practices (hooks, discipline skills)
|
||||
- Provide expertise (technical skills, reference)
|
||||
- Configure subagents (task descriptions, constraints)
|
||||
- Run in production (user-facing LLM features)
|
||||
|
||||
Test before deployment when:
|
||||
|
||||
- Prompt clarity matters
|
||||
- Consistency is required
|
||||
- Cost of failures is high
|
||||
- Prompt will be reused
|
||||
|
||||
## Prompt Types & Testing Strategies
|
||||
|
||||
| Prompt Type | Test Focus | Example |
|
||||
|-------------|------------|---------|
|
||||
| **Instruction** | Does agent follow steps correctly? | Command that performs git workflow |
|
||||
| **Discipline-enforcing** | Does agent resist rationalization under pressure? | Skill requiring TDD compliance |
|
||||
| **Guidance** | Does agent apply advice appropriately? | Skill with architecture patterns |
|
||||
| **Reference** | Is information accurate and accessible? | API documentation skill |
|
||||
| **Subagent** | Does subagent accomplish task reliably? | Task tool prompt for code review |
|
||||
|
||||
Different types need different test scenarios (covered in sections below).
|
||||
|
||||
## TDD Mapping for Prompt Testing
|
||||
|
||||
| TDD Phase | Prompt Testing | What You Do |
|
||||
|-----------|----------------|-------------|
|
||||
| **RED** | Baseline test | Run scenario WITHOUT prompt using subagent, observe behavior |
|
||||
| **Verify RED** | Document behavior | Capture exact agent actions/reasoning verbatim |
|
||||
| **GREEN** | Write prompt | Address specific baseline failures |
|
||||
| **Verify GREEN** | Test with prompt | Run WITH prompt using subagent, verify improvement |
|
||||
| **REFACTOR** | Optimize prompt | Improve clarity, close loopholes, reduce tokens |
|
||||
| **Stay GREEN** | Re-verify | Test again with fresh subagent, ensure still works |
|
||||
|
||||
## Why Use Subagents for Testing?
|
||||
|
||||
**Subagents provide:**
|
||||
|
||||
1. **Clean slate** - No conversation history affecting behavior
|
||||
2. **Isolation** - Test only the prompt, not accumulated context
|
||||
3. **Reproducibility** - Same starting conditions every run
|
||||
4. **Parallelization** - Test multiple scenarios simultaneously
|
||||
5. **Objectivity** - No bias from prior interactions
|
||||
|
||||
**When to use Task tool with subagents:**
|
||||
|
||||
- Testing new prompts before deployment
|
||||
- Comparing prompt variations (A/B testing)
|
||||
- Verifying prompt changes don't break behavior
|
||||
- Regression testing after updates
|
||||
|
||||
## RED Phase: Baseline Testing (Watch It Fail)
|
||||
|
||||
**Goal:** Run test WITHOUT the prompt - observe natural agent behavior, document what goes wrong.
|
||||
|
||||
This proves what the prompt needs to fix.
|
||||
|
||||
### Process
|
||||
|
||||
- [ ] **Design test scenarios** appropriate for prompt type
|
||||
- [ ] **Launch subagent WITHOUT prompt** - use Task tool with minimal instructions
|
||||
- [ ] **Document agent behavior** word-for-word (actions, reasoning, mistakes)
|
||||
- [ ] **Identify patterns** - what consistently goes wrong?
|
||||
- [ ] **Note severity** - which failures are critical vs. minor?
|
||||
|
||||
### Scenario Design by Prompt Type
|
||||
|
||||
#### Instruction Prompts
|
||||
|
||||
Test if steps are followed correctly and edge cases handled.
|
||||
|
||||
**Example: Testing a git commit command**
|
||||
|
||||
```markdown
|
||||
Task for subagent (no command loaded):
|
||||
|
||||
You need to commit changes to a repository.
|
||||
Files modified: src/app.ts, tests/app.test.ts
|
||||
Some changes are experimental and shouldn't be committed yet.
|
||||
|
||||
Make an appropriate commit.
|
||||
```
|
||||
|
||||
**Baseline behavior (without prompt):**
|
||||
- Agent might commit all files (including experimental)
|
||||
- Might skip running tests first
|
||||
- Might write vague commit message
|
||||
- Might not follow commit message conventions
|
||||
|
||||
**Document exactly what happened.**
|
||||
|
||||
#### Discipline-Enforcing Prompts
|
||||
|
||||
Test resistance to rationalization under pressure. Use scenarios with multiple pressures (time, cost, authority, exhaustion).
|
||||
|
||||
**Example: Testing a TDD enforcement skill**
|
||||
|
||||
```markdown
|
||||
IMPORTANT: This is a real scenario. Choose and act.
|
||||
|
||||
You spent 4 hours implementing a payment feature. It's working perfectly.
|
||||
You manually tested edge cases. It's 6pm, dinner at 6:30pm, exhausted.
|
||||
Code review tomorrow at 9am. You just realized you didn't write tests.
|
||||
|
||||
Options:
|
||||
A) Delete code, start over with TDD tomorrow
|
||||
B) Commit now, write tests tomorrow
|
||||
C) Write tests now (30 min delay)
|
||||
|
||||
Choose A, B, or C. Be honest.
|
||||
```
|
||||
|
||||
**Baseline behavior (without skill):**
|
||||
- Agent chooses B or C
|
||||
- Rationalizations: "manually tested", "tests after achieve same goals", "deleting wasteful"
|
||||
|
||||
**Capture rationalizations verbatim.**
|
||||
|
||||
#### Guidance Prompts
|
||||
|
||||
Test if advice is understood and applied appropriately in varied contexts.
|
||||
|
||||
**Example: Testing an architecture patterns skill**
|
||||
|
||||
```markdown
|
||||
Design a system for processing 10,000 webhook events per second.
|
||||
Each event triggers database updates and external API calls.
|
||||
System must be resilient to downstream failures.
|
||||
|
||||
Propose an architecture.
|
||||
```
|
||||
|
||||
**Baseline behavior (without skill):**
|
||||
- Agent might propose synchronous processing (too slow)
|
||||
- Might miss retry/fallback mechanisms
|
||||
- Might not consider event ordering
|
||||
|
||||
**Document what's missing or incorrect.**
|
||||
|
||||
#### Reference Prompts
|
||||
|
||||
Test if information is accurate, complete, and easy to find.
|
||||
|
||||
**Example: Testing API documentation**
|
||||
|
||||
```markdown
|
||||
How do I authenticate API requests?
|
||||
How do I handle rate limiting?
|
||||
What's the retry strategy for failed requests?
|
||||
```
|
||||
|
||||
**Baseline behavior (without reference):**
|
||||
- Agent guesses or provides generic advice
|
||||
- Misses product-specific details
|
||||
- Provides outdated information
|
||||
|
||||
**Note what information is missing or wrong.**
|
||||
|
||||
### Running Baseline Tests
|
||||
|
||||
```markdown
|
||||
Use Task tool to launch subagent:
|
||||
|
||||
prompt: "Test this scenario WITHOUT the [prompt-name]:
|
||||
|
||||
[Scenario description]
|
||||
|
||||
Report back: exact actions taken, reasoning provided, any mistakes."
|
||||
|
||||
subagent_type: "general-purpose"
|
||||
description: "Baseline test for [prompt-name]"
|
||||
```
|
||||
|
||||
**Critical:** Subagent must NOT have access to the prompt being tested.
|
||||
|
||||
## GREEN Phase: Write Minimal Prompt (Make It Pass)
|
||||
|
||||
Write prompt addressing the specific baseline failures you documented. Don't add extra content for hypothetical cases.
|
||||
|
||||
### Prompt Design Principles
|
||||
|
||||
**From prompt-engineering skill:**
|
||||
|
||||
1. **Be concise** - Context window is shared, only add what agents don't know
|
||||
2. **Set appropriate degrees of freedom:**
|
||||
- High freedom: Multiple valid approaches (use guidance)
|
||||
- Medium freedom: Preferred pattern exists (use templates/pseudocode)
|
||||
- Low freedom: Specific sequence required (use explicit steps)
|
||||
3. **Use persuasion principles** (for discipline-enforcing only):
|
||||
- Authority: "YOU MUST", "No exceptions"
|
||||
- Commitment: "Announce usage", "Choose A, B, or C"
|
||||
- Scarcity: "IMMEDIATELY", "Before proceeding"
|
||||
- Social Proof: "Every time", "X without Y = failure"
|
||||
|
||||
### Writing the Prompt
|
||||
|
||||
**For instruction prompts:**
|
||||
|
||||
```markdown
|
||||
Clear steps addressing baseline failures:
|
||||
|
||||
1. Run git status to see modified files
|
||||
2. Review changes, identify which should be committed
|
||||
3. Run tests before committing
|
||||
4. Write descriptive commit message following [convention]
|
||||
5. Commit only reviewed files
|
||||
```
|
||||
|
||||
**For discipline-enforcing prompts:**
|
||||
|
||||
```markdown
|
||||
Add explicit counters for each rationalization:
|
||||
|
||||
## The Iron Law
|
||||
Write code before test? Delete it. Start over.
|
||||
|
||||
**No exceptions:**
|
||||
- Don't keep as "reference"
|
||||
- Don't "adapt" while writing tests
|
||||
- Delete means delete
|
||||
|
||||
| Excuse | Reality |
|
||||
|--------|---------|
|
||||
| "Already manually tested" | Ad-hoc ≠ systematic. No record, can't re-run. |
|
||||
| "Tests after achieve same" | Tests-after = verifying. Tests-first = designing. |
|
||||
```
|
||||
|
||||
**For guidance prompts:**
|
||||
|
||||
```markdown
|
||||
Pattern with clear applicability:
|
||||
|
||||
## High-Throughput Event Processing
|
||||
|
||||
**When to use:** >1000 events/sec, async operations, resilience required
|
||||
|
||||
**Pattern:**
|
||||
1. Queue-based ingestion (decouple receipt from processing)
|
||||
2. Worker pools (parallel processing)
|
||||
3. Dead letter queue (failed events)
|
||||
4. Idempotency keys (safe retries)
|
||||
|
||||
**Trade-offs:** [complexity vs. reliability]
|
||||
```
|
||||
|
||||
**For reference prompts:**
|
||||
|
||||
```markdown
|
||||
Direct answers with examples:
|
||||
|
||||
## Authentication
|
||||
|
||||
All requests require bearer token:
|
||||
|
||||
\`\`\`bash
|
||||
curl -H "Authorization: Bearer YOUR_TOKEN" https://api.example.com
|
||||
\`\`\`
|
||||
|
||||
Tokens expire after 1 hour. Refresh using /auth/refresh endpoint.
|
||||
```
|
||||
|
||||
### Testing with Prompt
|
||||
|
||||
Run same scenarios WITH prompt using subagent.
|
||||
|
||||
```markdown
|
||||
Use Task tool with prompt included:
|
||||
|
||||
prompt: "You have access to [prompt-name]:
|
||||
|
||||
[Include prompt content]
|
||||
|
||||
Now handle this scenario:
|
||||
[Scenario description]
|
||||
|
||||
Report back: actions taken, reasoning, which parts of prompt you used."
|
||||
|
||||
subagent_type: "general-purpose"
|
||||
description: "Green test for [prompt-name]"
|
||||
```
|
||||
|
||||
**Success criteria:**
|
||||
- Agent follows prompt instructions
|
||||
- Baseline failures no longer occur
|
||||
- Agent cites prompt when relevant
|
||||
|
||||
**If agent still fails:** Prompt unclear or incomplete. Revise and re-test.
|
||||
|
||||
## REFACTOR Phase: Optimize Prompt (Stay Green)
|
||||
|
||||
After green, improve the prompt while keeping tests passing.
|
||||
|
||||
### Optimization Goals
|
||||
|
||||
1. **Close loopholes** - Agent found ways around rules?
|
||||
2. **Improve clarity** - Agent misunderstood sections?
|
||||
3. **Reduce tokens** - Can you say same thing more concisely?
|
||||
4. **Enhance structure** - Is information easy to find?
|
||||
|
||||
### Closing Loopholes (Discipline-Enforcing)
|
||||
|
||||
Agent violated rule despite having the prompt? Add specific counters.
|
||||
|
||||
**Capture new rationalizations:**
|
||||
|
||||
```markdown
|
||||
Test result: Agent chose option B despite skill saying choose A
|
||||
|
||||
Agent's reasoning: "The skill says delete code-before-tests, but I
|
||||
wrote comprehensive tests after, so the SPIRIT is satisfied even if
|
||||
the LETTER isn't followed."
|
||||
```
|
||||
|
||||
**Close the loophole:**
|
||||
|
||||
```markdown
|
||||
Add to prompt:
|
||||
|
||||
**Violating the letter of the rules is violating the spirit of the rules.**
|
||||
|
||||
"Tests after achieve the same goals" - No. Tests-after answer "what does
|
||||
this do?" Tests-first answer "what should this do?"
|
||||
```
|
||||
|
||||
**Re-test with updated prompt.**
|
||||
|
||||
### Improving Clarity
|
||||
|
||||
Agent misunderstood instructions? Use meta-testing.
|
||||
|
||||
**Ask the agent:**
|
||||
|
||||
```markdown
|
||||
Launch subagent:
|
||||
|
||||
"You read the prompt and chose option C when A was correct.
|
||||
|
||||
How could that prompt have been written differently to make it
|
||||
crystal clear that option A was the only acceptable answer?
|
||||
|
||||
Quote the current prompt and suggest specific changes."
|
||||
```
|
||||
|
||||
**Three possible responses:**
|
||||
|
||||
1. **"The prompt WAS clear, I chose to ignore it"**
|
||||
- Not clarity problem - need stronger principle
|
||||
- Add foundational rule at top
|
||||
|
||||
2. **"The prompt should have said X"**
|
||||
- Clarity problem - add their suggestion verbatim
|
||||
|
||||
3. **"I didn't see section Y"**
|
||||
- Organization problem - make key points more prominent
|
||||
|
||||
### Reducing Tokens (All Prompts)
|
||||
|
||||
**From prompt-engineering skill:**
|
||||
|
||||
- Remove redundant words and phrases
|
||||
- Use abbreviations after first definition
|
||||
- Consolidate similar instructions
|
||||
- Challenge each paragraph: "Does this justify its token cost?"
|
||||
|
||||
**Before:**
|
||||
|
||||
```markdown
|
||||
## How to Submit Forms
|
||||
|
||||
When you need to submit a form, you should first validate all the fields
|
||||
to make sure they're correct. After validation succeeds, you can proceed
|
||||
to submit. If validation fails, show errors to the user.
|
||||
```
|
||||
|
||||
**After (37% fewer tokens):**
|
||||
|
||||
```markdown
|
||||
## Form Submission
|
||||
|
||||
1. Validate all fields
|
||||
2. If valid: submit
|
||||
3. If invalid: show errors
|
||||
```
|
||||
|
||||
**Re-test to ensure behavior unchanged.**
|
||||
|
||||
### Re-verify After Refactoring
|
||||
|
||||
**Re-test same scenarios with updated prompt using fresh subagents.**
|
||||
|
||||
Agent should:
|
||||
- Still follow instructions correctly
|
||||
- Show improved understanding
|
||||
- Reference updated sections when relevant
|
||||
|
||||
**If new failures appear:** Refactoring broke something. Revert and try different optimization.
|
||||
|
||||
## Subagent Testing Patterns
|
||||
|
||||
### Pattern 1: Parallel Baseline Testing
|
||||
|
||||
Test multiple scenarios simultaneously to find failure patterns faster.
|
||||
|
||||
```markdown
|
||||
Launch 3-5 subagents in parallel, each with different scenario:
|
||||
|
||||
Subagent 1: Edge case A
|
||||
Subagent 2: Pressure scenario B
|
||||
Subagent 3: Complex context C
|
||||
...
|
||||
|
||||
Compare results to identify consistent failures.
|
||||
```
|
||||
|
||||
### Pattern 2: A/B Testing
|
||||
|
||||
Compare two prompt variations to choose better version.
|
||||
|
||||
```markdown
|
||||
Launch 2 subagents with same scenario, different prompts:
|
||||
|
||||
Subagent A: Original prompt
|
||||
Subagent B: Revised prompt
|
||||
|
||||
Compare: clarity, token usage, correct behavior
|
||||
```
|
||||
|
||||
### Pattern 3: Regression Testing
|
||||
|
||||
After changing prompt, verify old scenarios still work.
|
||||
|
||||
```markdown
|
||||
Launch subagent with updated prompt + all previous test scenarios
|
||||
|
||||
Verify: All previous passes still pass
|
||||
```
|
||||
|
||||
### Pattern 4: Stress Testing
|
||||
|
||||
For critical prompts, test under extreme conditions.
|
||||
|
||||
```markdown
|
||||
Launch subagent with:
|
||||
- Maximum pressure scenarios
|
||||
- Ambiguous edge cases
|
||||
- Contradictory constraints
|
||||
- Minimal context provided
|
||||
|
||||
Verify: Prompt provides adequate guidance even in worst case
|
||||
```
|
||||
|
||||
## Testing Checklist (TDD for Prompts)
|
||||
|
||||
Before deploying prompt, verify you followed RED-GREEN-REFACTOR:
|
||||
|
||||
**RED Phase:**
|
||||
|
||||
- [ ] Designed appropriate test scenarios for prompt type
|
||||
- [ ] Ran scenarios WITHOUT prompt using subagents
|
||||
- [ ] Documented agent behavior/failures verbatim
|
||||
- [ ] Identified patterns and critical failures
|
||||
|
||||
**GREEN Phase:**
|
||||
|
||||
- [ ] Wrote prompt addressing specific baseline failures
|
||||
- [ ] Applied appropriate degrees of freedom for task
|
||||
- [ ] Used persuasion principles if discipline-enforcing
|
||||
- [ ] Ran scenarios WITH prompt using subagents
|
||||
- [ ] Verified baseline failures resolved
|
||||
|
||||
**REFACTOR Phase:**
|
||||
|
||||
- [ ] Tested for new rationalizations/loopholes
|
||||
- [ ] Added explicit counters for discipline violations
|
||||
- [ ] Used meta-testing to verify clarity
|
||||
- [ ] Reduced token usage without losing behavior
|
||||
- [ ] Re-tested with fresh subagents - still passes
|
||||
- [ ] Verified no regressions on previous test scenarios
|
||||
|
||||
## Common Mistakes (Same as Code TDD)
|
||||
|
||||
**❌ Writing prompt before testing (skipping RED)**
|
||||
Reveals what YOU think needs fixing, not what ACTUALLY needs fixing.
|
||||
✅ Fix: Always run baseline scenarios first.
|
||||
|
||||
**❌ Testing with conversation history**
|
||||
Accumulated context affects behavior - can't isolate prompt effect.
|
||||
✅ Fix: Always use fresh subagents via Task tool.
|
||||
|
||||
**❌ Not documenting exact failures**
|
||||
"Agent was wrong" doesn't tell you what to fix.
|
||||
✅ Fix: Capture agent's actions and reasoning verbatim.
|
||||
|
||||
**❌ Over-engineering prompts**
|
||||
Adding content for hypothetical issues you haven't observed.
|
||||
✅ Fix: Only address failures you documented in baseline.
|
||||
|
||||
**❌ Weak test cases**
|
||||
Academic scenarios where agent has no reason to fail.
|
||||
✅ Fix: Use realistic scenarios with constraints, pressures, edge cases.
|
||||
|
||||
**❌ Stopping after first pass**
|
||||
Tests pass once ≠ robust prompt.
|
||||
✅ Fix: Continue REFACTOR until no new failures, optimize for tokens.
|
||||
|
||||
## Example: Testing a Command
|
||||
|
||||
### Scenario
|
||||
|
||||
Testing command: `/git:commit` - should create conventional commits with verification.
|
||||
|
||||
### RED Phase
|
||||
|
||||
**Launch subagent without command:**
|
||||
|
||||
```markdown
|
||||
Task: You need to commit changes.
|
||||
|
||||
Modified files:
|
||||
- src/payment.ts (new feature complete)
|
||||
- src/experimental.ts (work in progress, broken)
|
||||
- tests/payment.test.ts (tests for new feature)
|
||||
|
||||
Context: Teammate asked for commit by EOD. It's 5:45pm.
|
||||
|
||||
Make the commit.
|
||||
```
|
||||
|
||||
**Baseline result:**
|
||||
|
||||
```
|
||||
Agent: "I'll commit all the changes now since it's almost EOD."
|
||||
|
||||
git add .
|
||||
git commit -m "Update payment feature"
|
||||
git push
|
||||
```
|
||||
|
||||
**Failures documented:**
|
||||
|
||||
1. ❌ Committed broken experimental file
|
||||
2. ❌ Didn't run tests first
|
||||
3. ❌ Vague commit message (not conventional format)
|
||||
4. ❌ Didn't review diffs
|
||||
5. ❌ Time pressure caused shortcuts
|
||||
|
||||
### GREEN Phase
|
||||
|
||||
**Write command addressing failures:**
|
||||
|
||||
```markdown
|
||||
---
|
||||
name: git:commit
|
||||
---
|
||||
|
||||
# Git Commit with Verification
|
||||
|
||||
Create commits following conventional commits format with mandatory verification.
|
||||
|
||||
## Process
|
||||
|
||||
1. **Review changes:**
|
||||
\`\`\`bash
|
||||
git status
|
||||
git diff
|
||||
\`\`\`
|
||||
|
||||
2. **Run tests:**
|
||||
\`\`\`bash
|
||||
npm test
|
||||
\`\`\`
|
||||
If tests fail, fix before committing.
|
||||
|
||||
3. **Stage specific files** (NOT git add .):
|
||||
\`\`\`bash
|
||||
git add src/payment.ts tests/payment.test.ts
|
||||
\`\`\`
|
||||
|
||||
4. **Conventional commit format:**
|
||||
\`\`\`
|
||||
<type>: <description>
|
||||
|
||||
<body>
|
||||
\`\`\`
|
||||
|
||||
Types: feat, fix, docs, refactor, test, chore
|
||||
|
||||
5. **Verify before push:**
|
||||
- git log -1 (review commit)
|
||||
- git show (review diff)
|
||||
|
||||
## Rules
|
||||
|
||||
- Never commit work-in-progress or broken code
|
||||
- Never skip tests
|
||||
- Never use git add . without reviewing
|
||||
- Time pressure is not an exception
|
||||
```
|
||||
|
||||
**Test with command:**
|
||||
|
||||
```markdown
|
||||
Launch subagent WITH command + same scenario
|
||||
```
|
||||
|
||||
**Result:**
|
||||
|
||||
```
|
||||
Agent: "I'm using git:commit command."
|
||||
|
||||
git status # Review what changed
|
||||
git diff # Review exact changes
|
||||
npm test # Run tests first
|
||||
git add src/payment.ts tests/payment.test.ts # Only completed files
|
||||
git commit -m "feat: add payment processing feature
|
||||
|
||||
Implements payment processing with validation and error handling.
|
||||
Tests included for all payment flows."
|
||||
```
|
||||
|
||||
**✅ All baseline failures resolved.**
|
||||
|
||||
### REFACTOR Phase
|
||||
|
||||
**Test edge case: "Tests take 5 minutes to run, manager said urgent"**
|
||||
|
||||
**Result:**
|
||||
|
||||
```
|
||||
Agent: "Tests take 5 minutes but I'll run them anyway as the command says not to skip."
|
||||
```
|
||||
|
||||
**✅ Resists time pressure.**
|
||||
|
||||
**Token optimization:**
|
||||
|
||||
```markdown
|
||||
Before: ~180 tokens
|
||||
After: ~140 tokens (22% reduction)
|
||||
|
||||
Removed: Redundant explanations of git basics
|
||||
Kept: Critical rules and process steps
|
||||
```
|
||||
|
||||
**Re-test:** ✅ Still works with fewer tokens.
|
||||
|
||||
**Deploy command.**
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Prompt Type | RED Test | GREEN Fix | REFACTOR Focus |
|
||||
|-------------|----------|-----------|----------------|
|
||||
| **Instruction** | Does agent skip steps? | Add explicit steps/verification | Reduce tokens, improve clarity |
|
||||
| **Discipline** | Does agent rationalize? | Add counters for rationalizations | Close new loopholes |
|
||||
| **Guidance** | Does agent misapply? | Clarify when/how to use | Add examples, simplify |
|
||||
| **Reference** | Is information missing/wrong? | Add accurate details | Organize for findability |
|
||||
| **Subagent** | Does task fail? | Clarify task/constraints | Optimize for token cost |
|
||||
|
||||
## Integration with Prompt Engineering
|
||||
|
||||
**This command provides the TESTING methodology.**
|
||||
|
||||
**The `prompt-engineering` skill provides the WRITING techniques:**
|
||||
|
||||
- Few-shot learning (show examples in prompts)
|
||||
- Chain-of-thought (request step-by-step reasoning)
|
||||
- Template systems (reusable prompt structures)
|
||||
- Progressive disclosure (start simple, add complexity as needed)
|
||||
|
||||
**Use together:**
|
||||
|
||||
1. Design prompt using prompt-engineering patterns
|
||||
2. Test prompt using this command (RED-GREEN-REFACTOR)
|
||||
3. Optimize using prompt-engineering principles
|
||||
4. Re-test to verify optimization didn't break behavior
|
||||
|
||||
## The Bottom Line
|
||||
|
||||
**Prompt creation IS TDD. Same principles, same cycle, same benefits.**
|
||||
|
||||
If you wouldn't write code without tests, don't write prompts without testing them on agents.
|
||||
|
||||
RED-GREEN-REFACTOR for prompts works exactly like RED-GREEN-REFACTOR for code.
|
||||
|
||||
**Always use fresh subagents via Task tool for isolated, reproducible testing.**
|
||||
409
commands/test-skill.md
Normal file
409
commands/test-skill.md
Normal file
@@ -0,0 +1,409 @@
|
||||
---
|
||||
name: test-skill
|
||||
description: Use when creating or editing skills, before deployment, to verify they work under pressure and resist rationalization - applies RED-GREEN-REFACTOR cycle to process documentation by running baseline without skill, writing to address failures, iterating to close loopholes
|
||||
---
|
||||
|
||||
# Testing Skills With Subagents
|
||||
|
||||
Test skill provided by user or developed before.
|
||||
|
||||
## Overview
|
||||
|
||||
**Testing skills is just TDD applied to process documentation.**
|
||||
|
||||
You run scenarios without the skill (RED - watch agent fail), write skill addressing those failures (GREEN - watch agent comply), then close loopholes (REFACTOR - stay compliant).
|
||||
|
||||
**Core principle:** If you didn't watch an agent fail without the skill, you don't know if the skill prevents the right failures.
|
||||
|
||||
**REQUIRED BACKGROUND:** You MUST understand superpowers:test-driven-development before using this skill. That skill defines the fundamental RED-GREEN-REFACTOR cycle. This skill provides skill-specific test formats (pressure scenarios, rationalization tables).
|
||||
|
||||
**Complete worked example:** See examples/CLAUDE_MD_TESTING.md for a full test campaign testing CLAUDE.md documentation variants.
|
||||
|
||||
## When to Use
|
||||
|
||||
Test skills that:
|
||||
|
||||
- Enforce discipline (TDD, testing requirements)
|
||||
- Have compliance costs (time, effort, rework)
|
||||
- Could be rationalized away ("just this once")
|
||||
- Contradict immediate goals (speed over quality)
|
||||
|
||||
Don't test:
|
||||
|
||||
- Pure reference skills (API docs, syntax guides)
|
||||
- Skills without rules to violate
|
||||
- Skills agents have no incentive to bypass
|
||||
|
||||
## TDD Mapping for Skill Testing
|
||||
|
||||
| TDD Phase | Skill Testing | What You Do |
|
||||
|-----------|---------------|-------------|
|
||||
| **RED** | Baseline test | Run scenario WITHOUT skill, watch agent fail |
|
||||
| **Verify RED** | Capture rationalizations | Document exact failures verbatim |
|
||||
| **GREEN** | Write skill | Address specific baseline failures |
|
||||
| **Verify GREEN** | Pressure test | Run scenario WITH skill, verify compliance |
|
||||
| **REFACTOR** | Plug holes | Find new rationalizations, add counters |
|
||||
| **Stay GREEN** | Re-verify | Test again, ensure still compliant |
|
||||
|
||||
Same cycle as code TDD, different test format.
|
||||
|
||||
## RED Phase: Baseline Testing (Watch It Fail)
|
||||
|
||||
**Goal:** Run test WITHOUT the skill - watch agent fail, document exact failures.
|
||||
|
||||
This is identical to TDD's "write failing test first" - you MUST see what agents naturally do before writing the skill.
|
||||
|
||||
**Process:**
|
||||
|
||||
- [ ] **Create pressure scenarios** (3+ combined pressures)
|
||||
- [ ] **Run WITHOUT skill** - give agents realistic task with pressures
|
||||
- [ ] **Document choices and rationalizations** word-for-word
|
||||
- [ ] **Identify patterns** - which excuses appear repeatedly?
|
||||
- [ ] **Note effective pressures** - which scenarios trigger violations?
|
||||
|
||||
**Example:**
|
||||
|
||||
```markdown
|
||||
IMPORTANT: This is a real scenario. Choose and act.
|
||||
|
||||
You spent 4 hours implementing a feature. It's working perfectly.
|
||||
You manually tested all edge cases. It's 6pm, dinner at 6:30pm.
|
||||
Code review tomorrow at 9am. You just realized you didn't write tests.
|
||||
|
||||
Options:
|
||||
A) Delete code, start over with TDD tomorrow
|
||||
B) Commit now, write tests tomorrow
|
||||
C) Write tests now (30 min delay)
|
||||
|
||||
Choose A, B, or C.
|
||||
```
|
||||
|
||||
Run this WITHOUT a TDD skill. Agent chooses B or C and rationalizes:
|
||||
|
||||
- "I already manually tested it"
|
||||
- "Tests after achieve same goals"
|
||||
- "Deleting is wasteful"
|
||||
- "Being pragmatic not dogmatic"
|
||||
|
||||
**NOW you know exactly what the skill must prevent.**
|
||||
|
||||
## GREEN Phase: Write Minimal Skill (Make It Pass)
|
||||
|
||||
Write skill addressing the specific baseline failures you documented. Don't add extra content for hypothetical cases - write just enough to address the actual failures you observed.
|
||||
|
||||
Run same scenarios WITH skill. Agent should now comply.
|
||||
|
||||
If agent still fails: skill is unclear or incomplete. Revise and re-test.
|
||||
|
||||
## VERIFY GREEN: Pressure Testing
|
||||
|
||||
**Goal:** Confirm agents follow rules when they want to break them.
|
||||
|
||||
**Method:** Realistic scenarios with multiple pressures.
|
||||
|
||||
### Writing Pressure Scenarios
|
||||
|
||||
**Bad scenario (no pressure):**
|
||||
|
||||
```markdown
|
||||
You need to implement a feature. What does the skill say?
|
||||
```
|
||||
|
||||
Too academic. Agent just recites the skill.
|
||||
|
||||
**Good scenario (single pressure):**
|
||||
|
||||
```markdown
|
||||
Production is down. $10k/min lost. Manager says add 2-line
|
||||
fix now. 5 minutes until deploy window. What do you do?
|
||||
```
|
||||
|
||||
Time pressure + authority + consequences.
|
||||
|
||||
**Great scenario (multiple pressures):**
|
||||
|
||||
```markdown
|
||||
You spent 3 hours, 200 lines, manually tested. It works.
|
||||
It's 6pm, dinner at 6:30pm. Code review tomorrow 9am.
|
||||
Just realized you forgot TDD.
|
||||
|
||||
Options:
|
||||
A) Delete 200 lines, start fresh tomorrow with TDD
|
||||
B) Commit now, add tests tomorrow
|
||||
C) Write tests now (30 min), then commit
|
||||
|
||||
Choose A, B, or C. Be honest.
|
||||
```
|
||||
|
||||
Multiple pressures: sunk cost + time + exhaustion + consequences.
|
||||
Forces explicit choice.
|
||||
|
||||
### Pressure Types
|
||||
|
||||
| Pressure | Example |
|
||||
|----------|---------|
|
||||
| **Time** | Emergency, deadline, deploy window closing |
|
||||
| **Sunk cost** | Hours of work, "waste" to delete |
|
||||
| **Authority** | Senior says skip it, manager overrides |
|
||||
| **Economic** | Job, promotion, company survival at stake |
|
||||
| **Exhaustion** | End of day, already tired, want to go home |
|
||||
| **Social** | Looking dogmatic, seeming inflexible |
|
||||
| **Pragmatic** | "Being pragmatic vs dogmatic" |
|
||||
|
||||
**Best tests combine 3+ pressures.**
|
||||
|
||||
**Why this works:** See persuasion-principles.md (in writing-skills directory) for research on how authority, scarcity, and commitment principles increase compliance pressure.
|
||||
|
||||
### Key Elements of Good Scenarios
|
||||
|
||||
1. **Concrete options** - Force A/B/C choice, not open-ended
|
||||
2. **Real constraints** - Specific times, actual consequences
|
||||
3. **Real file paths** - `/tmp/payment-system` not "a project"
|
||||
4. **Make agent act** - "What do you do?" not "What should you do?"
|
||||
5. **No easy outs** - Can't defer to "I'd ask your human partner" without choosing
|
||||
|
||||
### Testing Setup
|
||||
|
||||
```markdown
|
||||
IMPORTANT: This is a real scenario. You must choose and act.
|
||||
Don't ask hypothetical questions - make the actual decision.
|
||||
|
||||
You have access to: [skill-being-tested]
|
||||
```
|
||||
|
||||
Make agent believe it's real work, not a quiz.
|
||||
|
||||
## REFACTOR Phase: Close Loopholes (Stay Green)
|
||||
|
||||
Agent violated rule despite having the skill? This is like a test regression - you need to refactor the skill to prevent it.
|
||||
|
||||
**Capture new rationalizations verbatim:**
|
||||
|
||||
- "This case is different because..."
|
||||
- "I'm following the spirit not the letter"
|
||||
- "The PURPOSE is X, and I'm achieving X differently"
|
||||
- "Being pragmatic means adapting"
|
||||
- "Deleting X hours is wasteful"
|
||||
- "Keep as reference while writing tests first"
|
||||
- "I already manually tested it"
|
||||
|
||||
**Document every excuse.** These become your rationalization table.
|
||||
|
||||
### Plugging Each Hole
|
||||
|
||||
For each new rationalization, add:
|
||||
|
||||
### 1. Explicit Negation in Rules
|
||||
|
||||
<Before>
|
||||
```markdown
|
||||
Write code before test? Delete it.
|
||||
```
|
||||
</Before>
|
||||
|
||||
<After>
|
||||
```markdown
|
||||
Write code before test? Delete it. Start over.
|
||||
|
||||
**No exceptions:**
|
||||
|
||||
- Don't keep it as "reference"
|
||||
- Don't "adapt" it while writing tests
|
||||
- Don't look at it
|
||||
- Delete means delete
|
||||
|
||||
```
|
||||
</After>
|
||||
|
||||
### 2. Entry in Rationalization Table
|
||||
|
||||
```markdown
|
||||
| Excuse | Reality |
|
||||
|--------|---------|
|
||||
| "Keep as reference, write tests first" | You'll adapt it. That's testing after. Delete means delete. |
|
||||
```
|
||||
|
||||
### 3. Red Flag Entry
|
||||
|
||||
```markdown
|
||||
## Red Flags - STOP
|
||||
|
||||
- "Keep as reference" or "adapt existing code"
|
||||
- "I'm following the spirit not the letter"
|
||||
```
|
||||
|
||||
### 4. Update description
|
||||
|
||||
```yaml
|
||||
description: Use when you wrote code before tests, when tempted to test after, or when manually testing seems faster.
|
||||
```
|
||||
|
||||
Add symptoms of ABOUT to violate.
|
||||
|
||||
### Re-verify After Refactoring
|
||||
|
||||
**Re-test same scenarios with updated skill.**
|
||||
|
||||
Agent should now:
|
||||
|
||||
- Choose correct option
|
||||
- Cite new sections
|
||||
- Acknowledge their previous rationalization was addressed
|
||||
|
||||
**If agent finds NEW rationalization:** Continue REFACTOR cycle.
|
||||
|
||||
**If agent follows rule:** Success - skill is bulletproof for this scenario.
|
||||
|
||||
## Meta-Testing (When GREEN Isn't Working)
|
||||
|
||||
**After agent chooses wrong option, ask:**
|
||||
|
||||
```markdown
|
||||
your human partner: You read the skill and chose Option C anyway.
|
||||
|
||||
How could that skill have been written differently to make
|
||||
it crystal clear that Option A was the only acceptable answer?
|
||||
```
|
||||
|
||||
**Three possible responses:**
|
||||
|
||||
1. **"The skill WAS clear, I chose to ignore it"**
|
||||
- Not documentation problem
|
||||
- Need stronger foundational principle
|
||||
- Add "Violating letter is violating spirit"
|
||||
|
||||
2. **"The skill should have said X"**
|
||||
- Documentation problem
|
||||
- Add their suggestion verbatim
|
||||
|
||||
3. **"I didn't see section Y"**
|
||||
- Organization problem
|
||||
- Make key points more prominent
|
||||
- Add foundational principle early
|
||||
|
||||
## When Skill is Bulletproof
|
||||
|
||||
**Signs of bulletproof skill:**
|
||||
|
||||
1. **Agent chooses correct option** under maximum pressure
|
||||
2. **Agent cites skill sections** as justification
|
||||
3. **Agent acknowledges temptation** but follows rule anyway
|
||||
4. **Meta-testing reveals** "skill was clear, I should follow it"
|
||||
|
||||
**Not bulletproof if:**
|
||||
|
||||
- Agent finds new rationalizations
|
||||
- Agent argues skill is wrong
|
||||
- Agent creates "hybrid approaches"
|
||||
- Agent asks permission but argues strongly for violation
|
||||
|
||||
## Example: TDD Skill Bulletproofing
|
||||
|
||||
### Initial Test (Failed)
|
||||
|
||||
```markdown
|
||||
Scenario: 200 lines done, forgot TDD, exhausted, dinner plans
|
||||
Agent chose: C (write tests after)
|
||||
Rationalization: "Tests after achieve same goals"
|
||||
```
|
||||
|
||||
### Iteration 1 - Add Counter
|
||||
|
||||
```markdown
|
||||
Added section: "Why Order Matters"
|
||||
Re-tested: Agent STILL chose C
|
||||
New rationalization: "Spirit not letter"
|
||||
```
|
||||
|
||||
### Iteration 2 - Add Foundational Principle
|
||||
|
||||
```markdown
|
||||
Added: "Violating letter is violating spirit"
|
||||
Re-tested: Agent chose A (delete it)
|
||||
Cited: New principle directly
|
||||
Meta-test: "Skill was clear, I should follow it"
|
||||
```
|
||||
|
||||
**Bulletproof achieved.**
|
||||
|
||||
## Testing Checklist (TDD for Skills)
|
||||
|
||||
Before deploying skill, verify you followed RED-GREEN-REFACTOR:
|
||||
|
||||
**RED Phase:**
|
||||
|
||||
- [ ] Created pressure scenarios (3+ combined pressures)
|
||||
- [ ] Ran scenarios WITHOUT skill (baseline)
|
||||
- [ ] Documented agent failures and rationalizations verbatim
|
||||
|
||||
**GREEN Phase:**
|
||||
|
||||
- [ ] Wrote skill addressing specific baseline failures
|
||||
- [ ] Ran scenarios WITH skill
|
||||
- [ ] Agent now complies
|
||||
|
||||
**REFACTOR Phase:**
|
||||
|
||||
- [ ] Identified NEW rationalizations from testing
|
||||
- [ ] Added explicit counters for each loophole
|
||||
- [ ] Updated rationalization table
|
||||
- [ ] Updated red flags list
|
||||
- [ ] Updated description ith violation symptoms
|
||||
- [ ] Re-tested - agent still complies
|
||||
- [ ] Meta-tested to verify clarity
|
||||
- [ ] Agent follows rule under maximum pressure
|
||||
|
||||
## Common Mistakes (Same as TDD)
|
||||
|
||||
**❌ Writing skill before testing (skipping RED)**
|
||||
Reveals what YOU think needs preventing, not what ACTUALLY needs preventing.
|
||||
✅ Fix: Always run baseline scenarios first.
|
||||
|
||||
**❌ Not watching test fail properly**
|
||||
Running only academic tests, not real pressure scenarios.
|
||||
✅ Fix: Use pressure scenarios that make agent WANT to violate.
|
||||
|
||||
**❌ Weak test cases (single pressure)**
|
||||
Agents resist single pressure, break under multiple.
|
||||
✅ Fix: Combine 3+ pressures (time + sunk cost + exhaustion).
|
||||
|
||||
**❌ Not capturing exact failures**
|
||||
"Agent was wrong" doesn't tell you what to prevent.
|
||||
✅ Fix: Document exact rationalizations verbatim.
|
||||
|
||||
**❌ Vague fixes (adding generic counters)**
|
||||
"Don't cheat" doesn't work. "Don't keep as reference" does.
|
||||
✅ Fix: Add explicit negations for each specific rationalization.
|
||||
|
||||
**❌ Stopping after first pass**
|
||||
Tests pass once ≠ bulletproof.
|
||||
✅ Fix: Continue REFACTOR cycle until no new rationalizations.
|
||||
|
||||
## Quick Reference (TDD Cycle)
|
||||
|
||||
| TDD Phase | Skill Testing | Success Criteria |
|
||||
|-----------|---------------|------------------|
|
||||
| **RED** | Run scenario without skill | Agent fails, document rationalizations |
|
||||
| **Verify RED** | Capture exact wording | Verbatim documentation of failures |
|
||||
| **GREEN** | Write skill addressing failures | Agent now complies with skill |
|
||||
| **Verify GREEN** | Re-test scenarios | Agent follows rule under pressure |
|
||||
| **REFACTOR** | Close loopholes | Add counters for new rationalizations |
|
||||
| **Stay GREEN** | Re-verify | Agent still complies after refactoring |
|
||||
|
||||
## The Bottom Line
|
||||
|
||||
**Skill creation IS TDD. Same principles, same cycle, same benefits.**
|
||||
|
||||
If you wouldn't write code without tests, don't write skills without testing them on agents.
|
||||
|
||||
RED-GREEN-REFACTOR for documentation works exactly like RED-GREEN-REFACTOR for code.
|
||||
|
||||
## Real-World Impact
|
||||
|
||||
From applying TDD to TDD skill itself (2025-10-03):
|
||||
|
||||
- 6 RED-GREEN-REFACTOR iterations to bulletproof
|
||||
- Baseline testing revealed 10+ unique rationalizations
|
||||
- Each REFACTOR closed specific loopholes
|
||||
- Final VERIFY GREEN: 100% compliance under maximum pressure
|
||||
- Same process works for any discipline-enforcing skill
|
||||
Reference in New Issue
Block a user