475 lines
13 KiB
Markdown
475 lines
13 KiB
Markdown
<overview>
|
|
Skills improve through iteration and testing. This reference covers evaluation-driven development, Claude A/B testing patterns, and XML structure validation during testing.
|
|
</overview>
|
|
|
|
<evaluation_driven_development>
|
|
<principle>
|
|
Create evaluations BEFORE writing extensive documentation. This ensures your skill solves real problems rather than documenting imagined ones.
|
|
</principle>
|
|
|
|
<workflow>
|
|
<step_1>
|
|
**Identify gaps**: Run Claude on representative tasks without a skill. Document specific failures or missing context.
|
|
</step_1>
|
|
|
|
<step_2>
|
|
**Create evaluations**: Build three scenarios that test these gaps.
|
|
</step_2>
|
|
|
|
<step_3>
|
|
**Establish baseline**: Measure Claude's performance without the skill.
|
|
</step_3>
|
|
|
|
<step_4>
|
|
**Write minimal instructions**: Create just enough content to address the gaps and pass evaluations.
|
|
</step_4>
|
|
|
|
<step_5>
|
|
**Iterate**: Execute evaluations, compare against baseline, and refine.
|
|
</step_5>
|
|
</workflow>
|
|
|
|
<evaluation_structure>
|
|
```json
|
|
{
|
|
"skills": ["pdf-processing"],
|
|
"query": "Extract all text from this PDF file and save it to output.txt",
|
|
"files": ["test-files/document.pdf"],
|
|
"expected_behavior": [
|
|
"Successfully reads the PDF file using appropriate library",
|
|
"Extracts text content from all pages without missing any",
|
|
"Saves extracted text to output.txt in clear, readable format"
|
|
]
|
|
}
|
|
```
|
|
</evaluation_structure>
|
|
|
|
<why_evaluations_first>
|
|
- Prevents documenting imagined problems
|
|
- Forces clarity about what success looks like
|
|
- Provides objective measurement of skill effectiveness
|
|
- Keeps skill focused on actual needs
|
|
- Enables quantitative improvement tracking
|
|
</why_evaluations_first>
|
|
</evaluation_driven_development>
|
|
|
|
<iterative_development_with_claude>
|
|
<principle>
|
|
The most effective skill development uses Claude itself. Work with "Claude A" (expert who helps refine) to create skills used by "Claude B" (agent executing tasks).
|
|
</principle>
|
|
|
|
<creating_skills>
|
|
<workflow>
|
|
<step_1>
|
|
**Complete task without skill**: Work through problem with Claude A, noting what context you repeatedly provide.
|
|
</step_1>
|
|
|
|
<step_2>
|
|
**Ask Claude A to create skill**: "Create a skill that captures this pattern we just used"
|
|
</step_2>
|
|
|
|
<step_3>
|
|
**Review for conciseness**: Remove unnecessary explanations.
|
|
</step_3>
|
|
|
|
<step_4>
|
|
**Improve architecture**: Organize content with progressive disclosure.
|
|
</step_4>
|
|
|
|
<step_5>
|
|
**Test with Claude B**: Use fresh instance to test on real tasks.
|
|
</step_5>
|
|
|
|
<step_6>
|
|
**Iterate based on observation**: Return to Claude A with specific issues observed.
|
|
</step_6>
|
|
</workflow>
|
|
|
|
<insight>
|
|
Claude models understand skill format natively. Simply ask Claude to create a skill and it will generate properly structured SKILL.md content.
|
|
</insight>
|
|
</creating_skills>
|
|
|
|
<improving_skills>
|
|
<workflow>
|
|
<step_1>
|
|
**Use skill in real workflows**: Give Claude B actual tasks.
|
|
</step_1>
|
|
|
|
<step_2>
|
|
**Observe behavior**: Where does it struggle, succeed, or make unexpected choices?
|
|
</step_2>
|
|
|
|
<step_3>
|
|
**Return to Claude A**: Share observations and current SKILL.md.
|
|
</step_3>
|
|
|
|
<step_4>
|
|
**Review suggestions**: Claude A might suggest reorganization, stronger language, or workflow restructuring.
|
|
</step_4>
|
|
|
|
<step_5>
|
|
**Apply and test**: Update skill and test again.
|
|
</step_5>
|
|
|
|
<step_6>
|
|
**Repeat**: Continue based on real usage, not assumptions.
|
|
</step_6>
|
|
</workflow>
|
|
|
|
<what_to_watch_for>
|
|
- **Unexpected exploration paths**: Structure might not be intuitive
|
|
- **Missed connections**: Links might need to be more explicit
|
|
- **Overreliance on sections**: Consider moving frequently-read content to main SKILL.md
|
|
- **Ignored content**: Poorly signaled or unnecessary files
|
|
- **Critical metadata**: The name and description in your skill's metadata are critical for discovery
|
|
</what_to_watch_for>
|
|
</improving_skills>
|
|
</iterative_development_with_claude>
|
|
|
|
<model_testing>
|
|
<principle>
|
|
Test with all models you plan to use. Different models have different strengths and need different levels of detail.
|
|
</principle>
|
|
|
|
<haiku_testing>
|
|
**Claude Haiku** (fast, economical)
|
|
|
|
Questions to ask:
|
|
- Does the skill provide enough guidance?
|
|
- Are examples clear and complete?
|
|
- Do implicit assumptions become explicit?
|
|
- Does Haiku need more structure?
|
|
|
|
Haiku benefits from:
|
|
- More explicit instructions
|
|
- Complete examples (no partial code)
|
|
- Clear success criteria
|
|
- Step-by-step workflows
|
|
</haiku_testing>
|
|
|
|
<sonnet_testing>
|
|
**Claude Sonnet** (balanced)
|
|
|
|
Questions to ask:
|
|
- Is the skill clear and efficient?
|
|
- Does it avoid over-explanation?
|
|
- Are workflows well-structured?
|
|
- Does progressive disclosure work?
|
|
|
|
Sonnet benefits from:
|
|
- Balanced detail level
|
|
- XML structure for clarity
|
|
- Progressive disclosure
|
|
- Concise but complete guidance
|
|
</sonnet_testing>
|
|
|
|
<opus_testing>
|
|
**Claude Opus** (powerful reasoning)
|
|
|
|
Questions to ask:
|
|
- Does the skill avoid over-explaining?
|
|
- Can Opus infer obvious steps?
|
|
- Are constraints clear?
|
|
- Is context minimal but sufficient?
|
|
|
|
Opus benefits from:
|
|
- Concise instructions
|
|
- Principles over procedures
|
|
- High degrees of freedom
|
|
- Trust in reasoning capabilities
|
|
</opus_testing>
|
|
|
|
<balancing_across_models>
|
|
What works for Opus might need more detail for Haiku. Aim for instructions that work well across all target models. Find the balance that serves your target audience.
|
|
|
|
See [core-principles.md](core-principles.md) for model testing examples.
|
|
</balancing_across_models>
|
|
</model_testing>
|
|
|
|
<xml_structure_validation>
|
|
<principle>
|
|
During testing, validate that your skill's XML structure is correct and complete.
|
|
</principle>
|
|
|
|
<validation_checklist>
|
|
After updating a skill, verify:
|
|
|
|
<required_tags_present>
|
|
- ✅ `<objective>` tag exists and defines what skill does
|
|
- ✅ `<quick_start>` tag exists with immediate guidance
|
|
- ✅ `<success_criteria>` or `<when_successful>` tag exists
|
|
</required_tags_present>
|
|
|
|
<no_markdown_headings>
|
|
- ✅ No `#`, `##`, or `###` headings in skill body
|
|
- ✅ All sections use XML tags instead
|
|
- ✅ Markdown formatting within tags is preserved (bold, italic, lists, code blocks)
|
|
</no_markdown_headings>
|
|
|
|
<proper_xml_nesting>
|
|
- ✅ All XML tags properly closed
|
|
- ✅ Nested tags have correct hierarchy
|
|
- ✅ No unclosed tags
|
|
</proper_xml_nesting>
|
|
|
|
<conditional_tags_appropriate>
|
|
- ✅ Conditional tags match skill complexity
|
|
- ✅ Simple skills use required tags only
|
|
- ✅ Complex skills add appropriate conditional tags
|
|
- ✅ No over-engineering or under-specifying
|
|
</conditional_tags_appropriate>
|
|
|
|
<reference_files_check>
|
|
- ✅ Reference files also use pure XML structure
|
|
- ✅ Links to reference files are correct
|
|
- ✅ References are one level deep from SKILL.md
|
|
</reference_files_check>
|
|
</validation_checklist>
|
|
|
|
<testing_xml_during_iteration>
|
|
When iterating on a skill:
|
|
|
|
1. Make changes to XML structure
|
|
2. **Validate XML structure** (check tags, nesting, completeness)
|
|
3. Test with Claude on representative tasks
|
|
4. Observe if XML structure aids or hinders Claude's understanding
|
|
5. Iterate structure based on actual performance
|
|
</testing_xml_during_iteration>
|
|
</xml_structure_validation>
|
|
|
|
<observation_based_iteration>
|
|
<principle>
|
|
Iterate based on what you observe, not what you assume. Real usage reveals issues assumptions miss.
|
|
</principle>
|
|
|
|
<observation_categories>
|
|
<what_claude_reads>
|
|
Which sections does Claude actually read? Which are ignored? This reveals:
|
|
- Relevance of content
|
|
- Effectiveness of progressive disclosure
|
|
- Whether section names are clear
|
|
</what_claude_reads>
|
|
|
|
<where_claude_struggles>
|
|
Which tasks cause confusion or errors? This reveals:
|
|
- Missing context
|
|
- Unclear instructions
|
|
- Insufficient examples
|
|
- Ambiguous requirements
|
|
</where_claude_struggles>
|
|
|
|
<where_claude_succeeds>
|
|
Which tasks go smoothly? This reveals:
|
|
- Effective patterns
|
|
- Good examples
|
|
- Clear instructions
|
|
- Appropriate detail level
|
|
</where_claude_succeeds>
|
|
|
|
<unexpected_behaviors>
|
|
What does Claude do that surprises you? This reveals:
|
|
- Unstated assumptions
|
|
- Ambiguous phrasing
|
|
- Missing constraints
|
|
- Alternative interpretations
|
|
</unexpected_behaviors>
|
|
</observation_categories>
|
|
|
|
<iteration_pattern>
|
|
1. **Observe**: Run Claude on real tasks with current skill
|
|
2. **Document**: Note specific issues, not general feelings
|
|
3. **Hypothesize**: Why did this issue occur?
|
|
4. **Fix**: Make targeted changes to address specific issues
|
|
5. **Test**: Verify fix works on same scenario
|
|
6. **Validate**: Ensure fix doesn't break other scenarios
|
|
7. **Repeat**: Continue with next observed issue
|
|
</iteration_pattern>
|
|
</observation_based_iteration>
|
|
|
|
<progressive_refinement>
|
|
<principle>
|
|
Skills don't need to be perfect initially. Start minimal, observe usage, add what's missing.
|
|
</principle>
|
|
|
|
<initial_version>
|
|
Start with:
|
|
- Valid YAML frontmatter
|
|
- Required XML tags: objective, quick_start, success_criteria
|
|
- Minimal working example
|
|
- Basic success criteria
|
|
|
|
Skip initially:
|
|
- Extensive examples
|
|
- Edge case documentation
|
|
- Advanced features
|
|
- Detailed reference files
|
|
</initial_version>
|
|
|
|
<iteration_additions>
|
|
Add through iteration:
|
|
- Examples when patterns aren't clear from description
|
|
- Edge cases when observed in real usage
|
|
- Advanced features when users need them
|
|
- Reference files when SKILL.md approaches 500 lines
|
|
- Validation scripts when errors are common
|
|
</iteration_additions>
|
|
|
|
<benefits>
|
|
- Faster to initial working version
|
|
- Additions solve real needs, not imagined ones
|
|
- Keeps skills focused and concise
|
|
- Progressive disclosure emerges naturally
|
|
- Documentation stays aligned with actual usage
|
|
</benefits>
|
|
</progressive_refinement>
|
|
|
|
<testing_discovery>
|
|
<principle>
|
|
Test that Claude can discover and use your skill when appropriate.
|
|
</principle>
|
|
|
|
<discovery_testing>
|
|
<test_description>
|
|
Test if Claude loads your skill when it should:
|
|
|
|
1. Start fresh conversation (Claude B)
|
|
2. Ask question that should trigger skill
|
|
3. Check if skill was loaded
|
|
4. Verify skill was used appropriately
|
|
</test_description>
|
|
|
|
<description_quality>
|
|
If skill isn't discovered:
|
|
- Check description includes trigger keywords
|
|
- Verify description is specific, not vague
|
|
- Ensure description explains when to use skill
|
|
- Test with different phrasings of the same request
|
|
|
|
The description is Claude's primary discovery mechanism.
|
|
</description_quality>
|
|
</discovery_testing>
|
|
</testing_discovery>
|
|
|
|
<common_iteration_patterns>
|
|
<pattern name="too_verbose">
|
|
**Observation**: Skill works but uses lots of tokens
|
|
|
|
**Fix**:
|
|
- Remove obvious explanations
|
|
- Assume Claude knows common concepts
|
|
- Use examples instead of lengthy descriptions
|
|
- Move advanced content to reference files
|
|
</pattern>
|
|
|
|
<pattern name="too_minimal">
|
|
**Observation**: Claude makes incorrect assumptions or misses steps
|
|
|
|
**Fix**:
|
|
- Add explicit instructions where assumptions fail
|
|
- Provide complete working examples
|
|
- Define edge cases
|
|
- Add validation steps
|
|
</pattern>
|
|
|
|
<pattern name="poor_discovery">
|
|
**Observation**: Skill exists but Claude doesn't load it when needed
|
|
|
|
**Fix**:
|
|
- Improve description with specific triggers
|
|
- Add relevant keywords
|
|
- Test description against actual user queries
|
|
- Make description more specific about use cases
|
|
</pattern>
|
|
|
|
<pattern name="unclear_structure">
|
|
**Observation**: Claude reads wrong sections or misses relevant content
|
|
|
|
**Fix**:
|
|
- Use clearer XML tag names
|
|
- Reorganize content hierarchy
|
|
- Move frequently-needed content earlier
|
|
- Add explicit links to relevant sections
|
|
</pattern>
|
|
|
|
<pattern name="incomplete_examples">
|
|
**Observation**: Claude produces outputs that don't match expected pattern
|
|
|
|
**Fix**:
|
|
- Add more examples showing pattern
|
|
- Make examples more complete
|
|
- Show edge cases in examples
|
|
- Add anti-pattern examples (what not to do)
|
|
</pattern>
|
|
</common_iteration_patterns>
|
|
|
|
<iteration_velocity>
|
|
<principle>
|
|
Small, frequent iterations beat large, infrequent rewrites.
|
|
</principle>
|
|
|
|
<fast_iteration>
|
|
**Good approach**:
|
|
1. Make one targeted change
|
|
2. Test on specific scenario
|
|
3. Verify improvement
|
|
4. Commit change
|
|
5. Move to next issue
|
|
|
|
Total time: Minutes per iteration
|
|
Iterations per day: 10-20
|
|
Learning rate: High
|
|
</fast_iteration>
|
|
|
|
<slow_iteration>
|
|
**Problematic approach**:
|
|
1. Accumulate many issues
|
|
2. Make large refactor
|
|
3. Test everything at once
|
|
4. Debug multiple issues simultaneously
|
|
5. Hard to know what fixed what
|
|
|
|
Total time: Hours per iteration
|
|
Iterations per day: 1-2
|
|
Learning rate: Low
|
|
</slow_iteration>
|
|
|
|
<benefits_of_fast_iteration>
|
|
- Isolate cause and effect
|
|
- Build pattern recognition faster
|
|
- Less wasted work from wrong directions
|
|
- Easier to revert if needed
|
|
- Maintains momentum
|
|
</benefits_of_fast_iteration>
|
|
</iteration_velocity>
|
|
|
|
<success_metrics>
|
|
<principle>
|
|
Define how you'll measure if the skill is working. Quantify success.
|
|
</principle>
|
|
|
|
<objective_metrics>
|
|
- **Success rate**: Percentage of tasks completed correctly
|
|
- **Token usage**: Average tokens consumed per task
|
|
- **Iteration count**: How many tries to get correct output
|
|
- **Error rate**: Percentage of tasks with errors
|
|
- **Discovery rate**: How often skill loads when it should
|
|
</objective_metrics>
|
|
|
|
<subjective_metrics>
|
|
- **Output quality**: Does output meet requirements?
|
|
- **Appropriate detail**: Too verbose or too minimal?
|
|
- **Claude confidence**: Does Claude seem uncertain?
|
|
- **User satisfaction**: Does skill solve the actual problem?
|
|
</subjective_metrics>
|
|
|
|
<tracking_improvement>
|
|
Compare metrics before and after changes:
|
|
- Baseline: Measure without skill
|
|
- Initial: Measure with first version
|
|
- Iteration N: Measure after each change
|
|
|
|
Track which changes improve which metrics. Double down on effective patterns.
|
|
</tracking_improvement>
|
|
</success_metrics>
|