Skills improve through iteration and testing. This reference covers evaluation-driven development, Claude A/B testing patterns, and XML structure validation during testing. Create evaluations BEFORE writing extensive documentation. This ensures your skill solves real problems rather than documenting imagined ones. **Identify gaps**: Run Claude on representative tasks without a skill. Document specific failures or missing context. **Create evaluations**: Build three scenarios that test these gaps. **Establish baseline**: Measure Claude's performance without the skill. **Write minimal instructions**: Create just enough content to address the gaps and pass evaluations. **Iterate**: Execute evaluations, compare against baseline, and refine. ```json { "skills": ["pdf-processing"], "query": "Extract all text from this PDF file and save it to output.txt", "files": ["test-files/document.pdf"], "expected_behavior": [ "Successfully reads the PDF file using appropriate library", "Extracts text content from all pages without missing any", "Saves extracted text to output.txt in clear, readable format" ] } ``` - Prevents documenting imagined problems - Forces clarity about what success looks like - Provides objective measurement of skill effectiveness - Keeps skill focused on actual needs - Enables quantitative improvement tracking The most effective skill development uses Claude itself. Work with "Claude A" (expert who helps refine) to create skills used by "Claude B" (agent executing tasks). **Complete task without skill**: Work through problem with Claude A, noting what context you repeatedly provide. **Ask Claude A to create skill**: "Create a skill that captures this pattern we just used" **Review for conciseness**: Remove unnecessary explanations. **Improve architecture**: Organize content with progressive disclosure. **Test with Claude B**: Use fresh instance to test on real tasks. **Iterate based on observation**: Return to Claude A with specific issues observed. Claude models understand skill format natively. Simply ask Claude to create a skill and it will generate properly structured SKILL.md content. **Use skill in real workflows**: Give Claude B actual tasks. **Observe behavior**: Where does it struggle, succeed, or make unexpected choices? **Return to Claude A**: Share observations and current SKILL.md. **Review suggestions**: Claude A might suggest reorganization, stronger language, or workflow restructuring. **Apply and test**: Update skill and test again. **Repeat**: Continue based on real usage, not assumptions. - **Unexpected exploration paths**: Structure might not be intuitive - **Missed connections**: Links might need to be more explicit - **Overreliance on sections**: Consider moving frequently-read content to main SKILL.md - **Ignored content**: Poorly signaled or unnecessary files - **Critical metadata**: The name and description in your skill's metadata are critical for discovery Test with all models you plan to use. Different models have different strengths and need different levels of detail. **Claude Haiku** (fast, economical) Questions to ask: - Does the skill provide enough guidance? - Are examples clear and complete? - Do implicit assumptions become explicit? - Does Haiku need more structure? Haiku benefits from: - More explicit instructions - Complete examples (no partial code) - Clear success criteria - Step-by-step workflows **Claude Sonnet** (balanced) Questions to ask: - Is the skill clear and efficient? - Does it avoid over-explanation? - Are workflows well-structured? - Does progressive disclosure work? Sonnet benefits from: - Balanced detail level - XML structure for clarity - Progressive disclosure - Concise but complete guidance **Claude Opus** (powerful reasoning) Questions to ask: - Does the skill avoid over-explaining? - Can Opus infer obvious steps? - Are constraints clear? - Is context minimal but sufficient? Opus benefits from: - Concise instructions - Principles over procedures - High degrees of freedom - Trust in reasoning capabilities What works for Opus might need more detail for Haiku. Aim for instructions that work well across all target models. Find the balance that serves your target audience. See [core-principles.md](core-principles.md) for model testing examples. During testing, validate that your skill's XML structure is correct and complete. After updating a skill, verify: - ✅ `` tag exists and defines what skill does - ✅ `` tag exists with immediate guidance - ✅ `` or `` tag exists - ✅ No `#`, `##`, or `###` headings in skill body - ✅ All sections use XML tags instead - ✅ Markdown formatting within tags is preserved (bold, italic, lists, code blocks) - ✅ All XML tags properly closed - ✅ Nested tags have correct hierarchy - ✅ No unclosed tags - ✅ Conditional tags match skill complexity - ✅ Simple skills use required tags only - ✅ Complex skills add appropriate conditional tags - ✅ No over-engineering or under-specifying - ✅ Reference files also use pure XML structure - ✅ Links to reference files are correct - ✅ References are one level deep from SKILL.md When iterating on a skill: 1. Make changes to XML structure 2. **Validate XML structure** (check tags, nesting, completeness) 3. Test with Claude on representative tasks 4. Observe if XML structure aids or hinders Claude's understanding 5. Iterate structure based on actual performance Iterate based on what you observe, not what you assume. Real usage reveals issues assumptions miss. Which sections does Claude actually read? Which are ignored? This reveals: - Relevance of content - Effectiveness of progressive disclosure - Whether section names are clear Which tasks cause confusion or errors? This reveals: - Missing context - Unclear instructions - Insufficient examples - Ambiguous requirements Which tasks go smoothly? This reveals: - Effective patterns - Good examples - Clear instructions - Appropriate detail level What does Claude do that surprises you? This reveals: - Unstated assumptions - Ambiguous phrasing - Missing constraints - Alternative interpretations 1. **Observe**: Run Claude on real tasks with current skill 2. **Document**: Note specific issues, not general feelings 3. **Hypothesize**: Why did this issue occur? 4. **Fix**: Make targeted changes to address specific issues 5. **Test**: Verify fix works on same scenario 6. **Validate**: Ensure fix doesn't break other scenarios 7. **Repeat**: Continue with next observed issue Skills don't need to be perfect initially. Start minimal, observe usage, add what's missing. Start with: - Valid YAML frontmatter - Required XML tags: objective, quick_start, success_criteria - Minimal working example - Basic success criteria Skip initially: - Extensive examples - Edge case documentation - Advanced features - Detailed reference files Add through iteration: - Examples when patterns aren't clear from description - Edge cases when observed in real usage - Advanced features when users need them - Reference files when SKILL.md approaches 500 lines - Validation scripts when errors are common - Faster to initial working version - Additions solve real needs, not imagined ones - Keeps skills focused and concise - Progressive disclosure emerges naturally - Documentation stays aligned with actual usage Test that Claude can discover and use your skill when appropriate. Test if Claude loads your skill when it should: 1. Start fresh conversation (Claude B) 2. Ask question that should trigger skill 3. Check if skill was loaded 4. Verify skill was used appropriately If skill isn't discovered: - Check description includes trigger keywords - Verify description is specific, not vague - Ensure description explains when to use skill - Test with different phrasings of the same request The description is Claude's primary discovery mechanism. **Observation**: Skill works but uses lots of tokens **Fix**: - Remove obvious explanations - Assume Claude knows common concepts - Use examples instead of lengthy descriptions - Move advanced content to reference files **Observation**: Claude makes incorrect assumptions or misses steps **Fix**: - Add explicit instructions where assumptions fail - Provide complete working examples - Define edge cases - Add validation steps **Observation**: Skill exists but Claude doesn't load it when needed **Fix**: - Improve description with specific triggers - Add relevant keywords - Test description against actual user queries - Make description more specific about use cases **Observation**: Claude reads wrong sections or misses relevant content **Fix**: - Use clearer XML tag names - Reorganize content hierarchy - Move frequently-needed content earlier - Add explicit links to relevant sections **Observation**: Claude produces outputs that don't match expected pattern **Fix**: - Add more examples showing pattern - Make examples more complete - Show edge cases in examples - Add anti-pattern examples (what not to do) Small, frequent iterations beat large, infrequent rewrites. **Good approach**: 1. Make one targeted change 2. Test on specific scenario 3. Verify improvement 4. Commit change 5. Move to next issue Total time: Minutes per iteration Iterations per day: 10-20 Learning rate: High **Problematic approach**: 1. Accumulate many issues 2. Make large refactor 3. Test everything at once 4. Debug multiple issues simultaneously 5. Hard to know what fixed what Total time: Hours per iteration Iterations per day: 1-2 Learning rate: Low - Isolate cause and effect - Build pattern recognition faster - Less wasted work from wrong directions - Easier to revert if needed - Maintains momentum Define how you'll measure if the skill is working. Quantify success. - **Success rate**: Percentage of tasks completed correctly - **Token usage**: Average tokens consumed per task - **Iteration count**: How many tries to get correct output - **Error rate**: Percentage of tasks with errors - **Discovery rate**: How often skill loads when it should - **Output quality**: Does output meet requirements? - **Appropriate detail**: Too verbose or too minimal? - **Claude confidence**: Does Claude seem uncertain? - **User satisfaction**: Does skill solve the actual problem? Compare metrics before and after changes: - Baseline: Measure without skill - Initial: Measure with first version - Iteration N: Measure after each change Track which changes improve which metrics. Double down on effective patterns.