Initial commit

2025-11-29 18:28:37 +08:00
commit ccc65b3f07
180 changed files with 53970 additions and 0 deletions
--- a/skills/create-agent-skills/references/iteration-and-testing.md
+++ b/skills/create-agent-skills/references/iteration-and-testing.md
@@ -0,0 +1,474 @@
+<overview>
+Skills improve through iteration and testing. This reference covers evaluation-driven development, Claude A/B testing patterns, and XML structure validation during testing.
+</overview>
+
+<evaluation_driven_development>
+<principle>
+Create evaluations BEFORE writing extensive documentation. This ensures your skill solves real problems rather than documenting imagined ones.
+</principle>
+
+<workflow>
+<step_1>
+**Identify gaps**: Run Claude on representative tasks without a skill. Document specific failures or missing context.
+</step_1>
+
+<step_2>
+**Create evaluations**: Build three scenarios that test these gaps.
+</step_2>
+
+<step_3>
+**Establish baseline**: Measure Claude's performance without the skill.
+</step_3>
+
+<step_4>
+**Write minimal instructions**: Create just enough content to address the gaps and pass evaluations.
+</step_4>
+
+<step_5>
+**Iterate**: Execute evaluations, compare against baseline, and refine.
+</step_5>
+</workflow>
+
+<evaluation_structure>
+```json
+{
+  "skills": ["pdf-processing"],
+  "query": "Extract all text from this PDF file and save it to output.txt",
+  "files": ["test-files/document.pdf"],
+  "expected_behavior": [
+    "Successfully reads the PDF file using appropriate library",
+    "Extracts text content from all pages without missing any",
+    "Saves extracted text to output.txt in clear, readable format"
+  ]
+}
+```
+</evaluation_structure>
+
+<why_evaluations_first>
+- Prevents documenting imagined problems
+- Forces clarity about what success looks like
+- Provides objective measurement of skill effectiveness
+- Keeps skill focused on actual needs
+- Enables quantitative improvement tracking
+</why_evaluations_first>
+</evaluation_driven_development>
+
+<iterative_development_with_claude>
+<principle>
+The most effective skill development uses Claude itself. Work with "Claude A" (expert who helps refine) to create skills used by "Claude B" (agent executing tasks).
+</principle>
+
+<creating_skills>
+<workflow>
+<step_1>
+**Complete task without skill**: Work through problem with Claude A, noting what context you repeatedly provide.
+</step_1>
+
+<step_2>
+**Ask Claude A to create skill**: "Create a skill that captures this pattern we just used"
+</step_2>
+
+<step_3>
+**Review for conciseness**: Remove unnecessary explanations.
+</step_3>
+
+<step_4>
+**Improve architecture**: Organize content with progressive disclosure.
+</step_4>
+
+<step_5>
+**Test with Claude B**: Use fresh instance to test on real tasks.
+</step_5>
+
+<step_6>
+**Iterate based on observation**: Return to Claude A with specific issues observed.
+</step_6>
+</workflow>
+
+<insight>
+Claude models understand skill format natively. Simply ask Claude to create a skill and it will generate properly structured SKILL.md content.
+</insight>
+</creating_skills>
+
+<improving_skills>
+<workflow>
+<step_1>
+**Use skill in real workflows**: Give Claude B actual tasks.
+</step_1>
+
+<step_2>
+**Observe behavior**: Where does it struggle, succeed, or make unexpected choices?
+</step_2>
+
+<step_3>
+**Return to Claude A**: Share observations and current SKILL.md.
+</step_3>
+
+<step_4>
+**Review suggestions**: Claude A might suggest reorganization, stronger language, or workflow restructuring.
+</step_4>
+
+<step_5>
+**Apply and test**: Update skill and test again.
+</step_5>
+
+<step_6>
+**Repeat**: Continue based on real usage, not assumptions.
+</step_6>
+</workflow>
+
+<what_to_watch_for>
+- **Unexpected exploration paths**: Structure might not be intuitive
+- **Missed connections**: Links might need to be more explicit
+- **Overreliance on sections**: Consider moving frequently-read content to main SKILL.md
+- **Ignored content**: Poorly signaled or unnecessary files
+- **Critical metadata**: The name and description in your skill's metadata are critical for discovery
+</what_to_watch_for>
+</improving_skills>
+</iterative_development_with_claude>
+
+<model_testing>
+<principle>
+Test with all models you plan to use. Different models have different strengths and need different levels of detail.
+</principle>
+
+<haiku_testing>
+**Claude Haiku** (fast, economical)
+
+Questions to ask:
+- Does the skill provide enough guidance?
+- Are examples clear and complete?
+- Do implicit assumptions become explicit?
+- Does Haiku need more structure?
+
+Haiku benefits from:
+- More explicit instructions
+- Complete examples (no partial code)
+- Clear success criteria
+- Step-by-step workflows
+</haiku_testing>
+
+<sonnet_testing>
+**Claude Sonnet** (balanced)
+
+Questions to ask:
+- Is the skill clear and efficient?
+- Does it avoid over-explanation?
+- Are workflows well-structured?
+- Does progressive disclosure work?
+
+Sonnet benefits from:
+- Balanced detail level
+- XML structure for clarity
+- Progressive disclosure
+- Concise but complete guidance
+</sonnet_testing>
+
+<opus_testing>
+**Claude Opus** (powerful reasoning)
+
+Questions to ask:
+- Does the skill avoid over-explaining?
+- Can Opus infer obvious steps?
+- Are constraints clear?
+- Is context minimal but sufficient?
+
+Opus benefits from:
+- Concise instructions
+- Principles over procedures
+- High degrees of freedom
+- Trust in reasoning capabilities
+</opus_testing>
+
+<balancing_across_models>
+What works for Opus might need more detail for Haiku. Aim for instructions that work well across all target models. Find the balance that serves your target audience.
+
+See [core-principles.md](core-principles.md) for model testing examples.
+</balancing_across_models>
+</model_testing>
+
+<xml_structure_validation>
+<principle>
+During testing, validate that your skill's XML structure is correct and complete.
+</principle>
+
+<validation_checklist>
+After updating a skill, verify:
+
+<required_tags_present>
+- ✅ `<objective>` tag exists and defines what skill does
+- ✅ `<quick_start>` tag exists with immediate guidance
+- ✅ `<success_criteria>` or `<when_successful>` tag exists
+</required_tags_present>
+
+<no_markdown_headings>
+- ✅ No `#`, `##`, or `###` headings in skill body
+- ✅ All sections use XML tags instead
+- ✅ Markdown formatting within tags is preserved (bold, italic, lists, code blocks)
+</no_markdown_headings>
+
+<proper_xml_nesting>
+- ✅ All XML tags properly closed
+- ✅ Nested tags have correct hierarchy
+- ✅ No unclosed tags
+</proper_xml_nesting>
+
+<conditional_tags_appropriate>
+- ✅ Conditional tags match skill complexity
+- ✅ Simple skills use required tags only
+- ✅ Complex skills add appropriate conditional tags
+- ✅ No over-engineering or under-specifying
+</conditional_tags_appropriate>
+
+<reference_files_check>
+- ✅ Reference files also use pure XML structure
+- ✅ Links to reference files are correct
+- ✅ References are one level deep from SKILL.md
+</reference_files_check>
+</validation_checklist>
+
+<testing_xml_during_iteration>
+When iterating on a skill:
+
+1. Make changes to XML structure
+2. **Validate XML structure** (check tags, nesting, completeness)
+3. Test with Claude on representative tasks
+4. Observe if XML structure aids or hinders Claude's understanding
+5. Iterate structure based on actual performance
+</testing_xml_during_iteration>
+</xml_structure_validation>
+
+<observation_based_iteration>
+<principle>
+Iterate based on what you observe, not what you assume. Real usage reveals issues assumptions miss.
+</principle>
+
+<observation_categories>
+<what_claude_reads>
+Which sections does Claude actually read? Which are ignored? This reveals:
+- Relevance of content
+- Effectiveness of progressive disclosure
+- Whether section names are clear
+</what_claude_reads>
+
+<where_claude_struggles>
+Which tasks cause confusion or errors? This reveals:
+- Missing context
+- Unclear instructions
+- Insufficient examples
+- Ambiguous requirements
+</where_claude_struggles>
+
+<where_claude_succeeds>
+Which tasks go smoothly? This reveals:
+- Effective patterns
+- Good examples
+- Clear instructions
+- Appropriate detail level
+</where_claude_succeeds>
+
+<unexpected_behaviors>
+What does Claude do that surprises you? This reveals:
+- Unstated assumptions
+- Ambiguous phrasing
+- Missing constraints
+- Alternative interpretations
+</unexpected_behaviors>
+</observation_categories>
+
+<iteration_pattern>
+1. **Observe**: Run Claude on real tasks with current skill
+2. **Document**: Note specific issues, not general feelings
+3. **Hypothesize**: Why did this issue occur?
+4. **Fix**: Make targeted changes to address specific issues
+5. **Test**: Verify fix works on same scenario
+6. **Validate**: Ensure fix doesn't break other scenarios
+7. **Repeat**: Continue with next observed issue
+</iteration_pattern>
+</observation_based_iteration>
+
+<progressive_refinement>
+<principle>
+Skills don't need to be perfect initially. Start minimal, observe usage, add what's missing.
+</principle>
+
+<initial_version>
+Start with:
+- Valid YAML frontmatter
+- Required XML tags: objective, quick_start, success_criteria
+- Minimal working example
+- Basic success criteria
+
+Skip initially:
+- Extensive examples
+- Edge case documentation
+- Advanced features
+- Detailed reference files
+</initial_version>
+
+<iteration_additions>
+Add through iteration:
+- Examples when patterns aren't clear from description
+- Edge cases when observed in real usage
+- Advanced features when users need them
+- Reference files when SKILL.md approaches 500 lines
+- Validation scripts when errors are common
+</iteration_additions>
+
+<benefits>
+- Faster to initial working version
+- Additions solve real needs, not imagined ones
+- Keeps skills focused and concise
+- Progressive disclosure emerges naturally
+- Documentation stays aligned with actual usage
+</benefits>
+</progressive_refinement>
+
+<testing_discovery>
+<principle>
+Test that Claude can discover and use your skill when appropriate.
+</principle>
+
+<discovery_testing>
+<test_description>
+Test if Claude loads your skill when it should:
+
+1. Start fresh conversation (Claude B)
+2. Ask question that should trigger skill
+3. Check if skill was loaded
+4. Verify skill was used appropriately
+</test_description>
+
+<description_quality>
+If skill isn't discovered:
+- Check description includes trigger keywords
+- Verify description is specific, not vague
+- Ensure description explains when to use skill
+- Test with different phrasings of the same request
+
+The description is Claude's primary discovery mechanism.
+</description_quality>
+</discovery_testing>
+</testing_discovery>
+
+<common_iteration_patterns>
+<pattern name="too_verbose">
+**Observation**: Skill works but uses lots of tokens
+
+**Fix**:
+- Remove obvious explanations
+- Assume Claude knows common concepts
+- Use examples instead of lengthy descriptions
+- Move advanced content to reference files
+</pattern>
+
+<pattern name="too_minimal">
+**Observation**: Claude makes incorrect assumptions or misses steps
+
+**Fix**:
+- Add explicit instructions where assumptions fail
+- Provide complete working examples
+- Define edge cases
+- Add validation steps
+</pattern>
+
+<pattern name="poor_discovery">
+**Observation**: Skill exists but Claude doesn't load it when needed
+
+**Fix**:
+- Improve description with specific triggers
+- Add relevant keywords
+- Test description against actual user queries
+- Make description more specific about use cases
+</pattern>
+
+<pattern name="unclear_structure">
+**Observation**: Claude reads wrong sections or misses relevant content
+
+**Fix**:
+- Use clearer XML tag names
+- Reorganize content hierarchy
+- Move frequently-needed content earlier
+- Add explicit links to relevant sections
+</pattern>
+
+<pattern name="incomplete_examples">
+**Observation**: Claude produces outputs that don't match expected pattern
+
+**Fix**:
+- Add more examples showing pattern
+- Make examples more complete
+- Show edge cases in examples
+- Add anti-pattern examples (what not to do)
+</pattern>
+</common_iteration_patterns>
+
+<iteration_velocity>
+<principle>
+Small, frequent iterations beat large, infrequent rewrites.
+</principle>
+
+<fast_iteration>
+**Good approach**:
+1. Make one targeted change
+2. Test on specific scenario
+3. Verify improvement
+4. Commit change
+5. Move to next issue
+
+Total time: Minutes per iteration
+Iterations per day: 10-20
+Learning rate: High
+</fast_iteration>
+
+<slow_iteration>
+**Problematic approach**:
+1. Accumulate many issues
+2. Make large refactor
+3. Test everything at once
+4. Debug multiple issues simultaneously
+5. Hard to know what fixed what
+
+Total time: Hours per iteration
+Iterations per day: 1-2
+Learning rate: Low
+</slow_iteration>
+
+<benefits_of_fast_iteration>
+- Isolate cause and effect
+- Build pattern recognition faster
+- Less wasted work from wrong directions
+- Easier to revert if needed
+- Maintains momentum
+</benefits_of_fast_iteration>
+</iteration_velocity>
+
+<success_metrics>
+<principle>
+Define how you'll measure if the skill is working. Quantify success.
+</principle>
+
+<objective_metrics>
+- **Success rate**: Percentage of tasks completed correctly
+- **Token usage**: Average tokens consumed per task
+- **Iteration count**: How many tries to get correct output
+- **Error rate**: Percentage of tasks with errors
+- **Discovery rate**: How often skill loads when it should
+</objective_metrics>
+
+<subjective_metrics>
+- **Output quality**: Does output meet requirements?
+- **Appropriate detail**: Too verbose or too minimal?
+- **Claude confidence**: Does Claude seem uncertain?
+- **User satisfaction**: Does skill solve the actual problem?
+</subjective_metrics>
+
+<tracking_improvement>
+Compare metrics before and after changes:
+- Baseline: Measure without skill
+- Initial: Measure with first version
+- Iteration N: Measure after each change
+
+Track which changes improve which metrics. Double down on effective patterns.
+</tracking_improvement>
+</success_metrics>