gh-glittercowboy-taches-cc-resources/skills/create-agent-skills/references/iteration-and-testing.md at ccc65b3f07f345385b8c6407b4df11b139fe0b63

zhongwei/gh-glittercowboy-taches-cc-resources

Files

Zhongwei Li ccc65b3f07 Initial commit

2025-11-29 18:28:37 +08:00

13 KiB

Raw Blame History

Skills improve through iteration and testing. This reference covers evaluation-driven development, Claude A/B testing patterns, and XML structure validation during testing.

<evaluation_driven_development> Create evaluations BEFORE writing extensive documentation. This ensures your skill solves real problems rather than documenting imagined ones.

**Identify gaps**: Run Claude on representative tasks without a skill. Document specific failures or missing context.

<step_2> Create evaluations: Build three scenarios that test these gaps. </step_2>

<step_3> Establish baseline: Measure Claude's performance without the skill. </step_3>

<step_4> Write minimal instructions: Create just enough content to address the gaps and pass evaluations. </step_4>

<step_5> Iterate: Execute evaluations, compare against baseline, and refine. </step_5>

<evaluation_structure>

{
  "skills": ["pdf-processing"],
  "query": "Extract all text from this PDF file and save it to output.txt",
  "files": ["test-files/document.pdf"],
  "expected_behavior": [
    "Successfully reads the PDF file using appropriate library",
    "Extracts text content from all pages without missing any",
    "Saves extracted text to output.txt in clear, readable format"
  ]
}

</evaluation_structure>

<why_evaluations_first>

Prevents documenting imagined problems
Forces clarity about what success looks like
Provides objective measurement of skill effectiveness
Keeps skill focused on actual needs
Enables quantitative improvement tracking </why_evaluations_first> </evaluation_driven_development>

<iterative_development_with_claude> The most effective skill development uses Claude itself. Work with "Claude A" (expert who helps refine) to create skills used by "Claude B" (agent executing tasks).

<creating_skills> <step_1> Complete task without skill: Work through problem with Claude A, noting what context you repeatedly provide. </step_1>

<step_2> Ask Claude A to create skill: "Create a skill that captures this pattern we just used" </step_2>

<step_3> Review for conciseness: Remove unnecessary explanations. </step_3>

<step_4> Improve architecture: Organize content with progressive disclosure. </step_4>

<step_5> Test with Claude B: Use fresh instance to test on real tasks. </step_5>

<step_6> Iterate based on observation: Return to Claude A with specific issues observed. </step_6>

Claude models understand skill format natively. Simply ask Claude to create a skill and it will generate properly structured SKILL.md content.

<improving_skills> <step_1> Use skill in real workflows: Give Claude B actual tasks. </step_1>

<step_2> Observe behavior: Where does it struggle, succeed, or make unexpected choices? </step_2>

<step_3> Return to Claude A: Share observations and current SKILL.md. </step_3>

<step_4> Review suggestions: Claude A might suggest reorganization, stronger language, or workflow restructuring. </step_4>

<step_5> Apply and test: Update skill and test again. </step_5>

<step_6> Repeat: Continue based on real usage, not assumptions. </step_6>

<what_to_watch_for>

Unexpected exploration paths: Structure might not be intuitive
Missed connections: Links might need to be more explicit
Overreliance on sections: Consider moving frequently-read content to main SKILL.md
Ignored content: Poorly signaled or unnecessary files
Critical metadata: The name and description in your skill's metadata are critical for discovery </what_to_watch_for> </improving_skills> </iterative_development_with_claude>

<model_testing> Test with all models you plan to use. Different models have different strengths and need different levels of detail.

<haiku_testing> Claude Haiku (fast, economical)

Questions to ask:

Does the skill provide enough guidance?
Are examples clear and complete?
Do implicit assumptions become explicit?
Does Haiku need more structure?

Haiku benefits from:

More explicit instructions
Complete examples (no partial code)
Clear success criteria
Step-by-step workflows </haiku_testing>

<sonnet_testing> Claude Sonnet (balanced)

Questions to ask:

Is the skill clear and efficient?
Does it avoid over-explanation?
Are workflows well-structured?
Does progressive disclosure work?

Sonnet benefits from:

Balanced detail level
XML structure for clarity
Progressive disclosure
Concise but complete guidance </sonnet_testing>

<opus_testing> Claude Opus (powerful reasoning)

Questions to ask:

Does the skill avoid over-explaining?
Can Opus infer obvious steps?
Are constraints clear?
Is context minimal but sufficient?

Opus benefits from:

Concise instructions
Principles over procedures
High degrees of freedom
Trust in reasoning capabilities </opus_testing>

<balancing_across_models> What works for Opus might need more detail for Haiku. Aim for instructions that work well across all target models. Find the balance that serves your target audience.

See core-principles.md for model testing examples. </balancing_across_models> </model_testing>

<xml_structure_validation> During testing, validate that your skill's XML structure is correct and complete.

<validation_checklist> After updating a skill, verify:

<required_tags_present>

✅ <objective> tag exists and defines what skill does
✅ <quick_start> tag exists with immediate guidance
✅ <success_criteria> or <when_successful> tag exists </required_tags_present>

<no_markdown_headings>

✅ No #, ##, or ### headings in skill body
✅ All sections use XML tags instead
✅ Markdown formatting within tags is preserved (bold, italic, lists, code blocks) </no_markdown_headings>

<proper_xml_nesting>

✅ All XML tags properly closed
✅ Nested tags have correct hierarchy
✅ No unclosed tags </proper_xml_nesting>

<conditional_tags_appropriate>

✅ Conditional tags match skill complexity
✅ Simple skills use required tags only
✅ Complex skills add appropriate conditional tags
✅ No over-engineering or under-specifying </conditional_tags_appropriate>

<reference_files_check>

✅ Reference files also use pure XML structure
✅ Links to reference files are correct
✅ References are one level deep from SKILL.md </reference_files_check> </validation_checklist>

<testing_xml_during_iteration> When iterating on a skill:

Make changes to XML structure
Validate XML structure (check tags, nesting, completeness)
Test with Claude on representative tasks
Observe if XML structure aids or hinders Claude's understanding
Iterate structure based on actual performance </testing_xml_during_iteration> </xml_structure_validation>

<observation_based_iteration> Iterate based on what you observe, not what you assume. Real usage reveals issues assumptions miss.

<observation_categories> <what_claude_reads> Which sections does Claude actually read? Which are ignored? This reveals:

Relevance of content
Effectiveness of progressive disclosure
Whether section names are clear </what_claude_reads>

<where_claude_struggles> Which tasks cause confusion or errors? This reveals:

Missing context
Unclear instructions
Insufficient examples
Ambiguous requirements </where_claude_struggles>

<where_claude_succeeds> Which tasks go smoothly? This reveals:

Effective patterns
Good examples
Clear instructions
Appropriate detail level </where_claude_succeeds>

<unexpected_behaviors> What does Claude do that surprises you? This reveals:

Unstated assumptions
Ambiguous phrasing
Missing constraints
Alternative interpretations </unexpected_behaviors> </observation_categories>

<iteration_pattern>

Observe: Run Claude on real tasks with current skill
Document: Note specific issues, not general feelings
Hypothesize: Why did this issue occur?
Fix: Make targeted changes to address specific issues
Test: Verify fix works on same scenario
Validate: Ensure fix doesn't break other scenarios
Repeat: Continue with next observed issue </iteration_pattern> </observation_based_iteration>

<progressive_refinement> Skills don't need to be perfect initially. Start minimal, observe usage, add what's missing.

<initial_version> Start with:

Valid YAML frontmatter
Required XML tags: objective, quick_start, success_criteria
Minimal working example
Basic success criteria

Skip initially:

Extensive examples
Edge case documentation
Advanced features
Detailed reference files </initial_version>

<iteration_additions> Add through iteration:

Examples when patterns aren't clear from description
Edge cases when observed in real usage
Advanced features when users need them
Reference files when SKILL.md approaches 500 lines
Validation scripts when errors are common </iteration_additions>

- Faster to initial working version - Additions solve real needs, not imagined ones - Keeps skills focused and concise - Progressive disclosure emerges naturally - Documentation stays aligned with actual usage

<testing_discovery> Test that Claude can discover and use your skill when appropriate.

<discovery_testing> <test_description> Test if Claude loads your skill when it should:

Start fresh conversation (Claude B)
Ask question that should trigger skill
Check if skill was loaded
Verify skill was used appropriately </test_description>

<description_quality> If skill isn't discovered:

Check description includes trigger keywords
Verify description is specific, not vague
Ensure description explains when to use skill
Test with different phrasings of the same request

The description is Claude's primary discovery mechanism. </description_quality> </discovery_testing> </testing_discovery>

<common_iteration_patterns> Observation: Skill works but uses lots of tokens

Fix:

Remove obvious explanations
Assume Claude knows common concepts
Use examples instead of lengthy descriptions
Move advanced content to reference files

**Observation**: Claude makes incorrect assumptions or misses steps

Fix:

Add explicit instructions where assumptions fail
Provide complete working examples
Define edge cases
Add validation steps

**Observation**: Skill exists but Claude doesn't load it when needed

Fix:

Improve description with specific triggers
Add relevant keywords
Test description against actual user queries
Make description more specific about use cases

**Observation**: Claude reads wrong sections or misses relevant content

Fix:

Use clearer XML tag names
Reorganize content hierarchy
Move frequently-needed content earlier
Add explicit links to relevant sections

**Observation**: Claude produces outputs that don't match expected pattern

Fix:

Add more examples showing pattern
Make examples more complete
Show edge cases in examples
Add anti-pattern examples (what not to do) </common_iteration_patterns>

<iteration_velocity> Small, frequent iterations beat large, infrequent rewrites.

<fast_iteration> Good approach:

Make one targeted change
Test on specific scenario
Verify improvement
Commit change
Move to next issue

Total time: Minutes per iteration Iterations per day: 10-20 Learning rate: High </fast_iteration>

<slow_iteration> Problematic approach:

Accumulate many issues
Make large refactor
Test everything at once
Debug multiple issues simultaneously
Hard to know what fixed what

Total time: Hours per iteration Iterations per day: 1-2 Learning rate: Low </slow_iteration>

<benefits_of_fast_iteration>

Isolate cause and effect
Build pattern recognition faster
Less wasted work from wrong directions
Easier to revert if needed
Maintains momentum </benefits_of_fast_iteration> </iteration_velocity>

<success_metrics> Define how you'll measure if the skill is working. Quantify success.

<objective_metrics>

Success rate: Percentage of tasks completed correctly
Token usage: Average tokens consumed per task
Iteration count: How many tries to get correct output
Error rate: Percentage of tasks with errors
Discovery rate: How often skill loads when it should </objective_metrics>

<subjective_metrics>

Output quality: Does output meet requirements?
Appropriate detail: Too verbose or too minimal?
Claude confidence: Does Claude seem uncertain?
User satisfaction: Does skill solve the actual problem? </subjective_metrics>

<tracking_improvement> Compare metrics before and after changes:

Baseline: Measure without skill
Initial: Measure with first version
Iteration N: Measure after each change

Track which changes improve which metrics. Double down on effective patterns. </tracking_improvement> </success_metrics>

13 KiB Raw Blame History

13 KiB

Raw Blame History