Files
2025-11-30 08:37:35 +08:00

7.8 KiB

name, description
name description
skill-evaluator Evaluate Claude Code skills against best practices including size, structure, examples, and prompt engineering quality. Provides comprehensive analysis with actionable suggestions.

Claude Code Skill Evaluator

Systematically evaluate Claude Code skills for quality, compliance with best practices, and optimization opportunities. Provides detailed assessment with actionable suggestions for improvement.

Table of Contents

Instructions

1. Find Skill

Identify the skill passed in the directory passed to you or find all in the user's ~/.claude/skills/ directory. For each directory (excluding hidden files), verify it contains a SKILL.md file.

Present the user with:

  • List of available skills
  • Ask which skill to evaluate (or accept skill name as input)

2. Read the Skill File

Once a skill is selected, read its SKILL.md file and extract:

  • Frontmatter metadata (name, description)
  • Total line count
  • Word count
  • Character count
  • Structure and sections

3. Analyze Against Best Practices

Evaluate the skill across 8 dimensions:

Dimension 1: Size & Length

Guidelines:

  • Body: Under 500 lines (hard maximum)
  • Name: Maximum 64 characters
  • Description: Maximum 1024 characters (200 char summary preferred)
  • Table of Contents: Include if over 100 lines

Assessment:

  • Count total lines in SKILL.md body
  • Flag if over 500 lines
  • Compliment if well-sized (ideal: 100-300 lines for medium skills)
  • Check if TOC exists (expected for 100+ line skills)

Dimension 2: Scope Definition

Guidelines:

  • Narrow focus (one skill = one capability)
  • Clear boundary of what the skill does and doesn't do
  • No scope creep (e.g., "document processing" → "PDF form filling")

Assessment:

  • Does the description clearly state what the skill does?
  • Are there multiple conflicting capabilities within one skill?
  • Is the boundary clear to a new user?

Dimension 3: Description Quality

Guidelines:

  • Third-person voice (avoid "I can" or "you can")
  • Include both WHAT and WHEN TO USE
  • Specific, searchable terminology
  • 200 character summary ideal

Assessment:

  • Voice and tone appropriate?
  • Discovery terms clear? (Would users search for these terms?)
  • Is "when to use" explained?

Dimension 4: Structure & Organization

Guidelines:

  • Clear section hierarchy (headings, subsections)
  • Logical flow (progressive disclosure)
  • Step-by-step instructions preferred for workflows
  • Rules/constraints clearly stated

Assessment:

  • Is structure logical?
  • Can a user easily navigate?
  • Are instructions sequential or scattered?

Dimension 5: Examples

Guidelines:

  • Quality over quantity
  • Typical: 2-3 examples for basic skills, more for format-heavy
  • Concrete (not abstract)
  • Show patterns and edge cases

Assessment:

  • How many examples? (count them)
  • Are examples concrete and realistic?
  • Do they demonstrate key patterns?
  • Are there enough to show variations?

Dimension 6: Anti-Pattern Detection

Red flags (check for these):

  • Windows-style paths (should use forward slashes)
  • Magic numbers without justification
  • Vague terminology (inconsistent synonyms)
  • Time-sensitive instructions (date-dependent)
  • Deeply nested file references (over 2 levels)
  • Vague descriptions (missing WHAT or WHEN)
  • Scope creep (trying to do too much)
  • No error handling or validation steps
  • No user feedback loops (for complex workflows)
  • Multiple conflicting approaches for same task

Assessment:

  • Count violations
  • Severity of each violation
  • Impact on usability

Dimension 7: Prompt Engineering Quality

Guidelines:

  • Imperative language (verb-first instructions)
  • Explicit rules with clear boundaries
  • Validation loops where appropriate (especially for destructive ops)
  • Clear error handling
  • Assumes user is intelligent (don't over-explain)

Assessment:

  • Is language imperative?
  • Are there validation steps?
  • How clear are the rules?
  • Is error handling explicit?

Dimension 8: Completeness

Guidelines:

  • Requirements listed (what's needed to use the skill)
  • Edge cases acknowledged
  • Limitations stated where relevant

Assessment:

  • Are prerequisites clear?
  • Are limitations or edge cases mentioned?
  • Is scope of responsibility clear?

4. Generate Comprehensive Evaluation Report

Create a detailed evaluation report with these components:

  1. Executive Summary: 1-2 paragraphs covering overall assessment, key strengths, and critical issues

  2. Metrics: Present line count, word count, character count, and guideline compliance assessment

  3. Dimensional Analysis: For each of the 8 dimensions:

    • Status indicator (✓ Pass / ⚠ Warning / Fail)
    • 1-2 sentence assessment explaining the rating
  4. Detected Issues: Organize by severity:

    • Critical Issues (must fix) - any Fail items with explanation
    • Warnings (should address) - any ⚠ Warning items with explanation
    • Observations (minor items worth noting)
  5. Comparative Analysis: Compare the skill against official skills repository patterns with examples and rationale

  6. Actionable Suggestions: Numbered list of specific improvements, prioritized by impact:

    • High Priority (do this first)
    • Medium Priority (nice to have)
    • Low Priority (optional refinements)

    Each suggestion should include concrete rationale, not vague guidance.

  7. Overall Assessment:

    • Professional verdict on production-readiness
    • Clear recommendation (Keep as-is / Minor tweaks / Significant refactor / Major restructure)

5. Deliver Report to User

Present the complete evaluation report to the user in a clear, formatted structure. Ensure:

  • Status indicators are visible (✓ Pass / ⚠ Warning / Fail)
  • Actionable suggestions are specific (not vague)
  • Rationale is explained for each issue
  • Prioritization is clear

Important Guidelines

  • Be brutally honest: Point out real issues, don't sugarcoat
  • Specific over vague: "The examples don't show error handling" not "examples could be better"
  • Professional tone: Constructive criticism, not harsh
  • Evidence-based: Reference specific lines or patterns from the skill
  • Proportional feedback: Don't over-critique minor issues
  • Future-focused: Suggest improvements, not judgment

Requirements

  • User has installed skills in ~/.claude/skills/
  • Target skill has a valid SKILL.md file with frontmatter
  • User accepts the detailed, honest evaluation

Context & Standards

This evaluator uses best practices from:

  • Official Anthropic Claude Code Skills documentation
  • Analysis of official skills repository patterns
  • Professional technical writing standards
  • Prompt engineering best practices for LLM interactions

All assessments are comparative to official guidelines, not arbitrary standards.