9.9 KiB
9.9 KiB
Quality Assurance Loops
How skill-factory ensures every skill meets minimum quality standards.
Quality Scoring (Anthropic Best Practices)
Based on official Anthropic guidelines, total possible: 10.0 points
Scoring Criteria
| Criterion | Weight | What to Check |
|---|---|---|
| Description Quality | 2.0 | Specific, includes when_to_use, third-person |
| Name Convention | 0.5 | Lowercase, hyphens, descriptive |
| Conciseness | 1.5 | <500 lines OR progressive disclosure |
| Progressive Disclosure | 1.0 | Reference files for details |
| Examples & Workflows | 1.0 | Concrete code samples |
| Degree of Freedom | 0.5 | Appropriate for task type |
| Dependencies | 0.5 | Documented and verified |
| Structure | 1.0 | Well-organized sections |
| Error Handling | 0.5 | Scripts handle errors |
| Anti-Patterns | 1.0 | No time-sensitive info, consistent terminology |
| Testing | 0.5 | Evidence of testing |
Enhancement Loop Algorithm
def quality_assurance_loop(skill_path: str, min_score: float = 8.0) -> Skill:
"""
Iteratively improve skill until it meets quality threshold.
Max iterations: 5 (prevents infinite loops)
"""
max_iterations = 5
iteration = 0
while iteration < max_iterations:
# Score skill
score, issues = score_skill(skill_path)
print(f"📊 Quality check: {score}/10")
if score >= min_score:
print(f"✅ Quality threshold met ({score} >= {min_score})")
return load_skill(skill_path)
# Report issues
print(f" ⚠️ Issues found:")
for issue in issues:
print(f" - {issue.description}")
# Apply fixes
print(f"🔧 Enhancing skill...")
skill = apply_fixes(skill_path, issues)
iteration += 1
# If we hit max iterations without reaching threshold
if score < min_score:
print(f"⚠️ Quality score {score} below threshold after {max_iterations} iterations")
print(f" Manual review recommended")
return load_skill(skill_path)
return load_skill(skill_path)
Fix Strategies
Issue: Description Too Generic
Detection:
def check_description(skill):
desc = skill.frontmatter.description
if len(desc) < 50:
return Issue("Description too short (< 50 chars)")
if not contains_specifics(desc):
return Issue("Description lacks specifics")
if "help" in desc.lower() or "tool" in desc.lower():
return Issue("Description too vague")
return None
Fix:
def fix_description(skill):
# Extract key topics from skill content
topics = extract_topics(skill.content)
# Generate specific description
desc = f"Comprehensive guide for {skill.name} covering "
desc += ", ".join(topics[:3])
desc += f". Use when working with {topics[0]} "
desc += f"and need {', '.join(topics[1:3])}"
skill.frontmatter.description = desc
return skill
Issue: Missing Examples
Detection:
def check_examples(skill):
code_blocks = count_code_blocks(skill.content)
if code_blocks < 3:
return Issue(f"Only {code_blocks} code examples (recommend 5+)")
return None
Fix:
def add_examples(skill, source_docs=None):
if source_docs:
# Extract from documentation
examples = extract_code_examples(source_docs)
else:
# Generate from skill content
examples = generate_examples_from_topics(skill)
# Add examples section
if "## Examples" not in skill.content:
skill.content += "\n\n## Examples\n\n"
for ex in examples[:5]: # Add top 5 examples
skill.content += f"### {ex.title}\n\n"
skill.content += f"```{ex.language}\n{ex.code}\n```\n\n"
if ex.explanation:
skill.content += f"{ex.explanation}\n\n"
return skill
Issue: Too Long (> 500 lines)
Detection:
def check_length(skill):
line_count = count_lines(skill.content)
if line_count > 500:
return Issue(f"SKILL.md is {line_count} lines (recommend <500)")
return None
Fix:
def apply_progressive_disclosure(skill):
# Identify sections that can be moved to references
movable_sections = find_detail_sections(skill.content)
skill.references = {}
for section in movable_sections:
# Create reference file
ref_name = slugify(section.title)
ref_path = f"references/{ref_name}.md"
# Move content
skill.references[ref_name] = section.content
# Replace with reference
skill.content = skill.content.replace(
section.full_text,
f"See {ref_path} for detailed {section.title.lower()}."
)
return skill
Issue: Poor Structure
Detection:
def check_structure(skill):
issues = []
# Check for required sections
required = ["## Overview", "## Usage", "## Examples"]
for section in required:
if section not in skill.content:
issues.append(f"Missing {section}")
# Check heading hierarchy
if has_heading_skips(skill.content):
issues.append("Heading hierarchy skips levels")
# Check for TOC if long
if count_lines(skill.content) > 200 and "## Table of Contents" not in skill.content:
issues.append("Long skill missing table of contents")
return issues if issues else None
Fix:
def fix_structure(skill, issues):
# Add missing sections
if "Missing ## Overview" in issues:
overview = generate_overview(skill)
skill.content = insert_after_frontmatter(skill.content, overview)
if "Missing ## Usage" in issues:
usage = generate_usage_section(skill)
skill.content = insert_before_examples(skill.content, usage)
# Fix heading hierarchy
if "Heading hierarchy" in str(issues):
skill.content = normalize_headings(skill.content)
# Add TOC if needed
if "missing table of contents" in str(issues):
toc = generate_toc(skill.content)
skill.content = insert_toc(skill.content, toc)
return skill
Issue: Vague/Generic Content
Detection:
def check_specificity(skill):
vague_phrases = [
"you can", "might want to", "it's possible",
"there are various", "several options",
"many ways to", "different approaches"
]
content_lower = skill.content.lower()
vague_count = sum(1 for phrase in vague_phrases if phrase in content_lower)
if vague_count > 10:
return Issue(f"Too many vague phrases ({vague_count})")
return None
Fix:
def improve_specificity(skill):
# Replace vague with specific
replacements = {
"you can": "Use",
"might want to": "Should",
"there are various": "Three main approaches:",
"several options": "Options:",
"many ways to": "Primary methods:",
}
for vague, specific in replacements.items():
skill.content = skill.content.replace(vague, specific)
return skill
Testing Integration
After each enhancement, run tests:
def enhance_and_test(skill):
while score < min_score:
# Enhance
skill = apply_enhancements(skill)
# Score
score = calculate_score(skill)
# Test
test_results = run_tests(skill)
if not test_results.all_passed():
# Tests revealed new issues
issues = test_results.get_failures()
skill = fix_test_failures(skill, issues)
return skill
Progress Reporting
User sees:
📊 Quality check: 7.4/10
⚠️ Issues found:
- Description too generic
- Missing examples in 4 sections
- Some outdated patterns detected
🔧 Enhancing skill...
✏️ Improving description... ✅
📝 Adding code examples... ✅
🔄 Updating patterns... ✅
📊 Quality check: 8.9/10 ✅
Internal execution:
issues = [
Issue("description_generic", fix=fix_description),
Issue("missing_examples", fix=add_examples, count=4),
Issue("outdated_patterns", fix=update_patterns)
]
for issue in issues:
print(f" {issue.icon} {issue.action}... ", end="")
skill = issue.fix(skill)
print("✅")
Quality Metrics Dashboard
After completion:
📊 Final Quality Report
Anthropic Best Practices Score: 8.9/10
Breakdown:
✅ Description Quality: 2.0/2.0 (Excellent)
✅ Name Convention: 0.5/0.5 (Correct)
✅ Conciseness: 1.4/1.5 (Good - 420 lines)
✅ Progressive Disclosure: 1.0/1.0 (Excellent - 3 reference files)
✅ Examples & Workflows: 1.0/1.0 (12 code examples)
✅ Degree of Freedom: 0.5/0.5 (Appropriate)
✅ Dependencies: 0.5/0.5 (Documented)
✅ Structure: 1.0/1.0 (Well-organized)
✅ Error Handling: 0.5/0.5 (N/A for doc skill)
✅ Anti-Patterns: 0.5/1.0 (Minor: 2 time refs)
✅ Testing: 0.5/0.5 (15/15 tests passing)
Recommendations:
⚠️ Remove 2 time-sensitive references for 1.0/1.0 on anti-patterns
Failure Modes
Can't Reach Threshold
If after 5 iterations score is still < 8.0:
⚠️ Quality score 7.8 after 5 iterations
Blocking issues:
- Source documentation lacks code examples
- Framework has limited reference material
Recommendations:
1. Manual examples needed (auto-generation limited)
2. Consider hybrid approach with custom content
3. Lower quality threshold to 7.5 for this specific case
Continue with current skill? (y/n)
Conflicting Requirements
⚠️ Conflicting requirements detected
Issue: Comprehensive coverage (800 lines) vs Conciseness (<500 lines)
Resolution: Applying progressive disclosure
- Main SKILL.md: 380 lines (overview + quick ref)
- Reference files: 5 files with detailed content
Summary
Quality loops ensure:
- Every skill scores >= threshold (default 8.0)
- Anthropic best practices followed
- Automatic fixes applied
- Tests pass
- User sees progress, not complexity