Files
gh-alonw0-llm-docs-optimizer/skills/llm-docs-optimizer/references/c7score_metrics.md
2025-11-29 17:52:13 +08:00

11 KiB
Raw Blame History

C7Score Metrics Reference

Overview

c7score evaluates documentation quality for Context7 using 5 metrics divided into two groups:

  • LLM Analysis (Metrics 1-2): AI-powered evaluation
  • Text Analysis (Metrics 3-5): Rule-based checks

Metric 1: Question-Snippet Comparison (LLM)

What it measures: How well code snippets answer common developer questions about the library.

Scoring approach:

  • LLM generates 15 common questions developers might ask about the library
  • Each snippet is evaluated on how well it answers these questions
  • Higher scores for snippets that directly address practical usage questions

Optimization strategies:

  • Include code examples that answer "how do I..." questions
  • Provide working code snippets for common use cases
  • Address setup, configuration, and basic operations
  • Show real-world usage patterns, not just API signatures
  • Include examples that demonstrate the library's main features

What scores well:

  • "How do I initialize the client?" with full working example
  • "How do I handle authentication?" with complete code
  • "How do I make a basic query?" with error handling included

What scores poorly:

  • Partial code that doesn't run standalone
  • API reference without usage examples
  • Theoretical explanations without practical code

Metric 2: LLM Evaluation (LLM)

What it measures: Overall snippet quality including relevancy, clarity, and correctness.

Scoring criteria:

  • Relevancy: Does the snippet provide useful information about the library?
  • Clarity: Is the code and explanation easy to understand?
  • Correctness: Is the code syntactically correct and using proper APIs?
  • Uniqueness: Are snippets providing unique information or duplicating content?

Optimization strategies:

  • Ensure each snippet provides distinct, valuable information
  • Use clear variable names and structure
  • Add brief explanatory comments where helpful
  • Verify all code is syntactically correct
  • Remove or consolidate duplicate snippets
  • Test code examples to ensure they work

What causes low scores:

  • High rate of duplicate snippets (>25% identical copies)
  • Unclear or confusing code structure
  • Syntax errors or incorrect API usage
  • Snippets that don't add new information

Metric 3: Formatting (Text Analysis)

What it measures: Whether snippets have the expected format and structure.

Checks performed:

  • Are categories missing? (e.g., no title, description, or code)
  • Are code snippets too short or too long?
  • Are language tags actually descriptions? (e.g., "FORTE Build System Configuration")
  • Are languages set to "none" or showing console output?
  • Is the code just a list or argument descriptions?

Optimization strategies:

  • Follow consistent snippet structure: TITLE / DESCRIPTION / CODE
  • Use 40-dash delimiters between snippets (----------------------------------------)
  • Set proper language tags (python, javascript, typescript, bash, etc.)
  • Avoid very short snippets (<3 lines) unless absolutely necessary
  • Avoid very long snippets (>100 lines) - break into focused examples
  • Don't use lists in place of code

Example good format:

Getting Started with Authentication
----------------------------------------
Initialize the client with your API key and authenticate requests.

```python
from library import Client

client = Client(api_key="your_api_key")
client.authenticate()

**What to avoid:**
- Language tags like "CLI Arguments" or "Configuration File"
- Pretty-printed tables instead of code
- Numbered/bulleted lists masquerading as code
- Missing titles or descriptions
- Inconsistent formatting

## Metric 4: Project Metadata (Text Analysis)

**What it measures:** Presence of irrelevant project information that doesn't help developers use the library.

**Checks performed:**
- BibTeX citations (would have language tag "Bibtex")
- Licensing information
- Directory structure listings
- Project governance or administrative content

**Optimization strategies:**
- Remove or minimize licensing snippets
- Avoid directory tree representations
- Don't include citation information
- Focus on usage, not project management
- Keep administrative content out of code documentation

**What to remove or relocate:**
- LICENSE files or license text
- CONTRIBUTING.md guidelines
- Directory listings or project structure
- Academic citations (BibTeX, APA, etc.)
- Governance policies

**Exception:** Brief installation or setup instructions that mention directories are okay if needed for library usage.

## Metric 5: Initialization (Text Analysis)

**What it measures:** Snippets that are only imports or installations without meaningful content.

**Checks performed:**
- Snippets that are just import statements
- Snippets that are just installation commands (pip install, npm install)
- No additional context or usage examples

**Optimization strategies:**
- Combine imports with usage examples
- Show installation in context of setup process
- Always follow imports with actual usage code
- Make installation snippets include next steps

**Good approach:**
```python
# Installation and basic usage
# First install: pip install library-name

from library import Client

# Initialize and make your first request
client = Client()
result = client.get_data()

Poor approach:

# Just imports
import library
from library import Client
# Just installation
pip install library-name

Scoring Weights

Default c7score weights (can be customized):

  • Question-Snippet Comparison: 0.8 (80%)
  • LLM Evaluation: 0.05 (5%)
  • Formatting: 0.05 (5%)
  • Project Metadata: 0.05 (5%)
  • Initialization: 0.05 (5%)

The question-answer metric dominates because Context7's primary goal is helping developers answer practical questions about library usage.

Overall Best Practices

  1. Focus on answering questions: Think "How would a developer actually use this?"
  2. Provide complete, working examples: Not just fragments
  3. Ensure uniqueness: Each snippet should teach something new
  4. Structure consistently: TITLE / DESCRIPTION / CODE format
  5. Use proper language tags: python, javascript, typescript, etc.
  6. Remove noise: No licensing, directory trees, or pure imports
  7. Test your code: All examples should be syntactically correct
  8. Keep it practical: Real-world usage beats theoretical explanation

Self-Evaluation Rubrics

When evaluating documentation quality using c7score methodology, use these detailed rubrics:

1. Question-Snippet Matching Rubric (80% weight)

Score: 90-100 (Excellent)

  • All major developer questions have complete answers
  • Code examples are self-contained and runnable
  • Examples include imports, setup, and usage context
  • Common use cases are clearly demonstrated
  • Error handling is shown where relevant
  • Examples progress from simple to advanced

Score: 70-89 (Good)

  • Most questions are answered with working code
  • Examples are mostly complete but may miss minor details
  • Some context or imports may be implicit
  • Common use cases covered
  • Minor gaps in error handling

Score: 50-69 (Fair)

  • Some questions answered, others partially addressed
  • Examples require significant external knowledge
  • Missing imports or setup context
  • Limited use case coverage
  • Error handling largely absent

Score: 30-49 (Poor)

  • Few questions fully answered
  • Examples are fragments without context
  • Unclear how to actually use the code
  • Major use cases not covered
  • No error handling

Score: 0-29 (Very Poor)

  • Questions not addressed in documentation
  • No practical examples
  • Only API signatures without usage
  • Cannot determine how to use the library

2. LLM Evaluation Rubric (10% weight)

Unique Information (30% of metric):

  • 100%: Every snippet provides unique value, no duplicates
  • 75%: Minimal duplication, mostly unique content
  • 50%: Some repeated information across snippets
  • 25%: Significant duplication
  • 0%: Many duplicate snippets

Clarity (30% of metric):

  • 100%: Well-worded, professional, no errors
  • 75%: Clear with minor grammar/wording issues
  • 50%: Understandable but awkward phrasing
  • 25%: Confusing or poorly worded
  • 0%: Unclear, incomprehensible

Correct Syntax (40% of metric):

  • 100%: All code syntactically perfect
  • 75%: Minor syntax issues (missing semicolons, etc.)
  • 50%: Some syntax errors but code is recognizable
  • 25%: Multiple syntax errors
  • 0%: Code is not valid

Final LLM Evaluation Score = (Unique×0.3) + (Clarity×0.3) + (Syntax×0.4)

3. Formatting Rubric (5% weight)

Score: 100 (Perfect)

  • All snippets have proper language tags (python, javascript, etc.)
  • Language tags are actual languages, not descriptions
  • All code blocks use triple backticks with language
  • Code blocks are properly closed
  • No lists within CODE sections
  • Minimum length requirements met (5+ words)

Score: 80-99 (Minor Issues)

  • 1-2 snippets missing language tags
  • One or two incorrectly formatted blocks
  • Minor inconsistencies

Score: 50-79 (Multiple Problems)

  • Several snippets missing language tags
  • Some use descriptive strings instead of language names
  • Inconsistent formatting

Score: 0-49 (Significant Issues)

  • Many snippets improperly formatted
  • Widespread use of wrong language tags
  • Code not in proper blocks

4. Metadata Removal Rubric (2.5% weight)

Score: 100 (Clean)

  • No license text in code examples
  • No citation formats (BibTeX, RIS)
  • No directory structure listings
  • No project metadata
  • Pure code and usage examples

Score: 75-99 (Minimal Metadata)

  • One or two snippets with minor metadata
  • Brief license mentions that don't dominate

Score: 50-74 (Some Metadata)

  • Several snippets include project metadata
  • Directory structures present
  • Some citation content

Score: 0-49 (Heavy Metadata)

  • Significant license/citation content
  • Multiple directory listings
  • Project metadata dominates

5. Initialization Rubric (2.5% weight)

Score: 100 (Excellent)

  • All examples show usage beyond setup
  • Installation combined with first usage
  • Imports followed by practical examples
  • No standalone import/install snippets

Score: 75-99 (Mostly Good)

  • 1-2 snippets are setup-only
  • Most examples show actual usage

Score: 50-74 (Some Init-Only)

  • Several snippets are just imports/installation
  • Mixed quality

Score: 0-49 (Many Init-Only)

  • Many snippets are only imports
  • Many snippets are only installation
  • Lack of usage examples

Scoring Best Practices

When evaluating:

  1. Read entire documentation before scoring
  2. Count specific examples (e.g., "7 out of 10 snippets...")
  3. Be consistent between before/after evaluations
  4. Explain scores with concrete evidence
  5. Use percentages when quantifying (e.g., "80% of examples...")
  6. Identify improvements specifically
  7. Calculate weighted average: (Q×0.8) + (L×0.1) + (F×0.05) + (M×0.025) + (I×0.025)

Example Calculation:

  • Question-Snippet: 85/100 × 0.8 = 68
  • LLM Evaluation: 90/100 × 0.1 = 9
  • Formatting: 100/100 × 0.05 = 5
  • Metadata: 100/100 × 0.025 = 2.5
  • Initialization: 95/100 × 0.025 = 2.375
  • Total: 86.875 ≈ 87/100

Common Scoring Mistakes to Avoid

Being too generous: Score based on evidence, not potential Ignoring weights: Question-answer matters most (80%) Vague explanations: Say "5 of 8 examples lack imports" not "some issues" Inconsistent standards: Apply same rubric to before/after Forgetting context: Consider project type and audience Be specific, objective, and consistent