Files
2025-11-30 08:59:22 +08:00

7.9 KiB

Assessing Code Quality

Purpose

Analyze code quality indicators beyond architecture with EVIDENCE-BASED findings - produces quality scorecard with concrete examples, file/line references, and actionable recommendations.

When to Use

  • Coordinator delegates quality assessment after subsystem catalog completion
  • Need to identify specific, actionable quality issues beyond architectural patterns
  • Output feeds into architect handover or improvement planning

Core Principle: Evidence Over Speculation

Good quality analysis provides concrete evidence. Poor quality analysis makes educated guesses.

❌ BAD:  "Payment service likely lacks error handling"
✅ GOOD: "Payment service process_payment() (line 127) raises generic Exception, loses error context"

❌ BAD:  "Functions may be too long"
✅ GOOD: "src/api/orders.py:process_order() is 145 lines (lines 89-234)"

❌ BAD:  "Error handling unclear"
✅ GOOD: "3 different error patterns: orders.py uses exceptions, payment.py returns codes, user.py mixes both"

Your goal: Document WHAT YOU OBSERVED, not what you guess might be true.

The Speculation Trap (from baseline testing)

Common rationalization: "I don't have full context, so I'll make educated guesses to provide value."

Reality: Speculation masquerading as analysis is worse than saying "insufficient information."

Baseline failure mode:

  • Document says "Error handling likely mixes patterns"
  • Based on: Filename alone, not actual code review
  • Should say: "Reviewed file structure only - implementation details not analyzed"

Weasel Words - BANNED

If you catch yourself writing these, STOP:

Banned Phrase Why Banned Replace With
"Likely" Speculation "Observed in file.py:line" OR "Not analyzed"
"May" Hedge "Found X" OR "Did not review X"
"Unclear if" Evasion "Code shows X" OR "Insufficient information to assess X"
"Appears to" Guessing "Lines 12-45 implement X" OR "Not examined"
"Probably" Assumption Concrete observation OR admit limitation
"Should" Inference "Currently does X" (observation)
"Suggests" Reading between lines Quote actual code OR acknowledge gap

From baseline testing: Agents will use these to fill gaps. Skill must ban them explicitly.

Evidence Requirements

Every finding MUST include:

  1. File path - Exact location

    • src/services/payment.py
    • "payment service"
  2. Line numbers - Specific range

    • "Lines 127-189 (63 lines)"
    • "Long function"
  3. Code example or description - What you saw

    • "Function has 8 nested if statements"
    • "Quoted: except Exception as e: pass"
    • "Complex logic"
  4. Severity with reasoning - Why this level

    • "Critical: Swallows exceptions in payment processing"
    • "High priority issue"

If you didn't examine implementation:

  • "Reviewed imports and structure only - implementation not analyzed"
  • "Function likely does X"

Severity Framework (REQUIRED)

Define criteria BEFORE rating anything:

Critical:

  • Security vulnerability (injection, exposed secrets, auth bypass)
  • Data loss/corruption risk
  • Blocks deployment
  • Example: Hardcoded credentials, SQL injection, unhandled exceptions in critical path

High:

  • Frequent source of bugs
  • High maintenance burden
  • Performance degradation
  • Example: 200-line functions, 15% code duplication, N+1 queries

Medium:

  • Moderate maintainability concern
  • Technical debt accumulation
  • Example: Missing docstrings, inconsistent error handling, magic numbers

Low:

  • Minor improvement
  • Style/cosmetic
  • Example: Verbose naming, minor duplication (< 5%)

From baseline testing: Agents will use severity labels without defining them. Enforce framework first.

Observation vs. Inference

Distinguish clearly:

Observation (what you saw):

  • "Function process_order() is 145 lines"
  • "3 files contain identical validation logic (lines quoted)"
  • "No docstrings found in src/api/ (0/12 functions)"

Inference (what it might mean):

  • "145-line function suggests complexity - recommend review"
  • "Identical validation in 3 files indicates duplication - recommend extraction"
  • "Missing docstrings may hinder onboarding - recommend adding"

Always lead with observation, inference optional:

**Observed:** `src/api/orders.py:process_order()` is 145 lines (lines 89-234) with 12 decision points
**Assessment:** High complexity - recommend splitting into smaller functions

"Insufficient Information" is Valid

When you haven't examined something, say so:

Honest limitation:

## Testing Coverage

**Not analyzed** - Test files not reviewed during subsystem cataloging
**Recommendation:** Review test/ directory to assess coverage

Speculation:

## Testing Coverage

Testing likely exists but coverage unclear. May have integration tests. Probably needs more unit tests.

From baseline testing: Agents filled every section with speculation. Skill must make "not analyzed" acceptable.

Minimal Viable Quality Assessment

If time/sample constraints prevent comprehensive analysis:

  1. Acknowledge scope explicitly

    • "Reviewed 5 files from 8 subsystems"
    • "Implementation details not examined (imports/structure only)"
  2. Document what you CAN observe

    • File sizes, function counts (from grep/wc)
    • Imports (coupling indicators)
    • Structure (directory organization)
  3. List gaps explicitly

    • "Not analyzed: Test coverage, runtime behavior, security patterns"
  4. Mark confidence appropriately

    • "Low confidence: Small sample, structural review only"

Better to be honest about limitations than to guess.

Output Contract

Write to 05-quality-assessment.md:

# Code Quality Assessment

**Analysis Date:** YYYY-MM-DD
**Scope:** [What you reviewed - be specific]
**Methodology:** [How you analyzed - tools, sampling strategy]
**Confidence:** [High/Medium/Low with reasoning]

## Severity Framework

[Define Critical/High/Medium/Low with examples]

## Findings

### [Category: Complexity/Duplication/etc.]

**Observed:**
- [File path:line numbers] [Concrete observation]
- [Quote code or describe specifically]

**Severity:** [Level] - [Reasoning based on framework]

**Recommendation:** [Specific action]

### [Repeat for each finding]

## What Was NOT Analyzed

- [Explicitly list gaps]
- [No speculation about these areas]

## Recommendations

[Prioritized by severity]

## Limitations

- Sample size: [N files from M subsystems]
- Methodology constraints: [No tools, time pressure, etc.]
- Confidence assessment: [Why Medium/Low]

Success Criteria

You succeeded when:

  • Every finding has file:line evidence
  • No weasel words ("likely", "may", "unclear if")
  • Severity framework defined before use
  • Observations distinguished from inferences
  • Gaps acknowledged as "not analyzed"
  • Output follows contract

You failed when:

  • Speculation instead of evidence
  • Vague findings ("functions may be long")
  • Undefined severity ratings
  • "Unclear if" everywhere
  • Professional-looking guesswork

Common Rationalizations (STOP SIGNALS)

Excuse Reality
"I'll provide value with educated guesses" Speculation is not analysis. Be honest about gaps.
"I can infer from file names" File names ≠ implementation. Acknowledge limitation.
"Stakeholders expect complete analysis" They expect accurate analysis. Partial truth > full speculation.
"I reviewed it during cataloging" Catalog focus ≠ quality focus. If you didn't examine for quality, say so.
"Professional reports don't say 'I don't know'" Professional reports distinguish facts from limitations.

Integration with Workflow

Typically invoked after subsystem catalog completion, before or alongside diagram generation. Feeds findings into final report and architect handover.