7.9 KiB
Assessing Code Quality
Purpose
Analyze code quality indicators beyond architecture with EVIDENCE-BASED findings - produces quality scorecard with concrete examples, file/line references, and actionable recommendations.
When to Use
- Coordinator delegates quality assessment after subsystem catalog completion
- Need to identify specific, actionable quality issues beyond architectural patterns
- Output feeds into architect handover or improvement planning
Core Principle: Evidence Over Speculation
Good quality analysis provides concrete evidence. Poor quality analysis makes educated guesses.
❌ BAD: "Payment service likely lacks error handling"
✅ GOOD: "Payment service process_payment() (line 127) raises generic Exception, loses error context"
❌ BAD: "Functions may be too long"
✅ GOOD: "src/api/orders.py:process_order() is 145 lines (lines 89-234)"
❌ BAD: "Error handling unclear"
✅ GOOD: "3 different error patterns: orders.py uses exceptions, payment.py returns codes, user.py mixes both"
Your goal: Document WHAT YOU OBSERVED, not what you guess might be true.
The Speculation Trap (from baseline testing)
Common rationalization: "I don't have full context, so I'll make educated guesses to provide value."
Reality: Speculation masquerading as analysis is worse than saying "insufficient information."
Baseline failure mode:
- Document says "Error handling likely mixes patterns"
- Based on: Filename alone, not actual code review
- Should say: "Reviewed file structure only - implementation details not analyzed"
Weasel Words - BANNED
If you catch yourself writing these, STOP:
| Banned Phrase | Why Banned | Replace With |
|---|---|---|
| "Likely" | Speculation | "Observed in file.py:line" OR "Not analyzed" |
| "May" | Hedge | "Found X" OR "Did not review X" |
| "Unclear if" | Evasion | "Code shows X" OR "Insufficient information to assess X" |
| "Appears to" | Guessing | "Lines 12-45 implement X" OR "Not examined" |
| "Probably" | Assumption | Concrete observation OR admit limitation |
| "Should" | Inference | "Currently does X" (observation) |
| "Suggests" | Reading between lines | Quote actual code OR acknowledge gap |
From baseline testing: Agents will use these to fill gaps. Skill must ban them explicitly.
Evidence Requirements
Every finding MUST include:
-
File path - Exact location
- ✅
src/services/payment.py - ❌ "payment service"
- ✅
-
Line numbers - Specific range
- ✅ "Lines 127-189 (63 lines)"
- ❌ "Long function"
-
Code example or description - What you saw
- ✅ "Function has 8 nested if statements"
- ✅ "Quoted:
except Exception as e: pass" - ❌ "Complex logic"
-
Severity with reasoning - Why this level
- ✅ "Critical: Swallows exceptions in payment processing"
- ❌ "High priority issue"
If you didn't examine implementation:
- ✅ "Reviewed imports and structure only - implementation not analyzed"
- ❌ "Function likely does X"
Severity Framework (REQUIRED)
Define criteria BEFORE rating anything:
Critical:
- Security vulnerability (injection, exposed secrets, auth bypass)
- Data loss/corruption risk
- Blocks deployment
- Example: Hardcoded credentials, SQL injection, unhandled exceptions in critical path
High:
- Frequent source of bugs
- High maintenance burden
- Performance degradation
- Example: 200-line functions, 15% code duplication, N+1 queries
Medium:
- Moderate maintainability concern
- Technical debt accumulation
- Example: Missing docstrings, inconsistent error handling, magic numbers
Low:
- Minor improvement
- Style/cosmetic
- Example: Verbose naming, minor duplication (< 5%)
From baseline testing: Agents will use severity labels without defining them. Enforce framework first.
Observation vs. Inference
Distinguish clearly:
Observation (what you saw):
- "Function process_order() is 145 lines"
- "3 files contain identical validation logic (lines quoted)"
- "No docstrings found in src/api/ (0/12 functions)"
Inference (what it might mean):
- "145-line function suggests complexity - recommend review"
- "Identical validation in 3 files indicates duplication - recommend extraction"
- "Missing docstrings may hinder onboarding - recommend adding"
Always lead with observation, inference optional:
**Observed:** `src/api/orders.py:process_order()` is 145 lines (lines 89-234) with 12 decision points
**Assessment:** High complexity - recommend splitting into smaller functions
"Insufficient Information" is Valid
When you haven't examined something, say so:
✅ Honest limitation:
## Testing Coverage
**Not analyzed** - Test files not reviewed during subsystem cataloging
**Recommendation:** Review test/ directory to assess coverage
❌ Speculation:
## Testing Coverage
Testing likely exists but coverage unclear. May have integration tests. Probably needs more unit tests.
From baseline testing: Agents filled every section with speculation. Skill must make "not analyzed" acceptable.
Minimal Viable Quality Assessment
If time/sample constraints prevent comprehensive analysis:
-
Acknowledge scope explicitly
- "Reviewed 5 files from 8 subsystems"
- "Implementation details not examined (imports/structure only)"
-
Document what you CAN observe
- File sizes, function counts (from grep/wc)
- Imports (coupling indicators)
- Structure (directory organization)
-
List gaps explicitly
- "Not analyzed: Test coverage, runtime behavior, security patterns"
-
Mark confidence appropriately
- "Low confidence: Small sample, structural review only"
Better to be honest about limitations than to guess.
Output Contract
Write to 05-quality-assessment.md:
# Code Quality Assessment
**Analysis Date:** YYYY-MM-DD
**Scope:** [What you reviewed - be specific]
**Methodology:** [How you analyzed - tools, sampling strategy]
**Confidence:** [High/Medium/Low with reasoning]
## Severity Framework
[Define Critical/High/Medium/Low with examples]
## Findings
### [Category: Complexity/Duplication/etc.]
**Observed:**
- [File path:line numbers] [Concrete observation]
- [Quote code or describe specifically]
**Severity:** [Level] - [Reasoning based on framework]
**Recommendation:** [Specific action]
### [Repeat for each finding]
## What Was NOT Analyzed
- [Explicitly list gaps]
- [No speculation about these areas]
## Recommendations
[Prioritized by severity]
## Limitations
- Sample size: [N files from M subsystems]
- Methodology constraints: [No tools, time pressure, etc.]
- Confidence assessment: [Why Medium/Low]
Success Criteria
You succeeded when:
- Every finding has file:line evidence
- No weasel words ("likely", "may", "unclear if")
- Severity framework defined before use
- Observations distinguished from inferences
- Gaps acknowledged as "not analyzed"
- Output follows contract
You failed when:
- Speculation instead of evidence
- Vague findings ("functions may be long")
- Undefined severity ratings
- "Unclear if" everywhere
- Professional-looking guesswork
Common Rationalizations (STOP SIGNALS)
| Excuse | Reality |
|---|---|
| "I'll provide value with educated guesses" | Speculation is not analysis. Be honest about gaps. |
| "I can infer from file names" | File names ≠ implementation. Acknowledge limitation. |
| "Stakeholders expect complete analysis" | They expect accurate analysis. Partial truth > full speculation. |
| "I reviewed it during cataloging" | Catalog focus ≠ quality focus. If you didn't examine for quality, say so. |
| "Professional reports don't say 'I don't know'" | Professional reports distinguish facts from limitations. |
Integration with Workflow
Typically invoked after subsystem catalog completion, before or alongside diagram generation. Feeds findings into final report and architect handover.