Files
2025-11-30 08:59:43 +08:00

449 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
name: quality-metrics-and-kpis
description: Use when setting up quality dashboards, defining test coverage targets, tracking quality trends, configuring CI/CD quality gates, or reporting quality metrics to stakeholders - provides metric selection, threshold strategies, and dashboard design patterns
---
# Quality Metrics & KPIs
## Overview
**Core principle:** Measure what matters. Track trends, not absolutes. Use metrics to drive action, not for vanity.
**Rule:** Every metric must have a defined threshold and action plan. If a metric doesn't change behavior, stop tracking it.
## Quality Metrics vs Vanity Metrics
| Type | Example | Problem | Better Metric |
|------|---------|---------|---------------|
| **Vanity** | "We have 10,000 tests!" | Doesn't indicate quality | Pass rate, flakiness rate |
| **Vanity** | "95% code coverage!" | Can be gamed, doesn't mean tests are good | Coverage delta (new code), mutation score |
| **Actionable** | "Test flakiness: 5% → 2%" | Drives action | Track trend, set target |
| **Actionable** | "P95 build time: 15 min" | Identifies bottleneck | Optimize slow tests |
**Actionable metrics answer:** "What should I fix next?"
---
## Core Quality Metrics
### 1. Test Pass Rate
**Definition:** % of tests that pass on first run
```
Pass Rate = (Passing Tests / Total Tests) × 100
```
**Thresholds:**
- **> 98%:** Healthy
- **95-98%:** Investigate failures
- **< 95%:** Critical (tests are unreliable)
**Why it matters:** Low pass rate means flaky tests or broken code
**Action:** If < 98%, run flaky-test-prevention skill
---
### 2. Test Flakiness Rate
**Definition:** % of tests that fail intermittently
```
Flakiness Rate = (Flaky Tests / Total Tests) × 100
```
**How to measure:**
```bash
# Run each test 100 times
pytest --count=100 test_checkout.py
# Flaky if passes 1-99 times (not 0 or 100)
```
**Thresholds:**
- **< 1%:** Healthy
- **1-5%:** Moderate (fix soon)
- **> 5%:** Critical (CI is unreliable)
**Action:** Fix flaky tests before adding new tests
---
### 3. Code Coverage
**Definition:** % of code lines executed by tests
```
Coverage = (Executed Lines / Total Lines) × 100
```
**Thresholds (by test type):**
- **Unit tests:** 80-90% of business logic
- **Integration tests:** 60-70% of integration points
- **E2E tests:** 40-50% of critical paths
**Configuration (pytest):**
```ini
# .coveragerc
[run]
source = src
omit = */tests/*, */migrations/*
[report]
fail_under = 80 # Fail if coverage < 80%
show_missing = True
```
**Anti-pattern:** 100% coverage as goal
**Why it's wrong:** Easy to game (tests that execute code without asserting anything)
**Better metric:** Coverage + mutation score (see mutation-testing skill)
---
### 4. Coverage Delta (New Code)
**Definition:** Coverage of newly added code
**Why it matters:** More actionable than absolute coverage
```bash
# Measure coverage on changed files only
pytest --cov=src --cov-report=term-missing \
$(git diff --name-only origin/main...HEAD | grep '\.py$')
```
**Threshold:** 90% for new code (stricter than legacy)
**Action:** Block PR if new code coverage < 90%
---
### 5. Build Time (CI/CD)
**Definition:** Time from commit to merge-ready
**Track by stage:**
- **Lint/format:** < 30s
- **Unit tests:** < 2 min
- **Integration tests:** < 5 min
- **E2E tests:** < 15 min
- **Total PR pipeline:** < 20 min
**Why it matters:** Slow CI blocks developer productivity
**Action:** If build > 20 min, see test-automation-architecture for optimization patterns
---
### 6. Test Execution Time Trend
**Definition:** How test suite duration changes over time
```python
# Track in CI
import time
import json
start = time.time()
pytest.main()
duration = time.time() - start
metrics = {"test_duration_seconds": duration, "timestamp": time.time()}
with open("metrics.json", "w") as f:
json.dump(metrics, f)
```
**Threshold:** < 5% growth per month
**Action:** If growth > 5%/month, parallelize tests or refactor slow tests
---
### 7. Defect Escape Rate
**Definition:** Bugs found in production that should have been caught by tests
```
Defect Escape Rate = (Production Bugs / Total Releases) × 100
```
**Thresholds:**
- **< 2%:** Excellent
- **2-5%:** Acceptable
- **> 5%:** Tests are missing critical scenarios
**Action:** For each escape, write regression test to prevent recurrence
---
### 8. Mean Time to Detection (MTTD)
**Definition:** Time from bug introduction to discovery
```
MTTD = Deployment Time - Bug Introduction Time
```
**Thresholds:**
- **< 1 hour:** Excellent (caught in CI)
- **1-24 hours:** Good (caught in staging/canary)
- **> 24 hours:** Poor (caught in production)
**Action:** If MTTD > 24h, improve observability (see observability-and-monitoring skill)
---
### 9. Mean Time to Recovery (MTTR)
**Definition:** Time from bug detection to fix deployed
```
MTTR = Fix Deployment Time - Bug Detection Time
```
**Thresholds:**
- **< 1 hour:** Excellent
- **1-8 hours:** Acceptable
- **> 8 hours:** Poor
**Action:** If MTTR > 8h, improve rollback procedures (see testing-in-production skill)
---
## Dashboard Design
### Grafana Dashboard Example
```yaml
# grafana-dashboard.json
{
"panels": [
{
"title": "Test Pass Rate (7 days)",
"targets": [{
"expr": "sum(tests_passed) / sum(tests_total) * 100"
}],
"thresholds": [
{"value": 95, "color": "red"},
{"value": 98, "color": "yellow"},
{"value": 100, "color": "green"}
]
},
{
"title": "Build Time Trend (30 days)",
"targets": [{
"expr": "avg_over_time(ci_build_duration_seconds[30d])"
}]
},
{
"title": "Coverage Delta (per PR)",
"targets": [{
"expr": "coverage_new_code_percent"
}],
"thresholds": [
{"value": 90, "color": "green"},
{"value": 80, "color": "yellow"},
{"value": 0, "color": "red"}
]
}
]
}
```
---
### CI/CD Quality Gates
**GitHub Actions example:**
```yaml
# .github/workflows/quality-gates.yml
name: Quality Gates
on: [pull_request]
jobs:
quality-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run tests with coverage
run: pytest --cov=src --cov-report=json
- name: Check coverage threshold
run: |
COVERAGE=$(jq '.totals.percent_covered' coverage.json)
if (( $(echo "$COVERAGE < 80" | bc -l) )); then
echo "Coverage $COVERAGE% below 80% threshold"
exit 1
fi
- name: Check build time
run: |
DURATION=$(jq '.duration' test-results.json)
if (( $(echo "$DURATION > 300" | bc -l) )); then
echo "Build time ${DURATION}s exceeds 5-minute threshold"
exit 1
fi
```
---
## Reporting Patterns
### Weekly Quality Report
**Template:**
```markdown
# Quality Report - Week of 2025-01-13
## Summary
- **Test pass rate:** 98.5% (+0.5% from last week)
- **Flakiness rate:** 2.1% (-1.3% from last week) ✅
- **Coverage:** 85.2% (+2.1% from last week) ✅
- **Build time:** 18 min (-2 min from last week) ✅
## Actions Taken
- Fixed 8 flaky tests in checkout flow
- Added integration tests for payment service (+5% coverage)
- Parallelized slow E2E tests (reduced build time by 2 min)
## Action Items
- [ ] Fix remaining 3 flaky tests in user registration
- [ ] Increase coverage of order service (currently 72%)
- [ ] Investigate why staging MTTD increased to 4 hours
```
---
### Stakeholder Dashboard (Executive View)
**Metrics to show:**
1. **Quality trend (6 months):** Pass rate over time
2. **Velocity impact:** How long does CI take per PR?
3. **Production stability:** Defect escape rate
4. **Recovery time:** MTTR for incidents
**What NOT to show:**
- Absolute test count (vanity metric)
- Lines of code (meaningless)
- Individual developer metrics (creates wrong incentives)
---
## Anti-Patterns Catalog
### ❌ Coverage as the Only Metric
**Symptom:** "We need 100% coverage!"
**Why bad:** Easy to game with meaningless tests
```python
# ❌ BAD: 100% coverage, 0% value
def calculate_tax(amount):
return amount * 0.08
def test_calculate_tax():
calculate_tax(100) # Executes code, asserts nothing!
```
**Fix:** Use coverage + mutation score
---
### ❌ Tracking Metrics Without Thresholds
**Symptom:** Dashboard shows metrics but no action taken
**Why bad:** Metrics become noise
**Fix:** Every metric needs:
- **Target threshold** (e.g., flakiness < 1%)
- **Alert level** (e.g., alert if flakiness > 5%)
- **Action plan** (e.g., "Fix flaky tests before adding new features")
---
### ❌ Optimizing for Metrics, Not Quality
**Symptom:** Gaming metrics to hit targets
**Example:** Removing tests to increase pass rate
**Fix:** Track multiple complementary metrics (pass rate + flakiness + coverage)
---
### ❌ Measuring Individual Developer Productivity
**Symptom:** "Developer A writes more tests than Developer B"
**Why bad:** Creates wrong incentives (quantity over quality)
**Fix:** Measure team metrics, not individual
---
## Tool Integration
### SonarQube Metrics
**Quality Gate:**
```properties
# sonar-project.properties
sonar.qualitygate.wait=true
# Metrics tracked:
# - Bugs (target: 0)
# - Vulnerabilities (target: 0)
# - Code smells (target: < 100)
# - Coverage (target: > 80%)
# - Duplications (target: < 3%)
```
---
### Codecov Integration
```yaml
# codecov.yml
coverage:
status:
project:
default:
target: 80% # Overall coverage target
threshold: 2% # Allow 2% drop
patch:
default:
target: 90% # New code must have 90% coverage
threshold: 0% # No drops allowed
```
---
## Bottom Line
**Track actionable metrics with defined thresholds. Use metrics to drive improvement, not for vanity.**
**Core dashboard:**
- Test pass rate (> 98%)
- Flakiness rate (< 1%)
- Coverage delta on new code (> 90%)
- Build time (< 20 min)
- Defect escape rate (< 2%)
**Weekly actions:**
- Review metrics against thresholds
- Identify trends (improving/degrading)
- Create action items for violations
- Track progress on improvements
**If you're tracking a metric but not acting on it, stop tracking it. Metrics exist to drive action, not to fill dashboards.**