---
name: quality-metrics-and-kpis
description: Use when setting up quality dashboards, defining test coverage targets, tracking quality trends, configuring CI/CD quality gates, or reporting quality metrics to stakeholders - provides metric selection, threshold strategies, and dashboard design patterns
---

# Quality Metrics & KPIs

## Overview

**Core principle:** Measure what matters. Track trends, not absolutes. Use metrics to drive action, not for vanity.

**Rule:** Every metric must have a defined threshold and action plan. If a metric doesn't change behavior, stop tracking it.

## Quality Metrics vs Vanity Metrics

| Type | Example | Problem | Better Metric |
|------|---------|---------|---------------|
| **Vanity** | "We have 10,000 tests!" | Doesn't indicate quality | Pass rate, flakiness rate |
| **Vanity** | "95% code coverage!" | Can be gamed, doesn't mean tests are good | Coverage delta (new code), mutation score |
| **Actionable** | "Test flakiness: 5% → 2%" | Drives action | Track trend, set target |
| **Actionable** | "P95 build time: 15 min" | Identifies bottleneck | Optimize slow tests |

**Actionable metrics answer:** "What should I fix next?"

---

## Core Quality Metrics

### 1. Test Pass Rate

**Definition:** % of tests that pass on first run

```
Pass Rate = (Passing Tests / Total Tests) × 100
```

**Thresholds:**
- **> 98%:** Healthy
- **95-98%:** Investigate failures
- **< 95%:** Critical (tests are unreliable)

**Why it matters:** Low pass rate means flaky tests or broken code

**Action:** If < 98%, run flaky-test-prevention skill

---

### 2. Test Flakiness Rate

**Definition:** % of tests that fail intermittently

```
Flakiness Rate = (Flaky Tests / Total Tests) × 100
```

**How to measure:**
```bash
# Run each test 100 times
pytest --count=100 test_checkout.py

# Flaky if passes 1-99 times (not 0 or 100)
```

**Thresholds:**
- **< 1%:** Healthy
- **1-5%:** Moderate (fix soon)
- **> 5%:** Critical (CI is unreliable)

**Action:** Fix flaky tests before adding new tests

---

### 3. Code Coverage

**Definition:** % of code lines executed by tests

```
Coverage = (Executed Lines / Total Lines) × 100
```

**Thresholds (by test type):**
- **Unit tests:** 80-90% of business logic
- **Integration tests:** 60-70% of integration points
- **E2E tests:** 40-50% of critical paths

**Configuration (pytest):**
```ini
# .coveragerc
[run]
source = src
omit = */tests/*, */migrations/*

[report]
fail_under = 80  # Fail if coverage < 80%
show_missing = True
```

**Anti-pattern:** 100% coverage as goal

**Why it's wrong:** Easy to game (tests that execute code without asserting anything)

**Better metric:** Coverage + mutation score (see mutation-testing skill)

---

### 4. Coverage Delta (New Code)

**Definition:** Coverage of newly added code

**Why it matters:** More actionable than absolute coverage

```bash
# Measure coverage on changed files only
pytest --cov=src --cov-report=term-missing \
  $(git diff --name-only origin/main...HEAD | grep '\.py$')
```

**Threshold:** 90% for new code (stricter than legacy)

**Action:** Block PR if new code coverage < 90%

---

### 5. Build Time (CI/CD)

**Definition:** Time from commit to merge-ready

**Track by stage:**
- **Lint/format:** < 30s
- **Unit tests:** < 2 min
- **Integration tests:** < 5 min
- **E2E tests:** < 15 min
- **Total PR pipeline:** < 20 min

**Why it matters:** Slow CI blocks developer productivity

**Action:** If build > 20 min, see test-automation-architecture for optimization patterns

---

### 6. Test Execution Time Trend

**Definition:** How test suite duration changes over time

```python
# Track in CI
import time
import json

start = time.time()
pytest.main()
duration = time.time() - start

metrics = {"test_duration_seconds": duration, "timestamp": time.time()}
with open("metrics.json", "w") as f:
    json.dump(metrics, f)
```

**Threshold:** < 5% growth per month

**Action:** If growth > 5%/month, parallelize tests or refactor slow tests

---

### 7. Defect Escape Rate

**Definition:** Bugs found in production that should have been caught by tests

```
Defect Escape Rate = (Production Bugs / Total Releases) × 100
```

**Thresholds:**
- **< 2%:** Excellent
- **2-5%:** Acceptable
- **> 5%:** Tests are missing critical scenarios

**Action:** For each escape, write regression test to prevent recurrence

---

### 8. Mean Time to Detection (MTTD)

**Definition:** Time from bug introduction to discovery

```
MTTD = Deployment Time - Bug Introduction Time
```

**Thresholds:**
- **< 1 hour:** Excellent (caught in CI)
- **1-24 hours:** Good (caught in staging/canary)
- **> 24 hours:** Poor (caught in production)

**Action:** If MTTD > 24h, improve observability (see observability-and-monitoring skill)

---

### 9. Mean Time to Recovery (MTTR)

**Definition:** Time from bug detection to fix deployed

```
MTTR = Fix Deployment Time - Bug Detection Time
```

**Thresholds:**
- **< 1 hour:** Excellent
- **1-8 hours:** Acceptable
- **> 8 hours:** Poor

**Action:** If MTTR > 8h, improve rollback procedures (see testing-in-production skill)

---

## Dashboard Design

### Grafana Dashboard Example

```yaml
# grafana-dashboard.json
{
  "panels": [
    {
      "title": "Test Pass Rate (7 days)",
      "targets": [{
        "expr": "sum(tests_passed) / sum(tests_total) * 100"
      }],
      "thresholds": [
        {"value": 95, "color": "red"},
        {"value": 98, "color": "yellow"},
        {"value": 100, "color": "green"}
      ]
    },
    {
      "title": "Build Time Trend (30 days)",
      "targets": [{
        "expr": "avg_over_time(ci_build_duration_seconds[30d])"
      }]
    },
    {
      "title": "Coverage Delta (per PR)",
      "targets": [{
        "expr": "coverage_new_code_percent"
      }],
      "thresholds": [
        {"value": 90, "color": "green"},
        {"value": 80, "color": "yellow"},
        {"value": 0, "color": "red"}
      ]
    }
  ]
}
```

---

### CI/CD Quality Gates

**GitHub Actions example:**

```yaml
# .github/workflows/quality-gates.yml
name: Quality Gates

on: [pull_request]

jobs:
  quality-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Run tests with coverage
        run: pytest --cov=src --cov-report=json

      - name: Check coverage threshold
        run: |
          COVERAGE=$(jq '.totals.percent_covered' coverage.json)
          if (( $(echo "$COVERAGE < 80" | bc -l) )); then
            echo "Coverage $COVERAGE% below 80% threshold"
            exit 1
          fi

      - name: Check build time
        run: |
          DURATION=$(jq '.duration' test-results.json)
          if (( $(echo "$DURATION > 300" | bc -l) )); then
            echo "Build time ${DURATION}s exceeds 5-minute threshold"
            exit 1
          fi
```

---

## Reporting Patterns

### Weekly Quality Report

**Template:**

```markdown
# Quality Report - Week of 2025-01-13

## Summary
- **Test pass rate:** 98.5% (+0.5% from last week)
- **Flakiness rate:** 2.1% (-1.3% from last week) ✅
- **Coverage:** 85.2% (+2.1% from last week) ✅
- **Build time:** 18 min (-2 min from last week) ✅

## Actions Taken
- Fixed 8 flaky tests in checkout flow
- Added integration tests for payment service (+5% coverage)
- Parallelized slow E2E tests (reduced build time by 2 min)

## Action Items
- [ ] Fix remaining 3 flaky tests in user registration
- [ ] Increase coverage of order service (currently 72%)
- [ ] Investigate why staging MTTD increased to 4 hours
```

---

### Stakeholder Dashboard (Executive View)

**Metrics to show:**
1. **Quality trend (6 months):** Pass rate over time
2. **Velocity impact:** How long does CI take per PR?
3. **Production stability:** Defect escape rate
4. **Recovery time:** MTTR for incidents

**What NOT to show:**
- Absolute test count (vanity metric)
- Lines of code (meaningless)
- Individual developer metrics (creates wrong incentives)

---

## Anti-Patterns Catalog

### ❌ Coverage as the Only Metric

**Symptom:** "We need 100% coverage!"

**Why bad:** Easy to game with meaningless tests

```python
# ❌ BAD: 100% coverage, 0% value
def calculate_tax(amount):
    return amount * 0.08

def test_calculate_tax():
    calculate_tax(100)  # Executes code, asserts nothing!
```

**Fix:** Use coverage + mutation score

---

### ❌ Tracking Metrics Without Thresholds

**Symptom:** Dashboard shows metrics but no action taken

**Why bad:** Metrics become noise

**Fix:** Every metric needs:
- **Target threshold** (e.g., flakiness < 1%)
- **Alert level** (e.g., alert if flakiness > 5%)
- **Action plan** (e.g., "Fix flaky tests before adding new features")

---

### ❌ Optimizing for Metrics, Not Quality

**Symptom:** Gaming metrics to hit targets

**Example:** Removing tests to increase pass rate

**Fix:** Track multiple complementary metrics (pass rate + flakiness + coverage)

---

### ❌ Measuring Individual Developer Productivity

**Symptom:** "Developer A writes more tests than Developer B"

**Why bad:** Creates wrong incentives (quantity over quality)

**Fix:** Measure team metrics, not individual

---

## Tool Integration

### SonarQube Metrics

**Quality Gate:**
```properties
# sonar-project.properties
sonar.qualitygate.wait=true

# Metrics tracked:
# - Bugs (target: 0)
# - Vulnerabilities (target: 0)
# - Code smells (target: < 100)
# - Coverage (target: > 80%)
# - Duplications (target: < 3%)
```

---

### Codecov Integration

```yaml
# codecov.yml
coverage:
  status:
    project:
      default:
        target: 80%      # Overall coverage target
        threshold: 2%    # Allow 2% drop

    patch:
      default:
        target: 90%      # New code must have 90% coverage
        threshold: 0%    # No drops allowed
```

---

## Bottom Line

**Track actionable metrics with defined thresholds. Use metrics to drive improvement, not for vanity.**

**Core dashboard:**
- Test pass rate (> 98%)
- Flakiness rate (< 1%)
- Coverage delta on new code (> 90%)
- Build time (< 20 min)
- Defect escape rate (< 2%)

**Weekly actions:**
- Review metrics against thresholds
- Identify trends (improving/degrading)
- Create action items for violations
- Track progress on improvements

**If you're tracking a metric but not acting on it, stop tracking it. Metrics exist to drive action, not to fill dashboards.**