Initial commit

2025-11-30 08:59:43 +08:00
commit 966ef521f7
25 changed files with 9763 additions and 0 deletions
--- a/skills/flaky-test-prevention/SKILL.md
+++ b/skills/flaky-test-prevention/SKILL.md
@@ -0,0 +1,493 @@
+---
+name: flaky-test-prevention
+description: Use when debugging intermittent test failures, choosing between retries vs fixes, quarantining flaky tests, calculating flakiness rates, or preventing non-deterministic behavior - provides root cause diagnosis, anti-patterns, and systematic debugging
+---
+
+# Flaky Test Prevention
+
+## Overview
+
+**Core principle:** Fix root causes, don't mask symptoms.
+
+**Rule:** Flaky tests indicate real problems - in test design, application code, or infrastructure.
+
+## Flakiness Decision Tree
+
+| Symptom | Root Cause Category | Diagnostic | Fix |
+|---------|---------------------|------------|-----|
+| Passes alone, fails in suite | Test Interdependence | Run tests in random order | Use test isolation (transactions, unique IDs) |
+| Fails randomly ~10% | Timing/Race Condition | Add logging, run 100x | Replace sleeps with explicit waits |
+| Fails only in CI, not locally | Environment Difference | Compare CI vs local env | Match environments, use containers |
+| Fails at specific times | Time Dependency | Check for date/time usage | Mock system time |
+| Fails under load | Resource Contention | Run in parallel locally | Add resource isolation, increase limits |
+| Different results each run | Non-Deterministic Code | Check for randomness | Seed random generators, use fixtures |
+
+**First step:** Identify symptom, trace to root cause category.
+
+## Anti-Patterns Catalog
+
+### ❌ Sleepy Assertion
+**Symptom:** Using fixed `sleep()` or `wait()` instead of condition-based waits
+
+**Why bad:** Wastes time on fast runs, still fails on slow runs, brittle
+
+**Fix:** Explicit waits for conditions
+
+```python
+# ❌ Bad
+time.sleep(5)  # Hope 5 seconds is enough
+assert element.text == "Loaded"
+
+# ✅ Good
+WebDriverWait(driver, 10).until(
+    lambda d: d.find_element_by_id("status").text == "Loaded"
+)
+assert element.text == "Loaded"
+```
+
+---
+
+### ❌ Test Interdependence
+**Symptom:** Tests pass when run in specific order, fail when shuffled
+
+**Why bad:** Hidden dependencies, can't run in parallel, breaks test isolation
+
+**Fix:** Each test creates its own data, no shared state
+
+```python
+# ❌ Bad
+def test_create_user():
+    user = create_user("test_user")  # Shared ID
+
+def test_update_user():
+    update_user("test_user")  # Depends on test_create_user
+
+# ✅ Good
+def test_create_user():
+    user_id = f"user_{uuid4()}"
+    user = create_user(user_id)
+
+def test_update_user():
+    user_id = f"user_{uuid4()}"
+    user = create_user(user_id)  # Independent
+    update_user(user_id)
+```
+
+---
+
+### ❌ Hidden Dependencies
+**Symptom:** Tests fail due to external state (network, database, file system) beyond test control
+
+**Why bad:** Unpredictable failures, environment-specific issues
+
+**Fix:** Mock external dependencies
+
+```python
+# ❌ Bad
+def test_weather_api():
+    response = requests.get("https://api.weather.com/...")
+    assert response.json()["temp"] > 0  # Fails if API is down
+
+# ✅ Good
+@mock.patch('requests.get')
+def test_weather_api(mock_get):
+    mock_get.return_value.json.return_value = {"temp": 75}
+    response = get_weather("Seattle")
+    assert response["temp"] == 75
+```
+
+---
+
+### ❌ Time Bomb
+**Symptom:** Tests that depend on current date/time and fail at specific moments (midnight, month boundaries, DST)
+
+**Why bad:** Fails unpredictably based on when tests run
+
+**Fix:** Mock system time
+
+```python
+# ❌ Bad
+def test_expiration():
+    created_at = datetime.now()
+    assert is_expired(created_at) == False  # Fails at midnight
+
+# ✅ Good
+@freeze_time("2025-11-15 12:00:00")
+def test_expiration():
+    created_at = datetime(2025, 11, 15, 12, 0, 0)
+    assert is_expired(created_at) == False
+```
+
+---
+
+### ❌ Timeout Inflation
+**Symptom:** Continuously increasing timeouts to "fix" flaky tests (5s → 10s → 30s)
+
+**Why bad:** Masks root cause, slows test suite, doesn't guarantee reliability
+
+**Fix:** Investigate why operation is slow, use explicit waits
+
+```python
+# ❌ Bad
+await page.waitFor(30000)  # Increased from 5s hoping it helps
+
+# ✅ Good
+await page.waitForSelector('.data-loaded', {timeout: 10000})
+await page.waitForNetworkIdle()
+```
+
+## Detection Strategies
+
+### Proactive Identification
+
+**Run tests multiple times (statistical detection):**
+
+```bash
+# pytest with repeat plugin
+pip install pytest-repeat
+pytest --count=50 test_flaky.py
+
+# Track pass rate
+# 50/50 = 100% reliable
+# 45/50 = 90% flaky (investigate immediately)
+# <95% = quarantine
+```
+
+**CI Integration (automatic tracking):**
+
+```yaml
+# GitHub Actions example
+- name: Run tests with flakiness detection
+  run: |
+    pytest --count=3 --junit-xml=results.xml
+    python scripts/calculate_flakiness.py results.xml
+```
+
+**Flakiness metrics to track:**
+- Pass rate per test (target: >99%)
+- Mean Time Between Failures (MTBF)
+- Failure clustering (same test failing together)
+
+### Systematic Debugging
+
+**When a test fails intermittently:**
+
+1. **Reproduce consistently** - Run 100x to establish failure rate
+2. **Isolate** - Run alone, with subset, with full suite (find interdependencies)
+3. **Add logging** - Capture state before assertion, screenshot on failure
+4. **Bisect** - If fails in suite, binary search which other test causes it
+5. **Environment audit** - Compare CI vs local (env vars, resources, timing)
+
+## Flakiness Metrics Guide
+
+**Calculating flake rate:**
+
+```python
+# Flakiness formula
+flake_rate = (failed_runs / total_runs) * 100
+
+# Example
+# Test run 100 times: 7 failures
+# Flake rate = 7/100 = 7%
+```
+
+**Thresholds:**
+
+| Flake Rate | Action | Priority |
+|------------|--------|----------|
+| 0% (100% pass) | Reliable | Monitor |
+| 0.1-1% | Investigate | Low |
+| 1-5% | Quarantine + Fix | Medium |
+| 5-10% | Quarantine + Fix Urgently | High |
+| >10% | Disable immediately | Critical |
+
+**Target:** All tests should maintain >99% pass rate (< 1% flake rate)
+
+## Quarantine Workflow
+
+**Purpose:** Keep CI green while fixing flaky tests systematically
+
+**Process:**
+
+1. **Detect** - Test fails >1% of runs
+2. **Quarantine** - Mark with `@pytest.mark.quarantine`, exclude from CI
+3. **Track** - Create issue with flake rate, failure logs, reproduction steps
+4. **Fix** - Assign owner, set SLA (e.g., 2 weeks to fix or delete)
+5. **Validate** - Run fixed test 100x, must achieve >99% pass rate
+6. **Re-Enable** - Remove quarantine mark, monitor for 1 week
+
+**Marking quarantined tests:**
+
+```python
+@pytest.mark.quarantine(reason="Flaky due to timing issue #1234")
+@pytest.mark.skip("Quarantined")
+def test_flaky_feature():
+    pass
+```
+
+**CI configuration:**
+
+```bash
+# Run all tests except quarantined
+pytest -m "not quarantine"
+```
+
+**SLA:** Quarantined tests must be fixed within 2 weeks or deleted. No test stays quarantined indefinitely.
+
+## Tool Ecosystem Quick Reference
+
+| Tool | Purpose | When to Use |
+|------|---------|-------------|
+| **pytest-repeat** | Run test N times | Statistical detection |
+| **pytest-xdist** | Parallel execution | Expose race conditions |
+| **pytest-rerunfailures** | Auto-retry on failure | Temporary mitigation during fix |
+| **pytest-randomly** | Randomize test order | Detect test interdependence |
+| **freezegun** | Mock system time | Fix time bombs |
+| **pytest-timeout** | Prevent hanging tests | Catch infinite loops |
+
+**Installation:**
+
+```bash
+pip install pytest-repeat pytest-xdist pytest-rerunfailures pytest-randomly freezegun pytest-timeout
+```
+
+**Usage examples:**
+
+```bash
+# Detect flakiness (run 50x)
+pytest --count=50 test_suite.py
+
+# Detect interdependence (random order)
+pytest --randomly-seed=12345 test_suite.py
+
+# Expose race conditions (parallel)
+pytest -n 4 test_suite.py
+
+# Temporary mitigation (reruns, not a fix!)
+pytest --reruns 2 --reruns-delay 1 test_suite.py
+```
+
+## Prevention Checklist
+
+**Use during test authoring to prevent flakiness:**
+
+- [ ] No fixed `time.sleep()` - use explicit waits for conditions
+- [ ] Each test creates its own data (UUID-based IDs)
+- [ ] No shared global state between tests
+- [ ] External dependencies mocked (APIs, network, databases)
+- [ ] Time/date frozen with `@freeze_time` if time-dependent
+- [ ] Random values seeded (`random.seed(42)`)
+- [ ] Tests pass when run in any order (`pytest --randomly-seed`)
+- [ ] Tests pass when run in parallel (`pytest -n 4`)
+- [ ] Tests pass 100/100 times (`pytest --count=100`)
+- [ ] Teardown cleans up all resources (files, database, cache)
+
+## Common Fixes Quick Reference
+
+| Problem | Fix Pattern | Example |
+|---------|-------------|---------|
+| **Timing issues** | Explicit waits | `WebDriverWait(driver, 10).until(condition)` |
+| **Test interdependence** | Unique IDs per test | `user_id = f"test_{uuid4()}"` |
+| **External dependencies** | Mock/stub | `@mock.patch('requests.get')` |
+| **Time dependency** | Freeze time | `@freeze_time("2025-11-15")` |
+| **Random behavior** | Seed randomness | `random.seed(42)` |
+| **Shared state** | Test isolation | Transactions, teardown fixtures |
+| **Resource contention** | Unique resources | Separate temp dirs, DB namespaces |
+
+## Your First Flaky Test Fix
+
+**Systematic approach for first fix:**
+
+**Step 1: Reproduce (Day 1)**
+
+```bash
+# Run test 100 times, capture failures
+pytest --count=100 --verbose test_flaky.py | tee output.log
+```
+
+**Step 2: Categorize (Day 1)**
+
+Check output.log:
+- Same failure message? → Likely timing/race condition
+- Different failures? → Likely test interdependence
+- Only fails in CI? → Environment difference
+
+**Step 3: Fix Based on Category (Day 2)**
+
+**If timing issue:**
+
+```python
+# Before
+time.sleep(2)
+assert element.text == "Loaded"
+
+# After
+wait.until(lambda: element.text == "Loaded")
+```
+
+**If interdependence:**
+
+```python
+# Before
+user = User.objects.get(id=1)  # Assumes user exists
+
+# After
+user = create_test_user(id=f"test_{uuid4()}")  # Creates own data
+```
+
+**Step 4: Validate (Day 2)**
+
+```bash
+# Must pass 100/100 times
+pytest --count=100 test_flaky.py
+# Expected: 100 passed
+```
+
+**Step 5: Monitor (Week 1)**
+
+Track in CI - test should maintain >99% pass rate for 1 week before considering it fixed.
+
+## CI-Only Flakiness (Can't Reproduce Locally)
+
+**Symptom:** Test fails intermittently in CI but passes 100% locally
+
+**Root cause:** Environment differences between CI and local (resources, parallelization, timing)
+
+### Systematic CI Debugging
+
+**Step 1: Environment Fingerprinting**
+
+Capture exact environment in both CI and locally:
+
+```python
+# Add to conftest.py
+import os, sys, platform, tempfile
+
+def pytest_configure(config):
+    print(f"Python: {sys.version}")
+    print(f"Platform: {platform.platform()}")
+    print(f"CPU count: {os.cpu_count()}")
+    print(f"TZ: {os.environ.get('TZ', 'not set')}")
+    print(f"Temp dir: {tempfile.gettempdir()}")
+    print(f"Parallel: {os.environ.get('PYTEST_XDIST_WORKER', 'not parallel')}")
+```
+
+Run in both environments, compare all outputs.
+
+**Step 2: Increase CI Observation Window**
+
+For low-probability failures (<5%), run more iterations:
+
+```yaml
+# GitHub Actions example
+- name: Run test 200x to catch 1% flake
+  run: pytest --count=200 --verbose --log-cli-level=DEBUG test.py
+
+- name: Upload failure artifacts
+  if: failure()
+  uses: actions/upload-artifact@v3
+  with:
+    name: failure-logs
+    path: |
+      *.log
+      screenshots/
+```
+
+**Step 3: Check CI-Specific Factors**
+
+| Factor | Diagnostic | Fix |
+|--------|------------|-----|
+| **Parallelization** | Run `pytest -n 4` locally | Add test isolation (unique IDs, transactions) |
+| **Resource limits** | Compare CI RAM/CPU to local | Mock expensive operations, add retries |
+| **Cold starts** | First run vs warm runs | Check caching assumptions |
+| **Disk I/O speed** | CI may use slower disks | Mock file operations |
+| **Network latency** | CI network may be slower/different | Mock external calls |
+
+**Step 4: Replicate CI Environment Locally**
+
+Use exact CI container:
+
+```bash
+# GitHub Actions uses Ubuntu 22.04
+docker run -it ubuntu:22.04 bash
+
+# Install dependencies
+apt-get update && apt-get install python3.11
+
+# Run test in container
+pytest --count=500 test.py
+```
+
+**Step 5: Enable CI Debug Mode**
+
+```yaml
+# GitHub Actions - Interactive debugging
+- name: Setup tmate session (on failure)
+  if: failure()
+  uses: mxschmitt/action-tmate@v3
+```
+
+### Quick CI Debugging Checklist
+
+When test fails only in CI:
+
+- [ ] Capture environment fingerprint in both CI and local
+- [ ] Run test with parallelization locally (`pytest -n auto`)
+- [ ] Check for resource contention (CPU, memory, disk)
+- [ ] Compare timezone settings (`TZ` env var)
+- [ ] Upload CI artifacts (logs, screenshots) on failure
+- [ ] Replicate CI environment with Docker
+- [ ] Check for cold start issues (first vs subsequent runs)
+
+## Common Mistakes
+
+### ❌ Using Retries as Permanent Solution
+**Fix:** Retries (@pytest.mark.flaky or --reruns) are temporary mitigation during investigation, not fixes
+
+---
+
+### ❌ No Flakiness Tracking
+**Fix:** Track pass rates in CI, set up alerts for tests dropping below 99%
+
+---
+
+### ❌ Fixing Flaky Tests by Making Them Slower
+**Fix:** Diagnose root cause - don't just add more wait time
+
+---
+
+### ❌ Ignoring Flaky Tests
+**Fix:** Quarantine workflow - either fix or delete, never ignore indefinitely
+
+## Quick Reference
+
+**Flakiness Thresholds:**
+- <1% flake rate: Monitor
+- 1-5%: Quarantine + fix (medium priority)
+- >5%: Disable + fix urgently (high priority)
+
+**Root Cause Categories:**
+1. Timing/race conditions → Explicit waits
+2. Test interdependence → Unique IDs, test isolation
+3. External dependencies → Mocking
+4. Time bombs → Freeze time
+5. Resource contention → Unique resources
+
+**Detection Tools:**
+- pytest-repeat (statistical detection)
+- pytest-randomly (interdependence)
+- pytest-xdist (race conditions)
+
+**Quarantine Process:**
+1. Detect (>1% flake rate)
+2. Quarantine (mark, exclude from CI)
+3. Track (create issue)
+4. Fix (assign owner, 2-week SLA)
+5. Validate (100/100 passes)
+6. Re-enable (monitor 1 week)
+
+## Bottom Line
+
+**Flaky tests are fixable - find the root cause, don't mask with retries.**
+
+Use detection tools to find flaky tests early. Categorize by symptom, diagnose root cause, apply pattern-based fix. Quarantine if needed, but always with SLA to fix or delete.