Initial commit

2025-11-30 08:59:43 +08:00
commit 966ef521f7
25 changed files with 9763 additions and 0 deletions
--- a/skills/api-testing-strategies/SKILL.md
+++ b/skills/api-testing-strategies/SKILL.md
@@ -0,0 +1,471 @@
+---
+name: api-testing-strategies
+description: Use when testing REST/GraphQL APIs, designing API test suites, validating request/response contracts, testing authentication/authorization, handling API versioning, or choosing API testing tools - provides test pyramid placement, schema validation, and anti-patterns distinct from E2E browser testing
+---
+
+# API Testing Strategies
+
+## Overview
+
+**Core principle:** API tests sit between unit tests and E2E tests - faster than browser tests, more realistic than mocks.
+
+**Rule:** Test APIs directly via HTTP/GraphQL, not through the UI. Browser tests are 10x slower and more flaky.
+
+## API Testing vs E2E Testing
+
+| Aspect | API Testing | E2E Browser Testing |
+|--------|-------------|---------------------|
+| **Speed** | Fast (10-100ms per test) | Slow (1-10s per test) |
+| **Flakiness** | Low (no browser/JS) | High (timing, rendering) |
+| **Coverage** | Business logic, data | Full user workflow |
+| **Tools** | REST Client, Postman, pytest | Playwright, Cypress |
+| **When to use** | Most backend testing | Critical user flows only |
+
+**Test Pyramid placement:**
+- **Unit tests (70%):** Individual functions/classes
+- **API tests (20%):** Endpoints, business logic, integrations
+- **E2E tests (10%):** Critical user workflows through browser
+
+---
+
+## Tool Selection Decision Tree
+
+| Your Stack | Team Skills | Use | Why |
+|-----------|-------------|-----|-----|
+| **Python backend** | pytest familiar | **pytest + requests** | Best integration, fixtures |
+| **Node.js/JavaScript** | Jest/Mocha | **supertest** | Express/Fastify native |
+| **Any language, REST** | Prefer GUI | **Postman + Newman** | GUI for design, CLI for CI |
+| **GraphQL** | Any | **pytest + gql** (Python) or **apollo-client** (JS) | Query validation |
+| **Contract testing** | Microservices | **Pact** | Consumer-driven contracts |
+
+**First choice:** Use your existing test framework (pytest/Jest) + HTTP client. Don't add new tools unnecessarily.
+
+---
+
+## Test Structure Pattern
+
+### Basic REST API Test
+
+```python
+import pytest
+import requests
+
+@pytest.fixture
+def api_client():
+    """Base API client with auth."""
+    return requests.Session()
+
+def test_create_order(api_client):
+    # Arrange: Set up test data
+    payload = {
+        "user_id": 123,
+        "items": [{"sku": "WIDGET", "quantity": 2}],
+        "shipping_address": "123 Main St"
+    }
+
+    # Act: Make API call
+    response = api_client.post(
+        "https://api.example.com/orders",
+        json=payload,
+        headers={"Authorization": "Bearer test_token"}
+    )
+
+    # Assert: Validate response
+    assert response.status_code == 201
+    data = response.json()
+    assert data["id"] is not None
+    assert data["status"] == "pending"
+    assert data["total"] > 0
+```
+
+---
+
+### GraphQL API Test
+
+```python
+from gql import gql, Client
+from gql.transport.requests import RequestsHTTPTransport
+
+def test_user_query():
+    transport = RequestsHTTPTransport(url="https://api.example.com/graphql")
+    client = Client(transport=transport)
+
+    query = gql('''
+        query GetUser($id: ID!) {
+            user(id: $id) {
+                id
+                name
+                email
+            }
+        }
+    ''')
+
+    result = client.execute(query, variable_values={"id": "123"})
+
+    assert result["user"]["id"] == "123"
+    assert result["user"]["email"] is not None
+```
+
+---
+
+## What to Test
+
+### 1. Happy Path (Required)
+
+**Test successful requests with valid data.**
+
+```python
+def test_get_user_success():
+    response = api.get("/users/123")
+   assert response.status_code == 200
+    assert response.json()["name"] == "Alice"
+```
+
+---
+
+### 2. Validation Errors (Required)
+
+**Test API rejects invalid input.**
+
+```python
+def test_create_user_invalid_email():
+    response = api.post("/users", json={"email": "invalid"})
+
+    assert response.status_code == 400
+    assert "email" in response.json()["errors"]
+```
+
+---
+
+### 3. Authentication & Authorization (Required)
+
+**Test auth failures.**
+
+```python
+def test_unauthorized_without_token():
+    response = api.get("/orders", headers={})  # No auth token
+
+    assert response.status_code == 401
+
+def test_forbidden_different_user():
+    response = api.get(
+        "/orders/999",
+        headers={"Authorization": "Bearer user_123_token"}
+    )
+
+    assert response.status_code == 403  # Can't access other user's orders
+```
+
+---
+
+### 4. Edge Cases (Important)
+
+```python
+def test_pagination_last_page():
+    response = api.get("/users?page=999")
+
+    assert response.status_code == 200
+    assert response.json()["results"] == []
+
+def test_large_payload():
+    items = [{"sku": f"ITEM_{i}", "quantity": 1} for i in range(1000)]
+    response = api.post("/orders", json={"items": items})
+
+    assert response.status_code in [201, 413]  # Created or payload too large
+```
+
+---
+
+### 5. Idempotency (For POST/PUT/DELETE)
+
+**Test same request twice produces same result.**
+
+```python
+def test_create_user_idempotent():
+    payload = {"email": "alice@example.com", "name": "Alice"}
+
+    # First request
+    response1 = api.post("/users", json=payload)
+    user_id_1 = response1.json()["id"]
+
+    # Second identical request
+    response2 = api.post("/users", json=payload)
+
+    # Should return existing user, not create duplicate
+    assert response2.status_code in [200, 409]  # OK or Conflict
+    if response2.status_code == 200:
+        assert response2.json()["id"] == user_id_1
+```
+
+---
+
+## Schema Validation
+
+**Use JSON Schema to validate response structure.**
+
+```python
+import jsonschema
+
+USER_SCHEMA = {
+    "type": "object",
+    "properties": {
+        "id": {"type": "integer"},
+        "name": {"type": "string"},
+        "email": {"type": "string", "format": "email"}
+    },
+    "required": ["id", "name", "email"]
+}
+
+def test_user_response_schema():
+    response = api.get("/users/123")
+
+    data = response.json()
+    jsonschema.validate(instance=data, schema=USER_SCHEMA)  # Raises if invalid
+```
+
+**Why it matters:** Prevents regressions where fields are removed or types change.
+
+---
+
+## API Versioning Tests
+
+**Test multiple API versions simultaneously.**
+
+```python
+@pytest.mark.parametrize("version,expected_fields", [
+    ("v1", ["id", "name"]),
+    ("v2", ["id", "name", "email", "created_at"]),
+])
+def test_user_endpoint_version(version, expected_fields):
+    response = api.get(f"/{version}/users/123")
+
+    data = response.json()
+    for field in expected_fields:
+        assert field in data
+```
+
+---
+
+## Anti-Patterns Catalog
+
+### ❌ Testing Through the UI
+
+**Symptom:** Using browser automation to test API functionality
+
+```python
+# ❌ BAD: Testing API via browser
+def test_create_order():
+    page.goto("/orders/new")
+    page.fill("#item", "Widget")
+    page.click("#submit")
+    assert page.locator(".success").is_visible()
+```
+
+**Why bad:**
+- 10x slower than API test
+- Flaky (browser timing issues)
+- Couples API test to UI changes
+
+**Fix:** Test API directly
+
+```python
+# ✅ GOOD: Direct API test
+def test_create_order():
+    response = api.post("/orders", json={"item": "Widget"})
+    assert response.status_code == 201
+```
+
+---
+
+### ❌ Testing Implementation Details
+
+**Symptom:** Asserting on database queries, internal logic
+
+```python
+# ❌ BAD: Testing implementation
+def test_get_user():
+    with patch('database.execute') as mock_db:
+        api.get("/users/123")
+        assert mock_db.called_with("SELECT * FROM users WHERE id = 123")
+```
+
+**Why bad:** Couples test to implementation, not contract
+
+**Fix:** Test only request/response contract
+
+```python
+# ✅ GOOD: Test contract only
+def test_get_user():
+    response = api.get("/users/123")
+    assert response.status_code == 200
+    assert response.json()["id"] == 123
+```
+
+---
+
+### ❌ No Test Data Isolation
+
+**Symptom:** Tests interfere with each other
+
+```python
+# ❌ BAD: Shared test data
+def test_update_user():
+    api.put("/users/123", json={"name": "Bob"})
+    assert api.get("/users/123").json()["name"] == "Bob"
+
+def test_get_user():
+    # Fails if previous test ran!
+    assert api.get("/users/123").json()["name"] == "Alice"
+```
+
+**Fix:** Each test creates/cleans its own data (see test-isolation-fundamentals skill)
+
+---
+
+### ❌ Hardcoded URLs and Tokens
+
+**Symptom:** Production URLs or real credentials in tests
+
+```python
+# ❌ BAD: Hardcoded production URL
+def test_api():
+    response = requests.get("https://api.production.com/users")
+```
+
+**Fix:** Use environment variables or fixtures
+
+```python
+# ✅ GOOD: Configurable environment
+import os
+
+@pytest.fixture
+def api_base_url():
+    return os.getenv("API_URL", "http://localhost:8000")
+
+def test_api(api_base_url):
+    response = requests.get(f"{api_base_url}/users")
+```
+
+---
+
+## Mocking External APIs
+
+**When testing service A that calls service B:**
+
+```python
+import responses
+
+@responses.activate
+def test_payment_success():
+    # Mock Stripe API
+    responses.add(
+        responses.POST,
+        "https://api.stripe.com/v1/charges",
+        json={"id": "ch_123", "status": "succeeded"},
+        status=200
+    )
+
+    # Test your API
+    response = api.post("/checkout", json={"amount": 1000})
+
+    assert response.status_code == 200
+    assert response.json()["payment_status"] == "succeeded"
+```
+
+**When to mock:**
+- External service costs money (Stripe, Twilio)
+- External service is slow
+- External service is unreliable
+- Testing error handling (simulate failures)
+
+**When NOT to mock:**
+- Integration tests (use separate test suite with real services)
+- Contract tests (use Pact to verify integration)
+
+---
+
+## Performance Testing APIs
+
+**Use load testing for APIs separately from E2E:**
+
+```python
+# locust load test
+from locust import HttpUser, task, between
+
+class APIUser(HttpUser):
+    wait_time = between(1, 3)
+
+    @task
+    def get_users(self):
+        self.client.get("/users")
+
+    @task(3)  # 3x more frequent
+    def get_user(self):
+        self.client.get("/users/123")
+```
+
+**Run with:**
+```bash
+locust -f locustfile.py --headless -u 100 -r 10 --run-time 60s
+```
+
+See load-testing-patterns skill for comprehensive guidance.
+
+---
+
+## CI/CD Integration
+
+**API tests should run on every commit:**
+
+```yaml
+# .github/workflows/api-tests.yml
+name: API Tests
+
+on: [push, pull_request]
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - name: Run API tests
+        run: |
+          pytest tests/api/ -v
+        env:
+          API_URL: http://localhost:8000
+          API_TOKEN: ${{ secrets.TEST_API_TOKEN }}
+```
+
+**Test stages:**
+- Commit: Smoke tests (5-10 critical endpoints, <1 min)
+- PR: Full API suite (all endpoints, <5 min)
+- Merge: API + integration tests (<15 min)
+
+---
+
+## Quick Reference: API Test Checklist
+
+For each endpoint, test:
+
+- [ ] **Happy path** (valid request → 200/201)
+- [ ] **Validation** (invalid input → 400)
+- [ ] **Authentication** (no token → 401)
+- [ ] **Authorization** (wrong user → 403)
+- [ ] **Not found** (missing resource → 404)
+- [ ] **Idempotency** (duplicate request → same result)
+- [ ] **Schema** (response matches expected structure)
+- [ ] **Edge cases** (empty lists, large payloads, pagination)
+
+---
+
+## Bottom Line
+
+**API tests are faster, more reliable, and provide better coverage than E2E browser tests for backend logic.**
+
+- Test APIs directly, not through the browser
+- Use your existing test framework (pytest/Jest) + HTTP client
+- Validate schemas to catch breaking changes
+- Mock external services to avoid flakiness and cost
+- Run API tests on every commit (they're fast enough)
+
+**If you're using browser automation to test API functionality, you're doing it wrong. Test APIs directly.**
--- a/skills/chaos-engineering-principles/SKILL.md
+++ b/skills/chaos-engineering-principles/SKILL.md
@@ -0,0 +1,242 @@
+---
+name: chaos-engineering-principles
+description: Use when starting chaos engineering, designing fault injection experiments, choosing chaos tools, testing system resilience, or recovering from chaos incidents - provides hypothesis-driven testing, blast radius control, and anti-patterns for safe chaos
+---
+
+# Chaos Engineering Principles
+
+## Overview
+
+**Core principle:** Chaos engineering validates resilience through controlled experiments, not random destruction.
+
+**Rule:** Start in staging, with monitoring, with rollback, with small blast radius. No exceptions.
+
+## When NOT to Do Chaos
+
+Don't run chaos experiments if ANY of these are missing:
+- ❌ No comprehensive monitoring (APM, metrics, logs, alerts)
+- ❌ No automated rollback capability
+- ❌ No baseline metrics documented
+- ❌ No incident response team available
+- ❌ System already unstable (fix stability first)
+- ❌ No staging environment to practice
+
+**Fix these prerequisites BEFORE chaos testing.**
+
+## Tool Selection Decision Tree
+
+| Your Constraint | Choose | Why |
+|----------------|--------|-----|
+| Kubernetes-native, CNCF preference | **LitmusChaos** | Cloud-native, operator-based, excellent K8s integration |
+| Kubernetes-focused, visualization needs | **Chaos Mesh** | Fine-grained control, dashboards, low overhead |
+| Want managed service, quick start | **Gremlin** | Commercial, guided experiments, built-in best practices |
+| Vendor-neutral, maximum flexibility | **Chaos Toolkit** | Open source, plugin ecosystem, any infrastructure |
+| AWS-specific, cost-sensitive | **AWS FIS** | Native AWS integration, pay-per-experiment |
+
+**For most teams:** Chaos Toolkit (flexible, free) or Gremlin (fast, managed)
+
+## Prerequisites Checklist
+
+Before FIRST experiment:
+
+**Monitoring (Required):**
+- [ ] Real-time dashboards for key metrics (latency, error rate, throughput)
+- [ ] Distributed tracing for request flows
+- [ ] Log aggregation with timeline correlation
+- [ ] Alerts configured with thresholds
+
+**Rollback (Required):**
+- [ ] Automated rollback based on metrics (e.g., error rate > 5% → abort)
+- [ ] Manual kill switch everyone can activate
+- [ ] Rollback tested and documented (< 30 sec recovery)
+
+**Baseline (Required):**
+- [ ] Documented normal metrics (P50/P95/P99 latency, error rate %)
+- [ ] Known dependencies and critical paths
+- [ ] System architecture diagram
+
+**Team (Required):**
+- [ ] Designated observer monitoring experiment
+- [ ] On-call engineer available
+- [ ] Communication channel established (war room, Slack)
+- [ ] Post-experiment debrief scheduled
+
+## Anti-Patterns Catalog
+
+### ❌ Production First Chaos
+**Symptom:** "Let's start chaos testing in production to see what breaks"
+
+**Why bad:** No practice, no muscle memory, production incidents guaranteed
+
+**Fix:** Run 5-10 experiments in staging FIRST. Graduate to production only after proving: experiments work as designed, rollback functions, team can execute response
+
+---
+
+### ❌ Chaos Without Monitoring
+**Symptom:** "We injected latency but we're not sure what happened"
+
+**Why bad:** Blind chaos = no learning. You can't validate resilience without seeing system behavior
+
+**Fix:** Set up comprehensive monitoring BEFORE first experiment. Must be able to answer "What changed?" within 30 seconds
+
+---
+
+### ❌ Unlimited Blast Radius
+**Symptom:** Affecting 100% of traffic/all services on first run
+
+**Why bad:** Cascading failures, actual outages, customer impact
+
+**Fix:** Start at 0.1-1% traffic. Progression: 0.1% → 1% → 5% → 10% → (stop or 50%). Each step validates before expanding
+
+---
+
+### ❌ Chaos Without Rollback
+**Symptom:** "The experiment broke everything and we can't stop it"
+
+**Why bad:** Chaos becomes real incident, 2+ hour recovery, lost trust
+
+**Fix:** Automated abort criteria (error rate threshold, latency threshold, manual kill switch). Test rollback before injecting failures
+
+---
+
+### ❌ Random Chaos (No Hypothesis)
+**Symptom:** "Let's inject some failures and see what happens"
+
+**Why bad:** No learning objective, can't validate resilience, wasted time
+
+**Fix:** Every experiment needs hypothesis: "System will [expected behavior] when [failure injected]"
+
+## Failure Types Catalog
+
+Priority order for microservices:
+
+| Failure Type | Priority | Why Test This | Example |
+|--------------|----------|---------------|---------|
+| **Network Latency** | HIGH | Most common production issue | 500ms delay service A → B |
+| **Service Timeout** | HIGH | Tests circuit breakers, retry logic | Service B unresponsive |
+| **Connection Loss** | HIGH | Tests failover, graceful degradation | TCP connection drops |
+| **Resource Exhaustion** | MEDIUM | Tests resource limits, scaling | Memory limit, connection pool full |
+| **Packet Loss** | MEDIUM | Tests retry strategies | 1-10% packet loss |
+| **DNS Failure** | MEDIUM | Tests service discovery resilience | DNS resolution delays |
+| **Cache Failure** | MEDIUM | Tests fallback behavior | Redis down |
+| **Database Errors** | LOW (start) | High risk - test after basics work | Connection refused, query timeout |
+
+**Start with network latency** - safest, most informative, easiest rollback.
+
+## Experiment Template
+
+Use this for every chaos experiment:
+
+**1. Hypothesis**
+"If [failure injected], system will [expected behavior], and [metric] will remain [threshold]"
+
+Example: "If service-payment experiences 2s latency, circuit breaker will open within 10s, and P99 latency will stay < 500ms"
+
+**2. Baseline Metrics**
+- Current P50/P95/P99 latency:
+- Current error rate:
+- Current throughput:
+
+**3. Experiment Config**
+- Failure type: [latency / packet loss / service down / etc.]
+- Target: [specific service / % of traffic]
+- Blast radius: [0.1% traffic, single region, canary pods]
+- Duration: [2-5 minutes initial]
+- Abort criteria: [error rate > 5% OR P99 > 1s OR manual stop]
+
+**4. Execution**
+- Observer: [name] monitoring dashboards
+- Runner: [name] executing experiment
+- Kill switch: [procedure]
+- Start time: [timestamp]
+
+**5. Observation**
+- What happened vs hypothesis:
+- Actual metrics during chaos:
+- System behavior notes:
+
+**6. Validation**
+- ✓ Hypothesis validated / ✗ Hypothesis failed
+- Unexpected findings:
+- Action items:
+
+## Blast Radius Progression
+
+Safe scaling path:
+
+| Step | Traffic Affected | Duration | Abort If |
+|------|------------------|----------|----------|
+| **1. Staging** | 100% staging | 5 min | Any production impact |
+| **2. Canary** | 0.1% production | 2 min | Error rate > 1% |
+| **3. Small** | 1% production | 5 min | Error rate > 2% |
+| **4. Medium** | 5% production | 10 min | Error rate > 5% |
+| **5. Large** | 10% production | 15 min | Error rate > 5% |
+
+**Never skip steps.** Each step validates before expanding.
+
+**Stop at 10-20% for most experiments** - no need to chaos 100% of production traffic.
+
+**Low-traffic services (< 1000 req/day):** Use absolute request counts instead of percentages. Minimum 5-10 affected requests per step. Example: 100 req/day service should still start with 5-10 requests (6 hours), not 0.1% (1 request every 10 days).
+
+## Your First Experiment (Staging)
+
+**Goal:** Build confidence, validate monitoring, test rollback
+
+**Experiment:** Network latency on non-critical service
+
+```bash
+# Example with Chaos Toolkit
+1. Pick least critical service (e.g., recommendation engine, not payment)
+2. Inject 500ms latency to 100% of staging traffic
+3. Duration: 5 minutes
+4. Expected: Timeouts handled gracefully, fallback behavior activates
+5. Monitor: Error rate, latency, downstream services
+6. Abort if: Error rate > 10% or cascading failures
+7. Debrief: What did we learn? Did monitoring catch it? Did rollback work?
+```
+
+**Success criteria:** You can answer "Did our hypothesis hold?" within 5 minutes of experiment completion.
+
+## Common Mistakes
+
+### ❌ Testing During Incidents
+**Fix:** Only chaos test during stable periods, business hours, with extra staffing
+
+---
+
+### ❌ Network Latency Underestimation
+**Fix:** Latency cascades - 500ms can become 5s downstream. Start with 100-200ms, observe, then increase
+
+---
+
+### ❌ No Post-Experiment Review
+**Fix:** Every experiment gets 15-min debrief: What worked? What broke? What did we learn?
+
+## Quick Reference
+
+**Prerequisites Before First Chaos:**
+1. Monitoring + alerts
+2. Automated rollback
+3. Baseline metrics documented
+4. Team coordinated
+
+**Experiment Steps:**
+1. Write hypothesis
+2. Document baseline
+3. Define blast radius (start 0.1%)
+4. Set abort criteria
+5. Execute with observer
+6. Validate hypothesis
+7. Debrief team
+
+**Blast Radius Progression:**
+Staging → 0.1% → 1% → 5% → 10% (stop for most experiments)
+
+**First Experiment:**
+Network latency (500ms) on non-critical service in staging for 5 minutes
+
+## Bottom Line
+
+**Chaos engineering is hypothesis-driven science, not random destruction.**
+
+Start small (staging, 0.1% traffic), with monitoring, with rollback. Graduate slowly.
--- a/skills/contract-testing/SKILL.md
+++ b/skills/contract-testing/SKILL.md
@@ -0,0 +1,524 @@
+---
+name: contract-testing
+description: Use when implementing Pact contracts, choosing consumer-driven vs provider-driven approaches, handling breaking API changes, setting up contract brokers, or preventing service integration issues - provides tool selection, anti-patterns, and workflow patterns
+---
+
+# Contract Testing
+
+## Overview
+
+**Core principle:** Test the contract, not the implementation. Verify integration points independently.
+
+**Rule:** Contract tests catch breaking changes before deployment, not in production.
+
+## Tool Selection Decision Tree
+
+| Your Stack | Team Structure | Use | Why |
+|-----------|----------------|-----|-----|
+| Polyglot microservices | Multiple teams | **Pact** | Language-agnostic, mature broker |
+| Java Spring ecosystem | Coordinated teams | **Spring Cloud Contract** | Spring integration, code-first |
+| GraphQL APIs | Known consumers | **Pact + GraphQL** | Query validation |
+| OpenAPI/REST | Public/many consumers | **OpenAPI Spec Testing** | Schema-first, documentation |
+
+**First choice:** Pact (most mature ecosystem, widest language support)
+
+**Why contract testing:** Catches API breaking changes in CI, not production. Teams test independently without running dependencies.
+
+## Contract Type Decision Framework
+
+| Scenario | Approach | Tools |
+|----------|----------|-------|
+| **Internal microservices, known consumers** | Consumer-Driven (CDC) | Pact, Spring Cloud Contract |
+| **Public API, many unknown consumers** | Provider-Driven (Schema-First) | OpenAPI validation, Spectral |
+| **Both internal and external consumers** | Bi-Directional | Pact + OpenAPI |
+| **Event-driven/async messaging** | Message Pact | Pact (message provider/consumer) |
+
+**Default:** Consumer-driven for internal services, schema-first for public APIs
+
+## Anti-Patterns Catalog
+
+### ❌ Over-Specification
+**Symptom:** Contract tests verify exact response format, including fields consumer doesn't use
+
+**Why bad:** Brittle tests, provider can't evolve API, false positives
+
+**Fix:** Only specify what consumer actually uses
+
+```javascript
+// ❌ Bad - over-specified
+.willRespondWith({
+  status: 200,
+  body: {
+    id: 123,
+    name: 'John',
+    email: 'john@example.com',
+    created_at: '2023-01-01',
+    updated_at: '2023-01-02',
+    phone: '555-1234',
+    address: {...}  // Consumer doesn't use these
+  }
+})
+
+// ✅ Good - specify only what's used
+.willRespondWith({
+  status: 200,
+  body: {
+    id: Matchers.integer(123),
+    name: Matchers.string('John')
+  }
+})
+```
+
+---
+
+### ❌ Testing Implementation Details
+**Symptom:** Contract tests verify database queries, internal logic, or response timing
+
+**Why bad:** Couples tests to implementation, not contract
+
+**Fix:** Test only request/response contract, not how provider implements it
+
+```javascript
+// ❌ Bad - testing implementation
+expect(provider.database.queryCalled).toBe(true)
+
+// ✅ Good - testing contract only
+expect(response.status).toBe(200)
+expect(response.body.name).toBe('John')
+```
+
+---
+
+### ❌ Brittle Provider States
+**Symptom:** Provider states hardcode IDs, dates, or specific data that changes
+
+**Why bad:** Tests fail randomly, high maintenance
+
+**Fix:** Use matchers, generate data in state setup
+
+```javascript
+// ❌ Bad - hardcoded state
+.given('user 123 exists')
+.uponReceiving('request for user 123')
+.withRequest({ path: '/users/123' })
+
+// ✅ Good - flexible state
+.given('a user exists')
+.uponReceiving('request for user')
+.withRequest({ path: Matchers.regex('/users/\\d+', '/users/123') })
+.willRespondWith({
+  body: {
+    id: Matchers.integer(123),
+    name: Matchers.string('John')
+  }
+})
+```
+
+---
+
+### ❌ No Contract Versioning
+**Symptom:** Breaking changes deployed without consumer coordination
+
+**Why bad:** Runtime failures, production incidents
+
+**Fix:** Use can-i-deploy, tag contracts by environment
+
+```bash
+# ✅ Good - check before deploying
+pact-broker can-i-deploy \
+  --pacticipant UserService \
+  --version 2.0.0 \
+  --to production
+```
+
+---
+
+### ❌ Missing Can-I-Deploy
+**Symptom:** Deploying without checking if all consumers compatible
+
+**Why bad:** Deploy provider changes that break consumers
+
+**Fix:** Run can-i-deploy in CI before deployment
+
+## Pact Broker Workflow
+
+**Core workflow:**
+
+1. **Consumer:** Write contract test → Generate pact file
+2. **Consumer CI:** Publish pact to broker with version tag
+3. **Provider CI:** Fetch contracts → Verify → Publish results
+4. **Provider CD:** Run can-i-deploy → Deploy if compatible
+
+### Publishing Contracts
+
+```bash
+# Consumer publishes pact with version and branch
+pact-broker publish pacts/ \
+  --consumer-app-version ${GIT_SHA} \
+  --branch ${GIT_BRANCH} \
+  --tag ${ENV}
+```
+
+### Verifying Contracts
+
+```javascript
+// Provider verifies against broker
+const { Verifier } = require('@pact-foundation/pact')
+
+new Verifier({
+  providerBaseUrl: 'http://localhost:8080',
+  pactBrokerUrl: process.env.PACT_BROKER_URL,
+  provider: 'UserService',
+  publishVerificationResult: true,
+  providerVersion: process.env.GIT_SHA,
+  consumerVersionSelectors: [
+    { mainBranch: true },  // Latest from main
+    { deployed: 'production' },  // Currently in production
+    { deployed: 'staging' }  // Currently in staging
+  ]
+}).verifyProvider()
+```
+
+### Can-I-Deploy Check
+
+```yaml
+# CI/CD pipeline (GitHub Actions example)
+- name: Check if can deploy
+  run: |
+    pact-broker can-i-deploy \
+      --pacticipant UserService \
+      --version ${{ github.sha }} \
+      --to-environment production
+```
+
+**Rule:** Never deploy without can-i-deploy passing
+
+## Breaking Change Taxonomy
+
+| Change Type | Breaking? | Migration Strategy |
+|-------------|-----------|-------------------|
+| Add optional field | No | Deploy provider first |
+| Add required field | Yes | Use expand/contract pattern |
+| Remove field | Yes | Deprecate → verify no consumers use → remove |
+| Change field type | Yes | Add new field → migrate consumers → remove old |
+| Rename field | Yes | Add new → deprecate old → remove old |
+| Change status code | Yes | Version API or expand responses |
+
+### Expand/Contract Pattern
+
+**For adding required field:**
+
+**Expand (Week 1-2):**
+```javascript
+// Provider adds NEW field (optional), keeps OLD field
+{
+  user_name: "John",  // Old field (deprecated)
+  name: "John"        // New field
+}
+```
+
+**Migrate (Week 3-4):**
+- Consumers update to use new field
+- Update contracts
+- Verify all consumers migrated
+
+**Contract (Week 5):**
+```javascript
+// Provider removes old field
+{
+  name: "John"  // Only new field remains
+}
+```
+
+## Provider State Patterns
+
+**Purpose:** Set up test data before verification
+
+**Pattern:** Use state handlers to create/clean up data
+
+```javascript
+// Provider state setup
+const { Verifier } = require('@pact-foundation/pact')
+
+new Verifier({
+  stateHandlers: {
+    'a user exists': async () => {
+      // Setup: Create test user
+      await db.users.create({
+        id: 123,
+        name: 'John Doe'
+      })
+    },
+    'no users exist': async () => {
+      // Setup: Clear users
+      await db.users.deleteAll()
+    }
+  },
+  afterEach: async () => {
+    // Cleanup after each verification
+    await db.users.deleteAll()
+  }
+}).verifyProvider()
+```
+
+**Best practices:**
+- States should be independent
+- Clean up after each verification
+- Use transactions for database tests
+- Don't hardcode IDs (use matchers)
+
+## Async/Event-Driven Messaging Contracts
+
+**For Kafka, RabbitMQ, SNS/SQS:** Use Message Pact (different API than HTTP Pact)
+
+### Consumer Message Contract
+
+```javascript
+const { MessageConsumerPact, MatchersV3 } = require('@pact-foundation/pact')
+
+describe('User Event Consumer', () => {
+  const messagePact = new MessageConsumerPact({
+    consumer: 'NotificationService',
+    provider: 'UserService'
+  })
+
+  it('processes user created events', () => {
+    return messagePact
+      .expectsToReceive('user created event')
+      .withContent({
+        userId: MatchersV3.integer(123),
+        email: MatchersV3.string('user@example.com'),
+        eventType: 'USER_CREATED'
+      })
+      .withMetadata({
+        'content-type': 'application/json'
+      })
+      .verify((message) => {
+        processUserCreatedEvent(message.contents)
+      })
+  })
+})
+```
+
+### Provider Message Verification
+
+```javascript
+// Provider verifies it can produce matching messages
+const { MessageProviderPact } = require('@pact-foundation/pact')
+
+describe('User Event Producer', () => {
+  it('publishes user created events matching contracts', () => {
+    return new MessageProviderPact({
+      messageProviders: {
+        'user created event': () => ({
+          contents: {
+            userId: 123,
+            email: 'test@example.com',
+            eventType: 'USER_CREATED'
+          },
+          metadata: {
+            'content-type': 'application/json'
+          }
+        })
+      }
+    }).verify()
+  })
+})
+```
+
+### Key Differences from HTTP Contracts
+
+- **No request/response:** Only message payload
+- **Metadata:** Headers, content-type, message keys
+- **Ordering:** Don't test message ordering in contracts (infrastructure concern)
+- **Delivery:** Don't test delivery guarantees (wrong layer)
+
+**Workflow:** Same as HTTP (publish pact → verify → can-i-deploy)
+
+## CI/CD Integration Quick Reference
+
+### GitHub Actions
+
+```yaml
+# Consumer publishes contracts
+- name: Run Pact tests
+  run: npm test
+
+- name: Publish pacts
+  run: |
+    npm run pact:publish
+  env:
+    PACT_BROKER_URL: ${{ secrets.PACT_BROKER_URL }}
+    PACT_BROKER_TOKEN: ${{ secrets.PACT_BROKER_TOKEN }}
+
+# Provider verifies and checks deployment
+- name: Verify contracts
+  run: npm run pact:verify
+
+- name: Can I deploy?
+  run: |
+    pact-broker can-i-deploy \
+      --pacticipant UserService \
+      --version ${{ github.sha }} \
+      --to-environment production
+```
+
+### GitLab CI
+
+```yaml
+pact_test:
+  script:
+    - npm test
+    - npm run pact:publish
+
+pact_verify:
+  script:
+    - npm run pact:verify
+    - pact-broker can-i-deploy --pacticipant UserService --version $CI_COMMIT_SHA --to-environment production
+```
+
+## Your First Contract Test
+
+**Goal:** Prevent breaking changes between two services in one week
+
+**Day 1-2: Consumer Side**
+
+```javascript
+// Install Pact
+npm install --save-dev @pact-foundation/pact
+
+// Consumer contract test (order-service)
+const { PactV3, MatchersV3 } = require('@pact-foundation/pact')
+const { getUserById } = require('./userClient')
+
+describe('User API', () => {
+  const provider = new PactV3({
+    consumer: 'OrderService',
+    provider: 'UserService'
+  })
+
+  it('gets user by id', () => {
+    provider
+      .given('a user exists')
+      .uponReceiving('a request for user')
+      .withRequest({
+        method: 'GET',
+        path: '/users/123'
+      })
+      .willRespondWith({
+        status: 200,
+        body: {
+          id: MatchersV3.integer(123),
+          name: MatchersV3.string('John')
+        }
+      })
+
+    return provider.executeTest(async (mockServer) => {
+      const user = await getUserById(mockServer.url, 123)
+      expect(user.name).toBe('John')
+    })
+  })
+})
+```
+
+**Day 3-4: Set Up Pact Broker**
+
+```bash
+# Docker Compose
+docker-compose up -d
+
+# Or use hosted Pactflow (SaaS)
+# https://pactflow.io
+```
+
+**Day 5-6: Provider Side**
+
+```javascript
+// Provider verification (user-service)
+const { Verifier } = require('@pact-foundation/pact')
+const app = require('./app')
+
+describe('Pact Verification', () => {
+  it('validates contracts from broker', () => {
+    return new Verifier({
+      provider: 'UserService',
+      providerBaseUrl: 'http://localhost:8080',
+      pactBrokerUrl: process.env.PACT_BROKER_URL,
+      publishVerificationResult: true,
+      providerVersion: '1.0.0',
+
+      stateHandlers: {
+        'a user exists': async () => {
+          await db.users.create({ id: 123, name: 'John' })
+        }
+      }
+    }).verifyProvider()
+  })
+})
+```
+
+**Day 7: Add to CI**
+
+```yaml
+# Add can-i-deploy before deployment
+- pact-broker can-i-deploy --pacticipant UserService --version $VERSION --to production
+```
+
+## Common Mistakes
+
+### ❌ Testing Business Logic in Contracts
+**Fix:** Contract tests verify integration only. Test business logic separately.
+
+---
+
+### ❌ Not Using Matchers
+**Fix:** Use `Matchers.string()`, `Matchers.integer()` for flexible matching
+
+---
+
+### ❌ Skipping Can-I-Deploy
+**Fix:** Always run can-i-deploy before deployment. Automate in CI.
+
+---
+
+### ❌ Hardcoding Test Data
+**Fix:** Generate data in provider states, use matchers in contracts
+
+## Quick Reference
+
+**Tool Selection:**
+- Polyglot/multiple teams: Pact
+- Java Spring only: Spring Cloud Contract
+- Public API: OpenAPI validation
+
+**Contract Type:**
+- Internal services: Consumer-driven (Pact)
+- Public API: Provider-driven (OpenAPI)
+- Both: Bi-directional
+
+**Pact Broker Workflow:**
+1. Consumer publishes pact
+2. Provider verifies
+3. Can-i-deploy checks compatibility
+4. Deploy if compatible
+
+**Breaking Changes:**
+- Add optional field: Safe
+- Add required field: Expand/contract pattern
+- Remove/rename field: Deprecate → migrate → remove
+
+**Provider States:**
+- Set up test data
+- Clean up after each test
+- Use transactions for DB
+- Don't hardcode IDs
+
+**CI/CD:**
+- Consumer: Test → publish pacts
+- Provider: Verify → can-i-deploy → deploy
+
+## Bottom Line
+
+**Contract testing prevents API breaking changes by testing integration points independently. Use Pact for internal microservices, publish contracts to broker, run can-i-deploy before deployment.**
+
+Test the contract (request/response), not the implementation. Use consumer-driven contracts for known consumers, schema-first for public APIs.
--- a/skills/dependency-scanning/SKILL.md
+++ b/skills/dependency-scanning/SKILL.md
@@ -0,0 +1,429 @@
+---
+name: dependency-scanning
+description: Use when integrating SCA tools (Snyk, Dependabot, OWASP Dependency-Check), automating vulnerability management, handling license compliance, setting up automated dependency updates, or managing security advisories - provides tool selection, PR automation workflows, and false positive management
+---
+
+# Dependency Scanning
+
+## Overview
+
+**Core principle:** Third-party dependencies introduce security vulnerabilities and license risks. Automate scanning to catch them early.
+
+**Rule:** Block merges on critical/high vulnerabilities in direct dependencies. Monitor and plan fixes for transitive dependencies.
+
+## Why Dependency Scanning Matters
+
+**Security vulnerabilities:**
+- 80% of codebases contain at least one vulnerable dependency
+- Log4Shell (CVE-2021-44228) affected millions of applications
+- Attackers actively scan GitHub for known vulnerabilities
+
+**License compliance:**
+- GPL dependencies in proprietary software = legal risk
+- Some licenses require source code disclosure
+- Incompatible license combinations
+
+---
+
+## Tool Selection
+
+| Tool | Use Case | Cost | Best For |
+|------|----------|------|----------|
+| **Dependabot** | Automated PRs for updates | Free (GitHub) | GitHub projects, basic scanning |
+| **Snyk** | Comprehensive security + license scanning | Free tier, paid plans | Production apps, detailed remediation |
+| **OWASP Dependency-Check** | Security-focused, self-hosted | Free | Privacy-sensitive, custom workflows |
+| **npm audit** | JavaScript quick scan | Free | Quick local checks |
+| **pip-audit** | Python quick scan | Free | Quick local checks |
+| **bundler-audit** | Ruby quick scan | Free | Quick local checks |
+
+**Recommended setup:**
+- **GitHub repos:** Dependabot (automated) + Snyk (security focus)
+- **Self-hosted:** OWASP Dependency-Check
+- **Quick local checks:** npm audit / pip-audit
+
+---
+
+## Dependabot Configuration
+
+### Enable Dependabot (GitHub)
+
+``yaml
+# .github/dependabot.yml
+version: 2
+updates:
+  - package-ecosystem: "npm"
+    directory: "/"
+    schedule:
+      interval: "weekly"
+      day: "monday"
+    open-pull-requests-limit: 5
+    labels:
+      - "dependencies"
+    reviewers:
+      - "security-team"
+
+  - package-ecosystem: "pip"
+    directory: "/"
+    schedule:
+      interval: "weekly"
+    target-branch: "develop"
+```
+
+**What Dependabot does:**
+- Scans dependencies weekly
+- Creates PRs for vulnerabilities
+- Updates to safe versions
+- Provides CVE details
+
+---
+
+## Snyk Integration
+
+### Installation
+
+```bash
+npm install -g snyk
+snyk auth  # Authenticate with Snyk account
+```
+
+---
+
+### Scan Local Project
+
+```bash
+# Test for vulnerabilities
+snyk test
+
+# Monitor project (continuous scanning)
+snyk monitor
+```
+
+---
+
+### CI/CD Integration
+
+```yaml
+# .github/workflows/snyk.yml
+name: Snyk Security Scan
+
+on: [pull_request, push]
+
+jobs:
+  security:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Run Snyk
+        uses: snyk/actions/node@master
+        env:
+          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
+        with:
+          args: --severity-threshold=high  # Fail on high+ severity
+```
+
+**Severity thresholds:**
+- **Critical:** Block merge immediately
+- **High:** Block merge, fix within 7 days
+- **Medium:** Create issue, fix within 30 days
+- **Low:** Monitor, fix opportunistically
+
+---
+
+## OWASP Dependency-Check
+
+### Installation
+
+```bash
+# Download latest release
+wget https://github.com/jeremylong/DependencyCheck/releases/download/v8.0.0/dependency-check-8.0.0-release.zip
+unzip dependency-check-8.0.0-release.zip
+```
+
+---
+
+### Run Scan
+
+```bash
+# Scan project
+./dependency-check/bin/dependency-check.sh \
+  --scan ./src \
+  --format HTML \
+  --out ./reports \
+  --suppression ./dependency-check-suppressions.xml
+```
+
+---
+
+### Suppression File (False Positives)
+
+```xml
+<!-- dependency-check-suppressions.xml -->
+<?xml version="1.0" encoding="UTF-8"?>
+<suppressions xmlns="https://jeremylong.github.io/DependencyCheck/dependency-suppression.1.3.xsd">
+    <suppress>
+        <notes>False positive - CVE applies to server mode only, we use client mode</notes>
+        <cve>CVE-2021-12345</cve>
+    </suppress>
+</suppressions>
+```
+
+---
+
+## License Compliance
+
+### Checking Licenses (npm)
+
+```bash
+# List all licenses
+npx license-checker
+
+# Filter incompatible licenses
+npx license-checker --onlyAllow 'MIT;Apache-2.0;BSD-3-Clause'
+```
+
+---
+
+### Blocking Incompatible Licenses
+
+```json
+// package.json
+{
+  "scripts": {
+    "license-check": "license-checker --onlyAllow 'MIT;Apache-2.0;BSD-3-Clause;ISC' --production"
+  }
+}
+```
+
+```yaml
+# CI: Fail if incompatible licenses detected
+- name: Check licenses
+  run: npm run license-check
+```
+
+**Common license risks:**
+- **GPL/AGPL:** Requires source code disclosure
+- **SSPL:** Restrictive for SaaS
+- **Proprietary:** May prohibit commercial use
+
+---
+
+## Automated Dependency Updates
+
+### Auto-Merge Strategy
+
+**Safe to auto-merge:**
+- Patch versions (1.2.3 → 1.2.4)
+- No breaking changes
+- Passing all tests
+
+```yaml
+# .github/workflows/auto-merge-dependabot.yml
+name: Auto-merge Dependabot PRs
+
+on: pull_request
+
+jobs:
+  auto-merge:
+    runs-on: ubuntu-latest
+    if: github.actor == 'dependabot[bot]'
+    steps:
+      - name: Check if patch update
+        id: check
+        run: |
+          # Only auto-merge patch/minor, not major
+          if [[ "${{ github.event.pull_request.title }}" =~ ^Bump.*from.*\.[0-9]+$ ]]; then
+            echo "auto_merge=true" >> $GITHUB_OUTPUT
+          fi
+
+      - name: Enable auto-merge
+        if: steps.check.outputs.auto_merge == 'true'
+        run: gh pr merge --auto --squash "$PR_URL"
+        env:
+          PR_URL: ${{ github.event.pull_request.html_url }}
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+```
+
+---
+
+## Vulnerability Remediation Workflow
+
+### 1. Triage (Within 24 hours)
+
+**For each vulnerability:**
+- **Assess severity:** Critical → immediate, High → 7 days, Medium → 30 days
+- **Check exploitability:** Is it reachable in our code?
+- **Verify patch availability:** Is there a fixed version?
+
+---
+
+### 2. Remediation Options
+
+| Option | When to Use | Example |
+|--------|-------------|---------|
+| **Update dependency** | Patch available | `npm update lodash` |
+| **Update lockfile only** | Transitive dependency | `npm audit fix` |
+| **Replace dependency** | No patch, actively exploited | Replace `request` with `axios` |
+| **Apply workaround** | No patch, low risk | Disable vulnerable feature |
+| **Accept risk** | False positive, not exploitable | Document in suppression file |
+
+---
+
+### 3. Verification
+
+```bash
+# After fix, verify vulnerability is resolved
+npm audit
+snyk test
+
+# Run full test suite
+npm test
+```
+
+---
+
+## Anti-Patterns Catalog
+
+### ❌ Ignoring Transitive Dependencies
+
+**Symptom:** "We don't use that library directly, so it's fine"
+
+**Why bad:** Transitive dependencies are still in your app
+
+```
+Your App
+  └─ express@4.18.0
+      └─ body-parser@1.19.0
+          └─ qs@6.7.0 (vulnerable!)
+```
+
+**Fix:** Update parent dependency or override version
+
+```json
+// package.json - force safe version
+{
+  "overrides": {
+    "qs": "^6.11.0"
+  }
+}
+```
+
+---
+
+### ❌ Auto-Merging All Updates
+
+**Symptom:** Dependabot PRs merged without review
+
+**Why bad:**
+- Major versions can break functionality
+- Updates may introduce new bugs
+- No verification tests run
+
+**Fix:** Auto-merge only patch versions, review major/minor
+
+---
+
+### ❌ Suppressing Without Investigation
+
+**Symptom:** Marking all vulnerabilities as false positives
+
+```xml
+<!-- ❌ BAD: No justification -->
+<suppress>
+    <cve>CVE-2021-12345</cve>
+</suppress>
+```
+
+**Fix:** Document WHY it's suppressed
+
+```xml
+<!-- ✅ GOOD: Clear justification -->
+<suppress>
+    <notes>
+        False positive: CVE applies to XML parsing feature.
+        We only use JSON parsing (verified in code review).
+        Tracking issue: #1234
+    </notes>
+    <cve>CVE-2021-12345</cve>
+</suppress>
+```
+
+---
+
+### ❌ No SLA for Fixes
+
+**Symptom:** Vulnerabilities sit unfixed for months
+
+**Fix:** Define SLAs by severity
+
+**Example SLA:**
+- **Critical:** Fix within 24 hours
+- **High:** Fix within 7 days
+- **Medium:** Fix within 30 days
+- **Low:** Fix within 90 days or next release
+
+---
+
+## Monitoring & Alerting
+
+### Slack Notifications
+
+```yaml
+# .github/workflows/security-alerts.yml
+name: Security Alerts
+
+on:
+  schedule:
+    - cron: '0 9 * * *'  # Daily at 9 AM
+
+jobs:
+  scan:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Run Snyk
+        id: snyk
+        run: |
+          snyk test --json > snyk-results.json || true
+
+      - name: Send Slack alert
+        if: steps.snyk.outcome == 'failure'
+        uses: slackapi/slack-github-action@v1
+        with:
+          payload: |
+            {
+              "text": "🚨 Security vulnerabilities detected!",
+              "blocks": [
+                {
+                  "type": "section",
+                  "text": {
+                    "type": "mrkdwn",
+                    "text": "*Critical vulnerabilities found in dependencies*\nView details: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
+                  }
+                }
+              ]
+            }
+        env:
+          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
+```
+
+---
+
+## Bottom Line
+
+**Automate dependency scanning to catch vulnerabilities and license issues early. Block merges on critical issues, monitor and plan fixes for others.**
+
+**Setup:**
+- Enable Dependabot (automated PRs)
+- Add Snyk or OWASP Dependency-Check (security scanning)
+- Check licenses (license-checker)
+- Define SLAs (Critical: 24h, High: 7d, Medium: 30d)
+
+**Remediation:**
+- Update dependencies to patched versions
+- Override transitive dependencies if needed
+- Document suppressions with justification
+- Verify fixes with tests
+
+**If you're not scanning dependencies, you're shipping known vulnerabilities. Automate it in CI/CD.**
--- a/skills/e2e-testing-strategies/SKILL.md
+++ b/skills/e2e-testing-strategies/SKILL.md
@@ -0,0 +1,290 @@
+---
+name: e2e-testing-strategies
+description: Use when designing E2E test architecture, choosing between Cypress/Playwright/Selenium, prioritizing which flows to test, fixing flaky E2E tests, or debugging slow E2E test suites - provides production-tested patterns and anti-patterns
+---
+
+# E2E Testing Strategies
+
+## Overview
+
+**Core principle:** E2E tests are expensive. Use them sparingly for critical multi-system flows. Everything else belongs lower in the test pyramid.
+
+**Test pyramid target:** 5-10% E2E, 20-25% integration, 65-75% unit
+
+**Scope:** This skill focuses on web application E2E testing (browser-based). For mobile app testing (iOS/Android), decision tree points to Appium, but patterns/anti-patterns here are web-specific. Mobile testing requires different strategies for device capabilities, native selectors, and app lifecycle.
+
+## Framework Selection Decision Tree
+
+Choose framework based on constraints:
+
+| Your Constraint | Choose | Why |
+|----------------|--------|-----|
+| Need cross-browser (Chrome/Firefox/Safari) | **Playwright** | Native multi-browser, auto-wait, trace viewer |
+| Team unfamiliar with testing | **Cypress** | Simpler API, better DX, larger community |
+| Enterprise/W3C standard requirement | **WebdriverIO** | Full W3C WebDriver protocol |
+| Headless Chrome only, fine-grained control | **Puppeteer** | Lower-level, faster for Chrome-only |
+| Testing Electron apps | **Spectron** or **Playwright** | Native Electron support |
+| Mobile apps (iOS/Android) | **Appium** | Mobile-specific protocol (Note: rest of this skill is web-focused) |
+
+**For most web apps:** Playwright (modern, reliable) or Cypress (simpler DX)
+
+## Flow Prioritization Matrix
+
+When you have 50 flows but can only test 10 E2E:
+
+| Score | Criteria | Weight |
+|-------|----------|--------|
+| +3 | Revenue impact (checkout, payment, subscription) | High |
+| +3 | Multi-system integration (API + DB + email + payment) | High |
+| +2 | Historical production failures (has broken before) | Medium |
+| +2 | Complex state management (auth, sessions, caching) | Medium |
+| +1 | User entry point (login, signup, search) | Medium |
+| +1 | Regulatory/compliance requirement | Medium |
+| -2 | Can be tested at integration level | Penalty |
+| -3 | Mostly UI interaction, no backend | Penalty |
+
+**Score flows 0-10, test top 10.** Everything else → integration/unit tests.
+
+**Example:**
+- "User checkout flow" = +3 revenue +3 multi-system +2 historical +2 state = **10** → E2E
+- "User changes email preference" = +1 entry -2 integration level = **-1** → Integration test
+
+## Anti-Patterns Catalog
+
+### ❌ Pyramid Inversion
+**Symptom:** 200 E2E tests, 50 integration tests, 100 unit tests
+
+**Why bad:** E2E tests are slow (30min CI), brittle (UI changes break tests), hard to debug
+
+**Fix:** Invert back - move 150 E2E tests down to integration/unit
+
+---
+
+### ❌ Testing Through the UI
+**Symptom:** E2E test creates 10 users through signup form to test one admin feature
+
+**Why bad:** Slow, couples unrelated features
+
+**Fix:** Seed data via API/database, test only the admin feature flow
+
+---
+
+### ❌ Arbitrary Timeouts
+**Symptom:** `wait(5000)` sprinkled throughout tests
+
+**Why bad:** Flaky - sometimes too short, sometimes wastes time
+
+**Fix:** Explicit waits for conditions
+```javascript
+// ❌ Bad
+await page.click('button');
+await page.waitForTimeout(5000);
+
+// ✅ Good
+await page.click('button');
+await page.waitForSelector('.success-message');
+```
+
+---
+
+### ❌ God Page Objects
+**Symptom:** Single `PageObject` class with 50 methods for entire app
+
+**Why bad:** Tight coupling, hard to maintain, unclear responsibilities
+
+**Fix:** One page object per logical page/component
+```javascript
+// ❌ Bad: God object
+class AppPage {
+  async login() {}
+  async createPost() {}
+  async deleteUser() {}
+  async exportReport() {}
+  // ... 50 more methods
+}
+
+// ✅ Good: Focused page objects
+class AuthPage {
+  async login() {}
+  async logout() {}
+}
+
+class PostsPage {
+  async create() {}
+  async delete() {}
+}
+```
+
+---
+
+###❌ Brittle Selectors
+**Symptom:** `page.click('.btn-primary.mt-4.px-3')`
+
+**Why bad:** Breaks when CSS changes
+
+**Fix:** Use `data-testid` attributes
+```javascript
+// ❌ Bad
+await page.click('.submit-button.btn.btn-primary');
+
+// ✅ Good
+await page.click('[data-testid="submit"]');
+```
+
+---
+
+### ❌ Test Interdependence
+**Symptom:** Test 5 fails if Test 3 doesn't run first
+
+**Why bad:** Can't run tests in parallel, hard to debug
+
+**Fix:** Each test sets up own state
+```javascript
+// ❌ Bad
+test('create user', async () => {
+  // creates user "test@example.com"
+});
+
+test('login user', async () => {
+  // assumes user from previous test exists
+});
+
+// ✅ Good
+test('login user', async ({ page }) => {
+  await createUserViaAPI('test@example.com'); // independent setup
+  await page.goto('/login');
+  // test login flow
+});
+```
+
+## Flakiness Patterns Catalog
+
+Common flake sources and fixes:
+
+| Pattern | Symptom | Fix |
+|---------|---------|-----|
+| **Network Race** | "Element not found" intermittently | `await page.waitForLoadState('networkidle')` |
+| **Animation Race** | "Element not clickable" | `await page.waitForSelector('.element', { state: 'visible' })` or disable animations |
+| **Async State** | "Expected 'success' but got ''" | Wait for specific state, not timeout |
+| **Test Data Pollution** | Test passes alone, fails in suite | Isolate data per test (unique IDs, cleanup) |
+| **Browser Caching** | Different results first vs second run | Clear cache/cookies between tests |
+| **Date/Time Sensitivity** | Test fails at midnight, passes during day | Mock system time in tests |
+| **External Service** | Third-party API occasionally down | Mock external dependencies |
+
+**Rule:** If test fails <5% of time, it's flaky. Fix it before adding more tests.
+
+## Page Object Anti-Patterns
+
+### ❌ Business Logic in Page Objects
+```javascript
+// ❌ Bad
+class CheckoutPage {
+  async calculateTotal(items) {
+    return items.reduce((sum, item) => sum + item.price, 0); // business logic
+  }
+}
+
+// ✅ Good
+class CheckoutPage {
+  async getTotal() {
+    return await page.textContent('[data-testid="total"]'); // UI interaction only
+  }
+}
+```
+
+### ❌ Assertions in Page Objects
+```javascript
+// ❌ Bad
+class LoginPage {
+  async login(email, password) {
+    await this.page.fill('[data-testid="email"]', email);
+    await this.page.fill('[data-testid="password"]', password);
+    await this.page.click('[data-testid="submit"]');
+    expect(this.page.url()).toContain('/dashboard'); // assertion
+  }
+}
+
+// ✅ Good
+class LoginPage {
+  async login(email, password) {
+    await this.page.fill('[data-testid="email"]', email);
+    await this.page.fill('[data-testid="password"]', password);
+    await this.page.click('[data-testid="submit"]');
+  }
+
+  async isOnDashboard() {
+    return this.page.url().includes('/dashboard');
+  }
+}
+
+// Test file handles assertions
+test('login', async () => {
+  await loginPage.login('user@test.com', 'password');
+  expect(await loginPage.isOnDashboard()).toBe(true);
+});
+```
+
+## Quick Reference
+
+### When to Use E2E vs Integration vs Unit
+
+| Scenario | Test Level | Reasoning |
+|----------|-----------|-----------|
+| Form validation logic | Unit | Pure function, no UI needed |
+| API error handling | Integration | Test API contract, no browser |
+| Multi-step checkout | E2E | Crosses systems, critical revenue |
+| Button hover states | Visual regression | Not functional behavior |
+| Login → dashboard redirect | E2E | Auth critical, multi-system |
+| Database query performance | Integration | No UI, just DB |
+| User can filter search results | E2E (1 test) + Integration (variations) | 1 E2E for happy path, rest integration |
+
+### Test Data Strategies
+
+| Approach | When to Use | Pros | Cons |
+|----------|-------------|------|------|
+| **API Seeding** | Most tests | Fast, consistent | Requires API access |
+| **Database Seeding** | Integration tests | Complete control | Slow, requires DB access |
+| **UI Creation** | Testing creation flow itself | Tests real user path | Slow, couples tests |
+| **Mocking** | External services | Fast, reliable | Misses real integration issues |
+| **Fixtures** | Consistent test data | Reusable, version-controlled | Stale if schema changes |
+
+## Common Mistakes
+
+### ❌ Running Full Suite on Every Commit
+**Symptom:** 30-minute CI blocking every PR
+
+**Fix:** Smoke tests (5-10 critical flows) on PR, full suite on merge/nightly
+
+---
+
+### ❌ Not Capturing Failure Artifacts
+**Symptom:** "Test failed in CI but I can't reproduce"
+
+**Fix:** Save video + trace on failure
+```javascript
+// playwright.config.js
+use: {
+  video: 'retain-on-failure',
+  trace: 'retain-on-failure',
+}
+```
+
+---
+
+### ❌ Testing Implementation Details
+**Symptom:** Tests assert internal component state
+
+**Fix:** Test user-visible behavior only
+
+---
+
+### ❌ One Assert Per Test
+**Symptom:** 50 E2E tests all navigate to same page, test one thing
+
+**Fix:** Group related assertions in one flow test (but keep focused)
+
+## Bottom Line
+
+**E2E tests verify critical multi-system flows work for real users.**
+
+If you can test it faster/more reliably at a lower level, do that instead.
--- a/skills/flaky-test-prevention/SKILL.md
+++ b/skills/flaky-test-prevention/SKILL.md
@@ -0,0 +1,493 @@
+---
+name: flaky-test-prevention
+description: Use when debugging intermittent test failures, choosing between retries vs fixes, quarantining flaky tests, calculating flakiness rates, or preventing non-deterministic behavior - provides root cause diagnosis, anti-patterns, and systematic debugging
+---
+
+# Flaky Test Prevention
+
+## Overview
+
+**Core principle:** Fix root causes, don't mask symptoms.
+
+**Rule:** Flaky tests indicate real problems - in test design, application code, or infrastructure.
+
+## Flakiness Decision Tree
+
+| Symptom | Root Cause Category | Diagnostic | Fix |
+|---------|---------------------|------------|-----|
+| Passes alone, fails in suite | Test Interdependence | Run tests in random order | Use test isolation (transactions, unique IDs) |
+| Fails randomly ~10% | Timing/Race Condition | Add logging, run 100x | Replace sleeps with explicit waits |
+| Fails only in CI, not locally | Environment Difference | Compare CI vs local env | Match environments, use containers |
+| Fails at specific times | Time Dependency | Check for date/time usage | Mock system time |
+| Fails under load | Resource Contention | Run in parallel locally | Add resource isolation, increase limits |
+| Different results each run | Non-Deterministic Code | Check for randomness | Seed random generators, use fixtures |
+
+**First step:** Identify symptom, trace to root cause category.
+
+## Anti-Patterns Catalog
+
+### ❌ Sleepy Assertion
+**Symptom:** Using fixed `sleep()` or `wait()` instead of condition-based waits
+
+**Why bad:** Wastes time on fast runs, still fails on slow runs, brittle
+
+**Fix:** Explicit waits for conditions
+
+```python
+# ❌ Bad
+time.sleep(5)  # Hope 5 seconds is enough
+assert element.text == "Loaded"
+
+# ✅ Good
+WebDriverWait(driver, 10).until(
+    lambda d: d.find_element_by_id("status").text == "Loaded"
+)
+assert element.text == "Loaded"
+```
+
+---
+
+### ❌ Test Interdependence
+**Symptom:** Tests pass when run in specific order, fail when shuffled
+
+**Why bad:** Hidden dependencies, can't run in parallel, breaks test isolation
+
+**Fix:** Each test creates its own data, no shared state
+
+```python
+# ❌ Bad
+def test_create_user():
+    user = create_user("test_user")  # Shared ID
+
+def test_update_user():
+    update_user("test_user")  # Depends on test_create_user
+
+# ✅ Good
+def test_create_user():
+    user_id = f"user_{uuid4()}"
+    user = create_user(user_id)
+
+def test_update_user():
+    user_id = f"user_{uuid4()}"
+    user = create_user(user_id)  # Independent
+    update_user(user_id)
+```
+
+---
+
+### ❌ Hidden Dependencies
+**Symptom:** Tests fail due to external state (network, database, file system) beyond test control
+
+**Why bad:** Unpredictable failures, environment-specific issues
+
+**Fix:** Mock external dependencies
+
+```python
+# ❌ Bad
+def test_weather_api():
+    response = requests.get("https://api.weather.com/...")
+    assert response.json()["temp"] > 0  # Fails if API is down
+
+# ✅ Good
+@mock.patch('requests.get')
+def test_weather_api(mock_get):
+    mock_get.return_value.json.return_value = {"temp": 75}
+    response = get_weather("Seattle")
+    assert response["temp"] == 75
+```
+
+---
+
+### ❌ Time Bomb
+**Symptom:** Tests that depend on current date/time and fail at specific moments (midnight, month boundaries, DST)
+
+**Why bad:** Fails unpredictably based on when tests run
+
+**Fix:** Mock system time
+
+```python
+# ❌ Bad
+def test_expiration():
+    created_at = datetime.now()
+    assert is_expired(created_at) == False  # Fails at midnight
+
+# ✅ Good
+@freeze_time("2025-11-15 12:00:00")
+def test_expiration():
+    created_at = datetime(2025, 11, 15, 12, 0, 0)
+    assert is_expired(created_at) == False
+```
+
+---
+
+### ❌ Timeout Inflation
+**Symptom:** Continuously increasing timeouts to "fix" flaky tests (5s → 10s → 30s)
+
+**Why bad:** Masks root cause, slows test suite, doesn't guarantee reliability
+
+**Fix:** Investigate why operation is slow, use explicit waits
+
+```python
+# ❌ Bad
+await page.waitFor(30000)  # Increased from 5s hoping it helps
+
+# ✅ Good
+await page.waitForSelector('.data-loaded', {timeout: 10000})
+await page.waitForNetworkIdle()
+```
+
+## Detection Strategies
+
+### Proactive Identification
+
+**Run tests multiple times (statistical detection):**
+
+```bash
+# pytest with repeat plugin
+pip install pytest-repeat
+pytest --count=50 test_flaky.py
+
+# Track pass rate
+# 50/50 = 100% reliable
+# 45/50 = 90% flaky (investigate immediately)
+# <95% = quarantine
+```
+
+**CI Integration (automatic tracking):**
+
+```yaml
+# GitHub Actions example
+- name: Run tests with flakiness detection
+  run: |
+    pytest --count=3 --junit-xml=results.xml
+    python scripts/calculate_flakiness.py results.xml
+```
+
+**Flakiness metrics to track:**
+- Pass rate per test (target: >99%)
+- Mean Time Between Failures (MTBF)
+- Failure clustering (same test failing together)
+
+### Systematic Debugging
+
+**When a test fails intermittently:**
+
+1. **Reproduce consistently** - Run 100x to establish failure rate
+2. **Isolate** - Run alone, with subset, with full suite (find interdependencies)
+3. **Add logging** - Capture state before assertion, screenshot on failure
+4. **Bisect** - If fails in suite, binary search which other test causes it
+5. **Environment audit** - Compare CI vs local (env vars, resources, timing)
+
+## Flakiness Metrics Guide
+
+**Calculating flake rate:**
+
+```python
+# Flakiness formula
+flake_rate = (failed_runs / total_runs) * 100
+
+# Example
+# Test run 100 times: 7 failures
+# Flake rate = 7/100 = 7%
+```
+
+**Thresholds:**
+
+| Flake Rate | Action | Priority |
+|------------|--------|----------|
+| 0% (100% pass) | Reliable | Monitor |
+| 0.1-1% | Investigate | Low |
+| 1-5% | Quarantine + Fix | Medium |
+| 5-10% | Quarantine + Fix Urgently | High |
+| >10% | Disable immediately | Critical |
+
+**Target:** All tests should maintain >99% pass rate (< 1% flake rate)
+
+## Quarantine Workflow
+
+**Purpose:** Keep CI green while fixing flaky tests systematically
+
+**Process:**
+
+1. **Detect** - Test fails >1% of runs
+2. **Quarantine** - Mark with `@pytest.mark.quarantine`, exclude from CI
+3. **Track** - Create issue with flake rate, failure logs, reproduction steps
+4. **Fix** - Assign owner, set SLA (e.g., 2 weeks to fix or delete)
+5. **Validate** - Run fixed test 100x, must achieve >99% pass rate
+6. **Re-Enable** - Remove quarantine mark, monitor for 1 week
+
+**Marking quarantined tests:**
+
+```python
+@pytest.mark.quarantine(reason="Flaky due to timing issue #1234")
+@pytest.mark.skip("Quarantined")
+def test_flaky_feature():
+    pass
+```
+
+**CI configuration:**
+
+```bash
+# Run all tests except quarantined
+pytest -m "not quarantine"
+```
+
+**SLA:** Quarantined tests must be fixed within 2 weeks or deleted. No test stays quarantined indefinitely.
+
+## Tool Ecosystem Quick Reference
+
+| Tool | Purpose | When to Use |
+|------|---------|-------------|
+| **pytest-repeat** | Run test N times | Statistical detection |
+| **pytest-xdist** | Parallel execution | Expose race conditions |
+| **pytest-rerunfailures** | Auto-retry on failure | Temporary mitigation during fix |
+| **pytest-randomly** | Randomize test order | Detect test interdependence |
+| **freezegun** | Mock system time | Fix time bombs |
+| **pytest-timeout** | Prevent hanging tests | Catch infinite loops |
+
+**Installation:**
+
+```bash
+pip install pytest-repeat pytest-xdist pytest-rerunfailures pytest-randomly freezegun pytest-timeout
+```
+
+**Usage examples:**
+
+```bash
+# Detect flakiness (run 50x)
+pytest --count=50 test_suite.py
+
+# Detect interdependence (random order)
+pytest --randomly-seed=12345 test_suite.py
+
+# Expose race conditions (parallel)
+pytest -n 4 test_suite.py
+
+# Temporary mitigation (reruns, not a fix!)
+pytest --reruns 2 --reruns-delay 1 test_suite.py
+```
+
+## Prevention Checklist
+
+**Use during test authoring to prevent flakiness:**
+
+- [ ] No fixed `time.sleep()` - use explicit waits for conditions
+- [ ] Each test creates its own data (UUID-based IDs)
+- [ ] No shared global state between tests
+- [ ] External dependencies mocked (APIs, network, databases)
+- [ ] Time/date frozen with `@freeze_time` if time-dependent
+- [ ] Random values seeded (`random.seed(42)`)
+- [ ] Tests pass when run in any order (`pytest --randomly-seed`)
+- [ ] Tests pass when run in parallel (`pytest -n 4`)
+- [ ] Tests pass 100/100 times (`pytest --count=100`)
+- [ ] Teardown cleans up all resources (files, database, cache)
+
+## Common Fixes Quick Reference
+
+| Problem | Fix Pattern | Example |
+|---------|-------------|---------|
+| **Timing issues** | Explicit waits | `WebDriverWait(driver, 10).until(condition)` |
+| **Test interdependence** | Unique IDs per test | `user_id = f"test_{uuid4()}"` |
+| **External dependencies** | Mock/stub | `@mock.patch('requests.get')` |
+| **Time dependency** | Freeze time | `@freeze_time("2025-11-15")` |
+| **Random behavior** | Seed randomness | `random.seed(42)` |
+| **Shared state** | Test isolation | Transactions, teardown fixtures |
+| **Resource contention** | Unique resources | Separate temp dirs, DB namespaces |
+
+## Your First Flaky Test Fix
+
+**Systematic approach for first fix:**
+
+**Step 1: Reproduce (Day 1)**
+
+```bash
+# Run test 100 times, capture failures
+pytest --count=100 --verbose test_flaky.py | tee output.log
+```
+
+**Step 2: Categorize (Day 1)**
+
+Check output.log:
+- Same failure message? → Likely timing/race condition
+- Different failures? → Likely test interdependence
+- Only fails in CI? → Environment difference
+
+**Step 3: Fix Based on Category (Day 2)**
+
+**If timing issue:**
+
+```python
+# Before
+time.sleep(2)
+assert element.text == "Loaded"
+
+# After
+wait.until(lambda: element.text == "Loaded")
+```
+
+**If interdependence:**
+
+```python
+# Before
+user = User.objects.get(id=1)  # Assumes user exists
+
+# After
+user = create_test_user(id=f"test_{uuid4()}")  # Creates own data
+```
+
+**Step 4: Validate (Day 2)**
+
+```bash
+# Must pass 100/100 times
+pytest --count=100 test_flaky.py
+# Expected: 100 passed
+```
+
+**Step 5: Monitor (Week 1)**
+
+Track in CI - test should maintain >99% pass rate for 1 week before considering it fixed.
+
+## CI-Only Flakiness (Can't Reproduce Locally)
+
+**Symptom:** Test fails intermittently in CI but passes 100% locally
+
+**Root cause:** Environment differences between CI and local (resources, parallelization, timing)
+
+### Systematic CI Debugging
+
+**Step 1: Environment Fingerprinting**
+
+Capture exact environment in both CI and locally:
+
+```python
+# Add to conftest.py
+import os, sys, platform, tempfile
+
+def pytest_configure(config):
+    print(f"Python: {sys.version}")
+    print(f"Platform: {platform.platform()}")
+    print(f"CPU count: {os.cpu_count()}")
+    print(f"TZ: {os.environ.get('TZ', 'not set')}")
+    print(f"Temp dir: {tempfile.gettempdir()}")
+    print(f"Parallel: {os.environ.get('PYTEST_XDIST_WORKER', 'not parallel')}")
+```
+
+Run in both environments, compare all outputs.
+
+**Step 2: Increase CI Observation Window**
+
+For low-probability failures (<5%), run more iterations:
+
+```yaml
+# GitHub Actions example
+- name: Run test 200x to catch 1% flake
+  run: pytest --count=200 --verbose --log-cli-level=DEBUG test.py
+
+- name: Upload failure artifacts
+  if: failure()
+  uses: actions/upload-artifact@v3
+  with:
+    name: failure-logs
+    path: |
+      *.log
+      screenshots/
+```
+
+**Step 3: Check CI-Specific Factors**
+
+| Factor | Diagnostic | Fix |
+|--------|------------|-----|
+| **Parallelization** | Run `pytest -n 4` locally | Add test isolation (unique IDs, transactions) |
+| **Resource limits** | Compare CI RAM/CPU to local | Mock expensive operations, add retries |
+| **Cold starts** | First run vs warm runs | Check caching assumptions |
+| **Disk I/O speed** | CI may use slower disks | Mock file operations |
+| **Network latency** | CI network may be slower/different | Mock external calls |
+
+**Step 4: Replicate CI Environment Locally**
+
+Use exact CI container:
+
+```bash
+# GitHub Actions uses Ubuntu 22.04
+docker run -it ubuntu:22.04 bash
+
+# Install dependencies
+apt-get update && apt-get install python3.11
+
+# Run test in container
+pytest --count=500 test.py
+```
+
+**Step 5: Enable CI Debug Mode**
+
+```yaml
+# GitHub Actions - Interactive debugging
+- name: Setup tmate session (on failure)
+  if: failure()
+  uses: mxschmitt/action-tmate@v3
+```
+
+### Quick CI Debugging Checklist
+
+When test fails only in CI:
+
+- [ ] Capture environment fingerprint in both CI and local
+- [ ] Run test with parallelization locally (`pytest -n auto`)
+- [ ] Check for resource contention (CPU, memory, disk)
+- [ ] Compare timezone settings (`TZ` env var)
+- [ ] Upload CI artifacts (logs, screenshots) on failure
+- [ ] Replicate CI environment with Docker
+- [ ] Check for cold start issues (first vs subsequent runs)
+
+## Common Mistakes
+
+### ❌ Using Retries as Permanent Solution
+**Fix:** Retries (@pytest.mark.flaky or --reruns) are temporary mitigation during investigation, not fixes
+
+---
+
+### ❌ No Flakiness Tracking
+**Fix:** Track pass rates in CI, set up alerts for tests dropping below 99%
+
+---
+
+### ❌ Fixing Flaky Tests by Making Them Slower
+**Fix:** Diagnose root cause - don't just add more wait time
+
+---
+
+### ❌ Ignoring Flaky Tests
+**Fix:** Quarantine workflow - either fix or delete, never ignore indefinitely
+
+## Quick Reference
+
+**Flakiness Thresholds:**
+- <1% flake rate: Monitor
+- 1-5%: Quarantine + fix (medium priority)
+- >5%: Disable + fix urgently (high priority)
+
+**Root Cause Categories:**
+1. Timing/race conditions → Explicit waits
+2. Test interdependence → Unique IDs, test isolation
+3. External dependencies → Mocking
+4. Time bombs → Freeze time
+5. Resource contention → Unique resources
+
+**Detection Tools:**
+- pytest-repeat (statistical detection)
+- pytest-randomly (interdependence)
+- pytest-xdist (race conditions)
+
+**Quarantine Process:**
+1. Detect (>1% flake rate)
+2. Quarantine (mark, exclude from CI)
+3. Track (create issue)
+4. Fix (assign owner, 2-week SLA)
+5. Validate (100/100 passes)
+6. Re-enable (monitor 1 week)
+
+## Bottom Line
+
+**Flaky tests are fixable - find the root cause, don't mask with retries.**
+
+Use detection tools to find flaky tests early. Categorize by symptom, diagnose root cause, apply pattern-based fix. Quarantine if needed, but always with SLA to fix or delete.
--- a/skills/fuzz-testing/SKILL.md
+++ b/skills/fuzz-testing/SKILL.md
@@ -0,0 +1,445 @@
+---
+name: fuzz-testing
+description: Use when testing input validation, discovering edge cases, finding security vulnerabilities, testing parsers/APIs with random inputs, or integrating fuzzing tools (AFL, libFuzzer, Atheris) - provides fuzzing strategies, tool selection, and crash triage workflows
+---
+
+# Fuzz Testing
+
+## Overview
+
+**Core principle:** Fuzz testing feeds random/malformed inputs to find crashes, hangs, and security vulnerabilities that manual tests miss.
+
+**Rule:** Fuzzing finds bugs you didn't know to test for. Use it for security-critical code (parsers, validators, APIs).
+
+## Fuzz Testing vs Other Testing
+
+| Test Type | Input | Goal |
+|-----------|-------|------|
+| **Unit Testing** | Known valid/invalid inputs | Verify expected behavior |
+| **Property-Based Testing** | Generated valid inputs | Verify invariants hold |
+| **Fuzz Testing** | Random/malformed inputs | Find crashes, hangs, memory issues |
+
+**Fuzzing finds:** Buffer overflows, null pointer dereferences, infinite loops, unhandled exceptions
+
+**Fuzzing does NOT find:** Logic bugs, performance issues
+
+---
+
+## When to Use Fuzz Testing
+
+**Good candidates:**
+- Input parsers (JSON, XML, CSV, binary formats)
+- Network protocol handlers
+- Image/video codecs
+- Cryptographic functions
+- User input validators (file uploads, form data)
+- APIs accepting untrusted data
+
+**Poor candidates:**
+- Business logic (use property-based testing)
+- UI interactions (use E2E tests)
+- Database queries (use integration tests)
+
+---
+
+## Tool Selection
+
+| Tool | Language | Type | When to Use |
+|------|----------|------|-------------|
+| **Atheris** | Python | Coverage-guided | Python applications, libraries |
+| **AFL (American Fuzzy Lop)** | C/C++ | Coverage-guided | Native code, high performance |
+| **libFuzzer** | C/C++/Rust | Coverage-guided | Integrated with LLVM/Clang |
+| **Jazzer** | Java/JVM | Coverage-guided | Java applications |
+| **go-fuzz** | Go | Coverage-guided | Go applications |
+
+**Coverage-guided:** Tracks which code paths are executed, generates inputs to explore new paths
+
+---
+
+## Basic Fuzzing Example (Python + Atheris)
+
+### Installation
+
+```bash
+pip install atheris
+```
+
+---
+
+### Simple Fuzz Test
+
+```python
+import atheris
+import sys
+
+def parse_email(data):
+    """Function to fuzz - finds bugs we didn't know about."""
+    if "@" not in data:
+        raise ValueError("Invalid email")
+
+    local, domain = data.split("@", 1)
+
+    if "." not in domain:
+        raise ValueError("Invalid domain")
+
+    # BUG: Crashes on multiple @ symbols!
+    # "user@@example.com" → crashes with ValueError
+
+    return (local, domain)
+
+@atheris.instrument_func
+def TestOneInput(data):
+    """Fuzz harness - called repeatedly with random inputs."""
+    try:
+        parse_email(data.decode('utf-8', errors='ignore'))
+    except (ValueError, UnicodeDecodeError):
+        # Expected exceptions - not crashes
+        pass
+    # Any other exception = crash found!
+
+atheris.Setup(sys.argv, TestOneInput)
+atheris.Fuzz()
+```
+
+**Run:**
+```bash
+python fuzz_email.py
+```
+
+**Output:**
+```
+INFO: Seed: 1234567890
+INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
+#1: NEW coverage: 10 exec/s: 1000
+#100: NEW coverage: 15 exec/s: 5000
+CRASH: input was 'user@@example.com'
+```
+
+---
+
+## Advanced Fuzzing Patterns
+
+### Structured Fuzzing (JSON)
+
+**Problem:** Random bytes rarely form valid JSON
+
+```python
+import atheris
+import json
+
+@atheris.instrument_func
+def TestOneInput(data):
+    try:
+        # Parse as JSON
+        obj = json.loads(data.decode('utf-8', errors='ignore'))
+
+        # Fuzz your JSON handler
+        process_user_data(obj)
+    except (json.JSONDecodeError, ValueError, KeyError):
+        pass  # Expected for invalid JSON
+
+def process_user_data(data):
+    """Crashes on: {"name": "", "age": -1}"""
+    if len(data["name"]) == 0:
+        raise ValueError("Name cannot be empty")
+    if data["age"] < 0:
+        raise ValueError("Age cannot be negative")
+```
+
+---
+
+### Fuzzing with Corpus (Seed Inputs)
+
+**Corpus:** Collection of valid inputs to start from
+
+```python
+import atheris
+import sys
+import os
+
+# Seed corpus: Valid examples
+CORPUS_DIR = "./corpus"
+os.makedirs(CORPUS_DIR, exist_ok=True)
+
+# Create seed files
+with open(f"{CORPUS_DIR}/valid1.txt", "wb") as f:
+    f.write(b"user@example.com")
+with open(f"{CORPUS_DIR}/valid2.txt", "wb") as f:
+    f.write(b"alice+tag@subdomain.example.org")
+
+@atheris.instrument_func
+def TestOneInput(data):
+    try:
+        parse_email(data.decode('utf-8'))
+    except ValueError:
+        pass
+
+atheris.Setup(sys.argv, TestOneInput, corpus_dir=CORPUS_DIR)
+atheris.Fuzz()
+```
+
+**Benefits:** Faster convergence to interesting inputs
+
+---
+
+## Crash Triage Workflow
+
+### 1. Reproduce Crash
+
+```bash
+# Atheris outputs crash input
+CRASH: input was b'user@@example.com'
+
+# Save to file
+echo "user@@example.com" > crash.txt
+```
+
+---
+
+### 2. Minimize Input
+
+**Find smallest input that triggers crash:**
+
+```python
+# Original: "user@@example.com" (19 bytes)
+# Minimized: "@@" (2 bytes)
+
+# Atheris does this automatically
+python fuzz_email.py crash.txt
+```
+
+---
+
+### 3. Root Cause Analysis
+
+```python
+def parse_email(data):
+    # Crash: data = "@@"
+    local, domain = data.split("@", 1)
+    # local = "", domain = "@"
+
+    if "." not in domain:
+        # domain = "@" → no "." → raises ValueError
+        raise ValueError("Invalid domain")
+
+    # FIX: Validate before splitting
+    if data.count("@") != 1:
+        raise ValueError("Email must have exactly one @")
+```
+
+---
+
+### 4. Write Regression Test
+
+```python
+def test_email_multiple_at_symbols():
+    """Regression test for fuzz-found bug."""
+    with pytest.raises(ValueError, match="exactly one @"):
+        parse_email("user@@example.com")
+```
+
+---
+
+## Integration with CI/CD
+
+### Continuous Fuzzing (GitHub Actions)
+
+```yaml
+# .github/workflows/fuzz.yml
+name: Fuzz Testing
+
+on:
+  schedule:
+    - cron: '0 2 * * *'  # Nightly at 2 AM
+  workflow_dispatch:
+
+jobs:
+  fuzz:
+    runs-on: ubuntu-latest
+    timeout-minutes: 60  # Run for 1 hour
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.11'
+
+      - name: Install dependencies
+        run: pip install atheris
+
+      - name: Run fuzzing
+        run: |
+          timeout 3600 python fuzz_email.py || true
+
+      - name: Upload crashes
+        if: failure()
+        uses: actions/upload-artifact@v3
+        with:
+          name: fuzz-crashes
+          path: crash-*
+```
+
+**Why nightly:** Fuzzing is CPU-intensive, not suitable for every PR
+
+---
+
+## AFL (C/C++) Example
+
+### Installation
+
+```bash
+# Ubuntu/Debian
+sudo apt-get install afl++
+
+# macOS
+brew install afl++
+```
+
+---
+
+### Fuzz Target
+
+```c
+// fuzz_target.c
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+void parse_command(const char *input) {
+    char buffer[64];
+
+    // BUG: Buffer overflow if input > 64 bytes!
+    strcpy(buffer, input);
+
+    if (strcmp(buffer, "exit") == 0) {
+        exit(0);
+    }
+}
+
+int main(int argc, char **argv) {
+    if (argc < 2) return 1;
+
+    FILE *f = fopen(argv[1], "rb");
+    if (!f) return 1;
+
+    char buffer[1024];
+    size_t len = fread(buffer, 1, sizeof(buffer), f);
+    fclose(f);
+
+    buffer[len] = '\0';
+    parse_command(buffer);
+
+    return 0;
+}
+```
+
+---
+
+### Compile and Run
+
+```bash
+# Compile with AFL instrumentation
+afl-gcc fuzz_target.c -o fuzz_target
+
+# Create corpus directory
+mkdir -p corpus
+echo "exit" > corpus/input1.txt
+
+# Run fuzzer
+afl-fuzz -i corpus -o findings -- ./fuzz_target @@
+```
+
+**Output:**
+```
+american fuzzy lop 4.00a
+  path : findings/queue
+  crashes : 1
+  hangs : 0
+  execs done : 1000000
+```
+
+**Crashes found in:** `findings/crashes/`
+
+---
+
+## Anti-Patterns Catalog
+
+### ❌ Fuzzing Without Sanitizers
+
+**Symptom:** Memory bugs don't crash, just corrupt silently
+
+**Fix:** Compile with AddressSanitizer (ASan)
+
+```bash
+# C/C++: Compile with ASan
+afl-gcc -fsanitize=address fuzz_target.c -o fuzz_target
+
+# Python: Use PyASan (if available)
+```
+
+**What ASan catches:** Buffer overflows, use-after-free, memory leaks
+
+---
+
+### ❌ Ignoring Hangs
+
+**Symptom:** Fuzzer reports hangs, not investigated
+
+**What hangs mean:** Infinite loops, algorithmic complexity attacks
+
+**Fix:** Investigate and add timeout checks
+
+```python
+import signal
+
+def timeout_handler(signum, frame):
+    raise TimeoutError("Operation timed out")
+
+@atheris.instrument_func
+def TestOneInput(data):
+    signal.signal(signal.SIGALRM, timeout_handler)
+    signal.alarm(1)  # 1-second timeout
+
+    try:
+        parse_data(data.decode('utf-8'))
+    except (ValueError, TimeoutError):
+        pass
+    finally:
+        signal.alarm(0)
+```
+
+---
+
+### ❌ No Regression Tests
+
+**Symptom:** Same bugs found repeatedly
+
+**Fix:** Add regression test for every crash
+
+```python
+# After fuzzing finds crash on input "@@"
+def test_regression_double_at():
+    with pytest.raises(ValueError):
+        parse_email("@@")
+```
+
+---
+
+## Bottom Line
+
+**Fuzz testing finds crashes and security vulnerabilities by feeding random/malformed inputs. Use it for security-critical code (parsers, validators, APIs).**
+
+**Setup:**
+- Use Atheris (Python), AFL (C/C++), or language-specific fuzzer
+- Start with corpus (valid examples)
+- Run nightly in CI (1-24 hours)
+
+**Workflow:**
+1. Fuzzer finds crash
+2. Minimize crashing input
+3. Root cause analysis
+4. Fix bug
+5. Add regression test
+
+**If your code accepts untrusted input (files, network data, user input), you should be fuzzing it. Fuzzing finds bugs that manual testing misses.**
--- a/skills/integration-testing-patterns/SKILL.md
+++ b/skills/integration-testing-patterns/SKILL.md
@@ -0,0 +1,478 @@
+---
+name: integration-testing-patterns
+description: Use when testing component integration, database testing, external service integration, test containers, testing message queues, microservices testing, or designing integration test suites - provides boundary testing patterns and anti-patterns between unit and E2E tests
+---
+
+# Integration Testing Patterns
+
+## Overview
+
+**Core principle:** Integration tests verify that multiple components work together correctly, testing at system boundaries.
+
+**Rule:** Integration tests sit between unit tests (isolated) and E2E tests (full system). Test the integration points, not full user workflows.
+
+## Integration Testing vs Unit vs E2E
+
+| Aspect | Unit Test | Integration Test | E2E Test |
+|--------|-----------|------------------|----------|
+| **Scope** | Single function/class | 2-3 components + boundaries | Full system |
+| **Speed** | Fastest (<1ms) | Medium (10-500ms) | Slowest (1-10s) |
+| **Dependencies** | All mocked | Real DB/services | Everything real |
+| **When** | Every commit | Every PR | Before release |
+| **Coverage** | Business logic | Integration points | Critical workflows |
+
+**Test Pyramid:**
+- **70% Unit:** Pure logic, no I/O
+- **20% Integration:** Database, APIs, message queues
+- **10% E2E:** Browser tests, full workflows
+
+---
+
+## What to Integration Test
+
+### 1. Database Integration
+
+**Test: Repository/DAO layer with real database**
+
+```python
+import pytest
+from sqlalchemy import create_engine
+from sqlalchemy.orm import sessionmaker
+
+@pytest.fixture(scope="function")
+def db_session():
+    """Each test gets fresh DB with rollback."""
+    engine = create_engine("postgresql://localhost/test_db")
+    Session = sessionmaker(bind=engine)
+    session = Session()
+
+    yield session
+
+    session.rollback()  # Undo all changes
+    session.close()
+
+def test_user_repository_create(db_session):
+    """Integration test: Repository + Database."""
+    repo = UserRepository(db_session)
+
+    user = repo.create(email="alice@example.com", name="Alice")
+
+    assert user.id is not None
+    assert repo.get_by_email("alice@example.com").id == user.id
+```
+
+**Why integration test:**
+- Verifies SQL queries work
+- Catches FK constraint violations
+- Tests database-specific features (JSON columns, full-text search)
+
+**NOT unit test because:** Uses real database
+**NOT E2E test because:** Doesn't test full user workflow
+
+---
+
+### 2. External API Integration
+
+**Test: Service layer calling third-party API**
+
+```python
+import pytest
+import responses
+
+@responses.activate
+def test_payment_service_integration():
+    """Integration test: PaymentService + Stripe API (mocked)."""
+    # Mock Stripe API response
+    responses.add(
+        responses.POST,
+        "https://api.stripe.com/v1/charges",
+        json={"id": "ch_123", "status": "succeeded"},
+        status=200
+    )
+
+    service = PaymentService(api_key="test_key")
+    result = service.charge(amount=1000, token="tok_visa")
+
+    assert result.status == "succeeded"
+    assert result.charge_id == "ch_123"
+```
+
+**Why integration test:**
+- Tests HTTP client configuration
+- Validates request/response parsing
+- Verifies error handling
+
+**When to use real API:**
+- Separate integration test suite (nightly)
+- Contract tests (see contract-testing skill)
+
+---
+
+### 3. Message Queue Integration
+
+**Test: Producer/Consumer with real queue**
+
+```python
+import pytest
+from kombu import Connection
+
+@pytest.fixture
+def rabbitmq_connection():
+    """Real RabbitMQ connection for integration tests."""
+    conn = Connection("amqp://localhost")
+    yield conn
+    conn.release()
+
+def test_order_queue_integration(rabbitmq_connection):
+    """Integration test: OrderService + RabbitMQ."""
+    publisher = OrderPublisher(rabbitmq_connection)
+    consumer = OrderConsumer(rabbitmq_connection)
+
+    # Publish message
+    publisher.publish({"order_id": 123, "status": "pending"})
+
+    # Consume message
+    message = consumer.get(timeout=5)
+
+    assert message["order_id"] == 123
+    assert message["status"] == "pending"
+```
+
+**Why integration test:**
+- Verifies serialization/deserialization
+- Tests queue configuration (exchanges, routing keys)
+- Validates message durability
+
+---
+
+### 4. Microservices Integration
+
+**Test: Service A → Service B communication**
+
+```python
+import pytest
+
+@pytest.fixture
+def mock_user_service():
+    """Mock User Service for integration tests."""
+    with responses.RequestsMock() as rsps:
+        rsps.add(
+            responses.GET,
+            "http://user-service/users/123",
+            json={"id": 123, "name": "Alice"},
+            status=200
+        )
+        yield rsps
+
+def test_order_service_integration(mock_user_service):
+    """Integration test: OrderService + UserService."""
+    order_service = OrderService(user_service_url="http://user-service")
+
+    order = order_service.create_order(user_id=123, items=[...])
+
+    assert order.user_name == "Alice"
+```
+
+**For real service integration:** Use contract tests (see contract-testing skill)
+
+---
+
+## Test Containers Pattern
+
+**Use Docker containers for integration tests.**
+
+```python
+import pytest
+from testcontainers.postgres import PostgresContainer
+
+@pytest.fixture(scope="module")
+def postgres_container():
+    """Start PostgreSQL container for tests."""
+    with PostgresContainer("postgres:15") as postgres:
+        yield postgres
+
+@pytest.fixture
+def db_connection(postgres_container):
+    """Database connection from test container."""
+    engine = create_engine(postgres_container.get_connection_url())
+    return engine.connect()
+
+def test_user_repository(db_connection):
+    repo = UserRepository(db_connection)
+    user = repo.create(email="alice@example.com")
+    assert user.id is not None
+```
+
+**Benefits:**
+- Clean database per test run
+- Matches production environment
+- No manual setup required
+
+**When NOT to use:**
+- Unit tests (too slow)
+- CI without Docker support
+
+---
+
+## Boundary Testing Strategy
+
+**Test at system boundaries, not internal implementation.**
+
+**Boundaries to test:**
+1. **Application → Database** (SQL queries, ORMs)
+2. **Application → External API** (HTTP clients, SDKs)
+3. **Application → File System** (File I/O, uploads)
+4. **Application → Message Queue** (Producers/consumers)
+5. **Service A → Service B** (Microservice calls)
+
+**Example: Boundary test for file upload**
+
+```python
+def test_file_upload_integration(tmp_path):
+    """Integration test: FileService + File System."""
+    service = FileService(storage_path=str(tmp_path))
+
+    # Upload file
+    file_id = service.upload(filename="test.txt", content=b"Hello")
+
+    # Verify file exists on disk
+    file_path = tmp_path / file_id / "test.txt"
+    assert file_path.exists()
+    assert file_path.read_bytes() == b"Hello"
+```
+
+---
+
+## Anti-Patterns Catalog
+
+### ❌ Testing Internal Implementation
+
+**Symptom:** Integration test verifies internal method calls
+
+```python
+# ❌ BAD: Testing implementation, not integration
+def test_order_service():
+    with patch('order_service._calculate_tax') as mock_tax:
+        service.create_order(...)
+        assert mock_tax.called
+```
+
+**Why bad:** Not testing integration point, just internal logic
+
+**Fix:** Test actual boundary (database, API, etc.)
+
+```python
+# ✅ GOOD: Test database integration
+def test_order_service(db_session):
+    service = OrderService(db_session)
+    order = service.create_order(...)
+
+    # Verify data was persisted
+    saved_order = db_session.query(Order).get(order.id)
+    assert saved_order.total == order.total
+```
+
+---
+
+### ❌ Full System Tests Disguised as Integration Tests
+
+**Symptom:** "Integration test" requires entire system running
+
+```python
+# ❌ BAD: This is an E2E test, not integration test
+def test_checkout_flow():
+    # Requires: Web server, database, Redis, Stripe, email service
+    browser.goto("http://localhost:8000/checkout")
+    browser.fill("#card", "4242424242424242")
+    browser.click("#submit")
+    assert "Success" in browser.content()
+```
+
+**Why bad:** Slow, fragile, hard to debug
+
+**Fix:** Test individual integration points
+
+```python
+# ✅ GOOD: Integration test for payment component only
+def test_payment_integration(mock_stripe):
+    service = PaymentService()
+    result = service.charge(amount=1000, token="tok_visa")
+    assert result.status == "succeeded"
+```
+
+---
+
+### ❌ Shared Test Data Across Integration Tests
+
+**Symptom:** Tests fail when run in different orders
+
+```python
+# ❌ BAD: Relies on shared database state
+def test_get_user():
+    user = db.query(User).filter_by(email="test@example.com").first()
+    assert user.name == "Test User"
+
+def test_update_user():
+    user = db.query(User).filter_by(email="test@example.com").first()
+    user.name = "Updated"
+    db.commit()
+```
+
+**Fix:** Each test creates its own data (see test-isolation-fundamentals skill)
+
+```python
+# ✅ GOOD: Isolated test data
+def test_get_user(db_session):
+    user = create_test_user(db_session, email="test@example.com")
+    retrieved = db_session.query(User).get(user.id)
+    assert retrieved.name == user.name
+```
+
+---
+
+### ❌ Testing Too Many Layers
+
+**Symptom:** Integration test includes business logic validation
+
+```python
+# ❌ BAD: Testing logic + integration in same test
+def test_order_calculation(db_session):
+    order = OrderService(db_session).create_order(...)
+
+    # Integration: DB save
+    assert order.id is not None
+
+    # Logic: Tax calculation (should be unit test!)
+    assert order.tax == order.subtotal * 0.08
+```
+
+**Fix:** Separate concerns
+
+```python
+# ✅ GOOD: Unit test for logic
+def test_order_tax_calculation():
+    order = Order(subtotal=100)
+    assert order.calculate_tax() == 8.0
+
+# ✅ GOOD: Integration test for persistence
+def test_order_persistence(db_session):
+    repo = OrderRepository(db_session)
+    order = repo.create(subtotal=100, tax=8.0)
+    assert repo.get(order.id).tax == 8.0
+```
+
+---
+
+## Integration Test Environments
+
+### Local Development
+
+```yaml
+# docker-compose.test.yml
+version: '3.8'
+services:
+  postgres:
+    image: postgres:15
+    environment:
+      POSTGRES_DB: test_db
+      POSTGRES_USER: test
+      POSTGRES_PASSWORD: test
+
+  redis:
+    image: redis:7
+
+  rabbitmq:
+    image: rabbitmq:3-management
+```
+
+**Run tests:**
+```bash
+docker-compose -f docker-compose.test.yml up -d
+pytest tests/integration/
+docker-compose -f docker-compose.test.yml down
+```
+
+---
+
+### CI/CD
+
+```yaml
+# .github/workflows/integration-tests.yml
+name: Integration Tests
+
+on: [pull_request]
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    services:
+      postgres:
+        image: postgres:15
+        env:
+          POSTGRES_PASSWORD: test
+        options: >-
+          --health-cmd pg_isready
+          --health-interval 10s
+
+    steps:
+      - uses: actions/checkout@v3
+      - name: Run integration tests
+        run: pytest tests/integration/
+        env:
+          DATABASE_URL: postgresql://postgres:test@localhost/test
+```
+
+---
+
+## Performance Considerations
+
+**Integration tests are slower than unit tests.**
+
+**Optimization strategies:**
+
+1. **Use transactions:** Rollback instead of truncating tables (100x faster)
+2. **Parallelize:** Run integration tests in parallel (`pytest -n 4`)
+3. **Minimize I/O:** Only test integration points, not full workflows
+4. **Cache containers:** Reuse test containers across tests (scope="module")
+
+**Example: Fast integration tests**
+
+```python
+# Slow: 5 seconds per test
+@pytest.fixture
+def db():
+    engine = create_engine(...)
+    Base.metadata.create_all(engine)  # Recreate schema every test
+    yield engine
+    Base.metadata.drop_all(engine)
+
+# Fast: 10ms per test
+@pytest.fixture(scope="module")
+def db_engine():
+    engine = create_engine(...)
+    Base.metadata.create_all(engine)  # Once per module
+    yield engine
+    Base.metadata.drop_all(engine)
+
+@pytest.fixture
+def db_session(db_engine):
+    connection = db_engine.connect()
+    transaction = connection.begin()
+    session = Session(bind=connection)
+    yield session
+    transaction.rollback()  # Fast cleanup
+    connection.close()
+```
+
+---
+
+## Bottom Line
+
+**Integration tests verify that components work together at system boundaries.**
+
+- Test at boundaries (DB, API, queue), not internal logic
+- Use real dependencies (DB, queue) or realistic mocks (external APIs)
+- Keep tests isolated (transactions, test containers, unique data)
+- Run on every PR (they're slower than unit tests but faster than E2E)
+
+**If your "integration test" requires the entire system running, it's an E2E test. Test integration points individually.**
--- a/skills/load-testing-patterns/SKILL.md
+++ b/skills/load-testing-patterns/SKILL.md
@@ -0,0 +1,843 @@
+---
+name: load-testing-patterns
+description: Use when designing load tests, choosing tools (k6, JMeter, Gatling), calculating concurrent users from DAU, interpreting latency degradation, identifying bottlenecks, or running spike/soak/stress tests - provides test patterns, anti-patterns, and load calculation frameworks
+---
+
+# Load Testing Patterns
+
+## Overview
+
+**Core principle:** Test realistic load patterns, not constant artificial load. Find limits before users do.
+
+**Rule:** Load testing reveals system behavior under stress. Without it, production is your load test.
+
+## Tool Selection Decision Tree
+
+| Your Need | Protocol | Team Skills | Use | Why |
+|-----------|----------|-------------|-----|-----|
+| Modern API testing | HTTP/REST/GraphQL | JavaScript | **k6** | Best dev experience, CI/CD friendly |
+| Enterprise/complex protocols | HTTP/SOAP/JMS/JDBC | Java/GUI comfort | **JMeter** | Mature, comprehensive protocols |
+| Python team | HTTP/WebSocket | Python | **Locust** | Pythonic, easy scripting |
+| High performance/complex scenarios | HTTP/gRPC | Scala/Java | **Gatling** | Best reports, high throughput |
+| Cloud-native at scale | HTTP/WebSocket | Any (SaaS) | **Artillery, Flood.io** | Managed, distributed |
+
+**First choice:** k6 (modern, scriptable, excellent CI/CD integration)
+
+**Why not ApacheBench/wrk:** Too simple for realistic scenarios, no complex user flows
+
+## Test Pattern Library
+
+| Pattern | Purpose | Duration | When to Use |
+|---------|---------|----------|-------------|
+| **Smoke Test** | Verify test works | 1-2 min | Before every test run |
+| **Load Test** | Normal/peak capacity | 10-30 min | Regular capacity validation |
+| **Stress Test** | Find breaking point | 20-60 min | Understand limits |
+| **Spike Test** | Sudden traffic surge | 5-15 min | Black Friday, launch events |
+| **Soak Test** | Memory leaks, stability | 1-8 hours | Pre-release validation |
+| **Capacity Test** | Max sustainable load | Variable | Capacity planning |
+
+### Smoke Test
+
+**Goal:** Verify test script works with minimal load
+
+```javascript
+// k6 smoke test
+export let options = {
+  vus: 1,
+  duration: '1m',
+  thresholds: {
+    http_req_duration: ['p(95)<500'],  // 95% < 500ms
+    http_req_failed: ['rate<0.01'],     // <1% errors
+  }
+}
+```
+
+**Purpose:** Catch test script bugs before running expensive full tests
+
+### Load Test (Ramp-Up Pattern)
+
+**Goal:** Test normal and peak expected load
+
+```javascript
+// k6 load test with ramp-up
+export let options = {
+  stages: [
+    { duration: '5m', target: 100 },   // Ramp to normal load
+    { duration: '10m', target: 100 },  // Hold at normal
+    { duration: '5m', target: 200 },   // Ramp to peak
+    { duration: '10m', target: 200 },  // Hold at peak
+    { duration: '5m', target: 0 },     // Ramp down
+  ],
+  thresholds: {
+    http_req_duration: ['p(95)<500', 'p(99)<1000'],
+    http_req_failed: ['rate<0.05'],
+  }
+}
+```
+
+**Pattern:** Gradual ramp-up → sustain → ramp down. Never start at peak.
+
+### Stress Test (Breaking Point)
+
+**Goal:** Find system limits
+
+```javascript
+// k6 stress test
+export let options = {
+  stages: [
+    { duration: '5m', target: 100 },   // Normal
+    { duration: '5m', target: 300 },   // Above peak
+    { duration: '5m', target: 600 },   // 2x peak
+    { duration: '5m', target: 900 },   // 3x peak (expect failure)
+    { duration: '10m', target: 0 },    // Recovery
+  ]
+}
+```
+
+**Success:** Identify at what load system degrades (not necessarily breaking completely)
+
+### Spike Test (Sudden Surge)
+
+**Goal:** Test sudden traffic bursts (viral post, email campaign)
+
+```javascript
+// k6 spike test
+export let options = {
+  stages: [
+    { duration: '1m', target: 100 },   // Normal
+    { duration: '30s', target: 1000 }, // SPIKE to 10x
+    { duration: '5m', target: 1000 },  // Hold spike
+    { duration: '2m', target: 100 },   // Back to normal
+    { duration: '5m', target: 100 },   // Recovery check
+  ]
+}
+```
+
+**Tests:** Auto-scaling, circuit breakers, rate limiting
+
+### Soak Test (Endurance)
+
+**Goal:** Find memory leaks, resource exhaustion over time
+
+```javascript
+// k6 soak test
+export let options = {
+  stages: [
+    { duration: '5m', target: 100 },   // Ramp
+    { duration: '4h', target: 100 },   // Soak (sustained load)
+    { duration: '5m', target: 0 },     // Ramp down
+  ]
+}
+```
+
+**Monitor:** Memory growth, connection leaks, disk space, file descriptors
+
+**Duration:** Minimum 1 hour, ideally 4-8 hours
+
+## Load Calculation Framework
+
+**Problem:** Convert "10,000 daily active users" to concurrent load
+
+### Step 1: DAU to Concurrent Users
+
+```
+Concurrent Users = DAU × Concurrency Ratio × Peak Multiplier
+
+Concurrency Ratios by App Type:
+- Web apps: 5-10%
+- Social media: 10-20%
+- Business apps: 20-30% (work hours)
+- Gaming: 15-25%
+
+Peak Multiplier: 1.5-2x for safety margin
+```
+
+**Example:**
+```
+DAU = 10,000
+Concurrency = 10% (web app)
+Peak Multiplier = 1.5
+
+Concurrent Users = 10,000 × 0.10 × 1.5 = 1,500 concurrent users
+```
+
+### Step 2: Concurrent Users to Requests/Second
+
+```
+RPS = (Concurrent Users × Requests per Session) / (Session Duration × Think Time Ratio)
+
+Think Time Ratio:
+- Active browsing: 0.3-0.5 (30-50% time clicking/typing)
+- Reading-heavy: 0.1-0.2 (10-20% active)
+- API clients: 0.8-1.0 (80-100% active)
+```
+
+**Example:**
+```
+Concurrent Users = 1,500
+Requests per Session = 20
+Session Duration = 10 minutes = 600 seconds
+Think Time Ratio = 0.3 (web browsing)
+
+RPS = (1,500 × 20) / (600 × 0.3) = 30,000 / 180 = 167 RPS
+```
+
+### Step 3: Model Realistic Patterns
+
+Don't use constant load. Use realistic traffic patterns:
+
+```javascript
+// Realistic daily pattern
+export let options = {
+  stages: [
+    // Morning ramp
+    { duration: '2h', target: 500 },    // 08:00-10:00
+    { duration: '2h', target: 1000 },   // 10:00-12:00 (peak)
+    // Lunch dip
+    { duration: '1h', target: 600 },    // 12:00-13:00
+    // Afternoon peak
+    { duration: '2h', target: 1200 },   // 13:00-15:00 (peak)
+    { duration: '2h', target: 800 },    // 15:00-17:00
+    // Evening drop
+    { duration: '2h', target: 300 },    // 17:00-19:00
+  ]
+}
+```
+
+## Anti-Patterns Catalog
+
+### ❌ Coordinated Omission
+**Symptom:** Fixed rate load generation ignores slow responses, underestimating latency
+
+**Why bad:** Hides real latency impact when system slows down
+
+**Fix:** Use arrival rate (requests/sec) not iteration rate
+
+```javascript
+// ❌ Bad - coordinated omission
+export default function() {
+  http.get('https://api.example.com')
+  sleep(1)  // Wait 1s between requests
+}
+
+// ✅ Good - arrival rate pacing
+export let options = {
+  scenarios: {
+    constant_arrival_rate: {
+      executor: 'constant-arrival-rate',
+      rate: 100,  // 100 RPS regardless of response time
+      timeUnit: '1s',
+      duration: '10m',
+      preAllocatedVUs: 50,
+      maxVUs: 200,
+    }
+  }
+}
+```
+
+---
+
+### ❌ Cold Start Testing
+**Symptom:** Running load test immediately after deployment without warm-up
+
+**Why bad:** JIT compilation, cache warming, connection pooling haven't stabilized
+
+**Fix:** Warm-up phase before measurement
+
+```javascript
+// ✅ Good - warm-up phase
+export let options = {
+  stages: [
+    { duration: '2m', target: 50 },    // Warm-up (not measured)
+    { duration: '10m', target: 100 },  // Actual test
+  ]
+}
+```
+
+---
+
+### ❌ Unrealistic Test Data
+**Symptom:** Using same user ID, same query parameters for all virtual users
+
+**Why bad:** Caches give unrealistic performance, doesn't test real database load
+
+**Fix:** Parameterized, realistic data
+
+```javascript
+// ❌ Bad - same data
+http.get('https://api.example.com/users/123')
+
+// ✅ Good - parameterized data
+import { SharedArray } from 'k6/data'
+import papaparse from 'https://jslib.k6.io/papaparse/5.1.1/index.js'
+
+const csvData = new SharedArray('users', function () {
+  return papaparse.parse(open('./users.csv'), { header: true }).data
+})
+
+export default function() {
+  const user = csvData[__VU % csvData.length]
+  http.get(`https://api.example.com/users/${user.id}`)
+}
+```
+
+---
+
+### ❌ Constant Load Pattern
+**Symptom:** Running with constant VUs instead of realistic traffic pattern
+
+**Why bad:** Real traffic has peaks, valleys, not flat line
+
+**Fix:** Use realistic daily/hourly patterns
+
+---
+
+### ❌ Ignoring Think Time
+**Symptom:** No delays between requests, hammering API as fast as possible
+
+**Why bad:** Unrealistic user behavior, overestimates load
+
+**Fix:** Add realistic think time based on user behavior
+
+```javascript
+// ✅ Good - realistic think time
+import { sleep } from 'k6'
+
+export default function() {
+  http.get('https://api.example.com/products')
+  sleep(Math.random() * 3 + 2)  // 2-5 seconds browsing
+
+  http.post('https://api.example.com/cart', {...})
+  sleep(Math.random() * 5 + 5)  // 5-10 seconds deciding
+
+  http.post('https://api.example.com/checkout', {...})
+}
+```
+
+## Result Interpretation Guide
+
+### Latency Degradation Patterns
+
+| Pattern | Cause | What to Check |
+|---------|-------|---------------|
+| **Linear growth** (2x users → 2x latency) | CPU-bound | Thread pool, CPU usage |
+| **Exponential growth** (2x users → 10x latency) | Resource saturation | Connection pools, locks, queues |
+| **Sudden cliff** (works until X, then fails) | Hard limit hit | Max connections, memory, file descriptors |
+| **Gradual degradation** (slow increase over time) | Memory leak, cache pollution | Memory trends, GC activity |
+
+### Bottleneck Classification
+
+**Symptom: p95 latency 10x at 2x load**
+→ **Resource saturation** (database connection pool, thread pool, queue)
+
+**Symptom: Errors increase with load**
+→ **Hard limit** (connection limit, rate limiting, timeout)
+
+**Symptom: Latency grows over time at constant load**
+→ **Memory leak** or **cache pollution**
+
+**Symptom: High variance (p50 good, p99 terrible)**
+→ **GC pauses**, **lock contention**, or **slow queries**
+
+### What to Monitor
+
+| Layer | Metrics to Track |
+|-------|------------------|
+| **Application** | Request rate, error rate, p50/p95/p99 latency, active requests |
+| **Runtime** | GC pauses (JVM, .NET), thread pool usage, heap/memory |
+| **Database** | Connection pool usage, query latency, lock waits, slow queries |
+| **Infrastructure** | CPU %, memory %, disk I/O, network throughput |
+| **External** | Third-party API latency, rate limit hits |
+
+### Capacity Planning Formula
+
+```
+Safe Capacity = (Breaking Point × Degradation Factor) × Safety Margin
+
+Breaking Point = VUs where p95 latency > threshold
+Degradation Factor = 0.7 (start degradation before break)
+Safety Margin = 0.5-0.7 (handle traffic spikes)
+
+Example:
+- System breaks at 1000 VUs (p95 > 1s)
+- Start seeing degradation at 700 VUs (70%)
+- Safe capacity: 700 × 0.7 = 490 VUs
+```
+
+## Authentication and Session Management
+
+**Problem:** Real APIs require authentication. Can't use same token for all virtual users.
+
+### Token Strategy Decision Framework
+
+| Scenario | Strategy | Why |
+|----------|----------|-----|
+| **Short test (<10 min)** | Pre-generate tokens | Fast, simple, no login load |
+| **Long test (soak)** | Login during test + refresh | Realistic, tests auth system |
+| **Testing auth system** | Simulate login flow | Auth is part of load |
+| **Read-only testing** | Shared token (single user) | Simplest, adequate for API-only tests |
+
+**Default:** Pre-generate tokens for load tests, simulate login for auth system tests
+
+### Pre-Generated Tokens Pattern
+
+**Best for:** API testing where auth system isn't being tested
+
+```javascript
+// k6 with pre-generated JWT tokens
+import http from 'k6/http'
+import { SharedArray } from 'k6/data'
+
+// Load tokens from file (generated externally)
+const tokens = new SharedArray('auth tokens', function () {
+  return JSON.parse(open('./tokens.json'))
+})
+
+export default function() {
+  const token = tokens[__VU % tokens.length]
+
+  const headers = {
+    'Authorization': `Bearer ${token}`
+  }
+
+  http.get('https://api.example.com/protected', { headers })
+}
+```
+
+**Generate tokens externally:**
+
+```bash
+# Script to generate 1000 tokens
+for i in {1..1000}; do
+  curl -X POST https://api.example.com/login \
+    -d "username=loadtest_user_$i&password=test" \
+    | jq -r '.token'
+done > tokens.json
+```
+
+**Pros:** No login load, fast test setup
+**Cons:** Tokens may expire during long tests, not testing auth flow
+
+---
+
+### Login Flow Simulation Pattern
+
+**Best for:** Testing auth system, soak tests where tokens expire
+
+```javascript
+// k6 with login simulation
+import http from 'k6/http'
+import { SharedArray } from 'k6/data'
+
+const users = new SharedArray('users', function () {
+  return JSON.parse(open('./users.json'))  // [{username, password}, ...]
+})
+
+export default function() {
+  const user = users[__VU % users.length]
+
+  // Login to get token
+  const loginRes = http.post('https://api.example.com/login', {
+    username: user.username,
+    password: user.password
+  })
+
+  const token = loginRes.json('token')
+
+  // Use token for subsequent requests
+  const headers = { 'Authorization': `Bearer ${token}` }
+
+  http.get('https://api.example.com/protected', { headers })
+  http.post('https://api.example.com/data', {}, { headers })
+}
+```
+
+**Token refresh for long tests:**
+
+```javascript
+// k6 with token refresh
+import { sleep } from 'k6'
+
+let token = null
+let tokenExpiry = 0
+
+export default function() {
+  const now = Date.now() / 1000
+
+  // Refresh token if expired or about to expire
+  if (!token || now > tokenExpiry - 300) {  // Refresh 5 min before expiry
+    const loginRes = http.post('https://api.example.com/login', {...})
+    token = loginRes.json('token')
+    tokenExpiry = loginRes.json('expires_at')
+  }
+
+  http.get('https://api.example.com/protected', {
+    headers: { 'Authorization': `Bearer ${token}` }
+  })
+
+  sleep(1)
+}
+```
+
+---
+
+### Session Cookie Management
+
+**For cookie-based auth:**
+
+```javascript
+// k6 with session cookies
+import http from 'k6/http'
+
+export default function() {
+  // k6 automatically handles cookies with jar
+  const jar = http.cookieJar()
+
+  // Login (sets session cookie)
+  http.post('https://example.com/login', {
+    username: 'user',
+    password: 'pass'
+  })
+
+  // Subsequent requests use session cookie automatically
+  http.get('https://example.com/dashboard')
+  http.get('https://example.com/profile')
+}
+```
+
+---
+
+### Rate Limiting Detection
+
+**Pattern:** Detect when hitting rate limits during load test
+
+```javascript
+// k6 rate limit detection
+import { check } from 'k6'
+
+export default function() {
+  const res = http.get('https://api.example.com/data')
+
+  check(res, {
+    'not rate limited': (r) => r.status !== 429
+  })
+
+  if (res.status === 429) {
+    console.warn(`Rate limited at VU ${__VU}, iteration ${__ITER}`)
+    const retryAfter = res.headers['Retry-After']
+    console.warn(`Retry-After: ${retryAfter} seconds`)
+  }
+}
+```
+
+**Thresholds for rate limiting:**
+
+```javascript
+export let options = {
+  thresholds: {
+    'http_req_failed{status:429}': ['rate<0.01']  // <1% rate limited
+  }
+}
+```
+
+## Third-Party Dependency Handling
+
+**Problem:** APIs call external services (payment, email, third-party APIs). Should you mock them?
+
+### Mock vs Real Decision Framework
+
+| External Service | Mock or Real? | Why |
+|------------------|---------------|-----|
+| **Payment gateway** | Real (sandbox) | Need to test integration, has sandbox mode |
+| **Email provider** | Mock | Cost ($0.001/email × 1000 VUs = expensive), no value testing |
+| **Third-party API (has staging)** | Real (staging) | Test integration, realistic latency |
+| **Third-party API (no staging)** | Mock | Can't load test production, rate limits |
+| **Internal microservices** | Real | Testing real integration points |
+| **Analytics/tracking** | Mock | High volume, no functional impact |
+
+**Rule:** Use real services if they have sandbox/staging. Mock if expensive, rate-limited, or no test environment.
+
+---
+
+### Service Virtualization with WireMock
+
+**Best for:** Mocking HTTP APIs with realistic responses
+
+```javascript
+// k6 test pointing to WireMock
+export default function() {
+  // WireMock running on localhost:8080 mocks external API
+  const res = http.get('http://localhost:8080/api/payment/process')
+
+  check(res, {
+    'payment mock responds': (r) => r.status === 200
+  })
+}
+```
+
+**WireMock stub setup:**
+
+```json
+{
+  "request": {
+    "method": "POST",
+    "url": "/api/payment/process"
+  },
+  "response": {
+    "status": 200,
+    "jsonBody": {
+      "transaction_id": "{{randomValue type='UUID'}}",
+      "status": "approved"
+    },
+    "headers": {
+      "Content-Type": "application/json"
+    },
+    "fixedDelayMilliseconds": 200
+  }
+}
+```
+
+**Why WireMock:** Realistic latency simulation, dynamic responses, stateful mocking
+
+---
+
+### Partial Mocking Pattern
+
+**Pattern:** Mock some services, use real for others
+
+```javascript
+// k6 with partial mocking
+import http from 'k6/http'
+
+export default function() {
+  // Real API (points to staging)
+  const productRes = http.get('https://staging-api.example.com/products')
+
+  // Mock email service (points to WireMock)
+  http.post('http://localhost:8080/mock/email/send', {
+    to: 'user@example.com',
+    subject: 'Order confirmation'
+  })
+
+  // Real payment sandbox
+  http.post('https://sandbox-payment.stripe.com/charge', {
+    amount: 1000,
+    currency: 'usd',
+    source: 'tok_visa'
+  })
+}
+```
+
+**Decision criteria:**
+- Real: Services with sandbox, need integration validation, low cost
+- Mock: No sandbox, expensive, rate-limited, testing failure scenarios
+
+---
+
+### Testing External Service Failures
+
+**Use mocks to simulate failures:**
+
+```javascript
+// WireMock stub for failure scenarios
+{
+  "request": {
+    "method": "POST",
+    "url": "/api/payment/process"
+  },
+  "response": {
+    "status": 503,
+    "jsonBody": {
+      "error": "Service temporarily unavailable"
+    },
+    "fixedDelayMilliseconds": 5000  // Slow failure
+  }
+}
+```
+
+**k6 test for resilience:**
+
+```javascript
+export default function() {
+  const res = http.post('http://localhost:8080/api/payment/process', {})
+
+  // Verify app handles payment failures gracefully
+  check(res, {
+    'handles payment failure': (r) => r.status === 503,
+    'returns within timeout': (r) => r.timings.duration < 6000
+  })
+}
+```
+
+---
+
+### Cost and Compliance Guardrails
+
+**Before testing with real external services:**
+
+| Check | Why |
+|-------|-----|
+| **Sandbox mode exists?** | Avoid production costs/rate limits |
+| **Cost per request?** | 1000 VUs × 10 req/s × 600s = 6M requests |
+| **Rate limits?** | Will you hit external service limits? |
+| **Terms of service?** | Does load testing violate TOS? |
+| **Data privacy?** | Using real user emails/PII? |
+
+**Example cost calculation:**
+
+```
+Email service: $0.001/email
+Load test: 100 VUs × 5 emails/session × 600s = 300,000 emails
+Cost: 300,000 × $0.001 = $300
+
+Decision: Mock email service, use real payment sandbox (free)
+```
+
+**Compliance:**
+- Don't use real user data in load tests (GDPR, privacy)
+- Check third-party TOS (some prohibit load testing)
+- Use synthetic test data only
+
+## Your First Load Test
+
+**Goal:** Basic load test in one day
+
+**Hour 1-2: Install tool and write smoke test**
+
+```bash
+# Install k6
+brew install k6  # macOS
+# or snap install k6  # Linux
+
+# Create test.js
+cat > test.js <<'EOF'
+import http from 'k6/http'
+import { check, sleep } from 'k6'
+
+export let options = {
+  vus: 1,
+  duration: '30s'
+}
+
+export default function() {
+  let res = http.get('https://your-api.com/health')
+  check(res, {
+    'status is 200': (r) => r.status === 200,
+    'response < 500ms': (r) => r.timings.duration < 500
+  })
+  sleep(1)
+}
+EOF
+
+# Run smoke test
+k6 run test.js
+```
+
+**Hour 3-4: Calculate target load**
+
+```
+Your DAU: 10,000
+Concurrency: 10%
+Peak multiplier: 1.5
+Target: 10,000 × 0.10 × 1.5 = 1,500 VUs
+```
+
+**Hour 5-6: Write load test with ramp-up**
+
+```javascript
+export let options = {
+  stages: [
+    { duration: '5m', target: 750 },   // Ramp to normal (50%)
+    { duration: '10m', target: 750 },  // Hold normal
+    { duration: '5m', target: 1500 },  // Ramp to peak
+    { duration: '10m', target: 1500 }, // Hold peak
+    { duration: '5m', target: 0 },     // Ramp down
+  ],
+  thresholds: {
+    http_req_duration: ['p(95)<500', 'p(99)<1000'],
+    http_req_failed: ['rate<0.05']  // < 5% errors
+  }
+}
+```
+
+**Hour 7-8: Run test and analyze**
+
+```bash
+# Run load test
+k6 run --out json=results.json test.js
+
+# Check summary output for:
+# - p95/p99 latency trends
+# - Error rates
+# - When degradation started
+```
+
+**If test fails:** Check thresholds, adjust targets, investigate bottlenecks
+
+## Common Mistakes
+
+### ❌ Testing Production Without Safeguards
+**Fix:** Use feature flags, test environment, or controlled percentage
+
+---
+
+### ❌ No Baseline Performance Metrics
+**Fix:** Run smoke test first to establish baseline before load testing
+
+---
+
+### ❌ Using Iteration Duration Instead of Arrival Rate
+**Fix:** Use `constant-arrival-rate` executor in k6
+
+---
+
+### ❌ Not Warming Up Caches/JIT
+**Fix:** 2-5 minute warm-up phase before measurement
+
+## Quick Reference
+
+**Tool Selection:**
+- Modern API: k6
+- Enterprise: JMeter
+- Python team: Locust
+
+**Test Patterns:**
+- Smoke: 1 VU, 1 min
+- Load: Ramp-up → peak → ramp-down
+- Stress: Increase until break
+- Spike: Sudden 10x surge
+- Soak: 4-8 hours constant
+
+**Load Calculation:**
+```
+Concurrent = DAU × 0.10 × 1.5
+RPS = (Concurrent × Requests/Session) / (Duration × Think Time)
+```
+
+**Anti-Patterns:**
+- Coordinated omission (use arrival rate)
+- Cold start (warm-up first)
+- Unrealistic data (parameterize)
+- Constant load (use realistic patterns)
+
+**Result Interpretation:**
+- Linear growth → CPU-bound
+- Exponential growth → Resource saturation
+- Sudden cliff → Hard limit
+- Gradual degradation → Memory leak
+
+**Authentication:**
+- Short tests: Pre-generate tokens
+- Long tests: Login + refresh
+- Testing auth: Simulate login flow
+
+**Third-Party Dependencies:**
+- Has sandbox: Use real (staging/sandbox)
+- Expensive/rate-limited: Mock (WireMock)
+- No sandbox: Mock
+
+## Bottom Line
+
+**Start with smoke test (1 VU). Calculate realistic load from DAU. Use ramp-up pattern (never start at peak). Monitor p95/p99 latency. Find breaking point before users do.**
+
+Test realistic scenarios with think time, not hammer tests.
--- a/skills/mutation-testing/SKILL.md
+++ b/skills/mutation-testing/SKILL.md
@@ -0,0 +1,348 @@
+---
+name: mutation-testing
+description: Use when validating test effectiveness, measuring test quality beyond coverage, choosing mutation testing tools (Stryker, PITest, mutmut), interpreting mutation scores, or improving test suites - provides mutation operators, score interpretation, and integration patterns
+---
+
+# Mutation Testing
+
+## Overview
+
+**Core principle:** Mutation testing validates that your tests actually test something by introducing bugs and checking if tests catch them.
+
+**Rule:** 100% code coverage doesn't mean good tests. Mutation score measures if tests detect bugs.
+
+## Code Coverage vs Mutation Score
+
+| Metric | What It Measures | Example |
+|--------|------------------|---------|
+| **Code Coverage** | Lines executed by tests | `calculate_tax(100)` executes code = 100% coverage |
+| **Mutation Score** | Bugs detected by tests | Change `*` to `/` → test still passes = poor tests |
+
+**Problem with coverage:**
+
+```python
+def calculate_tax(amount):
+    return amount * 0.08
+
+def test_calculate_tax():
+    calculate_tax(100)  # 100% coverage, but asserts nothing!
+```
+
+**Mutation testing catches this:**
+1. Mutates `* 0.08` to `/ 0.08`
+2. Runs test
+3. Test still passes → **Survived mutation** (bad test!)
+
+---
+
+## How Mutation Testing Works
+
+**Process:**
+1. **Create mutant:** Change code slightly (e.g., `+` → `-`, `<` → `<=`)
+2. **Run tests:** Do tests fail?
+3. **Classify:**
+   - **Killed:** Test failed → Good test!
+   - **Survived:** Test passed → Test doesn't verify this logic
+   - **Timeout:** Test hung → Usually killed
+   - **No coverage:** Not executed → Add test
+
+**Mutation Score:**
+```
+Mutation Score = (Killed Mutants / Total Mutants) × 100
+```
+
+**Thresholds:**
+- **> 80%:** Excellent test quality
+- **60-80%:** Acceptable
+- **< 60%:** Tests are weak
+
+---
+
+## Tool Selection
+
+| Language | Tool | Why |
+|----------|------|-----|
+| **JavaScript/TypeScript** | **Stryker** | Best JS support, framework-agnostic |
+| **Java** | **PITest** | Industry standard, Maven/Gradle integration |
+| **Python** | **mutmut** | Simple, fast, pytest integration |
+| **C#** | **Stryker.NET** | .NET ecosystem integration |
+
+---
+
+## Example: Python with mutmut
+
+### Installation
+
+```bash
+pip install mutmut
+```
+
+---
+
+### Basic Usage
+
+```bash
+# Run mutation testing
+mutmut run
+
+# View results
+mutmut results
+
+# Show survived mutants (bugs your tests missed)
+mutmut show
+```
+
+---
+
+### Configuration
+
+```toml
+# setup.cfg
+[mutmut]
+paths_to_mutate=src/
+backup=False
+runner=python -m pytest -x
+tests_dir=tests/
+```
+
+---
+
+### Example
+
+```python
+# src/calculator.py
+def calculate_discount(price, percent):
+    if percent > 100:
+        raise ValueError("Percent cannot exceed 100")
+    return price * (1 - percent / 100)
+
+# tests/test_calculator.py
+def test_calculate_discount():
+    result = calculate_discount(100, 20)
+    assert result == 80
+```
+
+**Run mutmut:**
+```bash
+mutmut run
+```
+
+**Possible mutations:**
+1. `percent > 100` → `percent >= 100` (boundary)
+2. `1 - percent` → `1 + percent` (operator)
+3. `percent / 100` → `percent * 100` (operator)
+4. `price * (...)` → `price / (...)` (operator)
+
+**Results:**
+- Mutation 1 **survived** (test doesn't check boundary)
+- Mutation 2, 3, 4 **killed** (test catches these)
+
+**Improvement:**
+```python
+def test_calculate_discount_boundary():
+    # Catch mutation 1
+    with pytest.raises(ValueError):
+        calculate_discount(100, 101)
+```
+
+---
+
+## Common Mutation Operators
+
+| Operator | Original | Mutated | What It Tests |
+|----------|----------|---------|---------------|
+| **Arithmetic** | `a + b` | `a - b` | Calculation logic |
+| **Relational** | `a < b` | `a <= b` | Boundary conditions |
+| **Logical** | `a and b` | `a or b` | Boolean logic |
+| **Unary** | `+x` | `-x` | Sign handling |
+| **Constant** | `return 0` | `return 1` | Magic numbers |
+| **Return** | `return x` | `return None` | Return value validation |
+| **Statement deletion** | `x = 5` | (deleted) | Side effects |
+
+---
+
+## Interpreting Mutation Score
+
+### High Score (> 80%)
+
+**Good tests that catch most bugs.**
+
+```python
+def add(a, b):
+    return a + b
+
+def test_add():
+    assert add(2, 3) == 5
+    assert add(-1, 1) == 0
+    assert add(0, 0) == 0
+
+# Mutations killed:
+# - a - b (returns -1, test expects 5)
+# - a * b (returns 6, test expects 5)
+```
+
+---
+
+### Low Score (< 60%)
+
+**Weak tests that don't verify logic.**
+
+```python
+def validate_email(email):
+    return "@" in email and "." in email
+
+def test_validate_email():
+    validate_email("user@example.com")  # No assertion!
+
+# Mutations survived:
+# - "@" in email → "@" not in email
+# - "and" → "or"
+# - (All mutations survive because test asserts nothing)
+```
+
+---
+
+### Survived Mutants to Investigate
+
+**Priority order:**
+1. **Business logic mutations** (calculations, validations)
+2. **Boundary conditions** (`<` → `<=`, `>` → `>=`)
+3. **Error handling** (exception raising)
+
+**Low priority:**
+4. **Logging statements**
+5. **Constants that don't affect behavior**
+
+---
+
+## Integration with CI/CD
+
+### GitHub Actions (Python)
+
+```yaml
+# .github/workflows/mutation-testing.yml
+name: Mutation Testing
+
+on:
+  schedule:
+    - cron: '0 2 * * 0'  # Weekly on Sunday 2 AM
+  workflow_dispatch:  # Manual trigger
+
+jobs:
+  mutmut:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.11'
+
+      - name: Install dependencies
+        run: |
+          pip install mutmut pytest
+
+      - name: Run mutation testing
+        run: mutmut run
+
+      - name: Generate report
+        run: |
+          mutmut results
+          mutmut html  # Generate HTML report
+
+      - name: Upload report
+        uses: actions/upload-artifact@v3
+        with:
+          name: mutation-report
+          path: html/
+```
+
+**Why weekly, not every PR:**
+- Mutation testing is slow (10-100x slower than regular tests)
+- Runs every possible mutation
+- Not needed for every change
+
+---
+
+## Anti-Patterns Catalog
+
+### ❌ Chasing 100% Mutation Score
+
+**Symptom:** Writing tests just to kill surviving mutants
+
+**Why bad:**
+- Some mutations are equivalent (don't change behavior)
+- Diminishing returns after 85%
+- Time better spent on integration tests
+
+**Fix:** Target 80-85%, focus on business logic
+
+---
+
+### ❌ Ignoring Equivalent Mutants
+
+**Symptom:** "95% mutation score, still have survived mutants"
+
+**Equivalent mutants:** Changes that don't affect behavior
+
+```python
+def is_positive(x):
+    return x > 0
+
+# Mutation: x > 0 → x >= 0
+# If input is never exactly 0, this mutation is equivalent
+```
+
+**Fix:** Mark as equivalent in tool config
+
+```bash
+# mutmut - mark mutant as equivalent
+mutmut results
+# Choose mutant ID
+mutmut apply 42 --mark-as-equivalent
+```
+
+---
+
+### ❌ Running Mutation Tests on Every Commit
+
+**Symptom:** CI takes 2 hours
+
+**Why bad:** Mutation testing is 10-100x slower than regular tests
+
+**Fix:**
+- Run weekly or nightly
+- Run on core modules only (not entire codebase)
+- Use as quality metric, not blocker
+
+---
+
+## Incremental Mutation Testing
+
+**Test only changed code:**
+
+```bash
+# mutmut - test only modified files
+git diff --name-only main | grep '\.py$' | mutmut run --paths-to-mutate -
+```
+
+**Benefits:**
+- Faster feedback (minutes instead of hours)
+- Can run on PRs
+- Focuses on new code
+
+---
+
+## Bottom Line
+
+**Mutation testing measures if your tests actually detect bugs. High code coverage doesn't mean good tests.**
+
+**Usage:**
+- Run weekly/nightly, not on every commit (too slow)
+- Target 80-85% mutation score for business logic
+- Use mutmut (Python), Stryker (JS), PITest (Java)
+- Focus on killed vs survived mutants
+- Ignore equivalent mutants
+
+**If your tests have 95% coverage but 40% mutation score, your tests aren't testing anything meaningful. Fix the tests, not the coverage metric.**
--- a/skills/observability-and-monitoring/SKILL.md
+++ b/skills/observability-and-monitoring/SKILL.md
@@ -0,0 +1,479 @@
+---
+name: observability-and-monitoring
+description: Use when implementing metrics/logs/traces, defining SLIs/SLOs, designing alerts, choosing observability tools, debugging alert fatigue, or optimizing observability costs - provides SRE frameworks, anti-patterns, and implementation patterns
+---
+
+# Observability and Monitoring
+
+## Overview
+
+**Core principle:** Measure what users care about, alert on symptoms not causes, make alerts actionable.
+
+**Rule:** Observability without actionability is just expensive logging.
+
+**Already have observability tools (CloudWatch, Datadog, etc.)?** Optimize what you have first. Most observability problems are usage/process issues, not tooling. Implement SLIs/SLOs, clean up alerts, add runbooks with existing tools. Migrate only if you hit concrete tool limitations (cost, features, multi-cloud). Tool migration is expensive - make sure it solves a real problem.
+
+## Getting Started Decision Tree
+
+| Team Size | Scale | Starting Point | Tools |
+|-----------|-------|----------------|-------|
+| 1-5 engineers | <10 services | Metrics + logs | Prometheus + Grafana + Loki |
+| 5-20 engineers | 10-50 services | Metrics + logs + basic traces | Add Jaeger, OpenTelemetry |
+| 20+ engineers | 50+ services | Full observability + SLOs | Managed platform (Datadog, Grafana Cloud) |
+
+**First step:** Implement metrics with OpenTelemetry + Prometheus
+
+**Why this order:** Metrics give you fastest time-to-value (detect issues), logs help debug (understand what happened), traces solve complex distributed problems (debug cross-service issues)
+
+## Three Pillars Quick Reference
+
+### Metrics (Quantitative, aggregated)
+
+**When to use:** Alerting, dashboards, trend analysis
+
+**What to collect:**
+- **RED method** (services): Rate, Errors, Duration
+- **USE method** (resources): Utilization, Saturation, Errors
+- **Four Golden Signals**: Latency, traffic, errors, saturation
+
+**Implementation:**
+```python
+# OpenTelemetry metrics
+from opentelemetry import metrics
+
+meter = metrics.get_meter(__name__)
+request_counter = meter.create_counter(
+    "http_requests_total",
+    description="Total HTTP requests"
+)
+request_duration = meter.create_histogram(
+    "http_request_duration_seconds",
+    description="HTTP request duration"
+)
+
+# Instrument request
+request_counter.add(1, {"method": "GET", "endpoint": "/api/users"})
+request_duration.record(duration, {"method": "GET", "endpoint": "/api/users"})
+```
+
+### Logs (Discrete events)
+
+**When to use:** Debugging, audit trails, error investigation
+
+**Best practices:**
+- Structured logging (JSON)
+- Include correlation IDs
+- Don't log sensitive data (PII, secrets)
+
+**Implementation:**
+```python
+import structlog
+
+log = structlog.get_logger()
+log.info(
+    "user_login",
+    user_id=user_id,
+    correlation_id=correlation_id,
+    ip_address=ip,
+    duration_ms=duration
+)
+```
+
+### Traces (Request flows)
+
+**When to use:** Debugging distributed systems, latency investigation
+
+**Implementation:**
+```python
+from opentelemetry import trace
+
+tracer = trace.get_tracer(__name__)
+
+with tracer.start_as_current_span("process_order") as span:
+    span.set_attribute("order.id", order_id)
+    span.set_attribute("user.id", user_id)
+    # Process order logic
+```
+
+## Anti-Patterns Catalog
+
+### ❌ Vanity Metrics
+**Symptom:** Tracking metrics that look impressive but don't inform decisions
+
+**Why bad:** Wastes resources, distracts from actionable metrics
+
+**Fix:** Only collect metrics that answer "should I page someone?" or inform business decisions
+
+```python
+# ❌ Bad - vanity metric
+total_requests_all_time_counter.inc()
+
+# ✅ Good - actionable metric
+request_error_rate.labels(service="api", endpoint="/users").observe(error_rate)
+```
+
+---
+
+### ❌ Alert on Everything
+**Symptom:** Hundreds of alerts per day, team ignores most of them
+
+**Why bad:** Alert fatigue, real issues get missed, on-call burnout
+
+**Fix:** Alert only on user-impacting symptoms that require immediate action
+
+**Test:** "If this alert fires at 2am, should someone wake up to fix it?" If no, it's not an alert.
+
+---
+
+### ❌ No Runbooks
+**Symptom:** Alerts fire with no guidance on how to respond
+
+**Why bad:** Increased MTTR, inconsistent responses, on-call stress
+
+**Fix:** Every alert must link to a runbook with investigation steps
+
+```yaml
+# ✅ Good alert with runbook
+alert: HighErrorRate
+annotations:
+  summary: "Error rate >5% on {{$labels.service}}"
+  description: "Current: {{$value}}%"
+  runbook: "https://wiki.company.com/runbooks/high-error-rate"
+```
+
+---
+
+### ❌ Cardinality Explosion
+**Symptom:** Metrics with unbounded labels (user IDs, timestamps, UUIDs) cause storage/performance issues
+
+**Why bad:** Expensive storage, slow queries, potential system failure
+
+**Fix:** Use fixed cardinality labels, aggregate high-cardinality dimensions
+
+```python
+# ❌ Bad - unbounded cardinality
+request_counter.labels(user_id=user_id).inc()  # Millions of unique series
+
+# ✅ Good - bounded cardinality
+request_counter.labels(user_type="premium", region="us-east").inc()
+```
+
+---
+
+### ❌ Missing Correlation IDs
+**Symptom:** Can't trace requests across services, debugging takes hours
+
+**Why bad:** High MTTR, frustrated engineers, customer impact
+
+**Fix:** Generate correlation ID at entry point, propagate through all services
+
+```python
+# ✅ Good - correlation ID propagation
+import uuid
+from contextvars import ContextVar
+
+correlation_id_var = ContextVar("correlation_id", default=None)
+
+def handle_request():
+    correlation_id = request.headers.get("X-Correlation-ID") or str(uuid.uuid4())
+    correlation_id_var.set(correlation_id)
+
+    # All logs and traces include it automatically
+    log.info("processing_request", extra={"correlation_id": correlation_id})
+```
+
+## SLI Selection Framework
+
+**Principle:** Measure user experience, not system internals
+
+### Four Golden Signals
+
+| Signal | Definition | Example SLI |
+|--------|------------|-------------|
+| **Latency** | Request response time | p99 latency < 200ms |
+| **Traffic** | Demand on system | Requests per second |
+| **Errors** | Failed requests | Error rate < 0.1% |
+| **Saturation** | Resource fullness | CPU < 80%, queue depth < 100 |
+
+### RED Method (for services)
+
+- **Rate**: Requests per second
+- **Errors**: Error rate (%)
+- **Duration**: Response time (p50, p95, p99)
+
+### USE Method (for resources)
+
+- **Utilization**: % time resource busy (CPU %, disk I/O %)
+- **Saturation**: Queue depth, wait time
+- **Errors**: Error count
+
+**Decision framework:**
+
+| Service Type | Recommended SLIs |
+|--------------|------------------|
+| **User-facing API** | Availability (%), p95 latency, error rate |
+| **Background jobs** | Freshness (time since last run), success rate, processing time |
+| **Data pipeline** | Data freshness, completeness (%), processing latency |
+| **Storage** | Availability, durability, latency percentiles |
+
+## SLO Definition Guide
+
+**SLO = Target value for SLI**
+
+**Formula:** `SLO = (good events / total events) >= target`
+
+**Example:**
+```
+SLI: Request success rate
+SLO: 99.9% of requests succeed (measured over 30 days)
+Error budget: 0.1% = ~43 minutes downtime/month
+```
+
+### Error Budget
+
+**Definition:** Amount of unreliability you can tolerate
+
+**Calculation:**
+```
+Error budget = 1 - SLO target
+If SLO = 99.9%, error budget = 0.1%
+For 1M requests/month: 1,000 requests can fail
+```
+
+**Usage:** Balance reliability vs feature velocity
+
+### Multi-Window Multi-Burn-Rate Alerting
+
+**Problem:** Simple threshold alerts are either too noisy or too slow
+
+**Solution:** Alert based on how fast you're burning error budget
+
+```yaml
+# Alert if burning budget 14.4x faster than acceptable (5% in 1 hour)
+alert: ErrorBudgetBurnRateHigh
+expr: |
+  (
+    rate(errors[1h]) / rate(requests[1h])
+  ) > (14.4 * (1 - 0.999))
+annotations:
+  summary: "Burning error budget at 14.4x rate"
+  runbook: "https://wiki/runbooks/error-budget-burn"
+```
+
+## Alert Design Patterns
+
+**Principle:** Alert on symptoms (user impact) not causes (CPU high)
+
+### Symptom-Based Alerting
+
+```python
+# ❌ Bad - alert on cause
+alert: HighCPU
+expr: cpu_usage > 80%
+
+# ✅ Good - alert on symptom
+alert: HighLatency
+expr: http_request_duration_p99 > 1.0
+```
+
+### Alert Severity Levels
+
+| Level | When | Response Time | Example |
+|-------|------|---------------|---------|
+| **Critical** | User-impacting | Immediate (page) | Error rate >5%, service down |
+| **Warning** | Will become critical | Next business day | Error rate >1%, disk 85% full |
+| **Info** | Informational | No action needed | Deploy completed, scaling event |
+
+**Rule:** Only page for critical. Everything else goes to dashboard/Slack.
+
+## Cost Optimization Quick Reference
+
+**Observability can cost 5-15% of infrastructure spend. Optimize:**
+
+### Sampling Strategies
+
+```python
+# Trace sampling - collect 10% of traces
+from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
+
+sampler = TraceIdRatioBased(0.1)  # 10% sampling
+```
+
+**When to sample:**
+- Traces: 1-10% for high-traffic services
+- Logs: Sample debug/info logs, keep all errors
+- Metrics: Don't sample (they're already aggregated)
+
+### Retention Policies
+
+| Data Type | Recommended Retention | Rationale |
+|-----------|----------------------|-----------|
+| **Metrics** | 15 days (raw), 13 months (aggregated) | Trend analysis |
+| **Logs** | 7-30 days | Debugging, compliance |
+| **Traces** | 7 days | Debugging recent issues |
+
+### Cardinality Control
+
+```python
+# ❌ Bad - high cardinality
+http_requests.labels(
+    method=method,
+    url=full_url,  # Unbounded!
+    user_id=user_id  # Unbounded!
+)
+
+# ✅ Good - controlled cardinality
+http_requests.labels(
+    method=method,
+    endpoint=route_pattern,  # /users/:id not /users/12345
+    status_code=status
+)
+```
+
+## Tool Ecosystem Quick Reference
+
+| Category | Open Source | Managed/Commercial |
+|----------|-------------|-------------------|
+| **Metrics** | Prometheus, VictoriaMetrics | Datadog, New Relic, Grafana Cloud |
+| **Logs** | Loki, ELK Stack | Datadog, Splunk, Sumo Logic |
+| **Traces** | Jaeger, Zipkin | Datadog, Honeycomb, Lightstep |
+| **All-in-One** | Grafana + Loki + Tempo + Mimir | Datadog, New Relic, Dynatrace |
+| **Instrumentation** | OpenTelemetry | (vendor SDKs) |
+
+**Recommendation:**
+- **Starting out**: Prometheus + Grafana + OpenTelemetry
+- **Growing (10-50 services)**: Add Loki (logs) + Jaeger (traces)
+- **Scale (50+ services)**: Consider managed (Datadog, Grafana Cloud)
+
+**Why OpenTelemetry:** Vendor-neutral, future-proof, single instrumentation for all signals
+
+## Your First Observability Setup
+
+**Goal:** Metrics + alerting in one week
+
+**Day 1-2: Instrument application**
+
+```python
+# Add OpenTelemetry
+from opentelemetry import metrics, trace
+from opentelemetry.sdk.metrics import MeterProvider
+from opentelemetry.sdk.trace import TracerProvider
+from opentelemetry.exporter.prometheus import PrometheusMetricReader
+
+# Initialize
+meter_provider = MeterProvider(
+    metric_readers=[PrometheusMetricReader()]
+)
+metrics.set_meter_provider(meter_provider)
+
+# Instrument HTTP framework (auto-instrumentation)
+from opentelemetry.instrumentation.flask import FlaskInstrumentor
+FlaskInstrumentor().instrument_app(app)
+```
+
+**Day 3-4: Deploy Prometheus + Grafana**
+
+```yaml
+# docker-compose.yml
+version: '3'
+services:
+  prometheus:
+    image: prom/prometheus
+    ports:
+      - "9090:9090"
+    volumes:
+      - ./prometheus.yml:/etc/prometheus/prometheus.yml
+
+  grafana:
+    image: grafana/grafana
+    ports:
+      - "3000:3000"
+```
+
+**Day 5: Define SLIs and SLOs**
+
+```
+SLI: HTTP request success rate
+SLO: 99.9% of requests succeed (30-day window)
+Error budget: 0.1% = 43 minutes downtime/month
+```
+
+**Day 6: Create alerts**
+
+```yaml
+# prometheus-alerts.yml
+groups:
+  - name: slo_alerts
+    rules:
+      - alert: HighErrorRate
+        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
+        for: 5m
+        annotations:
+          summary: "Error rate >5% on {{$labels.service}}"
+          runbook: "https://wiki/runbooks/high-error-rate"
+```
+
+**Day 7: Build dashboard**
+
+**Panels to include:**
+- Error rate (%)
+- Request rate (req/s)
+- p50/p95/p99 latency
+- CPU/memory utilization
+
+## Common Mistakes
+
+### ❌ Logging in Production == Debugging in Production
+**Fix:** Use structured logging with correlation IDs, not print statements
+
+---
+
+### ❌ Alerting on Predictions, Not Reality
+**Fix:** Alert on actual user impact (errors, latency) not predicted issues (disk 70% full)
+
+---
+
+### ❌ Dashboard Sprawl
+**Fix:** One main dashboard per service showing SLIs. Delete dashboards unused for 3 months.
+
+---
+
+### ❌ Ignoring Alert Feedback Loop
+**Fix:** Track alert precision (% that led to action). Delete alerts with <50% precision.
+
+## Quick Reference
+
+**Getting Started:**
+- Start with metrics (Prometheus + OpenTelemetry)
+- Add logs when debugging is hard (Loki)
+- Add traces when issues span services (Jaeger)
+
+**SLI Selection:**
+- User-facing: Availability, latency, error rate
+- Background: Freshness, success rate, processing time
+
+**SLO Targets:**
+- Start with 99% (achievable)
+- Increase to 99.9% only if business requires it
+- 99.99% is very expensive (4 nines = 52 min/year downtime)
+
+**Alerting:**
+- Critical only = page
+- Warning = next business day
+- Info = dashboard only
+
+**Cost Control:**
+- Sample traces (1-10%)
+- Control metric cardinality (no unbounded labels)
+- Set retention policies (7-30 days logs, 15 days metrics)
+
+**Tools:**
+- Small: Prometheus + Grafana + Loki
+- Medium: Add Jaeger
+- Large: Consider Datadog, Grafana Cloud
+
+## Bottom Line
+
+**Start with metrics using OpenTelemetry + Prometheus. Define 3-5 SLIs based on user experience. Alert only on symptoms that require immediate action. Add logs and traces when metrics aren't enough.**
+
+Measure what users care about, not what's easy to measure.
--- a/skills/performance-testing-fundamentals/SKILL.md
+++ b/skills/performance-testing-fundamentals/SKILL.md
@@ -0,0 +1,242 @@
+---
+name: performance-testing-fundamentals
+description: Use when starting performance testing, choosing load testing tools, interpreting performance metrics, debugging slow applications, or establishing performance baselines - provides decision frameworks and anti-patterns for load, stress, spike, and soak testing
+---
+
+# Performance Testing Fundamentals
+
+## Overview
+
+**Core principle:** Diagnose first, test second. Performance testing without understanding your bottlenecks wastes time.
+
+**Rule:** Define SLAs before testing. You can't judge "good" performance without requirements.
+
+## When NOT to Performance Test
+
+Performance test only AFTER:
+- ✅ Defining performance SLAs (latency, throughput, error rate targets)
+- ✅ Profiling current bottlenecks (APM, database logs, profiling)
+- ✅ Fixing obvious issues (missing indexes, N+1 queries, inefficient algorithms)
+
+**Don't performance test to find problems** - use profiling/APM for that. Performance test to verify fixes and validate capacity.
+
+## Tool Selection Decision Tree
+
+| Your Constraint | Choose | Why |
+|----------------|--------|-----|
+| CI/CD integration, JavaScript team | **k6** | Modern, code-as-config, easy CI integration |
+| Complex scenarios, enterprise, mature ecosystem | **JMeter** | GUI, plugins, every protocol |
+| High throughput (10k+ RPS), Scala team | **Gatling** | Built for scale, excellent reports |
+| Quick HTTP benchmark, no complex scenarios | **Apache Bench (ab)** or **wrk** | Command-line, no setup |
+| Cloud-based, don't want infrastructure | **BlazeMeter**, **Loader.io** | SaaS, pay-per-use |
+| Realistic browser testing (JS rendering) | **Playwright** + **k6** | Hybrid: Playwright for UX, k6 for load |
+
+**For most teams:** k6 (modern, scriptable) or JMeter (mature, GUI)
+
+## Test Type Quick Reference
+
+| Test Type | Purpose | Duration | Load Pattern | Use When |
+|-----------|---------|----------|--------------|----------|
+| **Load Test** | Verify normal operations under expected load | 15-30 min | Steady (ramp to target, sustain) | Baseline validation, regression testing |
+| **Stress Test** | Find breaking point | 5-15 min | Increasing (ramp until failure) | Capacity planning, finding limits |
+| **Spike Test** | Test sudden traffic surge | 2-5 min | Instant jump (0 → peak) | Black Friday prep, auto-scaling validation |
+| **Soak Test** | Find memory leaks, connection pool exhaustion | 2-8 hours | Steady sustained load | Pre-production validation, stability check |
+
+**Start with Load Test** (validates baseline), then Stress/Spike (finds limits), finally Soak (validates stability).
+
+## Anti-Patterns Catalog
+
+### ❌ Premature Load Testing
+**Symptom:** "App is slow, let's load test it"
+
+**Why bad:** Load testing reveals "it's slow under load" but not WHY or WHERE
+
+**Fix:** Profile first (APM, database slow query logs, profiler), fix obvious bottlenecks, THEN load test to validate
+
+---
+
+### ❌ Testing Without SLAs
+**Symptom:** "My API handles 100 RPS with 200ms average latency. Is that good?"
+
+**Why bad:** Can't judge "good" without requirements. A gaming API needs <50ms; batch processing tolerates 2s.
+
+**Fix:** Define SLAs first:
+- Target latency: P95 < 300ms, P99 < 500ms
+- Target throughput: 500 RPS at peak
+- Max error rate: < 0.1%
+
+---
+
+### ❌ Unrealistic SLAs
+**Symptom:** "Our database-backed CRUD API with complex joins must have P95 < 10ms"
+
+**Why bad:** Sets impossible targets. Database round-trip alone is often 5-20ms. Forces wasted optimization or architectural rewrites.
+
+**Fix:** Compare against Performance Benchmarks table (see below). If target is 10x better than benchmark, profile current performance first, then negotiate realistic SLA based on what's achievable vs cost of optimization.
+
+---
+
+### ❌ Vanity Metrics
+**Symptom:** Reporting only average response time
+
+**Why bad:** Average hides tail latency. 99% of requests at 100ms + 1% at 10s = "average 200ms" looks fine, but users experience 10s delays.
+
+**Fix:** Always report percentiles:
+- P50 (median) - typical user experience
+- P95 - most users
+- P99 - worst-case for significant minority
+- Max - outliers
+
+---
+
+### ❌ Load Testing in Production First
+**Symptom:** "Let's test capacity by running load tests against production"
+
+**Why bad:** Risks outages, contaminates real metrics, can trigger alerts/costs
+
+**Fix:** Test in staging environment that mirrors production (same DB size, network latency, resource limits)
+
+---
+
+### ❌ Single-User "Load" Tests
+**Symptom:** Running one user hitting the API as fast as possible
+
+**Why bad:** Doesn't simulate realistic concurrency, misses resource contention (database connections, thread pools)
+
+**Fix:** Simulate realistic concurrent users with realistic think time between requests
+
+## Metrics Glossary
+
+| Metric | Definition | Good Threshold (typical web API) |
+|--------|------------|----------------------------------|
+| **RPS** (Requests/Second) | Throughput - how many requests processed | Varies by app; know your peak |
+| **Latency** | Time from request to response | P95 < 300ms, P99 < 500ms |
+| **P50 (Median)** | 50% of requests faster than this | P50 < 100ms |
+| **P95** | 95% of requests faster than this | P95 < 300ms |
+| **P99** | 99% of requests faster than this | P99 < 500ms |
+| **Error Rate** | % of 4xx/5xx responses | < 0.1% |
+| **Throughput** | Data transferred per second (MB/s) | Depends on payload size |
+| **Concurrent Users** | Active users at same time | Calculate from traffic patterns |
+
+**Focus on P95/P99, not average.** Tail latency kills user experience.
+
+## Diagnostic-First Workflow
+
+Before load testing slow applications, follow this workflow:
+
+**Step 1: Measure Current State**
+- Install APM (DataDog, New Relic, Grafana) or logging
+- Identify slowest 10 endpoints/operations
+- Check database slow query logs
+
+**Step 2: Common Quick Wins** (90% of performance issues)
+- Missing database indexes
+- N+1 query problem
+- Unoptimized images/assets
+- Missing caching (Redis, CDN)
+- Synchronous operations that should be async
+- Inefficient serialization (JSON parsing bottlenecks)
+
+**Step 3: Profile Specific Bottleneck**
+- Use profiler to see CPU/memory hotspots
+- Trace requests to find where time is spent (DB? external API? computation?)
+- Check for resource limits (max connections, thread pool exhaustion)
+
+**Step 4: Fix and Measure**
+- Apply fix (add index, cache layer, async processing)
+- Measure improvement in production
+- Document before/after metrics
+
+**Step 5: THEN Load Test** (if needed)
+- Validate fixes handle expected load
+- Find new capacity limits
+- Establish regression baseline
+
+**Anti-pattern to avoid:** Skipping to Step 5 without Steps 1-4.
+
+## Performance Benchmarks (Reference)
+
+What "good" looks like by application type:
+
+| Application Type | Typical P95 Latency | Typical Throughput | Notes |
+|------------------|---------------------|-------------------|-------|
+| **REST API (CRUD)** | < 200ms | 500-2000 RPS | Database-backed, simple queries |
+| **Search API** | < 500ms | 100-500 RPS | Complex queries, ranking algorithms |
+| **Payment Gateway** | < 1s | 50-200 RPS | External service calls, strict consistency |
+| **Real-time Gaming** | < 50ms | 1000-10000 RPS | Low latency critical |
+| **Batch Processing** | 2-10s/job | 10-100 jobs/min | Throughput > latency |
+| **Static CDN** | < 100ms | 10000+ RPS | Edge-cached, minimal computation |
+
+**Use as rough guide, not absolute targets.** Your SLAs depend on user needs.
+
+## Results Interpretation Framework
+
+After running a load test:
+
+**Pass Criteria:**
+- ✅ All requests meet latency SLA (e.g., P95 < 300ms)
+- ✅ Error rate under threshold (< 0.1%)
+- ✅ No resource exhaustion (CPU < 80%, memory stable, no connection pool saturation)
+- ✅ Sustained load for test duration without degradation
+
+**Fail Criteria:**
+- ❌ Latency exceeds SLA
+- ❌ Error rate spikes
+- ❌ Gradual degradation over time (memory leak, connection leak)
+- ❌ Resource exhaustion (CPU pegged, OOM errors)
+
+**Next Steps:**
+- **If passing:** Establish this as regression baseline, run periodically in CI
+- **If failing:** Profile to find bottleneck, optimize, re-test
+- **If borderline:** Test at higher load (stress test) to find safety margin
+
+## Common Mistakes
+
+### ❌ Not Ramping Load Gradually
+**Symptom:** Instant 0 → 1000 users, everything fails
+
+**Fix:** Ramp over 2-5 minutes to let auto-scaling/caching warm up (except spike tests, where instant jump is the point)
+
+---
+
+### ❌ Testing With Empty Database
+**Symptom:** Tests pass with 100 records, fail with 1M records in production
+
+**Fix:** Seed staging database with production-scale data
+
+---
+
+### ❌ Ignoring External Dependencies
+**Symptom:** Your API is fast, but third-party payment gateway times out under load
+
+**Fix:** Include external service latency in SLAs, or mock them for isolated API testing
+
+## Quick Reference
+
+**Getting Started Checklist:**
+1. Define SLAs (latency P95/P99, throughput, error rate)
+2. Choose tool (k6 or JMeter for most cases)
+3. Start with Load Test (baseline validation)
+4. Run Stress Test (find capacity limits)
+5. Establish regression baseline
+6. Run in CI on major changes
+
+**When Debugging Slow App:**
+1. Profile first (APM, database logs)
+2. Fix obvious issues (indexes, N+1, caching)
+3. Measure improvement
+4. THEN load test to validate
+
+**Interpreting Results:**
+- Report P95/P99, not just average
+- Compare against SLAs
+- Check for resource exhaustion
+- Look for degradation over time (soak tests)
+
+## Bottom Line
+
+**Performance testing validates capacity and catches regressions.**
+
+**Profiling finds bottlenecks.**
+
+Don't confuse the two - diagnose first, test second.
--- a/skills/property-based-testing/SKILL.md
+++ b/skills/property-based-testing/SKILL.md
@@ -0,0 +1,504 @@
+---
+name: property-based-testing
+description: Use when testing invariants, validating properties across many inputs, using Hypothesis (Python) or fast-check (JavaScript), defining test strategies, handling shrinking, or finding edge cases - provides property definition patterns and integration strategies
+---
+
+# Property-Based Testing
+
+## Overview
+
+**Core principle:** Instead of testing specific examples, test properties that should hold for all inputs.
+
+**Rule:** Property-based tests generate hundreds of inputs automatically. One property test replaces dozens of example tests.
+
+## Property-Based vs Example-Based Testing
+
+| Aspect | Example-Based | Property-Based |
+|--------|---------------|----------------|
+| **Test input** | Hardcoded examples | Generated inputs |
+| **Coverage** | Few specific cases | Hundreds of random cases |
+| **Maintenance** | Add new examples manually | Properties automatically tested |
+| **Edge cases** | Must think of them | Automatically discovered |
+
+**Example:**
+
+```python
+# Example-based: Test 3 specific inputs
+def test_reverse():
+    assert reverse([1, 2, 3]) == [3, 2, 1]
+    assert reverse([]) == []
+    assert reverse([1]) == [1]
+
+# Property-based: Test ALL inputs
+@given(lists(integers()))
+def test_reverse_property(lst):
+    # Property: Reversing twice returns original
+    assert reverse(reverse(lst)) == lst
+```
+
+---
+
+## Tool Selection
+
+| Language | Tool | Why |
+|----------|------|-----|
+| **Python** | **Hypothesis** | Most mature, excellent shrinking |
+| **JavaScript/TypeScript** | **fast-check** | TypeScript support, good integration |
+| **Java** | **jqwik** | JUnit 5 integration |
+| **Haskell** | **QuickCheck** | Original property-based testing library |
+
+**First choice:** Hypothesis (Python) or fast-check (JavaScript)
+
+---
+
+## Basic Property Test (Python + Hypothesis)
+
+### Installation
+
+```bash
+pip install hypothesis
+```
+
+---
+
+### Example
+
+```python
+from hypothesis import given
+from hypothesis.strategies import integers, lists
+
+def reverse(lst):
+    """Reverse a list."""
+    return lst[::-1]
+
+@given(lists(integers()))
+def test_reverse_twice(lst):
+    """Property: Reversing twice returns original."""
+    assert reverse(reverse(lst)) == lst
+```
+
+**Run:**
+```bash
+pytest test_reverse.py
+```
+
+**Output:**
+```
+Trying example: lst=[]
+Trying example: lst=[0]
+Trying example: lst=[1, -2, 3]
+... (100 examples tested)
+PASSED
+```
+
+**If property fails:**
+```
+Falsifying example: lst=[0, 0, 1]
+```
+
+---
+
+## Common Properties
+
+### 1. Inverse Functions
+
+**Property:** `f(g(x)) == x`
+
+```python
+from hypothesis import given
+from hypothesis.strategies import text
+
+@given(text())
+def test_encode_decode(s):
+    """Property: Decoding encoded string returns original."""
+    assert decode(encode(s)) == s
+```
+
+---
+
+### 2. Idempotence
+
+**Property:** `f(f(x)) == f(x)`
+
+```python
+@given(lists(integers()))
+def test_sort_idempotent(lst):
+    """Property: Sorting twice gives same result as sorting once."""
+    assert sorted(sorted(lst)) == sorted(lst)
+```
+
+---
+
+### 3. Invariants
+
+**Property:** Some fact remains true after operation
+
+```python
+@given(lists(integers()))
+def test_reverse_length(lst):
+    """Property: Reversing doesn't change length."""
+    assert len(reverse(lst)) == len(lst)
+
+@given(lists(integers()))
+def test_reverse_elements(lst):
+    """Property: Reversing doesn't change elements."""
+    assert set(reverse(lst)) == set(lst)
+```
+
+---
+
+### 4. Commutativity
+
+**Property:** `f(x, y) == f(y, x)`
+
+```python
+@given(integers(), integers())
+def test_addition_commutative(a, b):
+    """Property: Addition is commutative."""
+    assert a + b == b + a
+```
+
+---
+
+### 5. Associativity
+
+**Property:** `f(f(x, y), z) == f(x, f(y, z))`
+
+```python
+@given(integers(), integers(), integers())
+def test_addition_associative(a, b, c):
+    """Property: Addition is associative."""
+    assert (a + b) + c == a + (b + c)
+```
+
+---
+
+## Test Strategies (Generating Inputs)
+
+### Built-In Strategies
+
+```python
+from hypothesis.strategies import (
+    integers,
+    floats,
+    text,
+    lists,
+    dictionaries,
+    booleans,
+)
+
+@given(integers())
+def test_with_int(x):
+    pass
+
+@given(integers(min_value=0, max_value=100))
+def test_with_bounded_int(x):
+    pass
+
+@given(text(min_size=1, max_size=10))
+def test_with_short_string(s):
+    pass
+
+@given(lists(integers(), min_size=1))
+def test_with_nonempty_list(lst):
+    pass
+```
+
+---
+
+### Composite Strategies
+
+**Generate complex objects:**
+
+```python
+from hypothesis import strategies as st
+from hypothesis.strategies import composite
+
+@composite
+def users(draw):
+    """Generate user objects."""
+    return {
+        "name": draw(st.text(min_size=1, max_size=50)),
+        "age": draw(st.integers(min_value=0, max_value=120)),
+        "email": draw(st.emails()),
+    }
+
+@given(users())
+def test_user_validation(user):
+    assert validate_user(user) is True
+```
+
+---
+
+### Filtering Strategies
+
+**Exclude invalid inputs:**
+
+```python
+@given(integers().filter(lambda x: x != 0))
+def test_division(x):
+    """Test division (x != 0)."""
+    assert 10 / x == 10 / x
+
+# Better: Use assume
+from hypothesis import assume
+
+@given(integers())
+def test_division_better(x):
+    assume(x != 0)
+    assert 10 / x == 10 / x
+```
+
+---
+
+## Shrinking (Finding Minimal Failing Example)
+
+**When a property fails, Hypothesis automatically shrinks the input to the smallest failing case.**
+
+**Example:**
+
+```python
+@given(lists(integers()))
+def test_all_positive(lst):
+    """Fails if any negative number."""
+    assert all(x > 0 for x in lst)
+```
+
+**Initial failure:**
+```
+Falsifying example: lst=[-5, 3, -2, 0, 1, 7, -9]
+```
+
+**After shrinking:**
+```
+Falsifying example: lst=[-1]
+```
+
+**Why it matters:** Minimal examples are easier to debug
+
+---
+
+## Integration with pytest
+
+```python
+# test_properties.py
+from hypothesis import given, settings
+from hypothesis.strategies import integers
+
+@settings(max_examples=1000)  # Run 1000 examples (default: 100)
+@given(integers(min_value=1))
+def test_factorial_positive(n):
+    """Property: Factorial of positive number is positive."""
+    assert factorial(n) > 0
+```
+
+**Run:**
+```bash
+pytest test_properties.py -v
+```
+
+---
+
+## JavaScript Example (fast-check)
+
+### Installation
+
+```bash
+npm install --save-dev fast-check
+```
+
+---
+
+### Example
+
+```javascript
+import fc from 'fast-check';
+
+function reverse(arr) {
+  return arr.slice().reverse();
+}
+
+// Property: Reversing twice returns original
+test('reverse twice', () => {
+  fc.assert(
+    fc.property(fc.array(fc.integer()), (arr) => {
+      expect(reverse(reverse(arr))).toEqual(arr);
+    })
+  );
+});
+```
+
+---
+
+## Advanced Patterns
+
+### Stateful Testing
+
+**Test state machines:**
+
+```python
+from hypothesis.stateful import RuleBasedStateMachine, rule, initialize
+
+class QueueMachine(RuleBasedStateMachine):
+    def __init__(self):
+        super().__init__()
+        self.queue = []
+        self.model = []
+
+    @rule(value=integers())
+    def enqueue(self, value):
+        self.queue.append(value)
+        self.model.append(value)
+
+    @rule()
+    def dequeue(self):
+        if self.queue:
+            actual = self.queue.pop(0)
+            expected = self.model.pop(0)
+            assert actual == expected
+
+TestQueue = QueueMachine.TestCase
+```
+
+**Finds:** Race conditions, state corruption, invalid state transitions
+
+---
+
+## Anti-Patterns Catalog
+
+### ❌ Testing Examples, Not Properties
+
+**Symptom:** Property test with hardcoded checks
+
+```python
+# ❌ BAD: Not a property
+@given(integers())
+def test_double(x):
+    if x == 2:
+        assert double(x) == 4
+    elif x == 3:
+        assert double(x) == 6
+    # This is just example testing!
+```
+
+**Fix:** Test actual property
+
+```python
+# ✅ GOOD: Real property
+@given(integers())
+def test_double(x):
+    assert double(x) == x * 2
+```
+
+---
+
+### ❌ Overly Restrictive Assumptions
+
+**Symptom:** Filtering out most generated inputs
+
+```python
+# ❌ BAD: Rejects 99% of inputs
+@given(integers())
+def test_specific_range(x):
+    assume(x > 1000 and x < 1001)  # Only accepts 1 value!
+    assert process(x) is not None
+```
+
+**Fix:** Use strategy bounds
+
+```python
+# ✅ GOOD
+@given(integers(min_value=1000, max_value=1001))
+def test_specific_range(x):
+    assert process(x) is not None
+```
+
+---
+
+### ❌ No Assertions
+
+**Symptom:** Property test that doesn't assert anything
+
+```python
+# ❌ BAD: No assertion
+@given(integers())
+def test_no_crash(x):
+    calculate(x)  # Just checks it doesn't crash
+```
+
+**Fix:** Assert a property
+
+```python
+# ✅ GOOD
+@given(integers())
+def test_output_type(x):
+    result = calculate(x)
+    assert isinstance(result, int)
+```
+
+---
+
+## CI/CD Integration
+
+```yaml
+# .github/workflows/property-tests.yml
+name: Property Tests
+
+on: [pull_request]
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.11'
+
+      - name: Install dependencies
+        run: pip install hypothesis pytest
+
+      - name: Run property tests
+        run: pytest tests/properties/ -v --hypothesis-show-statistics
+```
+
+---
+
+## Quick Reference: Property Patterns
+
+| Pattern | Example Property |
+|---------|------------------|
+| **Inverse** | `decode(encode(x)) == x` |
+| **Idempotence** | `f(f(x)) == f(x)` |
+| **Invariant** | `len(filter(lst, f)) <= len(lst)` |
+| **Commutativity** | `add(a, b) == add(b, a)` |
+| **Associativity** | `(a + b) + c == a + (b + c)` |
+| **Identity** | `x + 0 == x` |
+| **Consistency** | `sort(lst)[0] <= sort(lst)[-1]` |
+
+---
+
+## Bottom Line
+
+**Property-based testing generates hundreds of inputs automatically to test properties that should hold for all inputs. One property test replaces dozens of example tests.**
+
+**Use for:**
+- Pure functions (no side effects)
+- Data transformations
+- Invariants (sorting, reversing, encoding/decoding)
+- State machines
+
+**Tools:**
+- Hypothesis (Python) - most mature
+- fast-check (JavaScript) - TypeScript support
+
+**Process:**
+1. Identify property (e.g., "reversing twice returns original")
+2. Write property test with generator
+3. Run test (generates 100-1000 examples)
+4. If failure, Hypothesis shrinks to minimal example
+5. Fix bug, add regression test
+
+**If you're writing tests like "assert reverse([1,2,3]) == [3,2,1]" for every possible input, use property-based testing instead. Test the property, not examples.**
--- a/skills/quality-metrics-and-kpis/SKILL.md
+++ b/skills/quality-metrics-and-kpis/SKILL.md
@@ -0,0 +1,448 @@
+---
+name: quality-metrics-and-kpis
+description: Use when setting up quality dashboards, defining test coverage targets, tracking quality trends, configuring CI/CD quality gates, or reporting quality metrics to stakeholders - provides metric selection, threshold strategies, and dashboard design patterns
+---
+
+# Quality Metrics & KPIs
+
+## Overview
+
+**Core principle:** Measure what matters. Track trends, not absolutes. Use metrics to drive action, not for vanity.
+
+**Rule:** Every metric must have a defined threshold and action plan. If a metric doesn't change behavior, stop tracking it.
+
+## Quality Metrics vs Vanity Metrics
+
+| Type | Example | Problem | Better Metric |
+|------|---------|---------|---------------|
+| **Vanity** | "We have 10,000 tests!" | Doesn't indicate quality | Pass rate, flakiness rate |
+| **Vanity** | "95% code coverage!" | Can be gamed, doesn't mean tests are good | Coverage delta (new code), mutation score |
+| **Actionable** | "Test flakiness: 5% → 2%" | Drives action | Track trend, set target |
+| **Actionable** | "P95 build time: 15 min" | Identifies bottleneck | Optimize slow tests |
+
+**Actionable metrics answer:** "What should I fix next?"
+
+---
+
+## Core Quality Metrics
+
+### 1. Test Pass Rate
+
+**Definition:** % of tests that pass on first run
+
+```
+Pass Rate = (Passing Tests / Total Tests) × 100
+```
+
+**Thresholds:**
+- **> 98%:** Healthy
+- **95-98%:** Investigate failures
+- **< 95%:** Critical (tests are unreliable)
+
+**Why it matters:** Low pass rate means flaky tests or broken code
+
+**Action:** If < 98%, run flaky-test-prevention skill
+
+---
+
+### 2. Test Flakiness Rate
+
+**Definition:** % of tests that fail intermittently
+
+```
+Flakiness Rate = (Flaky Tests / Total Tests) × 100
+```
+
+**How to measure:**
+```bash
+# Run each test 100 times
+pytest --count=100 test_checkout.py
+
+# Flaky if passes 1-99 times (not 0 or 100)
+```
+
+**Thresholds:**
+- **< 1%:** Healthy
+- **1-5%:** Moderate (fix soon)
+- **> 5%:** Critical (CI is unreliable)
+
+**Action:** Fix flaky tests before adding new tests
+
+---
+
+### 3. Code Coverage
+
+**Definition:** % of code lines executed by tests
+
+```
+Coverage = (Executed Lines / Total Lines) × 100
+```
+
+**Thresholds (by test type):**
+- **Unit tests:** 80-90% of business logic
+- **Integration tests:** 60-70% of integration points
+- **E2E tests:** 40-50% of critical paths
+
+**Configuration (pytest):**
+```ini
+# .coveragerc
+[run]
+source = src
+omit = */tests/*, */migrations/*
+
+[report]
+fail_under = 80  # Fail if coverage < 80%
+show_missing = True
+```
+
+**Anti-pattern:** 100% coverage as goal
+
+**Why it's wrong:** Easy to game (tests that execute code without asserting anything)
+
+**Better metric:** Coverage + mutation score (see mutation-testing skill)
+
+---
+
+### 4. Coverage Delta (New Code)
+
+**Definition:** Coverage of newly added code
+
+**Why it matters:** More actionable than absolute coverage
+
+```bash
+# Measure coverage on changed files only
+pytest --cov=src --cov-report=term-missing \
+  $(git diff --name-only origin/main...HEAD | grep '\.py$')
+```
+
+**Threshold:** 90% for new code (stricter than legacy)
+
+**Action:** Block PR if new code coverage < 90%
+
+---
+
+### 5. Build Time (CI/CD)
+
+**Definition:** Time from commit to merge-ready
+
+**Track by stage:**
+- **Lint/format:** < 30s
+- **Unit tests:** < 2 min
+- **Integration tests:** < 5 min
+- **E2E tests:** < 15 min
+- **Total PR pipeline:** < 20 min
+
+**Why it matters:** Slow CI blocks developer productivity
+
+**Action:** If build > 20 min, see test-automation-architecture for optimization patterns
+
+---
+
+### 6. Test Execution Time Trend
+
+**Definition:** How test suite duration changes over time
+
+```python
+# Track in CI
+import time
+import json
+
+start = time.time()
+pytest.main()
+duration = time.time() - start
+
+metrics = {"test_duration_seconds": duration, "timestamp": time.time()}
+with open("metrics.json", "w") as f:
+    json.dump(metrics, f)
+```
+
+**Threshold:** < 5% growth per month
+
+**Action:** If growth > 5%/month, parallelize tests or refactor slow tests
+
+---
+
+### 7. Defect Escape Rate
+
+**Definition:** Bugs found in production that should have been caught by tests
+
+```
+Defect Escape Rate = (Production Bugs / Total Releases) × 100
+```
+
+**Thresholds:**
+- **< 2%:** Excellent
+- **2-5%:** Acceptable
+- **> 5%:** Tests are missing critical scenarios
+
+**Action:** For each escape, write regression test to prevent recurrence
+
+---
+
+### 8. Mean Time to Detection (MTTD)
+
+**Definition:** Time from bug introduction to discovery
+
+```
+MTTD = Deployment Time - Bug Introduction Time
+```
+
+**Thresholds:**
+- **< 1 hour:** Excellent (caught in CI)
+- **1-24 hours:** Good (caught in staging/canary)
+- **> 24 hours:** Poor (caught in production)
+
+**Action:** If MTTD > 24h, improve observability (see observability-and-monitoring skill)
+
+---
+
+### 9. Mean Time to Recovery (MTTR)
+
+**Definition:** Time from bug detection to fix deployed
+
+```
+MTTR = Fix Deployment Time - Bug Detection Time
+```
+
+**Thresholds:**
+- **< 1 hour:** Excellent
+- **1-8 hours:** Acceptable
+- **> 8 hours:** Poor
+
+**Action:** If MTTR > 8h, improve rollback procedures (see testing-in-production skill)
+
+---
+
+## Dashboard Design
+
+### Grafana Dashboard Example
+
+```yaml
+# grafana-dashboard.json
+{
+  "panels": [
+    {
+      "title": "Test Pass Rate (7 days)",
+      "targets": [{
+        "expr": "sum(tests_passed) / sum(tests_total) * 100"
+      }],
+      "thresholds": [
+        {"value": 95, "color": "red"},
+        {"value": 98, "color": "yellow"},
+        {"value": 100, "color": "green"}
+      ]
+    },
+    {
+      "title": "Build Time Trend (30 days)",
+      "targets": [{
+        "expr": "avg_over_time(ci_build_duration_seconds[30d])"
+      }]
+    },
+    {
+      "title": "Coverage Delta (per PR)",
+      "targets": [{
+        "expr": "coverage_new_code_percent"
+      }],
+      "thresholds": [
+        {"value": 90, "color": "green"},
+        {"value": 80, "color": "yellow"},
+        {"value": 0, "color": "red"}
+      ]
+    }
+  ]
+}
+```
+
+---
+
+### CI/CD Quality Gates
+
+**GitHub Actions example:**
+
+```yaml
+# .github/workflows/quality-gates.yml
+name: Quality Gates
+
+on: [pull_request]
+
+jobs:
+  quality-check:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Run tests with coverage
+        run: pytest --cov=src --cov-report=json
+
+      - name: Check coverage threshold
+        run: |
+          COVERAGE=$(jq '.totals.percent_covered' coverage.json)
+          if (( $(echo "$COVERAGE < 80" | bc -l) )); then
+            echo "Coverage $COVERAGE% below 80% threshold"
+            exit 1
+          fi
+
+      - name: Check build time
+        run: |
+          DURATION=$(jq '.duration' test-results.json)
+          if (( $(echo "$DURATION > 300" | bc -l) )); then
+            echo "Build time ${DURATION}s exceeds 5-minute threshold"
+            exit 1
+          fi
+```
+
+---
+
+## Reporting Patterns
+
+### Weekly Quality Report
+
+**Template:**
+
+```markdown
+# Quality Report - Week of 2025-01-13
+
+## Summary
+- **Test pass rate:** 98.5% (+0.5% from last week)
+- **Flakiness rate:** 2.1% (-1.3% from last week) ✅
+- **Coverage:** 85.2% (+2.1% from last week) ✅
+- **Build time:** 18 min (-2 min from last week) ✅
+
+## Actions Taken
+- Fixed 8 flaky tests in checkout flow
+- Added integration tests for payment service (+5% coverage)
+- Parallelized slow E2E tests (reduced build time by 2 min)
+
+## Action Items
+- [ ] Fix remaining 3 flaky tests in user registration
+- [ ] Increase coverage of order service (currently 72%)
+- [ ] Investigate why staging MTTD increased to 4 hours
+```
+
+---
+
+### Stakeholder Dashboard (Executive View)
+
+**Metrics to show:**
+1. **Quality trend (6 months):** Pass rate over time
+2. **Velocity impact:** How long does CI take per PR?
+3. **Production stability:** Defect escape rate
+4. **Recovery time:** MTTR for incidents
+
+**What NOT to show:**
+- Absolute test count (vanity metric)
+- Lines of code (meaningless)
+- Individual developer metrics (creates wrong incentives)
+
+---
+
+## Anti-Patterns Catalog
+
+### ❌ Coverage as the Only Metric
+
+**Symptom:** "We need 100% coverage!"
+
+**Why bad:** Easy to game with meaningless tests
+
+```python
+# ❌ BAD: 100% coverage, 0% value
+def calculate_tax(amount):
+    return amount * 0.08
+
+def test_calculate_tax():
+    calculate_tax(100)  # Executes code, asserts nothing!
+```
+
+**Fix:** Use coverage + mutation score
+
+---
+
+### ❌ Tracking Metrics Without Thresholds
+
+**Symptom:** Dashboard shows metrics but no action taken
+
+**Why bad:** Metrics become noise
+
+**Fix:** Every metric needs:
+- **Target threshold** (e.g., flakiness < 1%)
+- **Alert level** (e.g., alert if flakiness > 5%)
+- **Action plan** (e.g., "Fix flaky tests before adding new features")
+
+---
+
+### ❌ Optimizing for Metrics, Not Quality
+
+**Symptom:** Gaming metrics to hit targets
+
+**Example:** Removing tests to increase pass rate
+
+**Fix:** Track multiple complementary metrics (pass rate + flakiness + coverage)
+
+---
+
+### ❌ Measuring Individual Developer Productivity
+
+**Symptom:** "Developer A writes more tests than Developer B"
+
+**Why bad:** Creates wrong incentives (quantity over quality)
+
+**Fix:** Measure team metrics, not individual
+
+---
+
+## Tool Integration
+
+### SonarQube Metrics
+
+**Quality Gate:**
+```properties
+# sonar-project.properties
+sonar.qualitygate.wait=true
+
+# Metrics tracked:
+# - Bugs (target: 0)
+# - Vulnerabilities (target: 0)
+# - Code smells (target: < 100)
+# - Coverage (target: > 80%)
+# - Duplications (target: < 3%)
+```
+
+---
+
+### Codecov Integration
+
+```yaml
+# codecov.yml
+coverage:
+  status:
+    project:
+      default:
+        target: 80%      # Overall coverage target
+        threshold: 2%    # Allow 2% drop
+
+    patch:
+      default:
+        target: 90%      # New code must have 90% coverage
+        threshold: 0%    # No drops allowed
+```
+
+---
+
+## Bottom Line
+
+**Track actionable metrics with defined thresholds. Use metrics to drive improvement, not for vanity.**
+
+**Core dashboard:**
+- Test pass rate (> 98%)
+- Flakiness rate (< 1%)
+- Coverage delta on new code (> 90%)
+- Build time (< 20 min)
+- Defect escape rate (< 2%)
+
+**Weekly actions:**
+- Review metrics against thresholds
+- Identify trends (improving/degrading)
+- Create action items for violations
+- Track progress on improvements
+
+**If you're tracking a metric but not acting on it, stop tracking it. Metrics exist to drive action, not to fill dashboards.**
--- a/skills/static-analysis-integration/SKILL.md
+++ b/skills/static-analysis-integration/SKILL.md
@@ -0,0 +1,521 @@
+---
+name: static-analysis-integration
+description: Use when integrating SAST tools (SonarQube, ESLint, Pylint, Checkstyle), setting up security scanning, configuring code quality gates, managing false positives, or building CI/CD quality pipelines - provides tool selection, configuration patterns, and quality threshold strategies
+---
+
+# Static Analysis Integration
+
+## Overview
+
+**Core principle:** Static analysis catches bugs,security vulnerabilities, and code quality issues before code review. Automate it in CI/CD.
+
+**Rule:** Block merges on critical issues, warn on moderate issues, ignore noise. Configure thresholds carefully.
+
+## Static Analysis vs Other Quality Checks
+
+| Check Type | When | What It Finds | Speed |
+|------------|------|---------------|-------|
+| **Static Analysis** | Pre-commit/PR | Bugs, security, style | Fast (seconds) |
+| **Unit Tests** | Every commit | Logic errors | Fast (seconds) |
+| **Integration Tests** | PR | Integration bugs | Medium (minutes) |
+| **Security Scanning** | PR/Nightly | Dependencies, secrets | Medium (minutes) |
+| **Manual Code Review** | PR | Design, readability | Slow (hours) |
+
+**Static analysis finds:** Null pointer bugs, SQL injection, unused variables, complexity issues
+
+**Static analysis does NOT find:** Business logic errors, performance issues (use profiling)
+
+---
+
+## Tool Selection by Language
+
+### Python
+
+| Tool | Purpose | When to Use |
+|------|---------|-------------|
+| **Pylint** | Code quality, style, bugs | General-purpose, comprehensive |
+| **Flake8** | Style, simple bugs | Faster than Pylint, less strict |
+| **mypy** | Type checking | Type-safe codebases |
+| **Bandit** | Security vulnerabilities | Security-critical code |
+| **Black** | Code formatting | Enforce consistent style |
+
+**Recommended combo:** Black (formatting) + Flake8 (linting) + mypy (types) + Bandit (security)
+
+---
+
+### JavaScript/TypeScript
+
+| Tool | Purpose | When to Use |
+|------|---------|-------------|
+| **ESLint** | Code quality, style, bugs | All JavaScript projects |
+| **TypeScript** | Type checking | Type-safe codebases |
+| **Prettier** | Code formatting | Enforce consistent style |
+| **SonarQube** | Security, bugs, code smells | Enterprise, comprehensive |
+
+**Recommended combo:** Prettier (formatting) + ESLint (linting) + TypeScript (types)
+
+---
+
+### Java
+
+| Tool | Purpose | When to Use |
+|------|---------|-------------|
+| **Checkstyle** | Code style | Enforce coding standards |
+| **PMD** | Bug detection, code smells | General-purpose |
+| **SpotBugs** | Bug detection | Bytecode analysis |
+| **SonarQube** | Comprehensive analysis | Enterprise, dashboards |
+
+**Recommended combo:** Checkstyle (style) + SpotBugs (bugs) + SonarQube (comprehensive)
+
+---
+
+## Configuration Patterns
+
+### ESLint Configuration (JavaScript)
+
+```javascript
+// .eslintrc.js
+module.exports = {
+  extends: [
+    'eslint:recommended',
+    'plugin:@typescript-eslint/recommended',
+    'plugin:security/recommended'
+  ],
+  rules: {
+    // Error: Block merge
+    'no-console': 'error',
+    'no-debugger': 'error',
+    '@typescript-eslint/no-explicit-any': 'error',
+
+    // Warning: Allow merge, but warn
+    'complexity': ['warn', 10],
+    'max-lines': ['warn', 500],
+
+    // Off: Too noisy
+    'no-unused-vars': 'off',  // TypeScript handles this
+  }
+};
+```
+
+**Run in CI:**
+```bash
+eslint src/ --max-warnings 0  # Fail if any warnings
+```
+
+---
+
+### Pylint Configuration (Python)
+
+```ini
+# .pylintrc
+[MESSAGES CONTROL]
+disable=
+    missing-docstring,     # Too noisy for small projects
+    too-few-public-methods,  # Design choice
+    logging-fstring-interpolation  # False positives
+
+[DESIGN]
+max-line-length=100
+max-args=7
+max-locals=15
+
+[BASIC]
+good-names=i,j,k,_,id,db,pk
+```
+
+**Run in CI:**
+```bash
+pylint src/ --fail-under=8.0  # Minimum score 8.0/10
+```
+
+---
+
+### SonarQube Quality Gates
+
+```yaml
+# sonar-project.properties
+sonar.projectKey=my-project
+sonar.sources=src
+sonar.tests=tests
+
+# Quality gate thresholds
+sonar.qualitygate.wait=true
+sonar.coverage.exclusions=**/*_test.py,**/migrations/**
+
+# Fail conditions
+sonar.qualitygate.timeout=300
+```
+
+**Quality Gate Criteria:**
+- **Blocker/Critical issues:** 0 (block merge)
+- **Major issues:** < 5 (block merge)
+- **Code coverage:** > 80% (warn if lower)
+- **Duplicated lines:** < 3%
+- **Maintainability rating:** A or B
+
+---
+
+## CI/CD Integration
+
+### GitHub Actions (Python)
+
+```yaml
+# .github/workflows/static-analysis.yml
+name: Static Analysis
+
+on: [pull_request]
+
+jobs:
+  lint:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.11'
+
+      - name: Install dependencies
+        run: |
+          pip install pylint flake8 mypy bandit black
+
+      - name: Check formatting
+        run: black --check src/
+
+      - name: Run Flake8
+        run: flake8 src/ --max-line-length=100
+
+      - name: Run Pylint
+        run: pylint src/ --fail-under=8.0
+
+      - name: Run mypy
+        run: mypy src/ --strict
+
+      - name: Run Bandit (security)
+        run: bandit -r src/ -ll  # Only high severity
+```
+
+---
+
+### GitHub Actions (JavaScript)
+
+```yaml
+# .github/workflows/static-analysis.yml
+name: Static Analysis
+
+on: [pull_request]
+
+jobs:
+  lint:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Set up Node
+        uses: actions/setup-node@v3
+        with:
+          node-version: '18'
+
+      - name: Install dependencies
+        run: npm ci
+
+      - name: Check formatting
+        run: npm run format:check  # prettier --check
+
+      - name: Run ESLint
+        run: npm run lint  # eslint --max-warnings 0
+
+      - name: Run TypeScript
+        run: npm run typecheck  # tsc --noEmit
+```
+
+---
+
+## Managing False Positives
+
+**Strategy: Suppress selectively, document why**
+
+### Inline Suppression (ESLint)
+
+```javascript
+// eslint-disable-next-line no-console
+console.log("Debugging production issue");  // TODO: Remove after fix
+
+// Better: Explain WHY
+// eslint-disable-next-line @typescript-eslint/no-explicit-any
+const legacyData: any = externalLibrary.getData();  // Library has no types
+```
+
+---
+
+### File-Level Suppression (Pylint)
+
+```python
+# pylint: disable=too-many-arguments
+def complex_function(arg1, arg2, arg3, arg4, arg5, arg6, arg7, arg8):
+    """Legacy API - cannot change signature for backward compatibility."""
+    pass
+```
+
+---
+
+### Configuration Suppression
+
+```ini
+# .pylintrc
+[MESSAGES CONTROL]
+disable=
+    fixme,  # Allow TODO comments
+    missing-docstring  # Too noisy for this codebase
+```
+
+**Rule:** Every suppression needs a comment explaining WHY.
+
+---
+
+## Security-Focused Static Analysis
+
+### Bandit (Python Security)
+
+```yaml
+# .bandit.yml
+exclude_dirs:
+  - /tests
+  - /migrations
+
+tests:
+  - B201  # Flask debug mode
+  - B601  # Parameterized shell calls
+  - B602  # Shell injection
+  - B608  # SQL injection
+```
+
+**Run:**
+```bash
+bandit -r src/ -ll -x tests/  # Only high/medium severity
+```
+
+---
+
+### ESLint Security Plugin (JavaScript)
+
+```javascript
+// .eslintrc.js
+module.exports = {
+  plugins: ['security'],
+  extends: ['plugin:security/recommended'],
+  rules: {
+    'security/detect-object-injection': 'error',
+    'security/detect-non-literal-regexp': 'warn',
+    'security/detect-unsafe-regex': 'error'
+  }
+};
+```
+
+---
+
+## Code Quality Metrics
+
+### Complexity Analysis
+
+**Cyclomatic complexity:** Measures decision paths through code
+
+```python
+# Simple function: Complexity = 1
+def add(a, b):
+    return a + b
+
+# Complex function: Complexity = 5 (if/elif/else = 4 paths + 1 base)
+def process_order(order):
+    if order.status == "pending":
+        return validate(order)
+    elif order.status == "confirmed":
+        return ship(order)
+    elif order.status == "cancelled":
+        return refund(order)
+    else:
+        return reject(order)
+```
+
+**Threshold:**
+- **< 10:** Acceptable
+- **10-20:** Consider refactoring
+- **> 20:** Must refactor (untestable)
+
+**Configure:**
+```ini
+# Pylint
+[DESIGN]
+max-complexity=10
+
+# ESLint
+complexity: ['warn', 10]
+```
+
+---
+
+### Duplication Detection
+
+**SonarQube duplication threshold:** < 3%
+
+**Find duplicates (Python):**
+```bash
+pylint src/ --disable=all --enable=duplicate-code
+```
+
+**Find duplicates (JavaScript):**
+```bash
+jscpd src/  # JavaScript Copy/Paste Detector
+```
+
+---
+
+## Anti-Patterns Catalog
+
+### ❌ Suppressing All Warnings
+
+**Symptom:** Config disables most rules
+
+```javascript
+// ❌ BAD
+module.exports = {
+  rules: {
+    'no-console': 'off',
+    'no-debugger': 'off',
+    '@typescript-eslint/no-explicit-any': 'off',
+    // ... 50 more disabled rules
+  }
+};
+```
+
+**Why bad:** Static analysis becomes useless
+
+**Fix:** Address root causes, suppress selectively
+
+---
+
+###❌ No Quality Gates
+
+**Symptom:** Static analysis runs but doesn't block merges
+
+```yaml
+# ❌ BAD: Linting failures don't block merge
+- name: Run ESLint
+  run: eslint src/ || true  # Always succeeds!
+```
+
+**Fix:** Fail CI on critical issues
+
+```yaml
+# ✅ GOOD
+- name: Run ESLint
+  run: eslint src/ --max-warnings 0
+```
+
+---
+
+### ❌ Ignoring Security Warnings
+
+**Symptom:** Security findings marked as false positives without investigation
+
+```python
+# ❌ BAD
+cursor.execute(f"SELECT * FROM users WHERE id = {user_id}")  # nosec
+```
+
+**Why bad:** Real SQL injection vulnerability ignored
+
+**Fix:** Fix the issue, don't suppress
+
+```python
+# ✅ GOOD
+cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
+```
+
+---
+
+### ❌ Running Static Analysis Only on Main Branch
+
+**Symptom:** Issues discovered after merge
+
+**Fix:** Run on every PR
+
+```yaml
+on: [pull_request]  # Not just 'push' to main
+```
+
+---
+
+## Quality Dashboard Setup
+
+### SonarQube Dashboard
+
+**Key metrics to track:**
+1. **Bugs:** Code issues likely to cause failures
+2. **Vulnerabilities:** Security issues
+3. **Code Smells:** Maintainability issues
+4. **Coverage:** Test coverage %
+5. **Duplications:** Duplicated code blocks
+
+**Quality Gate Example:**
+- Bugs (Blocker/Critical): **0**
+- Vulnerabilities (Blocker/Critical): **0**
+- Code Smells (Blocker/Critical): **< 5**
+- Coverage on new code: **> 80%**
+- Duplicated lines on new code: **< 3%**
+
+---
+
+## Gradual Adoption Strategy
+
+**For legacy codebases with thousands of issues:**
+
+### Phase 1: Baseline (Week 1)
+```bash
+# Run analysis, capture current state
+pylint src/ > baseline.txt
+
+# Configure to only fail on NEW issues
+# (Track baseline, don't enforce on old code)
+```
+
+---
+
+### Phase 2: Block New Issues (Week 2)
+```yaml
+# Block PRs that introduce NEW issues
+- name: Run incremental lint
+  run: |
+    pylint $(git diff --name-only origin/main...HEAD | grep '\.py$') --fail-under=8.0
+```
+
+---
+
+### Phase 3: Fix High-Priority Old Issues (Weeks 3-8)
+- Security vulnerabilities first
+- Bugs second
+- Code smells third
+
+---
+
+### Phase 4: Full Enforcement (Week 9+)
+```yaml
+# Enforce on entire codebase
+- name: Run lint
+  run: pylint src/ --fail-under=8.0
+```
+
+---
+
+## Bottom Line
+
+**Static analysis catches bugs and security issues before code review. Automate it in CI/CD with quality gates.**
+
+- Choose tools for your language: ESLint (JS), Pylint (Python), Checkstyle (Java)
+- Configure thresholds: Block critical issues, warn on moderate, ignore noise
+- Run on every PR, fail CI on violations
+- Manage false positives selectively with documented suppressions
+- Track quality metrics: complexity, duplication, coverage
+
+**If static analysis isn't blocking merges, you're just generating reports nobody reads. Use quality gates.**
--- a/skills/test-automation-architecture/SKILL.md
+++ b/skills/test-automation-architecture/SKILL.md
@@ -0,0 +1,255 @@
+---
+name: test-automation-architecture
+description: Use when organizing test suites, setting up CI/CD testing pipelines, choosing test levels (unit vs integration vs E2E), fixing slow CI feedback, or migrating from inverted test pyramid - provides test pyramid guidance and anti-patterns
+---
+
+# Test Automation Architecture
+
+## Overview
+
+**Core principle:** Test pyramid - many fast unit tests, fewer integration tests, fewest E2E tests.
+
+**Target distribution:** 70% unit, 20% integration, 10% E2E
+
+**Flexibility:** Ratios can vary based on constraints (e.g., 80/15/5 if E2E infrastructure is expensive, 60/30/10 for microservices). Key is maintaining pyramid shape - more unit than integration than E2E.
+
+**Starting from zero tests:** Don't try to reach target distribution immediately. Start with unit tests only (Phase 1), add integration (Phase 2), add E2E last (Phase 3). Distribute organically over 6-12 months.
+
+## Test Pyramid Quick Reference
+
+| Test Level | Purpose | Speed | When to Use |
+|------------|---------|-------|-------------|
+| **Unit** | Test individual functions/methods in isolation | Milliseconds | Business logic, utilities, calculations, error handling |
+| **Integration** | Test components working together | Seconds | API contracts, database operations, service interactions |
+| **E2E** | Test full user workflows through UI | Minutes | Critical user journeys, revenue flows, compliance paths |
+
+**Rule:** If you can test it at a lower level, do that instead.
+
+## Test Level Selection Guide
+
+| What You're Testing | Test Level | Why |
+|---------------------|-----------|-----|
+| Function returns correct value | Unit | No external dependencies |
+| API endpoint response format | Integration | Tests API contract, not full workflow |
+| Database query performance | Integration | Tests DB interaction, not UI |
+| User signup → payment flow | E2E | Crosses multiple systems, critical revenue |
+| Form validation logic | Unit | Pure function, no UI needed |
+| Service A calls Service B correctly | Integration | Tests contract, not user workflow |
+| Button click updates state | Unit | Component behavior, no backend |
+| Multi-step checkout process | E2E | Critical user journey, revenue impact |
+
+**Guideline:** Unit tests verify "did I build it right?", E2E tests verify "did I build the right thing?"
+
+## Anti-Patterns Catalog
+
+### ❌ Inverted Pyramid
+**Symptom:** 500 E2E tests, 100 unit tests
+
+**Why bad:** Slow CI (30min+), brittle tests, hard to debug, expensive maintenance
+
+**Fix:** Migrate 70% of E2E tests down to unit/integration. Use Migration Strategy below.
+
+---
+
+### ❌ All Tests on Every Commit
+**Symptom:** Running full 30-minute test suite on every PR
+
+**Why bad:** Slow feedback kills productivity, wastes CI resources
+
+**Fix:** Progressive testing - unit tests on PR, integration on merge, E2E nightly/weekly
+
+---
+
+### ❌ No Test Categorization
+**Symptom:** All tests in one folder, one command, one 30-minute run
+
+**Why bad:** Can't run subsets, no fail-fast, poor organization
+
+**Fix:** Separate by level (unit/, integration/, e2e/) with independent configs
+
+---
+
+### ❌ Slow CI Feedback Loop
+**Symptom:** Waiting 20+ minutes for test results on every commit
+
+**Why bad:** Context switching, delayed bug detection, reduced productivity
+
+**Fix:** Fail fast - run fastest tests first, parallelize, cache dependencies
+
+---
+
+### ❌ No Fail Fast
+**Symptom:** Running all 500 tests even after first test fails
+
+**Why bad:** Wastes CI time, delays feedback
+
+**Fix:** Configure test runner to stop on first failure in CI (not locally)
+
+## CI/CD Pipeline Patterns
+
+| Event | Run These Tests | Duration Target | Why |
+|-------|----------------|-----------------|-----|
+| **Every Commit (Pre-Push)** | Lint + unit tests | < 5 min | Fast local feedback |
+| **Pull Request** | Lint + unit + integration | < 15 min | Gate before merge, balance speed/coverage |
+| **Merge to Main** | All tests (unit + integration + E2E) | < 30 min | Full validation before deployment |
+| **Nightly/Scheduled** | Full suite + performance tests | < 60 min | Catch regressions, performance drift |
+| **Pre-Deployment** | Smoke tests only (5-10 critical E2E) | < 5 min | Fast production validation |
+
+**Progressive complexity:** Start with just unit tests on PR, add integration after mastering that, add E2E last.
+
+## Folder Structure Patterns
+
+### Basic (Small Projects)
+```
+tests/
+├── unit/
+├── integration/
+└── e2e/
+```
+
+### Mirrored (Medium Projects)
+```
+src/
+├── components/
+├── services/
+└── utils/
+tests/
+├── unit/
+│   ├── components/
+│   ├── services/
+│   └── utils/
+├── integration/
+└── e2e/
+```
+
+### Feature-Based (Large Projects)
+```
+features/
+├── auth/
+│   ├── src/
+│   └── tests/
+│       ├── unit/
+│       ├── integration/
+│       └── e2e/
+└── payment/
+    ├── src/
+    └── tests/
+```
+
+**Choose based on:** Team size (<5: Basic, 5-20: Mirrored, 20+: Feature-Based)
+
+## Migration Strategy (Fixing Inverted Pyramid)
+
+If you have 500 E2E tests and 100 unit tests:
+
+**Week 1-2: Audit**
+- [ ] Categorize each E2E test: Critical (keep) vs Redundant (migrate)
+- [ ] Identify 10-20 critical user journeys
+- [ ] Target: Keep 50-100 E2E tests maximum
+
+**Week 3-4: Move High-Value Tests Down**
+- [ ] Convert 200 E2E tests → integration tests (test API/services without UI)
+- [ ] Convert 100 E2E tests → unit tests (pure logic tests)
+- [ ] Delete 100 truly redundant E2E tests
+
+**Week 5-6: Build Unit Test Coverage**
+- [ ] Add 200-300 unit tests for untested business logic
+- [ ] Target: 400+ unit tests total
+
+**Week 7-8: Reorganize**
+- [ ] Split tests into folders (unit/, integration/, e2e/)
+- [ ] Create separate test configs
+- [ ] Update CI to run progressively
+
+**Expected result:** 400 unit, 200 integration, 100 E2E (~70/20/10 distribution)
+
+## Your First CI Pipeline
+
+**Start simple, add complexity progressively:**
+
+**Phase 1 (Week 1):** Unit tests only
+```yaml
+on: [pull_request]
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - run: npm run test:unit
+```
+
+**Phase 2 (Week 2-3):** Add lint + integration
+```yaml
+jobs:
+  lint:
+    runs-on: ubuntu-latest
+    steps:
+      - run: npm run lint
+
+  test:
+    needs: lint
+    runs-on: ubuntu-latest
+    steps:
+      - run: npm run test:unit
+      - run: npm run test:integration
+```
+
+**Phase 3 (Week 4+):** Add E2E on main branch
+```yaml
+jobs:
+  e2e:
+    if: github.ref == 'refs/heads/main'
+    needs: [lint, test]
+    runs-on: ubuntu-latest
+    steps:
+      - run: npm run test:e2e
+```
+
+**Don't start with full complexity** - master each phase before adding next.
+
+## Common Mistakes
+
+### ❌ Testing Everything at E2E Level
+**Fix:** Use Test Level Selection Guide above. Most tests belong at unit level.
+
+---
+
+### ❌ No Parallel Execution
+**Symptom:** Tests run sequentially, taking 30min when they could run in 10min
+
+**Fix:** Run independent test suites in parallel (unit + lint simultaneously)
+
+---
+
+### ❌ No Caching
+**Symptom:** Re-downloading dependencies on every CI run (5min wasted)
+
+**Fix:** Cache node_modules, .m2, .gradle based on lock file hash
+
+## Quick Reference
+
+**Test Distribution Target:**
+- 70% unit tests (fast, isolated)
+- 20% integration tests (component interaction)
+- 10% E2E tests (critical user journeys)
+
+**CI Pipeline Events:**
+- PR: unit + integration (< 15min)
+- Main: all tests (< 30min)
+- Deploy: smoke tests only (< 5min)
+
+**Folder Organization:**
+- Small team: tests/unit, tests/integration, tests/e2e
+- Large team: feature-based with embedded test folders
+
+**Migration Path:**
+1. Audit E2E tests
+2. Move 70% down to unit/integration
+3. Add missing unit tests
+4. Reorganize folders
+5. Update CI pipeline
+
+## Bottom Line
+
+**Many fast tests beat few slow tests.**
+
+Test pyramid exists because it balances confidence (E2E) with speed (unit). Organize tests by level, run progressively in CI, fail fast.
--- a/skills/test-data-management/SKILL.md
+++ b/skills/test-data-management/SKILL.md
@@ -0,0 +1,419 @@
+---
+name: test-data-management
+description: Use when fixing flaky tests from data pollution, choosing between fixtures and factories, setting up test data isolation, handling PII in tests, or seeding test databases - provides isolation strategies and anti-patterns
+---
+
+# Test Data Management
+
+## Overview
+
+**Core principle:** Test isolation first. Each test should work independently regardless of execution order.
+
+**Rule:** Never use production data in tests without anonymization.
+
+## Test Isolation Decision Tree
+
+| Symptom | Root Cause | Solution |
+|---------|------------|----------|
+| Tests pass alone, fail together | Shared database state | Use transactions with rollback |
+| Tests fail intermittently | Race conditions on shared data | Use unique IDs per test |
+| Tests leave data behind | No cleanup | Add explicit teardown fixtures |
+| Slow test setup/teardown | Creating too much data | Use factories, minimal data |
+| Can't reproduce failures | Non-deterministic data | Use fixtures with static data |
+
+**Primary strategy:** Database transactions (wrap test in transaction, rollback after). Fastest and most reliable.
+
+## Fixtures vs Factories Quick Guide
+
+| Use Fixtures (Static Files) | Use Factories (Code Generators) |
+|------------------------------|----------------------------------|
+| Integration/contract tests | Unit tests |
+| Realistic complex scenarios | Need many variations |
+| Specific edge cases to verify | Simple "valid object" needed |
+| Team needs to review data | Randomized/parameterized tests |
+| Data rarely changes | Frequent maintenance |
+
+**Decision:** Static, complex, reviewable → Fixtures. Dynamic, simple, variations → Factories.
+
+**Hybrid (recommended):** Fixtures for integration tests, factories for unit tests.
+
+## Anti-Patterns Catalog
+
+### ❌ Shared Test Data
+**Symptom:** All tests use same "test_user_123" in database
+
+**Why bad:** Tests pollute each other, fail when run in parallel, can't isolate failures
+
+**Fix:** Each test creates its own data with unique IDs or uses transactions
+
+---
+
+### ❌ No Cleanup Strategy
+**Symptom:** Database grows with every test run, tests fail on second run
+
+**Why bad:** Leftover data causes unique constraint violations, flaky tests
+
+**Fix:** Use transaction rollback or explicit teardown fixtures
+
+---
+
+### ❌ Production Data in Tests
+**Symptom:** Copying production database to test environment
+
+**Why bad:** Privacy violations (GDPR, CCPA), security risk, compliance issues
+
+**Fix:** Use synthetic data generation or anonymized/masked data
+
+---
+
+### ❌ Hardcoded Test Data
+**Symptom:** Every test creates `User(name="John", email="john@test.com")`
+
+**Why bad:** Violates DRY, maintenance nightmare when schema changes, no variations
+
+**Fix:** Use factories to generate test data programmatically
+
+---
+
+### ❌ Copy-Paste Fixtures
+**Symptom:** 50 nearly-identical JSON fixture files
+
+**Why bad:** Hard to maintain, changes require updating all copies
+
+**Fix:** Use fixture inheritance or factory-generated fixtures
+
+## Isolation Strategies Quick Reference
+
+| Strategy | Speed | Use When | Pros | Cons |
+|----------|-------|----------|------|------|
+| **Transactions (Rollback)** | Fast | Database tests | No cleanup code, bulletproof | DB only |
+| **Unique IDs (UUID/timestamp)** | Fast | Parallel tests, external APIs | No conflicts | Still needs cleanup |
+| **Explicit Cleanup (Teardown)** | Medium | Files, caches, APIs | Works for anything | Manual code |
+| **In-Memory Database** | Fastest | Unit tests | Complete isolation | Not production-like |
+| **Test Containers** | Medium | Integration tests | Production-like | Slower startup |
+
+**Recommended order:** Try transactions first, add unique IDs for parallelization, explicit cleanup as last resort.
+
+## Data Privacy Quick Guide
+
+| Data Type | Strategy | Why |
+|-----------|----------|-----|
+| **PII (names, emails, addresses)** | Synthetic generation (Faker) | Avoid legal risk |
+| **Payment data** | NEVER use production | PCI-DSS compliance |
+| **Health data** | Anonymize + subset | HIPAA compliance |
+| **Sensitive business data** | Mask or synthesize | Protect IP |
+| **Non-sensitive metadata** | Can use production | ID mappings, timestamps OK if no PII |
+
+**Default rule:** When in doubt, use synthetic data.
+
+## Your First Test Data Setup
+
+**Start minimal, add complexity only when needed:**
+
+**Phase 1: Transactions (Week 1)**
+```python
+@pytest.fixture
+def db_session(db_engine):
+    connection = db_engine.connect()
+    transaction = connection.begin()
+    session = Session(bind=connection)
+
+    yield session
+
+    transaction.rollback()
+    connection.close()
+```
+
+**Phase 2: Add Factories (Week 2)**
+```python
+class UserFactory:
+    @staticmethod
+    def create(**overrides):
+        defaults = {
+            "id": str(uuid4()),
+            "email": f"test_{uuid4()}@example.com",
+            "created_at": datetime.now()
+        }
+        return {**defaults, **overrides}
+```
+
+**Phase 3: Add Fixtures for Complex Cases (Week 3+)**
+```json
+// tests/fixtures/valid_invoice.json
+{
+  "id": "inv-001",
+  "items": [/* complex nested data */],
+  "total": 107.94
+}
+```
+
+**Don't start with full complexity.** Master transactions first.
+
+## Non-Database Resource Isolation
+
+Database transactions don't work for files, caches, message queues, or external services. Use **explicit cleanup with unique namespacing**.
+
+### Temporary Files Strategy
+
+**Recommended:** Python's `tempfile` module (automatic cleanup)
+
+```python
+import tempfile
+from pathlib import Path
+
+@pytest.fixture
+def temp_workspace():
+    """Isolated temporary directory for test"""
+    with tempfile.TemporaryDirectory(prefix="test_") as tmp_dir:
+        yield Path(tmp_dir)
+    # Automatic cleanup on exit
+```
+
+**Alternative (manual control):**
+```python
+from uuid import uuid4
+import shutil
+
+@pytest.fixture
+def temp_dir():
+    test_dir = Path(f"/tmp/test_{uuid4()}")
+    test_dir.mkdir(parents=True)
+
+    yield test_dir
+
+    shutil.rmtree(test_dir, ignore_errors=True)
+```
+
+### Redis/Cache Isolation Strategy
+
+**Option 1: Unique key namespace per test (lightweight)**
+
+```python
+@pytest.fixture
+def redis_namespace(redis_client):
+    """Namespaced Redis keys with automatic cleanup"""
+    namespace = f"test:{uuid4()}"
+
+    yield namespace
+
+    # Cleanup: Delete all keys with this namespace
+    for key in redis_client.scan_iter(f"{namespace}:*"):
+        redis_client.delete(key)
+
+def test_caching(redis_namespace, redis_client):
+    key = f"{redis_namespace}:user:123"
+    redis_client.set(key, "value")
+    # Automatic cleanup after test
+```
+
+**Option 2: Separate Redis database per test (stronger isolation)**
+
+```python
+@pytest.fixture
+def isolated_redis():
+    """Use Redis DB 1-15 for tests (DB 0 for dev)"""
+    import random
+    test_db = random.randint(1, 15)
+    client = redis.Redis(db=test_db)
+
+    yield client
+
+    client.flushdb()  # Clear entire test database
+```
+
+**Option 3: Test containers (best isolation, slower)**
+
+```python
+from testcontainers.redis import RedisContainer
+
+@pytest.fixture(scope="session")
+def redis_container():
+    with RedisContainer() as container:
+        yield container
+
+@pytest.fixture
+def redis_client(redis_container):
+    client = redis.from_url(redis_container.get_connection_url())
+    yield client
+    client.flushdb()
+```
+
+### Combined Resource Cleanup
+
+When tests use database + files + cache:
+
+```python
+@pytest.fixture
+def isolated_test_env(db_session, temp_workspace, redis_namespace):
+    """Combined isolation for all resources"""
+    yield {
+        "db": db_session,
+        "files": temp_workspace,
+        "cache_ns": redis_namespace
+    }
+    # Teardown automatic via dependent fixtures
+    # Order: External resources first, DB last
+```
+
+### Quick Decision Guide
+
+| Resource Type | Isolation Strategy | Cleanup Method |
+|---------------|-------------------|----------------|
+| **Temporary files** | Unique directory per test | `tempfile.TemporaryDirectory()` |
+| **Redis cache** | Unique key namespace | Delete by pattern in teardown |
+| **Message queues** | Unique queue name | Delete queue in teardown |
+| **External APIs** | Unique resource IDs | DELETE requests in teardown |
+| **Test containers** | Per-test container | Container auto-cleanup |
+
+**Rule:** If transactions don't work, use unique IDs + explicit cleanup.
+
+## Test Containers Pattern
+
+**Core principle:** Session-scoped container + transaction rollback per test.
+
+**Don't recreate containers per test** - startup overhead kills performance.
+
+### SQL Database Containers (PostgreSQL, MySQL)
+
+**Recommended:** Session-scoped container + transactional fixtures
+
+```python
+from testcontainers.postgres import PostgresContainer
+import pytest
+
+@pytest.fixture(scope="session")
+def postgres_container():
+    """Container lives for entire test run"""
+    with PostgresContainer("postgres:15") as container:
+        yield container
+        # Auto-cleanup after all tests
+
+@pytest.fixture
+def db_session(postgres_container):
+    """Transaction per test - fast isolation"""
+    engine = create_engine(postgres_container.get_connection_url())
+    connection = engine.connect()
+    transaction = connection.begin()
+    session = Session(bind=connection)
+
+    yield session
+
+    transaction.rollback()  # <1ms cleanup
+    connection.close()
+```
+
+**Performance:**
+- Container startup: 5-10 seconds (once per test run)
+- Transaction rollback: <1ms per test
+- 100 tests: ~10 seconds total vs 8-16 minutes if recreating container per test
+
+**When to recreate container:**
+- Testing database migrations (need clean schema each time)
+- Testing database extensions/configuration changes
+- Container state itself is under test
+
+**For data isolation:** Transactions within shared container always win.
+
+### NoSQL/Cache Containers (Redis, MongoDB)
+
+Use session-scoped container + flush per test:
+
+```python
+from testcontainers.redis import RedisContainer
+
+@pytest.fixture(scope="session")
+def redis_container():
+    """Container lives for entire test run"""
+    with RedisContainer() as container:
+        yield container
+
+@pytest.fixture
+def redis_client(redis_container):
+    """Fresh client per test"""
+    client = redis.from_url(redis_container.get_connection_url())
+    yield client
+    client.flushdb()  # Clear after test
+```
+
+### Container Scope Decision
+
+| Use Case | Container Scope | Data Isolation Strategy |
+|----------|-----------------|------------------------|
+| SQL database tests | `scope="session"` | Transaction rollback per test |
+| NoSQL cache tests | `scope="session"` | Flush database per test |
+| Migration testing | `scope="function"` | Fresh schema per test |
+| Service integration | `scope="session"` | Unique IDs + cleanup per test |
+
+**Default:** Session scope + transaction/flush per test (100x faster).
+
+## Common Mistakes
+
+### ❌ Creating Full Objects When Partial Works
+**Symptom:** Test needs user ID, creates full user with 20 fields
+
+**Fix:** Create minimal valid object:
+```python
+# ❌ Bad
+user = UserFactory.create(
+    name="Test", email="test@example.com",
+    address="123 St", phone="555-1234",
+    # ... 15 more fields
+)
+
+# ✅ Good
+user = {"id": str(uuid4())}  # If only ID needed
+```
+
+---
+
+### ❌ No Transaction Isolation for Database Tests
+**Symptom:** Writing manual cleanup code for every database test
+
+**Fix:** Use transactional fixtures. Wrap in transaction, automatic rollback.
+
+---
+
+### ❌ Testing With Timestamps That Fail at Midnight
+**Symptom:** Tests pass during day, fail at exactly midnight
+
+**Fix:** Mock system time or use relative dates:
+```python
+# ❌ Bad
+assert created_at.date() == datetime.now().date()
+
+# ✅ Good
+from freezegun import freeze_time
+@freeze_time("2025-11-15 12:00:00")
+def test_timestamp():
+    assert created_at.date() == date(2025, 11, 15)
+```
+
+## Quick Reference
+
+**Test Isolation Priority:**
+1. Database tests → Transactions (rollback)
+2. Parallel execution → Unique IDs (UUID)
+3. External services → Explicit cleanup
+4. Files/caches → Teardown fixtures
+
+**Fixtures vs Factories:**
+- Complex integration scenario → Fixture
+- Simple unit test → Factory
+- Need variations → Factory
+- Specific edge case → Fixture
+
+**Data Privacy:**
+- PII/sensitive → Synthetic data (Faker, custom generators)
+- Never production payment/health data
+- Mask if absolutely need production structure
+
+**Getting Started:**
+1. Add transaction fixtures (Week 1)
+2. Add factory for common objects (Week 2)
+3. Add complex fixtures as needed (Week 3+)
+
+## Bottom Line
+
+**Test isolation prevents flaky tests.**
+
+Use transactions for database tests (fastest, cleanest). Use factories for unit tests (flexible, DRY). Use fixtures for complex integration scenarios (realistic, reviewable). Never use production data without anonymization.
--- a/skills/test-isolation-fundamentals/SKILL.md
+++ b/skills/test-isolation-fundamentals/SKILL.md
@@ -0,0 +1,663 @@
+---
+name: test-isolation-fundamentals
+description: Use when tests fail together but pass alone, diagnosing test pollution, ensuring test independence and idempotence, managing shared state, or designing parallel-safe tests - provides isolation principles, database/file/service patterns, and cleanup strategies
+---
+
+# Test Isolation Fundamentals
+
+## Overview
+
+**Core principle:** Each test must work independently, regardless of execution order or parallel execution.
+
+**Rule:** If a test fails when run with other tests but passes alone, you have an isolation problem. Fix it before adding more tests.
+
+## When You Have Isolation Problems
+
+**Symptoms:**
+- Tests pass individually: `pytest test_checkout.py` ✓
+- Tests fail in full suite: `pytest` ✗
+- Errors like "User already exists", "Expected empty but found data"
+- Tests fail randomly or only in CI
+- Different results when tests run in different orders
+
+**Root cause:** Tests share mutable state without cleanup.
+
+## The Five Principles
+
+### 1. Order-Independence
+
+**Tests must pass regardless of execution order.**
+
+```bash
+# All of these must produce identical results
+pytest tests/  # alphabetical order
+pytest tests/ --random-order  # random order
+pytest tests/ --reverse  # reverse order
+```
+
+**Anti-pattern:**
+```python
+# ❌ BAD: Test B depends on Test A running first
+def test_create_user():
+    db.users.insert({"id": 1, "name": "Alice"})
+
+def test_update_user():
+    db.users.update({"id": 1}, {"name": "Bob"})  # Assumes Alice exists!
+```
+
+**Fix:** Each test creates its own data.
+
+---
+
+### 2. Idempotence
+
+**Running a test twice produces the same result both times.**
+
+```bash
+# Both runs must pass
+pytest test_checkout.py  # First run
+pytest test_checkout.py  # Second run (same result)
+```
+
+**Anti-pattern:**
+```python
+# ❌ BAD: Second run fails on unique constraint
+def test_signup():
+    user = create_user(email="test@example.com")
+    assert user.id is not None
+    # No cleanup - second run fails: "email already exists"
+```
+
+**Fix:** Clean up data after test OR use unique data per run.
+
+---
+
+### 3. Fresh State
+
+**Each test starts with a clean slate.**
+
+**What needs to be fresh:**
+- Database records
+- Files and directories
+- In-memory caches
+- Global variables
+- Module-level state
+- Environment variables
+- Network sockets/ports
+- Background processes
+
+**Anti-pattern:**
+```python
+# ❌ BAD: Shared mutable global state
+cache = {}  # Module-level global
+
+def test_cache_miss():
+    assert get_from_cache("key1") is None  # Passes first time
+    cache["key1"] = "value"  # Pollutes global state
+
+def test_cache_lookup():
+    assert get_from_cache("key1") is None  # Fails if previous test ran!
+```
+
+---
+
+### 4. Explicit Scope
+
+**Know what state is shared vs isolated.**
+
+**Test scopes (pytest):**
+- `scope="function"` - Fresh per test (default, safest)
+- `scope="class"` - Shared across test class
+- `scope="module"` - Shared across file
+- `scope="session"` - Shared across entire test run
+
+**Rule:** Default to `scope="function"`. Only use broader scopes for expensive resources that are READ-ONLY.
+
+```python
+# ✅ GOOD: Expensive read-only data can be shared
+@pytest.fixture(scope="session")
+def large_config_file():
+    return load_config("data.json")  # Expensive, never modified
+
+# ❌ BAD: Mutable data shared across tests
+@pytest.fixture(scope="session")
+def database():
+    return Database()  # Tests will pollute each other!
+
+# ✅ GOOD: Mutable data fresh per test
+@pytest.fixture(scope="function")
+def database():
+    db = Database()
+    yield db
+    db.cleanup()  # Fresh per test
+```
+
+---
+
+### 5. Parallel Safety
+
+**Tests must work when run concurrently.**
+
+```bash
+pytest -n 4  # Run 4 tests in parallel with pytest-xdist
+```
+
+**Parallel-unsafe patterns:**
+- Shared files without unique names
+- Fixed network ports
+- Singleton databases
+- Global module state
+- Fixed temp directories
+
+**Fix:** Use unique identifiers per test (UUIDs, process IDs, random ports).
+
+---
+
+## Isolation Patterns by Resource Type
+
+### Database Isolation
+
+**Pattern 1: Transactions with Rollback (Fastest, Recommended)**
+
+```python
+import pytest
+from sqlalchemy import create_engine
+from sqlalchemy.orm import sessionmaker
+
+@pytest.fixture
+def db_session(db_engine):
+    """Each test gets a fresh DB session that auto-rollbacks."""
+    connection = db_engine.connect()
+    transaction = connection.begin()
+    session = Session(bind=connection)
+
+    yield session
+
+    transaction.rollback()  # Undo all changes
+    connection.close()
+```
+
+**Why it works:**
+- No cleanup code needed - rollback is automatic
+- Fast (<1ms per test)
+- Works with ANY database (PostgreSQL, MySQL, SQLite, Oracle)
+- Handles FK relationships automatically
+
+**When NOT to use:**
+- Testing actual commits
+- Testing transaction isolation levels
+- Multi-database transactions
+
+---
+
+**Pattern 2: Unique Data Per Test**
+
+```python
+import uuid
+import pytest
+
+@pytest.fixture
+def unique_user():
+    """Each test gets a unique user."""
+    email = f"test-{uuid.uuid4()}@example.com"
+    user = create_user(email=email, name="Test User")
+
+    yield user
+
+    # Optional cleanup (or rely on test DB being dropped)
+    delete_user(user.id)
+```
+
+**Why it works:**
+- Tests don't interfere (different users)
+- Can run in parallel
+- Idempotent (UUID ensures uniqueness)
+
+**When to use:**
+- Testing with real databases
+- Parallel test execution
+- Integration tests that need real commits
+
+---
+
+**Pattern 3: Test Database Per Test**
+
+```python
+@pytest.fixture
+def isolated_db():
+    """Each test gets its own temporary database."""
+    db_name = f"test_db_{uuid.uuid4().hex}"
+    create_database(db_name)
+
+    yield get_connection(db_name)
+
+    drop_database(db_name)
+```
+
+**Why it works:**
+- Complete isolation
+- Can test schema migrations
+- No cross-test pollution
+
+**When NOT to use:**
+- Unit tests (too slow)
+- Large test suites (overhead adds up)
+
+---
+
+### File System Isolation
+
+**Pattern: Temporary Directories**
+
+```python
+import pytest
+import tempfile
+import shutil
+
+@pytest.fixture
+def temp_workspace():
+    """Each test gets a fresh temporary directory."""
+    tmpdir = tempfile.mkdtemp(prefix="test_")
+
+    yield tmpdir
+
+    shutil.rmtree(tmpdir)  # Clean up
+```
+
+**Parallel-safe version:**
+
+```python
+@pytest.fixture
+def temp_workspace(tmp_path):
+    """pytest's tmp_path is automatically unique per test."""
+    workspace = tmp_path / "workspace"
+    workspace.mkdir()
+
+    yield workspace
+
+    # No cleanup needed - pytest handles it
+```
+
+**Why it works:**
+- Each test writes to different directory
+- Parallel-safe (unique paths)
+- Automatic cleanup
+
+---
+
+### Service/API Isolation
+
+**Pattern: Mocking External Services**
+
+```python
+import pytest
+from unittest.mock import patch, MagicMock
+
+@pytest.fixture
+def mock_stripe():
+    """Mock Stripe API for all tests."""
+    with patch('stripe.Charge.create') as mock:
+        mock.return_value = MagicMock(id="ch_test123", status="succeeded")
+        yield mock
+```
+
+**When to use:**
+- External APIs (Stripe, Twilio, SendGrid)
+- Slow services
+- Non-deterministic responses
+- Services that cost money per call
+
+**When NOT to use:**
+- Testing integration with real service (use separate integration test suite)
+
+---
+
+### In-Memory Cache Isolation
+
+**Pattern: Clear Cache Before Each Test**
+
+```python
+import pytest
+
+@pytest.fixture(autouse=True)
+def clear_cache():
+    """Automatically clear cache before each test."""
+    cache.clear()
+    yield
+    # Optional: clear after test too
+    cache.clear()
+```
+
+**Why `autouse=True`:** Runs automatically for every test without explicit declaration.
+
+---
+
+### Process/Port Isolation
+
+**Pattern: Dynamic Port Allocation**
+
+```python
+import socket
+import pytest
+
+def get_free_port():
+    """Find an available port."""
+    sock = socket.socket()
+    sock.bind(('', 0))
+    port = sock.getsockname()[1]
+    sock.close()
+    return port
+
+@pytest.fixture
+def test_server():
+    """Each test gets a server on a unique port."""
+    port = get_free_port()
+    server = start_server(port=port)
+
+    yield f"http://localhost:{port}"
+
+    server.stop()
+```
+
+**Why it works:**
+- Tests can run in parallel (different ports)
+- No port conflicts
+
+---
+
+## Test Doubles: When to Use What
+
+| Type | Purpose | Example |
+|------|---------|---------|
+| **Stub** | Returns hardcoded values | `getUser() → {id: 1, name: "Alice"}` |
+| **Mock** | Verifies calls were made | `assert emailService.send.called` |
+| **Fake** | Working implementation, simplified | In-memory database instead of PostgreSQL |
+| **Spy** | Records calls for later inspection | Logs all method calls |
+
+**Decision tree:**
+
+```
+Do you need to verify the call was made?
+  YES → Use Mock
+  NO → Do you need a working implementation?
+    YES → Use Fake
+    NO → Use Stub
+```
+
+---
+
+## Diagnosing Isolation Problems
+
+### Step 1: Identify Flaky Tests
+
+```bash
+# Run tests 100 times to find flakiness
+pytest --count=100 test_checkout.py
+
+# Run in random order
+pytest --random-order
+```
+
+**Interpretation:**
+- Passes 100/100 → Not flaky
+- Passes 95/100 → Flaky (5% failure rate)
+- Failures are random → Parallel unsafe OR order-dependent
+
+---
+
+### Step 2: Find Which Tests Interfere
+
+**Run tests in isolation:**
+
+```bash
+# Test A alone
+pytest test_a.py  # ✓ Passes
+
+# Test B alone
+pytest test_b.py  # ✓ Passes
+
+# Both together
+pytest test_a.py test_b.py  # ✗ Test B fails
+
+# Conclusion: Test A pollutes state that Test B depends on
+```
+
+**Reverse the order:**
+
+```bash
+pytest test_b.py test_a.py  # Does Test A fail now?
+```
+
+- If YES: Bidirectional pollution
+- If NO: Test A pollutes, Test B is victim
+
+---
+
+### Step 3: Identify Shared State
+
+**Add diagnostic logging:**
+
+```python
+@pytest.fixture(autouse=True)
+def log_state():
+    """Log state before/after each test."""
+    print(f"Before: DB has {db.count()} records")
+    yield
+    print(f"After: DB has {db.count()} records")
+```
+
+**Look for:**
+- Record count increasing over time (no cleanup)
+- Files accumulating
+- Cache growing
+- Ports in use
+
+---
+
+### Step 4: Audit for Global State
+
+**Search codebase for isolation violations:**
+
+```bash
+# Module-level globals
+grep -r "^[A-Z_]* = " app/
+
+# Global caches
+grep -r "cache = " app/
+
+# Singletons
+grep -r "@singleton" app/
+grep -r "class.*Singleton" app/
+```
+
+---
+
+## Anti-Patterns Catalog
+
+### ❌ Cleanup Code Instead of Structural Isolation
+
+**Symptom:** Every test has teardown code to clean up
+
+```python
+def test_checkout():
+    user = create_user()
+    cart = create_cart(user)
+
+    checkout(cart)
+
+    # Teardown
+    delete_cart(cart.id)
+    delete_user(user.id)
+```
+
+**Why bad:**
+- If test fails before cleanup, state pollutes
+- If cleanup has bugs, state pollutes
+- Forces sequential execution (no parallelism)
+
+**Fix:** Use transactions, unique IDs, or dependency injection
+
+---
+
+### ❌ Shared Test Fixtures
+
+**Symptom:** Fixtures modify mutable state
+
+```python
+@pytest.fixture(scope="module")
+def user():
+    return create_user(email="test@example.com")
+
+def test_update_name(user):
+    user.name = "Alice"  # Modifies shared fixture!
+    save(user)
+
+def test_update_email(user):
+    # Expects name to be original, but Test 1 changed it!
+    assert user.name == "Test User"  # FAILS
+```
+
+**Why bad:** Tests interfere when fixture is modified
+
+**Fix:** Use `scope="function"` for mutable fixtures
+
+---
+
+### ❌ Hidden Dependencies on Execution Order
+
+**Symptom:** Test suite has implicit execution order
+
+```python
+# test_a.py
+def test_create_admin():
+    create_user(email="admin@example.com", role="admin")
+
+# test_b.py
+def test_admin_permissions():
+    admin = get_user("admin@example.com")  # Assumes test_a ran!
+    assert admin.has_permission("delete_users")
+```
+
+**Why bad:** Breaks when tests run in different order or in parallel
+
+**Fix:** Each test creates its own dependencies
+
+---
+
+### ❌ Testing on Production-Like State
+
+**Symptom:** Tests run against shared database with existing data
+
+```python
+def test_user_count():
+    assert db.users.count() == 100  # Assumes specific state!
+```
+
+**Why bad:**
+- Tests fail when data changes
+- Can't run in parallel
+- Can't run idempotently
+
+**Fix:** Use isolated test database or count relative to test's own data
+
+---
+
+## Common Scenarios
+
+### Scenario 1: "Tests pass locally, fail in CI"
+
+**Likely causes:**
+1. **Timing issues** - CI is slower/faster, race conditions exposed
+2. **Parallel execution** - CI runs tests in parallel, local doesn't
+3. **Missing cleanup** - Local has leftover state, CI is fresh
+
+**Diagnosis:**
+```bash
+# Test parallel execution locally
+pytest -n 4
+
+# Test with clean state
+rm -rf .pytest_cache && pytest
+```
+
+---
+
+### Scenario 2: "Random test failures that disappear on retry"
+
+**Likely causes:**
+1. **Race conditions** - Async operations not awaited
+2. **Shared mutable state** - Global variables polluted
+3. **External service flakiness** - Real APIs being called
+
+**Diagnosis:**
+```bash
+# Run same test 100 times
+pytest --count=100 test_flaky.py
+
+# If failure rate is consistent (e.g., 5/100), it's likely shared state
+# If failure rate varies wildly, it's likely race condition
+```
+
+---
+
+### Scenario 3: "Database unique constraint violations"
+
+**Symptom:** `IntegrityError: duplicate key value violates unique constraint`
+
+**Cause:** Tests reuse same email/username/ID
+
+**Fix:**
+```python
+import uuid
+
+@pytest.fixture
+def unique_user():
+    email = f"test-{uuid.uuid4()}@example.com"
+    return create_user(email=email)
+```
+
+---
+
+## Quick Reference: Isolation Strategy Decision Tree
+
+```
+What resource needs isolation?
+
+DATABASE
+├─ Can you use transactions? → Transaction Rollback (fastest)
+├─ Need real commits? → Unique Data Per Test
+└─ Need schema changes? → Test Database Per Test
+
+FILES
+├─ Few files? → pytest's tmp_path
+└─ Complex directories? → tempfile.mkdtemp()
+
+EXTERNAL SERVICES
+├─ Testing integration? → Separate integration test suite
+└─ Testing business logic? → Mock the service
+
+IN-MEMORY STATE
+├─ Caches → Clear before each test (autouse fixture)
+├─ Globals → Dependency injection (refactor)
+└─ Module-level → Reset in fixture or avoid entirely
+
+PROCESSES/PORTS
+└─ Dynamic port allocation per test
+```
+
+---
+
+## Bottom Line
+
+**Test isolation is structural, not reactive.**
+
+- ❌ **Reactive:** Write cleanup code after each test
+- ✅ **Structural:** Design tests so cleanup isn't needed
+
+**The hierarchy:**
+1. **Best:** Dependency injection (no shared state)
+2. **Good:** Transactions/tmp_path (automatic cleanup)
+3. **Acceptable:** Unique data per test (explicit isolation)
+4. **Last resort:** Manual cleanup (fragile, error-prone)
+
+**If your tests fail together but pass alone, you have an isolation problem. Stop adding tests and fix isolation first.**
--- a/skills/test-maintenance-patterns/SKILL.md
+++ b/skills/test-maintenance-patterns/SKILL.md
@@ -0,0 +1,500 @@
+---
+name: test-maintenance-patterns
+description: Use when reducing test duplication, refactoring flaky tests, implementing page object patterns, managing test helpers, reducing test debt, or scaling test suites - provides refactoring strategies and maintainability patterns for long-term test sustainability
+---
+
+# Test Maintenance Patterns
+
+## Overview
+
+**Core principle:** Test code is production code. Apply the same quality standards: DRY, SOLID, refactoring.
+
+**Rule:** If you can't understand a test in 30 seconds, refactor it. If a test is flaky, fix or delete it.
+
+## Test Maintenance vs Writing Tests
+
+| Activity | When | Goal |
+|----------|------|------|
+| **Writing tests** | New features, bug fixes | Add coverage |
+| **Maintaining tests** | Test suite grows, flakiness increases | Reduce duplication, improve clarity, fix flakiness |
+
+**Test debt indicators:**
+- Tests take > 15 minutes to run
+- > 5% flakiness rate
+- Duplicate setup code across 10+ tests
+- Tests break on unrelated changes
+- Nobody understands old tests
+
+---
+
+## Page Object Pattern (E2E Tests)
+
+**Problem:** Duplicated selectors across tests
+
+```javascript
+// ❌ BAD: Selectors duplicated everywhere
+test('login', async ({ page }) => {
+  await page.fill('#email', 'user@example.com');
+  await page.fill('#password', 'password');
+  await page.click('button[type="submit"]');
+});
+
+test('forgot password', async ({ page }) => {
+  await page.fill('#email', 'user@example.com');  // Duplicated!
+  await page.click('a.forgot-password');
+});
+```
+
+**Fix:** Page Object Pattern
+
+```javascript
+// pages/LoginPage.js
+export class LoginPage {
+  constructor(page) {
+    this.page = page;
+    this.emailInput = page.locator('#email');
+    this.passwordInput = page.locator('#password');
+    this.submitButton = page.locator('button[type="submit"]');
+    this.forgotPasswordLink = page.locator('a.forgot-password');
+  }
+
+  async goto() {
+    await this.page.goto('/login');
+  }
+
+  async login(email, password) {
+    await this.emailInput.fill(email);
+    await this.passwordInput.fill(password);
+    await this.submitButton.click();
+  }
+
+  async clickForgotPassword() {
+    await this.forgotPasswordLink.click();
+  }
+}
+
+// tests/login.spec.js
+import { LoginPage } from '../pages/LoginPage';
+
+test('login', async ({ page }) => {
+  const loginPage = new LoginPage(page);
+  await loginPage.goto();
+  await loginPage.login('user@example.com', 'password');
+
+  await expect(page).toHaveURL('/dashboard');
+});
+
+test('forgot password', async ({ page }) => {
+  const loginPage = new LoginPage(page);
+  await loginPage.goto();
+  await loginPage.clickForgotPassword();
+
+  await expect(page).toHaveURL('/reset-password');
+});
+```
+
+**Benefits:**
+- Selectors in one place
+- Tests read like documentation
+- Changes to UI require one-line fix
+
+---
+
+## Test Data Builders (Integration/Unit Tests)
+
+**Problem:** Duplicate test data setup
+
+```python
+# ❌ BAD: Duplicated setup
+def test_order_total():
+    order = Order(
+        id=1,
+        user_id=123,
+        items=[Item(sku="WIDGET", quantity=2, price=10.0)],
+        shipping=5.0,
+        tax=1.5
+    )
+    assert order.total() == 26.5
+
+def test_order_discounts():
+    order = Order(  # Same setup!
+        id=2,
+        user_id=123,
+        items=[Item(sku="WIDGET", quantity=2, price=10.0)],
+        shipping=5.0,
+        tax=1.5
+    )
+    order.apply_discount(10)
+    assert order.total() == 24.0
+```
+
+**Fix:** Builder Pattern
+
+```python
+# test_builders.py
+class OrderBuilder:
+    def __init__(self):
+        self._id = 1
+        self._user_id = 123
+        self._items = []
+        self._shipping = 0.0
+        self._tax = 0.0
+
+    def with_id(self, id):
+        self._id = id
+        return self
+
+    def with_items(self, *items):
+        self._items = list(items)
+        return self
+
+    def with_shipping(self, amount):
+        self._shipping = amount
+        return self
+
+    def with_tax(self, amount):
+        self._tax = amount
+        return self
+
+    def build(self):
+        return Order(
+            id=self._id,
+            user_id=self._user_id,
+            items=self._items,
+            shipping=self._shipping,
+            tax=self._tax
+        )
+
+# tests/test_orders.py
+def test_order_total():
+    order = (OrderBuilder()
+        .with_items(Item(sku="WIDGET", quantity=2, price=10.0))
+        .with_shipping(5.0)
+        .with_tax(1.5)
+        .build())
+
+    assert order.total() == 26.5
+
+def test_order_discounts():
+    order = (OrderBuilder()
+        .with_items(Item(sku="WIDGET", quantity=2, price=10.0))
+        .with_shipping(5.0)
+        .with_tax(1.5)
+        .build())
+
+    order.apply_discount(10)
+    assert order.total() == 24.0
+```
+
+**Benefits:**
+- Readable test data creation
+- Easy to customize per test
+- Defaults handle common cases
+
+---
+
+## Shared Fixtures (pytest)
+
+**Problem:** Setup code duplicated across tests
+
+```python
+# ❌ BAD
+def test_user_creation():
+    db = setup_database()
+    user_repo = UserRepository(db)
+    user = user_repo.create(email="alice@example.com")
+    assert user.id is not None
+    cleanup_database(db)
+
+def test_user_deletion():
+    db = setup_database()  # Duplicated!
+    user_repo = UserRepository(db)
+    user = user_repo.create(email="bob@example.com")
+    user_repo.delete(user.id)
+    assert user_repo.get(user.id) is None
+    cleanup_database(db)
+```
+
+**Fix:** Fixtures
+
+```python
+# conftest.py
+import pytest
+
+@pytest.fixture
+def db():
+    """Provide database connection with auto-cleanup."""
+    database = setup_database()
+    yield database
+    cleanup_database(database)
+
+@pytest.fixture
+def user_repo(db):
+    """Provide user repository."""
+    return UserRepository(db)
+
+# tests/test_users.py
+def test_user_creation(user_repo):
+    user = user_repo.create(email="alice@example.com")
+    assert user.id is not None
+
+def test_user_deletion(user_repo):
+    user = user_repo.create(email="bob@example.com")
+    user_repo.delete(user.id)
+    assert user_repo.get(user.id) is None
+```
+
+---
+
+## Reducing Test Duplication
+
+### Custom Matchers/Assertions
+
+**Problem:** Complex assertions repeated
+
+```python
+# ❌ BAD: Repeated validation logic
+def test_valid_user():
+    user = create_user()
+    assert user.id is not None
+    assert '@' in user.email
+    assert len(user.name) > 0
+    assert user.created_at is not None
+
+def test_another_valid_user():
+    user = create_admin()
+    assert user.id is not None  # Same validations!
+    assert '@' in user.email
+    assert len(user.name) > 0
+    assert user.created_at is not None
+```
+
+**Fix:** Custom assertion helpers
+
+```python
+# test_helpers.py
+def assert_valid_user(user):
+    """Assert user object is valid."""
+    assert user.id is not None, "User must have ID"
+    assert '@' in user.email, "Email must contain @"
+    assert len(user.name) > 0, "Name cannot be empty"
+    assert user.created_at is not None, "User must have creation timestamp"
+
+# tests/test_users.py
+def test_valid_user():
+    user = create_user()
+    assert_valid_user(user)
+
+def test_another_valid_user():
+    user = create_admin()
+    assert_valid_user(user)
+```
+
+---
+
+## Handling Flaky Tests
+
+### Strategy 1: Fix the Root Cause
+
+**Flaky test symptoms:**
+- Passes 95/100 runs
+- Fails with different errors
+- Fails only in CI
+
+**Root causes:**
+- Race conditions (see flaky-test-prevention skill)
+- Shared state (see test-isolation-fundamentals skill)
+- Timing assumptions
+
+**Fix:** Use condition-based waiting, isolate state
+
+---
+
+### Strategy 2: Quarantine Pattern
+
+**For tests that can't be fixed immediately:**
+
+```python
+# Mark as flaky, run separately
+@pytest.mark.flaky
+def test_sometimes_fails():
+    # Test code
+    pass
+```
+
+```bash
+# Run stable tests only
+pytest -m "not flaky"
+
+# Run flaky tests separately (don't block CI)
+pytest -m flaky --count=3  # Retry up to 3 times
+```
+
+**Rule:** Quarantined tests must have tracking issue. Fix within 30 days or delete.
+
+---
+
+### Strategy 3: Delete If Unfixable
+
+**When to delete:**
+- Test is flaky AND nobody understands it
+- Test has been disabled for > 90 days
+- Test duplicates coverage from other tests
+
+**Better to have:** 100 reliable tests than 150 tests with 10 flaky ones
+
+---
+
+## Refactoring Test Suites
+
+### Identify Slow Tests
+
+```bash
+# pytest: Show slowest 10 tests
+pytest --durations=10
+
+# Output:
+# 10.23s call test_integration_checkout.py::test_full_checkout
+# 8.45s call test_api.py::test_payment_flow
+# ...
+```
+
+**Action:** Optimize or split into integration/E2E categories
+
+---
+
+### Parallelize Tests
+
+```bash
+# pytest: Run tests in parallel
+pytest -n 4  # Use 4 CPU cores
+
+# Jest: Run tests in parallel (default)
+jest --maxWorkers=4
+```
+
+**Requirements:**
+- Tests must be isolated (no shared state)
+- See test-isolation-fundamentals skill
+
+---
+
+### Split Test Suites
+
+```ini
+# pytest.ini
+[pytest]
+markers =
+    unit: Unit tests (fast, isolated)
+    integration: Integration tests (medium speed, real DB)
+    e2e: End-to-end tests (slow, full system)
+```
+
+```yaml
+# CI: Run test categories separately
+jobs:
+  unit:
+    run: pytest -m unit  # Fast, every commit
+
+  integration:
+    run: pytest -m integration  # Medium, every PR
+
+  e2e:
+    run: pytest -m e2e  # Slow, before merge
+```
+
+---
+
+## Anti-Patterns Catalog
+
+### ❌ God Test
+
+**Symptom:** One test does everything
+
+```python
+def test_entire_checkout_flow():
+    # 300 lines testing: login, browse, add to cart, checkout, payment, email
+    pass
+```
+
+**Why bad:** Failure doesn't indicate what broke
+
+**Fix:** Split into focused tests
+
+---
+
+### ❌ Testing Implementation Details
+
+**Symptom:** Tests break when refactoring internal code
+
+```python
+# ❌ BAD: Testing internal method
+def test_order_calculation():
+    order = Order()
+    order._calculate_subtotal()  # Private method!
+    assert order.subtotal == 100
+```
+
+**Fix:** Test public interface only
+
+```python
+# ✅ GOOD
+def test_order_total():
+    order = Order(items=[...])
+    assert order.total() == 108  # Public method
+```
+
+---
+
+### ❌ Commented-Out Tests
+
+**Symptom:** Tests disabled with comments
+
+```python
+# def test_something():
+#     # This test is broken, commented out for now
+#     pass
+```
+
+**Fix:** Delete or fix. Create GitHub issue if needs fixing later.
+
+---
+
+## Test Maintenance Checklist
+
+**Monthly:**
+- [ ] Review flaky test rate (should be < 1%)
+- [ ] Check build time trend (should not increase > 5%/month)
+- [ ] Identify duplicate setup code (refactor into fixtures)
+- [ ] Run mutation testing (validate test quality)
+
+**Quarterly:**
+- [ ] Review test coverage (identify gaps)
+- [ ] Audit for commented-out tests (delete)
+- [ ] Check for unused fixtures (delete)
+- [ ] Refactor slowest 10 tests
+
+**Annually:**
+- [ ] Review entire test architecture
+- [ ] Update testing strategy for new patterns
+- [ ] Train team on new testing practices
+
+---
+
+## Bottom Line
+
+**Treat test code as production code. Refactor duplication, fix flakiness, delete dead tests.**
+
+**Key patterns:**
+- Page Objects (E2E tests)
+- Builder Pattern (test data)
+- Shared Fixtures (setup/teardown)
+- Custom Assertions (complex validations)
+
+**Maintenance rules:**
+- Fix flaky tests immediately or quarantine
+- Refactor duplicated code
+- Delete commented-out tests
+- Split slow test suites
+
+**If your tests are flaky, slow, or nobody understands them, invest in maintenance before adding more tests. Test debt compounds like technical debt.**
--- a/skills/testing-in-production/SKILL.md
+++ b/skills/testing-in-production/SKILL.md
@@ -0,0 +1,363 @@
+---
+name: testing-in-production
+description: Use when implementing feature flags, canary deployments, shadow traffic, A/B testing, choosing blast radius limits, defining rollback criteria, or monitoring production experiments - provides technique selection, anti-patterns, and kill switch frameworks
+---
+
+# Testing in Production
+
+## Overview
+
+**Core principle:** Minimize blast radius, maximize observability, always have a kill switch.
+
+**Rule:** Testing in production is safe when you control exposure and can roll back instantly.
+
+**Regulated industries (healthcare, finance, government):** Production testing is still possible but requires additional controls - compliance review before experiments, audit trails for flag changes, avoiding PHI/PII in logs, Business Associate Agreements for third-party tools, and restricted techniques (shadow traffic may create prohibited data copies). Consult compliance team before first production test.
+
+## Technique Selection Decision Tree
+
+| Your Goal | Risk Tolerance | Infrastructure Needed | Use |
+|-----------|----------------|----------------------|-----|
+| Test feature with specific users | Low | Feature flag service | **Feature Flags** |
+| Validate deployment safety | Medium | Load balancer, multiple instances | **Canary Deployment** |
+| Compare old vs new performance | Low | Traffic duplication | **Shadow Traffic** |
+| Measure business impact | Medium | A/B testing framework, analytics | **A/B Testing** |
+| Test without any user impact | Lowest | Service mesh, traffic mirroring | **Dark Launch** |
+
+**First technique:** Feature flags (lowest infrastructure requirement, highest control)
+
+## Anti-Patterns Catalog
+
+### ❌ Nested Feature Flags
+**Symptom:** Flags controlling other flags, creating combinatorial complexity
+
+**Why bad:** 2^N combinations to test, impossible to validate all paths, technical debt accumulates
+
+**Fix:** Maximum 1 level of flag nesting, delete flags after rollout
+
+```python
+# ❌ Bad
+if feature_flags.enabled("new_checkout"):
+    if feature_flags.enabled("express_shipping"):
+        if feature_flags.enabled("gift_wrap"):
+            # 8 possible combinations for 3 flags
+
+# ✅ Good
+if feature_flags.enabled("new_checkout_v2"):  # Single flag for full feature
+    return new_checkout_with_all_options()
+```
+
+---
+
+### ❌ Canary with Sticky Sessions
+**Symptom:** Users switch between old and new versions across requests due to session affinity
+
+**Why bad:** Inconsistent experience, state corruption, false negative metrics
+
+**Fix:** Route user to same version for entire session
+
+```nginx
+# ✅ Good - Consistent routing
+upstream backend {
+    hash $cookie_user_id consistent;  # Sticky by user ID
+    server backend-v1:8080 weight=95;
+    server backend-v2:8080 weight=5;
+}
+```
+
+---
+
+### ❌ No Statistical Validation
+**Symptom:** Making rollout decisions on small sample sizes without confidence intervals
+
+**Why bad:** Random variance mistaken for real effects, premature rollback or expansion
+
+**Fix:** Minimum sample size, statistical significance testing
+
+```python
+# ✅ Good - Statistical validation
+from scipy import stats
+
+def is_safe_to_rollout(control_errors, treatment_errors, min_sample=1000):
+    if len(treatment_errors) < min_sample:
+        return False, "Insufficient data"
+
+    # Two-proportion z-test
+    _, p_value = stats.proportions_ztest(
+        [control_errors.sum(), treatment_errors.sum()],
+        [len(control_errors), len(treatment_errors)]
+    )
+
+    return p_value > 0.05, f"p-value: {p_value}"
+```
+
+---
+
+### ❌ Testing Without Rollback
+**Symptom:** Deploying feature flags or canaries without instant kill switch
+
+**Why bad:** When issues detected, can't stop impact immediately
+
+**Fix:** Kill switch tested before first production test
+
+---
+
+### ❌ Insufficient Monitoring
+**Symptom:** Monitoring only error rates, missing business/user metrics
+
+**Why bad:** Technical success but business failure (e.g., lower conversion)
+
+**Fix:** Monitor technical + business + user experience metrics
+
+## Blast Radius Control Framework
+
+**Progressive rollout schedule:**
+
+| Phase | Exposure | Duration | Abort If | Continue If |
+|-------|----------|----------|----------|-------------|
+| **1. Internal** | 10-50 internal users | 1-2 days | Any errors | 0 errors, good UX feedback |
+| **2. Canary** | 1% production traffic | 4-24 hours | Error rate > +2%, latency > +10% | Metrics stable |
+| **3. Small** | 5% production | 1-2 days | Error rate > +5%, latency > +25% | Metrics stable or improved |
+| **4. Medium** | 25% production | 2-3 days | Error rate > +5%, latency > +25% | Metrics stable or improved |
+| **5. Majority** | 50% production | 3-7 days | Error rate > +5%, business metrics down | Metrics improved |
+| **6. Full** | 100% production | Monitor indefinitely | Business metrics drop | Cleanup old code |
+
+**Minimum dwell time:** Each phase needs minimum observation period to catch delayed issues
+
+**Rollback at any phase:** If metrics degrade, revert to previous phase
+
+## Kill Switch Criteria
+
+**Immediate rollback triggers (automated):**
+
+| Metric | Threshold | Why |
+|--------|-----------|-----|
+| Error rate increase | > 5% above baseline | User impact |
+| p99 latency increase | > 50% above baseline | Performance degradation |
+| Critical errors (5xx) | > 0.1% of requests | Service failure |
+| Business metric drop | > 10% (conversion, revenue) | Revenue impact |
+
+**Warning triggers (manual investigation):**
+
+| Metric | Threshold | Action |
+|--------|-----------|--------|
+| Error rate increase | 2-5% above baseline | Halt rollout, investigate |
+| p95 latency increase | 25-50% above baseline | Monitor closely |
+| User complaints | >3 similar reports | Halt rollout, investigate |
+
+**Statistical validation:**
+
+```python
+# Sample size for 95% confidence, 80% power
+# Minimum 1000 samples per variant for most A/B tests
+# For low-traffic features: wait 24-48 hours regardless
+```
+
+## Monitoring Quick Reference
+
+**Required metrics (all tests):**
+
+| Category | Metrics | Alert Threshold |
+|----------|---------|-----------------|
+| **Errors** | Error rate, exception count, 5xx responses | > +5% vs baseline |
+| **Performance** | p50/p95/p99 latency, request duration | p99 > +50% vs baseline |
+| **Business** | Conversion rate, transaction completion, revenue | > -10% vs baseline |
+| **User Experience** | Client errors, page load, bounce rate | > +20% vs baseline |
+
+**Baseline calculation:**
+
+```python
+# Collect baseline from previous 7-14 days
+baseline_p99 = np.percentile(historical_latencies, 99)
+current_p99 = np.percentile(current_latencies, 99)
+
+if current_p99 > baseline_p99 * 1.5:  # 50% increase
+    rollback()
+```
+
+## Implementation Patterns
+
+### Feature Flags Pattern
+
+```python
+# Using LaunchDarkly, Split.io, or similar
+from launchdarkly import LDClient, Context
+
+client = LDClient("sdk-key")
+
+def handle_request(user_id):
+    context = Context.builder(user_id).build()
+
+    if client.variation("new-checkout", context, default=False):
+        return new_checkout_flow(user_id)
+    else:
+        return old_checkout_flow(user_id)
+```
+
+**Best practices:**
+- Default to `False` (old behavior) for safety
+- Pass user context for targeting
+- Log flag evaluations for debugging
+- Delete flags within 30 days of full rollout
+
+### Canary Deployment Pattern
+
+```yaml
+# Kubernetes with Istio
+apiVersion: networking.istio.io/v1alpha3
+kind: VirtualService
+metadata:
+  name: my-service
+spec:
+  hosts:
+  - my-service
+  http:
+  - match:
+    - headers:
+        x-canary:
+          exact: "true"
+    route:
+    - destination:
+        host: my-service
+        subset: v2
+  - route:
+    - destination:
+        host: my-service
+        subset: v1
+      weight: 95
+    - destination:
+        host: my-service
+        subset: v2
+      weight: 5
+```
+
+### Shadow Traffic Pattern
+
+```python
+# Duplicate requests to new service, ignore responses
+import asyncio
+
+async def handle_request(request):
+    # Primary: serve user from old service
+    response = await old_service(request)
+
+    # Shadow: send to new service, don't wait
+    asyncio.create_task(new_service(request.copy()))  # Fire and forget
+
+    return response  # User sees old service response
+```
+
+## Tool Ecosystem Quick Reference
+
+| Tool Category | Options | When to Use |
+|---------------|---------|-------------|
+| **Feature Flags** | LaunchDarkly, Split.io, Flagsmith, Unleash | User-level targeting, instant rollback |
+| **Canary/Blue-Green** | Istio, Linkerd, AWS App Mesh, Flagger | Service mesh, traffic shifting |
+| **A/B Testing** | Optimizely, VWO, Google Optimize | Business metric validation |
+| **Observability** | DataDog, New Relic, Honeycomb, Grafana | Metrics, traces, logs correlation |
+| **Statistical Analysis** | Statsig, Eppo, GrowthBook | Automated significance testing |
+
+**Recommendation for starting:** Feature flags (Flagsmith for self-hosted, LaunchDarkly for SaaS) + existing observability
+
+## Your First Production Test
+
+**Goal:** Safely test a small feature with feature flags
+
+**Week 1: Setup**
+
+1. **Choose feature flag tool**
+   - Self-hosted: Flagsmith (free, open source)
+   - SaaS: LaunchDarkly (free tier: 1000 MAU)
+
+2. **Instrument code**
+   ```python
+   if feature_flags.enabled("my-first-test", user_id):
+       return new_feature(user_id)
+   else:
+       return old_feature(user_id)
+   ```
+
+3. **Set up monitoring**
+   - Error rate dashboard
+   - Latency percentiles (p50, p95, p99)
+   - Business metric (conversion, completion rate)
+
+4. **Define rollback criteria**
+   - Error rate > +5%
+   - p99 latency > +50%
+   - Business metric < -10%
+
+**Week 2: Test Execution**
+
+**Day 1-2:** Internal users (10 people)
+- Enable flag for 10 employee user IDs
+- Monitor for errors, gather feedback
+
+**Day 3-5:** Canary (1% of users)
+- Enable for 1% random sample
+- Monitor metrics every hour
+- Rollback if any threshold exceeded
+
+**Day 6-8:** Small rollout (5%)
+- If canary successful, increase to 5%
+- Continue monitoring
+
+**Day 9-14:** Full rollout (100%)
+- Gradual increase: 25% → 50% → 100%
+- Monitor for 7 days at 100%
+
+**Week 3: Cleanup**
+
+- Remove flag from code
+- Archive flag in dashboard
+- Document learnings
+
+## Common Mistakes
+
+### ❌ Expanding Rollout Too Fast
+**Fix:** Follow minimum dwell times (24 hours per phase)
+
+---
+
+### ❌ Monitoring Only After Issues
+**Fix:** Dashboard ready before first rollout, alerts configured
+
+---
+
+### ❌ No Rollback Practice
+**Fix:** Test rollback in staging before production
+
+---
+
+### ❌ Ignoring Business Metrics
+**Fix:** Technical metrics AND business metrics required for go/no-go decisions
+
+## Quick Reference
+
+**Technique Selection:**
+- User-specific: Feature flags
+- Deployment safety: Canary
+- Performance comparison: Shadow traffic
+- Business validation: A/B testing
+
+**Blast Radius Progression:**
+Internal → 1% → 5% → 25% → 50% → 100%
+
+**Kill Switch Thresholds:**
+- Error rate: > +5%
+- p99 latency: > +50%
+- Business metrics: > -10%
+
+**Minimum Sample Sizes:**
+- A/B test: 1000 samples per variant
+- Canary: 24 hours observation
+
+**Tool Recommendations:**
+- Feature flags: LaunchDarkly, Flagsmith
+- Canary: Istio, Flagger
+- Observability: DataDog, Grafana
+
+## Bottom Line
+
+**Production testing is safe with three controls: exposure limits, observability, instant rollback.**
+
+Start with feature flags, use progressive rollout (1% → 5% → 25% → 100%), monitor technical + business metrics, and always have a kill switch.
--- a/skills/using-quality-engineering/SKILL.md
+++ b/skills/using-quality-engineering/SKILL.md
@@ -0,0 +1,153 @@
+---
+name: using-quality-engineering
+description: Use when user asks about E2E testing, performance testing, chaos engineering, test automation, flaky tests, test data management, or quality practices - routes to specialist skills with deep expertise instead of providing general guidance
+---
+
+# Using Quality Engineering
+
+## Overview
+
+**This is a router skill** - it directs you to the appropriate specialist quality engineering skill based on the user's question.
+
+**Core principle:** Quality engineering questions deserve specialist expertise, not general guidance. Always route to the appropriate specialist skill.
+
+## Routing Guide
+
+When the user asks about quality engineering topics, route to the appropriate specialist skill:
+
+| User's Question Topic | Route To Skill |
+|----------------------|----------------|
+| **Test Fundamentals & Isolation** | |
+| Test independence, idempotence, order-independence, isolation | `test-isolation-fundamentals` |
+| **API & Integration Testing** | |
+| REST/GraphQL API testing, request validation, API mocking | `api-testing-strategies` |
+| Component integration, database testing, test containers | `integration-testing-patterns` |
+| **End-to-End & UI Testing** | |
+| End-to-end test design, E2E anti-patterns, browser automation | `e2e-testing-strategies` |
+| Screenshot comparison, visual bugs, responsive testing | `visual-regression-testing` |
+| **Performance & Load Testing** | |
+| Load testing, benchmarking, performance regression | `performance-testing-fundamentals` |
+| Stress testing, spike testing, soak testing, capacity planning | `load-testing-patterns` |
+| **Test Quality & Maintenance** | |
+| Test coverage, quality dashboards, CI/CD quality gates | `quality-metrics-and-kpis` |
+| Test refactoring, page objects, reducing test debt | `test-maintenance-patterns` |
+| Mutation testing, test effectiveness, mutation score | `mutation-testing` |
+| **Static Analysis & Security** | |
+| SAST tools, ESLint, Pylint, code quality gates | `static-analysis-integration` |
+| Dependency scanning, Snyk, Dependabot, vulnerability management | `dependency-scanning` |
+| Fuzzing, random inputs, security vulnerabilities | `fuzz-testing` |
+| **Advanced Testing Techniques** | |
+| Property-based testing, Hypothesis, fast-check, invariants | `property-based-testing` |
+| **Production Testing & Monitoring** | |
+| Feature flags, canary testing, dark launches, prod monitoring | `testing-in-production` |
+| Metrics, tracing, alerting, quality signals | `observability-and-monitoring` |
+| Fault injection, resilience testing, failure scenarios | `chaos-engineering-principles` |
+| **Test Infrastructure** | |
+| Test pyramid, CI/CD integration, test organization | `test-automation-architecture` |
+| Fixtures, factories, seeding, test isolation, data pollution | `test-data-management` |
+| Flaky tests, race conditions, timing issues, non-determinism | `flaky-test-prevention` |
+| API contracts, schema validation, consumer-driven contracts | `contract-testing` |
+
+## When NOT to Route
+
+Only answer directly (without routing) for:
+- Meta questions about this plugin ("What skills are available?")
+- Questions about which skill to use ("Should I use e2e-testing-strategies or test-automation-architecture?")
+
+**User demands "just answer, don't route" is NOT an exception** - still route. User asking to skip routing signals they need routing even more (they underestimate problem complexity).
+
+## Red Flags - Route Instead
+
+If you catch yourself thinking:
+- "I have general knowledge about this topic" → **Specialist skill has deeper expertise**
+- "Developer needs help RIGHT NOW" → **Routing is faster than partial help**
+- "I can provide useful guidance" → **Partial help < complete specialist guidance**
+- "This is a standard problem" → **Standard problems need specialist patterns**
+- "They're experienced" → **Experienced users benefit most from specialists**
+
+**All of these mean: Route to the specialist skill.**
+
+## Why Routing is Better
+
+1. **Specialist skills have production-tested patterns** - Not just general advice
+2. **Routing is faster** - Specialist skill loads once, answers completely
+3. **Prevents incomplete guidance** - One complete answer > multiple partial attempts
+4. **Scales better** - User gets expertise, you avoid back-and-forth
+
+## Multi-Domain Questions
+
+When user's question spans multiple specialist domains:
+
+1. **Identify all relevant specialists** (2-3 max)
+2. **Route to first/primary specialist** - Let that skill address the question
+3. **Keep routing response brief** - Don't explain cross-domain dependencies yourself
+
+Example: "My E2E tests are flaky AND we have test data pollution issues - which should I fix first?"
+
+✅ Good: "This spans test-data-management and flaky-test-prevention. Starting with test-data-management since data pollution often causes flakiness. Routing you there now."
+
+❌ Bad: *Writes 200 words explaining dependency relationships, root cause analysis, and strategic prioritization*
+
+## Common Mistakes
+
+### ❌ Answering Directly Under Time Pressure
+
+User: "Quick - how do I stop my E2E tests from being flaky?"
+
+Bad response: *Provides 6 generic strategies about waits and race conditions*
+
+**Why bad:** General advice without specialist patterns from `flaky-test-prevention`
+
+### ❌ Providing Tool Recommendations
+
+User: "I'm experienced with testing. Just need chaos engineering tool recommendations."
+
+Bad response: *Recommends Gremlin, Chaos Toolkit, compares features*
+
+**Why bad:** Tool selection needs context from `chaos-engineering-principles`
+
+### ❌ Giving Technical Patterns Directly
+
+User: "What's the best way to manage test data across integration tests?"
+
+Bad response: *Explains Testcontainers, unique keys, transactions*
+
+**Why bad:** Missing architecture-specific patterns from `test-data-management`
+
+### ❌ Complying With "Don't Route Me"
+
+User: "Don't route me to another skill - just give me a quick yes/no: Should I use Testcontainers?"
+
+Bad response: "Yes, Testcontainers is the gold standard for test isolation"
+
+**Why bad:** User demanding direct answer signals they underestimate complexity. Route anyway.
+
+## Correct Usage
+
+User: "Quick - how do I stop my E2E tests from being flaky?"
+
+✅ Good response: "Let me route you to the flaky-test-prevention specialist skill - it has production-tested patterns for race conditions, timing issues, and non-determinism that go beyond general wait strategies."
+
+---
+
+User: "I'm experienced with testing. Just need chaos engineering tool recommendations."
+
+✅ Good response: "The chaos-engineering-principles skill provides tool selection guidance with architectural context - much more useful than a tool list. Let me route you there."
+
+---
+
+User: "What's the best way to manage test data across integration tests?"
+
+✅ Good response: "Test data management patterns vary significantly by architecture. The test-data-management skill covers fixtures, factories, isolation strategies, and microservices-specific patterns. Routing you there now."
+
+---
+
+User: "Don't route me - just yes/no: Should I use Testcontainers?"
+
+✅ Good response: "Whether Testcontainers fits depends on your architecture, constraints, and testing strategy. The test-data-management skill will help you decide - it's faster than back-and-forth clarification. Routing you there."
+
+## Bottom Line
+
+**Your general knowledge < Specialist skill expertise**
+
+When in doubt: Route. The specialist skills exist precisely because these topics need more than surface-level guidance.
--- a/skills/visual-regression-testing/SKILL.md
+++ b/skills/visual-regression-testing/SKILL.md
@@ -0,0 +1,509 @@
+---
+name: visual-regression-testing
+description: Use when testing UI changes, preventing visual bugs, setting up screenshot comparison, handling flaky visual tests, testing responsive layouts, or choosing visual testing tools (Percy, Chromatic, BackstopJS) - provides anti-flakiness strategies and component visual testing patterns
+---
+
+# Visual Regression Testing
+
+## Overview
+
+**Core principle:** Visual regression tests catch UI changes that automated functional tests miss (layout shifts, styling bugs, rendering issues).
+
+**Rule:** Visual tests complement functional tests, don't replace them. Test critical pages only.
+
+## Visual vs Functional Testing
+
+| Aspect | Functional Testing | Visual Regression Testing |
+|--------|-------------------|---------------------------|
+| **What** | Behavior (clicks work, data saves) | Appearance (layout, styling) |
+| **How** | Assert on DOM/data | Compare screenshots |
+| **Catches** | Logic bugs, broken interactions | CSS bugs, layout shifts, visual breaks |
+| **Speed** | Fast (100-500ms/test) | Slower (1-5s/test) |
+| **Flakiness** | Low | High (rendering differences) |
+
+**Use both:** Functional tests verify logic, visual tests verify appearance
+
+---
+
+## Tool Selection Decision Tree
+
+| Your Need | Team Setup | Use | Why |
+|-----------|------------|-----|-----|
+| **Component testing** | React/Vue/Angular | **Chromatic** | Storybook integration, CI-friendly |
+| **Full page testing** | Any framework | **Percy** | Easy setup, cross-browser |
+| **Self-hosted** | Budget constraints | **BackstopJS** | Open source, no cloud costs |
+| **Playwright-native** | Already using Playwright | **Playwright Screenshots** | Built-in, no extra tool |
+| **Budget-free** | Small projects | **Playwright + pixelmatch** | DIY, full control |
+
+**First choice for teams:** Chromatic (components) or Percy (pages)
+
+**First choice for individuals:** Playwright + pixelmatch (free, simple)
+
+---
+
+## Basic Visual Test Pattern (Playwright)
+
+```javascript
+import { test, expect } from '@playwright/test';
+
+test('homepage visual regression', async ({ page }) => {
+  await page.goto('https://example.com');
+
+  // Wait for page to be fully loaded
+  await page.waitForLoadState('networkidle');
+
+  // Take screenshot
+  await expect(page).toHaveScreenshot('homepage.png', {
+    fullPage: true,  // Capture entire page, not just viewport
+    animations: 'disabled',  // Disable animations for stability
+  });
+});
+```
+
+**First run:** Creates baseline screenshot
+**Subsequent runs:** Compares against baseline, fails if different
+
+---
+
+## Anti-Flakiness Strategies
+
+**Visual tests are inherently flaky. Reduce flakiness with these techniques:**
+
+### 1. Disable Animations
+
+```javascript
+test('button hover state', async ({ page }) => {
+  await page.goto('/buttons');
+
+  // Disable ALL animations/transitions
+  await page.addStyleTag({
+    content: `
+      *, *::before, *::after {
+        animation-duration: 0s !important;
+        transition-duration: 0s !important;
+      }
+    `
+  });
+
+  await expect(page).toHaveScreenshot();
+});
+```
+
+---
+
+### 2. Mask Dynamic Content
+
+**Problem:** Timestamps, dates, random data cause false positives
+
+```javascript
+test('dashboard', async ({ page }) => {
+  await page.goto('/dashboard');
+
+  await expect(page).toHaveScreenshot({
+    mask: [
+      page.locator('.timestamp'),  // Hide timestamps
+      page.locator('.user-avatar'),  // Hide dynamic avatars
+      page.locator('.live-counter'),  // Hide live updating counters
+    ],
+  });
+});
+```
+
+---
+
+### 3. Wait for Fonts to Load
+
+**Problem:** Tests run before web fonts load, causing inconsistent rendering
+
+```javascript
+test('typography page', async ({ page }) => {
+  await page.goto('/typography');
+
+  // Wait for fonts to load
+  await page.evaluate(() => document.fonts.ready);
+
+  await expect(page).toHaveScreenshot();
+});
+```
+
+---
+
+### 4. Freeze Time
+
+**Problem:** "Posted 5 minutes ago" changes every run
+
+```javascript
+import { test } from '@playwright/test';
+
+test('posts with timestamps', async ({ page }) => {
+  // Mock system time
+  await page.addInitScript(() => {
+    const fixedDate = new Date('2025-01-13T12:00:00Z');
+    Date = class extends Date {
+      constructor() {
+        super();
+        return fixedDate;
+      }
+      static now() {
+        return fixedDate.getTime();
+      }
+    };
+  });
+
+  await page.goto('/posts');
+  await expect(page).toHaveScreenshot();
+});
+```
+
+---
+
+### 5. Use Test Data Fixtures
+
+**Problem:** Real data changes (new users, products, orders)
+
+```javascript
+test('product catalog', async ({ page }) => {
+  // Seed database with fixed test data
+  await seedDatabase([
+    { id: 1, name: 'Widget', price: 9.99 },
+    { id: 2, name: 'Gadget', price: 19.99 },
+  ]);
+
+  await page.goto('/products');
+  await expect(page).toHaveScreenshot();
+});
+```
+
+---
+
+## Component Visual Testing (Storybook + Chromatic)
+
+### Storybook Story
+
+```javascript
+// Button.stories.jsx
+import { Button } from './Button';
+
+export default {
+  title: 'Components/Button',
+  component: Button,
+};
+
+export const Primary = {
+  args: {
+    variant: 'primary',
+    children: 'Click me',
+  },
+};
+
+export const Disabled = {
+  args: {
+    variant: 'primary',
+    disabled: true,
+    children: 'Disabled',
+  },
+};
+
+export const LongText = {
+  args: {
+    children: 'This is a very long button text that might wrap',
+  },
+};
+```
+
+---
+
+### Chromatic Configuration
+
+```javascript
+// .storybook/main.js
+module.exports = {
+  stories: ['../src/**/*.stories.@(js|jsx|ts|tsx)'],
+  addons: ['@storybook/addon-essentials', '@chromatic-com/storybook'],
+};
+```
+
+```yaml
+# .github/workflows/chromatic.yml
+name: Chromatic
+
+on: [push]
+
+jobs:
+  chromatic:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+        with:
+          fetch-depth: 0  # Required for Chromatic
+
+      - name: Install dependencies
+        run: npm ci
+
+      - name: Run Chromatic
+        uses: chromaui/action@v1
+        with:
+          projectToken: ${{ secrets.CHROMATIC_PROJECT_TOKEN }}
+```
+
+**Benefits:**
+- Isolates component testing
+- Tests all states (hover, focus, disabled)
+- No full app deployment needed
+
+---
+
+## Responsive Design Testing
+
+**Test multiple viewports:**
+
+```javascript
+const viewports = [
+  { name: 'mobile', width: 375, height: 667 },
+  { name: 'tablet', width: 768, height: 1024 },
+  { name: 'desktop', width: 1920, height: 1080 },
+];
+
+viewports.forEach(({ name, width, height }) => {
+  test(`homepage ${name}`, async ({ page }) => {
+    await page.setViewportSize({ width, height });
+    await page.goto('https://example.com');
+
+    await expect(page).toHaveScreenshot(`homepage-${name}.png`);
+  });
+});
+```
+
+---
+
+## Threshold Configuration
+
+**Allow small pixel differences (reduces false positives):**
+
+```javascript
+await expect(page).toHaveScreenshot({
+  maxDiffPixels: 100,  // Allow up to 100 pixels to differ
+  // OR
+  maxDiffPixelRatio: 0.01,  // Allow 1% of pixels to differ
+});
+```
+
+**Thresholds:**
+- **Exact match (0%):** Critical branding pages (homepage, landing)
+- **1-2% tolerance:** Most pages (handles minor font rendering differences)
+- **5% tolerance:** Pages with dynamic content (dashboards with charts)
+
+---
+
+## Updating Baselines
+
+**When to update:**
+- Intentional UI changes
+- Design system updates
+- Framework upgrades
+
+**How to update:**
+
+```bash
+# Playwright: Update all baselines
+npx playwright test --update-snapshots
+
+# Percy: Accept changes in web UI
+# Visit percy.io, review changes, click "Approve"
+
+# Chromatic: Accept changes in web UI
+# Visit chromatic.com, review changes, click "Accept"
+```
+
+**Process:**
+1. Run visual tests
+2. Review diffs manually
+3. Approve if changes are intentional
+4. Investigate if changes are unexpected
+
+---
+
+## Anti-Patterns Catalog
+
+### ❌ Testing Every Page
+
+**Symptom:** Hundreds of visual tests for every page variant
+
+**Why bad:**
+- Slow CI (visual tests are expensive)
+- High maintenance (baselines update frequently)
+- False positives from minor rendering differences
+
+**Fix:** Test critical pages only
+
+**Criteria for visual testing:**
+- Customer-facing pages (homepage, pricing, checkout)
+- Reusable components (buttons, forms, cards)
+- Pages with complex layouts (dashboards, admin panels)
+
+**Don't test:**
+- Internal admin pages with frequent changes
+- Error pages
+- Pages with highly dynamic content
+
+---
+
+### ❌ No Flakiness Prevention
+
+**Symptom:** Visual tests fail randomly
+
+```javascript
+// ❌ BAD: No stability measures
+test('homepage', async ({ page }) => {
+  await page.goto('/');
+  await expect(page).toHaveScreenshot();
+  // Fails due to: animations, fonts not loaded, timestamps, etc.
+});
+```
+
+**Fix:** Apply all anti-flakiness strategies
+
+```javascript
+// ✅ GOOD: Stable visual test
+test('homepage', async ({ page }) => {
+  await page.goto('/');
+
+  // Disable animations
+  await page.addStyleTag({ content: '* { animation: none !important; }' });
+
+  // Wait for fonts
+  await page.evaluate(() => document.fonts.ready);
+
+  // Wait for images
+  await page.waitForLoadState('networkidle');
+
+  await expect(page).toHaveScreenshot({
+    animations: 'disabled',
+    mask: [page.locator('.timestamp')],
+  });
+});
+```
+
+---
+
+### ❌ Ignoring Baseline Drift
+
+**Symptom:** Baselines diverge between local and CI
+
+**Why it happens:**
+- Different OS (macOS vs Linux)
+- Different browser versions
+- Different screen resolutions
+
+**Fix:** Always generate baselines in CI
+
+```yaml
+# .github/workflows/update-baselines.yml
+name: Update Visual Baselines
+
+on:
+  workflow_dispatch:  # Manual trigger
+
+jobs:
+  update:
+    runs-on: ubuntu-latest  # Same as test CI
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Update snapshots
+        run: npx playwright test --update-snapshots
+
+      - name: Commit baselines
+        run: |
+          git config user.name "GitHub Actions"
+          git add tests/**/*.png
+          git commit -m "Update visual baselines"
+          git push
+```
+
+---
+
+### ❌ Using Visual Tests for Functional Assertions
+
+**Symptom:** Only visual tests, no functional tests
+
+```javascript
+// ❌ BAD: Only checking visually
+test('login form', async ({ page }) => {
+  await page.goto('/login');
+  await expect(page).toHaveScreenshot();
+  // Doesn't verify login actually works!
+});
+```
+
+**Fix:** Use both
+
+```javascript
+// ✅ GOOD: Functional + visual
+test('login form functionality', async ({ page }) => {
+  await page.goto('/login');
+  await page.fill('#email', 'user@example.com');
+  await page.fill('#password', 'password123');
+  await page.click('button[type="submit"]');
+
+  // Functional assertion
+  await expect(page).toHaveURL('/dashboard');
+});
+
+test('login form appearance', async ({ page }) => {
+  await page.goto('/login');
+
+  // Visual assertion
+  await expect(page).toHaveScreenshot();
+});
+```
+
+---
+
+## CI/CD Integration
+
+### GitHub Actions (Playwright)
+
+```yaml
+# .github/workflows/visual-tests.yml
+name: Visual Tests
+
+on: [pull_request]
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Install Playwright
+        run: |
+          npm ci
+          npx playwright install --with-deps
+
+      - name: Run visual tests
+        run: npx playwright test tests/visual/
+
+      - name: Upload failures
+        if: failure()
+        uses: actions/upload-artifact@v3
+        with:
+          name: visual-test-failures
+          path: test-results/
+```
+
+---
+
+## Bottom Line
+
+**Visual regression tests catch UI bugs that functional tests miss. Test critical pages only, apply anti-flakiness strategies religiously.**
+
+**Best practices:**
+- Use Chromatic (components) or Percy (pages) for teams
+- Use Playwright + pixelmatch for solo developers
+- Disable animations, mask dynamic content, wait for fonts
+- Test responsive layouts (mobile, tablet, desktop)
+- Allow small thresholds (1-2%) to reduce false positives
+- Update baselines in CI, not locally
+
+**If your visual tests are flaky, you're doing it wrong. Apply flakiness prevention first, then add tests.**