gh-tachyon-beep-skillpacks-…/skills/using-ml-production/production-debugging-techniques.md


# Production Debugging Techniques Skill

## When to Use This Skill

Use this skill when:
- Investigating production incidents or outages
- Debugging performance bottlenecks or latency spikes
- Analyzing model quality issues (wrong predictions, hallucinations)
- Investigating A/B test anomalies or statistical issues
- Performing post-incident analysis and root cause investigation
- Debugging edge cases or unexpected behavior
- Analyzing production logs, traces, and metrics

**When NOT to use:** Development debugging (use IDE debugger), unit test failures (use TDD), or pre-production validation.

## Core Principle

**Production debugging is forensic investigation, not random guessing.**

Without systematic debugging:
- You make random changes hoping to fix issues (doesn't address root cause)
- You guess bottlenecks without data (optimize the wrong things)
- You can't diagnose issues from logs (missing critical information)
- You panic and rollback without learning (incidents repeat)
- You skip post-mortems (no prevention, just reaction)

**Formula:** Reproduce → Profile → Diagnose → Fix → Verify → Document = Systematic resolution.

## Production Debugging Framework

```
                    ┌─────────────────────────────────┐
                    │   Incident Detection/Report     │
                    └──────────┬──────────────────────┘
                               │
                    ┌──────────▼──────────────────────┐
                    │  Systematic Reproduction        │
                    │  Minimal repro, not speculation  │
                    └──────────┬──────────────────────┘
                               │
                ┌──────────────┼──────────────┐
                │              │              │
        ┌───────▼───────┐ ┌───▼──────┐ ┌────▼────────┐
        │ Performance   │ │  Error   │ │   Model     │
        │  Profiling    │ │ Analysis │ │  Debugging  │
        └───────┬───────┘ └───┬──────┘ └────┬────────┘
                │              │             │
                └──────────────┼─────────────┘
                               │
                ┌──────────────▼──────────────────┐
                │    Root Cause Identification    │
                │  Not symptoms, actual cause     │
                └──────────────┬──────────────────┘
                               │
                ┌──────────────▼──────────────────┐
                │      Fix Implementation         │
                │  Targeted, verified fix         │
                └──────────────┬──────────────────┘
                               │
                ┌──────────────▼──────────────────┐
                │         Verification            │
                │  Prove fix works                │
                └──────────────┬──────────────────┘
                               │
                ┌──────────────▼──────────────────┐
                │    Post-Mortem & Prevention     │
                │  Blameless, actionable          │
                └──────────────────────────────────┘
```


## RED Phase: Common Debugging Anti-Patterns

### Anti-Pattern 1: Random Changes (No Systematic Debugging)

**Symptom:** "Let me try changing this parameter and see if it helps."

**Why it fails:**
- No reproduction of the issue (can't verify fix)
- No understanding of root cause (might fix symptom, not cause)
- No measurement of impact (did it actually help?)
- Creates more problems (unintended side effects)

**Example:**

```python
# WRONG: Random parameter changes without investigation
def fix_slow_inference():
    # User reported slow inference, let's just try stuff
    model.batch_size = 32  # Maybe this helps?
    model.num_threads = 8  # Or this?
    model.use_cache = True  # Definitely cache!
    # Did any of this help? Who knows!
```

**Consequences:**
- Issue not actually fixed (root cause still present)
- New issues introduced (different batch size breaks memory)
- Can't explain what fixed it (no learning)
- Incident repeats (no prevention)

### Anti-Pattern 2: No Profiling (Guess Bottlenecks)

**Symptom:** "The database is probably slow, let's add caching everywhere."

**Why it fails:**
- Optimize based on intuition, not data
- Miss actual bottleneck (CPU, not DB)
- Waste time on irrelevant optimizations
- No measurable improvement

**Example:**

```python
# WRONG: Adding caching without profiling
def optimize_without_profiling():
    # Guess: Database is slow
    @cache  # Add caching everywhere
    def get_user_data(user_id):
        return db.query(user_id)

    # Actual bottleneck: JSON serialization (not DB)
    # Caching doesn't help!
```

**Consequences:**
- Latency still high (actual bottleneck not addressed)
- Increased complexity (caching layer adds bugs)
- Wasted optimization effort (wrong target)
- No improvement in metrics

### Anti-Pattern 3: Bad Logging (Can't Diagnose Issues)

**Symptom:** "An error occurred but I can't figure out what caused it."

**Why it fails:**
- Missing context (no user ID, request ID, timestamp)
- No structured logging (can't query or aggregate)
- Too much noise (logs everything, signal buried)
- No trace IDs (can't follow request across services)

**Example:**

```python
# WRONG: Useless logging
def process_request(request):
    print("Processing request")  # What request? When? By whom?

    try:
        result = model.predict(request.data)
    except Exception as e:
        print(f"Error: {e}")  # No context, can't debug

    print("Done")  # Success or failure?
```

**Consequences:**
- Can't reproduce issues (missing critical context)
- Can't trace distributed requests (no correlation)
- Can't analyze patterns (unstructured data)
- Slow investigation (manual log digging)

### Anti-Pattern 4: Panic Rollback (Don't Learn from Incidents)

**Symptom:** "There's an error! Rollback immediately! Now!"

**Why it fails:**
- No evidence collection (can't do post-mortem)
- No root cause analysis (will happen again)
- Lose opportunity to learn (panic mode)
- No distinction between minor and critical issues

**Example:**

```python
# WRONG: Immediate rollback without investigation
def handle_incident():
    if error_rate > 0.1:  # Any errors = panic
        # Rollback immediately!
        deploy_previous_version()
        # Wait, what was the error? We'll never know now...
```

**Consequences:**
- Issue repeats (root cause not fixed)
- Lost learning opportunity (no forensics)
- Unnecessary rollbacks (minor issues treated as critical)
- Team doesn't improve (no post-mortem)

### Anti-Pattern 5: No Post-Mortems

**Symptom:** "Incident resolved, let's move on to the next task."

**Why it fails:**
- No prevention (same incident repeats)
- No learning (team doesn't improve)
- No action items (nothing changes)
- Culture of blame (fear of investigation)

**Example:**

```python
# WRONG: No post-mortem process
def resolve_incident(incident):
    fix_issue(incident)
    close_ticket(incident)
    # Done! What incident? Already forgot...
    # No documentation, no prevention, no learning
```

**Consequences:**
- Incidents repeat (no prevention mechanisms)
- No improvement (same mistakes over and over)
- Low bus factor (knowledge not shared)
- Reactive culture (firefighting, not prevention)


## GREEN Phase: Systematic Debugging Methodology

### Part 1: Systematic Debugging Framework

**Core principle:** Reproduce → Diagnose → Fix → Verify

**Step-by-step process:**

```python
from dataclasses import dataclass
from typing import Any, Dict, List, Optional
from datetime import datetime
import logging

@dataclass
class DebuggingSession:
    """
    Structured debugging session with systematic methodology.
    """
    incident_id: str
    reported_by: str
    reported_at: datetime
    description: str
    severity: str  # CRITICAL, HIGH, MEDIUM, LOW

    # Reproduction
    reproduction_steps: List[str] = None
    minimal_repro: str = None
    reproduction_rate: float = 0.0  # 0.0 to 1.0

    # Diagnosis
    hypothesis: str = None
    evidence: Dict[str, Any] = None
    root_cause: str = None

    # Fix
    fix_description: str = None
    fix_verification: str = None

    # Prevention
    prevention_measures: List[str] = None

    def __post_init__(self):
        self.reproduction_steps = []
        self.evidence = {}
        self.prevention_measures = []


class SystematicDebugger:
    """
    Systematic debugging methodology for production issues.
    """

    def __init__(self):
        self.logger = logging.getLogger(__name__)
        self.sessions: Dict[str, DebuggingSession] = {}

    def start_session(
        self,
        incident_id: str,
        reported_by: str,
        description: str,
        severity: str
    ) -> DebuggingSession:
        """
        Start a new debugging session.

        Args:
            incident_id: Unique incident identifier
            reported_by: Who reported the issue
            description: What is the problem
            severity: CRITICAL, HIGH, MEDIUM, LOW

        Returns:
            DebuggingSession object
        """
        session = DebuggingSession(
            incident_id=incident_id,
            reported_by=reported_by,
            reported_at=datetime.now(),
            description=description,
            severity=severity
        )

        self.sessions[incident_id] = session
        self.logger.info(
            f"Started debugging session",
            extra={
                "incident_id": incident_id,
                "severity": severity,
                "description": description
            }
        )

        return session

    def reproduce_issue(
        self,
        session: DebuggingSession,
        reproduction_steps: List[str]
    ) -> bool:
        """
        Step 1: Reproduce the issue with minimal test case.

        Goal: Create minimal, deterministic reproduction.

        Args:
            session: Debugging session
            reproduction_steps: Steps to reproduce

        Returns:
            True if successfully reproduced
        """
        session.reproduction_steps = reproduction_steps

        # Try to reproduce
        for attempt in range(10):
            if self._attempt_reproduction(reproduction_steps):
                session.reproduction_rate += 0.1

        session.reproduction_rate = session.reproduction_rate

        reproduced = session.reproduction_rate > 0.5

        self.logger.info(
            f"Reproduction attempt",
            extra={
                "incident_id": session.incident_id,
                "reproduced": reproduced,
                "reproduction_rate": session.reproduction_rate
            }
        )

        return reproduced

    def _attempt_reproduction(self, steps: List[str]) -> bool:
        """
        Attempt to reproduce issue.
        Implementation depends on issue type.
        """
        # Override in subclass
        return False

    def collect_evidence(
        self,
        session: DebuggingSession,
        evidence_type: str,
        evidence_data: Any
    ):
        """
        Step 2: Collect evidence from multiple sources.

        Evidence types:
        - logs: Application logs
        - traces: Distributed traces
        - metrics: Performance metrics
        - profiles: CPU/memory profiles
        - requests: Failed request data
        """
        if evidence_type not in session.evidence:
            session.evidence[evidence_type] = []

        session.evidence[evidence_type].append({
            "timestamp": datetime.now(),
            "data": evidence_data
        })

        self.logger.info(
            f"Collected evidence",
            extra={
                "incident_id": session.incident_id,
                "evidence_type": evidence_type
            }
        )

    def form_hypothesis(
        self,
        session: DebuggingSession,
        hypothesis: str
    ):
        """
        Step 3: Form hypothesis based on evidence.

        Good hypothesis:
        - Specific and testable
        - Based on evidence, not intuition
        - Explains all symptoms
        """
        session.hypothesis = hypothesis

        self.logger.info(
            f"Formed hypothesis",
            extra={
                "incident_id": session.incident_id,
                "hypothesis": hypothesis
            }
        )

    def verify_hypothesis(
        self,
        session: DebuggingSession,
        verification_test: str,
        result: bool
    ) -> bool:
        """
        Step 4: Verify hypothesis with targeted test.

        Args:
            session: Debugging session
            verification_test: What test was run
            result: Did hypothesis hold?

        Returns:
            True if hypothesis verified
        """
        self.collect_evidence(
            session,
            "hypothesis_verification",
            {
                "test": verification_test,
                "result": result,
                "hypothesis": session.hypothesis
            }
        )

        return result

    def identify_root_cause(
        self,
        session: DebuggingSession,
        root_cause: str
    ):
        """
        Step 5: Identify root cause (not just symptoms).

        Root cause vs symptom:
        - Symptom: "API returns 500 errors"
        - Root cause: "Connection pool exhausted due to connection leak"
        """
        session.root_cause = root_cause

        self.logger.info(
            f"Identified root cause",
            extra={
                "incident_id": session.incident_id,
                "root_cause": root_cause
            }
        )

    def implement_fix(
        self,
        session: DebuggingSession,
        fix_description: str,
        fix_code: str = None
    ):
        """
        Step 6: Implement targeted fix.

        Good fix:
        - Addresses root cause, not symptom
        - Minimal changes (surgical fix)
        - Includes verification test
        """
        session.fix_description = fix_description

        self.logger.info(
            f"Implemented fix",
            extra={
                "incident_id": session.incident_id,
                "fix_description": fix_description
            }
        )

    def verify_fix(
        self,
        session: DebuggingSession,
        verification_method: str,
        verified: bool
    ) -> bool:
        """
        Step 7: Verify fix resolves the issue.

        Verification methods:
        - Reproduction test no longer fails
        - Metrics return to normal
        - No new errors in logs
        - A/B test shows improvement
        """
        session.fix_verification = verification_method

        self.logger.info(
            f"Verified fix",
            extra={
                "incident_id": session.incident_id,
                "verified": verified,
                "verification_method": verification_method
            }
        )

        return verified

    def add_prevention_measure(
        self,
        session: DebuggingSession,
        measure: str
    ):
        """
        Step 8: Add prevention measures.

        Prevention types:
        - Monitoring: Alert on similar patterns
        - Testing: Add regression test
        - Validation: Input validation to prevent
        - Documentation: Runbook for similar issues
        """
        session.prevention_measures.append(measure)

        self.logger.info(
            f"Added prevention measure",
            extra={
                "incident_id": session.incident_id,
                "measure": measure
            }
        )


# Example usage
debugger = SystematicDebugger()

# Start debugging session
session = debugger.start_session(
    incident_id="INC-2025-001",
    reported_by="oncall-engineer",
    description="API latency spike from 200ms to 2000ms",
    severity="HIGH"
)

# Step 1: Reproduce
reproduced = debugger.reproduce_issue(
    session,
    reproduction_steps=[
        "Send 100 concurrent requests to /api/predict",
        "Observe latency increase after 50 requests",
        "Check connection pool metrics"
    ]
)

if reproduced:
    # Step 2: Collect evidence
    debugger.collect_evidence(session, "metrics", {
        "latency_p50": 2000,
        "latency_p95": 5000,
        "connection_pool_size": 10,
        "active_connections": 10,
        "waiting_requests": 90
    })

    # Step 3: Form hypothesis
    debugger.form_hypothesis(
        session,
        "Connection pool exhausted. Pool size (10) too small for load (100 concurrent)."
    )

    # Step 4: Verify hypothesis
    verified = debugger.verify_hypothesis(
        session,
        "Increased pool size to 50, latency returned to normal",
        True
    )

    if verified:
        # Step 5: Root cause
        debugger.identify_root_cause(
            session,
            "Connection pool size not scaled with traffic increase"
        )

        # Step 6: Implement fix
        debugger.implement_fix(
            session,
            "Increase connection pool size to 50 and add auto-scaling"
        )

        # Step 7: Verify fix
        debugger.verify_fix(
            session,
            "A/B test: latency p95 < 300ms for 1 hour",
            True
        )

        # Step 8: Prevention
        debugger.add_prevention_measure(
            session,
            "Alert when connection pool utilization > 80%"
        )
        debugger.add_prevention_measure(
            session,
            "Load test before deploying to production"
        )
```

**Key principles:**

1. **Reproduce first:** Can't debug what you can't reproduce
2. **Evidence-based:** Collect data before forming hypothesis
3. **Root cause, not symptom:** Fix the actual cause
4. **Verify fix:** Prove it works before closing
5. **Prevent recurrence:** Add monitoring and tests


### Part 2: Performance Profiling

**When to profile:**
- Latency spikes or slow responses
- High CPU or memory usage
- Resource exhaustion (connections, threads)
- Optimization opportunities

#### CPU Profiling with py-spy

```python
import subprocess
import signal
import time
from pathlib import Path

class ProductionProfiler:
    """
    Non-intrusive profiling for production systems.
    """

    def __init__(self, output_dir: str = "./profiles"):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(exist_ok=True)

    def profile_cpu(
        self,
        pid: int,
        duration: int = 60,
        rate: int = 100
    ) -> str:
        """
        Profile CPU usage with py-spy (no code changes needed).

        Args:
            pid: Process ID to profile
            duration: How long to profile (seconds)
            rate: Sampling rate (samples/second)

        Returns:
            Path to flamegraph SVG

        Usage:
            # Install: pip install py-spy
            # Run: sudo py-spy record -o profile.svg --pid 12345 --duration 60
        """
        output_file = self.output_dir / f"cpu_profile_{pid}_{int(time.time())}.svg"

        cmd = [
            "py-spy", "record",
            "-o", str(output_file),
            "--pid", str(pid),
            "--duration", str(duration),
            "--rate", str(rate),
            "--format", "flamegraph"
        ]

        print(f"Profiling PID {pid} for {duration} seconds...")
        subprocess.run(cmd, check=True)

        print(f"Profile saved to: {output_file}")
        return str(output_file)

    def profile_memory(
        self,
        pid: int,
        duration: int = 60
    ) -> str:
        """
        Profile memory usage with memory_profiler.

        Returns:
            Path to memory profile
        """
        output_file = self.output_dir / f"memory_profile_{pid}_{int(time.time())}.txt"

        # Use memory_profiler for line-by-line analysis
        cmd = [
            "python", "-m", "memory_profiler",
            "--backend", "psutil",
            str(pid)
        ]

        result = subprocess.run(
            cmd,
            capture_output=True,
            text=True,
            timeout=duration
        )

        output_file.write_text(result.stdout)
        print(f"Memory profile saved to: {output_file}")

        return str(output_file)


# Example: Profile production inference
profiler = ProductionProfiler()

# Get PID of running process
import os
pid = os.getpid()

# Profile for 60 seconds
flamegraph = profiler.profile_cpu(pid, duration=60)
print(f"View flamegraph: {flamegraph}")

# Analyze flamegraph:
# - Wide bars = most time spent (bottleneck)
# - Look for unexpected functions
# - Check for excessive I/O waits
```

#### PyTorch Model Profiling

```python
import torch
import torch.profiler as profiler
from typing import Dict, List
import json

class ModelProfiler:
    """
    Profile PyTorch model performance.
    """

    def profile_model(
        self,
        model: torch.nn.Module,
        sample_input: torch.Tensor,
        num_steps: int = 100
    ) -> Dict[str, any]:
        """
        Profile model inference with PyTorch profiler.

        Args:
            model: PyTorch model
            sample_input: Sample input tensor
            num_steps: Number of profiling steps

        Returns:
            Profiling results
        """
        model.eval()

        with profiler.profile(
            activities=[
                profiler.ProfilerActivity.CPU,
                profiler.ProfilerActivity.CUDA,
            ],
            record_shapes=True,
            profile_memory=True,
            with_stack=True
        ) as prof:
            with profiler.record_function("model_inference"):
                for _ in range(num_steps):
                    with torch.no_grad():
                        _ = model(sample_input)

        # Print report
        print(prof.key_averages().table(
            sort_by="cuda_time_total",
            row_limit=10
        ))

        # Save trace
        prof.export_chrome_trace("model_trace.json")

        # Analyze results
        results = self._analyze_profile(prof)
        return results

    def _analyze_profile(self, prof) -> Dict[str, any]:
        """
        Analyze profiling results.
        """
        events = prof.key_averages()

        # Find bottlenecks
        cpu_events = sorted(
            [e for e in events if e.device_type == profiler.DeviceType.CPU],
            key=lambda e: e.self_cpu_time_total,
            reverse=True
        )

        cuda_events = sorted(
            [e for e in events if e.device_type == profiler.DeviceType.CUDA],
            key=lambda e: e.self_cuda_time_total,
            reverse=True
        )

        results = {
            "top_cpu_ops": [
                {
                    "name": e.key,
                    "cpu_time_ms": e.self_cpu_time_total / 1000,
                    "calls": e.count
                }
                for e in cpu_events[:10]
            ],
            "top_cuda_ops": [
                {
                    "name": e.key,
                    "cuda_time_ms": e.self_cuda_time_total / 1000,
                    "calls": e.count
                }
                for e in cuda_events[:10]
            ],
            "total_cpu_time_ms": sum(e.self_cpu_time_total for e in events) / 1000,
            "total_cuda_time_ms": sum(e.self_cuda_time_total for e in events) / 1000,
        }

        return results


# Example usage
import torch.nn as nn

model = nn.Transformer(
    d_model=512,
    nhead=8,
    num_encoder_layers=6,
    num_decoder_layers=6
)

sample_input = torch.randn(10, 32, 512)  # (seq_len, batch, d_model)

profiler = ModelProfiler()
results = profiler.profile_model(model, sample_input)

print(json.dumps(results, indent=2))

# Identify bottlenecks:
# - Which operations take most time?
# - CPU vs GPU time (data transfer overhead?)
# - Memory usage patterns
```

#### Database Query Profiling

```python
import time
from contextlib import contextmanager
from typing import Dict, List
import logging

class QueryProfiler:
    """
    Profile database query performance.
    """

    def __init__(self):
        self.logger = logging.getLogger(__name__)
        self.query_stats: List[Dict] = []

    @contextmanager
    def profile_query(self, query_name: str):
        """
        Context manager to profile a query.

        Usage:
            with profiler.profile_query("get_user"):
                user = db.query(User).filter_by(id=user_id).first()
        """
        start = time.perf_counter()

        try:
            yield
        finally:
            duration = (time.perf_counter() - start) * 1000  # ms

            self.query_stats.append({
                "query": query_name,
                "duration_ms": duration,
                "timestamp": time.time()
            })

            if duration > 100:  # Slow query threshold
                self.logger.warning(
                    f"Slow query detected",
                    extra={
                        "query": query_name,
                        "duration_ms": duration
                    }
                )

    def get_slow_queries(self, threshold_ms: float = 100) -> List[Dict]:
        """
        Get queries slower than threshold.
        """
        return [
            q for q in self.query_stats
            if q["duration_ms"] > threshold_ms
        ]

    def get_query_stats(self) -> Dict[str, Dict]:
        """
        Get aggregate statistics per query.
        """
        from collections import defaultdict
        import statistics

        stats_by_query = defaultdict(list)

        for q in self.query_stats:
            stats_by_query[q["query"]].append(q["duration_ms"])

        result = {}
        for query, durations in stats_by_query.items():
            result[query] = {
                "count": len(durations),
                "mean_ms": statistics.mean(durations),
                "median_ms": statistics.median(durations),
                "p95_ms": sorted(durations)[int(len(durations) * 0.95)]
                    if len(durations) > 0 else 0,
                "max_ms": max(durations)
            }

        return result


# Example usage
profiler = QueryProfiler()

# Profile queries
for user_id in range(100):
    with profiler.profile_query("get_user"):
        # user = db.query(User).filter_by(id=user_id).first()
        time.sleep(0.05)  # Simulate query

    with profiler.profile_query("get_posts"):
        # posts = db.query(Post).filter_by(user_id=user_id).all()
        time.sleep(0.15)  # Simulate slow query

# Analyze
slow_queries = profiler.get_slow_queries(threshold_ms=100)
print(f"Found {len(slow_queries)} slow queries")

stats = profiler.get_query_stats()
for query, metrics in stats.items():
    print(f"{query}: {metrics}")
```

**Key profiling insights:**

| Profile Type | Tool | What to Look For |
|--------------|------|------------------|
| CPU | py-spy | Wide bars in flamegraph (bottlenecks) |
| Memory | memory_profiler | Memory leaks, large allocations |
| Model | torch.profiler | Slow operations, CPU-GPU transfer |
| Database | Query profiler | Slow queries, N+1 queries |
| Network | distributed tracing | High latency services, cascading failures |


### Part 3: Error Analysis and Root Cause Investigation

**Goal:** Categorize errors, find patterns, identify root cause (not symptoms).

```python
from dataclasses import dataclass
from typing import List, Dict, Optional
from collections import Counter, defaultdict
from datetime import datetime, timedelta
import re

@dataclass
class ErrorEvent:
    """
    Structured error event.
    """
    timestamp: datetime
    error_type: str
    error_message: str
    stack_trace: str
    user_id: Optional[str] = None
    request_id: Optional[str] = None
    endpoint: Optional[str] = None
    severity: str = "ERROR"  # DEBUG, INFO, WARNING, ERROR, CRITICAL

    # Context
    input_data: Optional[Dict] = None
    system_state: Optional[Dict] = None


class ErrorAnalyzer:
    """
    Analyze error patterns and identify root causes.
    """

    def __init__(self):
        self.errors: List[ErrorEvent] = []

    def add_error(self, error: ErrorEvent):
        """Add error to analysis."""
        self.errors.append(error)

    def categorize_errors(self) -> Dict[str, List[ErrorEvent]]:
        """
        Categorize errors by type.

        Categories:
        - Input validation errors
        - Model inference errors
        - Infrastructure errors (DB, network)
        - Third-party API errors
        - Resource exhaustion errors
        """
        categories = defaultdict(list)

        for error in self.errors:
            category = self._categorize_single_error(error)
            categories[category].append(error)

        return dict(categories)

    def _categorize_single_error(self, error: ErrorEvent) -> str:
        """
        Categorize single error based on error message and type.
        """
        msg = error.error_message.lower()

        # Input validation
        if any(keyword in msg for keyword in ["invalid", "validation", "schema"]):
            return "input_validation"

        # Model errors
        if any(keyword in msg for keyword in ["model", "inference", "prediction"]):
            return "model_inference"

        # Infrastructure
        if any(keyword in msg for keyword in ["connection", "timeout", "database"]):
            return "infrastructure"

        # Resource exhaustion
        if any(keyword in msg for keyword in ["memory", "cpu", "quota", "limit"]):
            return "resource_exhaustion"

        # Third-party
        if any(keyword in msg for keyword in ["api", "external", "service"]):
            return "third_party"

        return "unknown"

    def find_error_patterns(self) -> List[Dict]:
        """
        Find patterns in errors (temporal, user, endpoint).
        """
        patterns = []

        # Temporal clustering (errors spike at certain times?)
        temporal = self._analyze_temporal_patterns()
        if temporal:
            patterns.append({
                "type": "temporal",
                "description": f"Error spike detected",
                "details": temporal
            })

        # User clustering (errors for specific users?)
        user_errors = defaultdict(int)
        for error in self.errors:
            if error.user_id:
                user_errors[error.user_id] += 1

        # Top 5 users with most errors
        top_users = sorted(
            user_errors.items(),
            key=lambda x: x[1],
            reverse=True
        )[:5]

        if top_users and top_users[0][1] > 10:
            patterns.append({
                "type": "user_specific",
                "description": f"High error rate for specific users",
                "details": {"top_users": top_users}
            })

        # Endpoint clustering
        endpoint_errors = defaultdict(int)
        for error in self.errors:
            if error.endpoint:
                endpoint_errors[error.endpoint] += 1

        top_endpoints = sorted(
            endpoint_errors.items(),
            key=lambda x: x[1],
            reverse=True
        )[:5]

        if top_endpoints:
            patterns.append({
                "type": "endpoint_specific",
                "description": f"Errors concentrated in specific endpoints",
                "details": {"top_endpoints": top_endpoints}
            })

        return patterns

    def _analyze_temporal_patterns(self) -> Optional[Dict]:
        """
        Detect temporal error patterns (spikes, periodicity).
        """
        if len(self.errors) < 10:
            return None

        # Group by hour
        errors_by_hour = defaultdict(int)
        for error in self.errors:
            hour_key = error.timestamp.replace(minute=0, second=0, microsecond=0)
            errors_by_hour[hour_key] += 1

        # Calculate average and detect spikes
        error_counts = list(errors_by_hour.values())
        avg_errors = sum(error_counts) / len(error_counts)
        max_errors = max(error_counts)

        if max_errors > avg_errors * 3:  # 3x spike
            spike_hour = max(errors_by_hour, key=errors_by_hour.get)
            return {
                "avg_errors_per_hour": avg_errors,
                "max_errors_per_hour": max_errors,
                "spike_time": spike_hour.isoformat(),
                "spike_magnitude": max_errors / avg_errors
            }

        return None

    def identify_root_cause(
        self,
        error_category: str,
        errors: List[ErrorEvent]
    ) -> Dict:
        """
        Identify root cause for category of errors.

        Analysis steps:
        1. Find common patterns in error messages
        2. Analyze system state at error time
        3. Check for external factors (deployment, traffic spike)
        4. Identify root cause vs symptoms
        """
        analysis = {
            "category": error_category,
            "total_errors": len(errors),
            "time_range": {
                "start": min(e.timestamp for e in errors).isoformat(),
                "end": max(e.timestamp for e in errors).isoformat()
            }
        }

        # Common error messages
        error_messages = [e.error_message for e in errors]
        message_counts = Counter(error_messages)
        analysis["most_common_errors"] = message_counts.most_common(5)

        # Stack trace analysis (find common frames)
        common_frames = self._find_common_stack_frames(errors)
        analysis["common_stack_frames"] = common_frames

        # Hypothesis based on category
        if error_category == "input_validation":
            analysis["hypothesis"] = "Client sending invalid data. Check API contract."
            analysis["action_items"] = [
                "Add input validation at API layer",
                "Return clear error messages to client",
                "Add monitoring for validation failures"
            ]

        elif error_category == "model_inference":
            analysis["hypothesis"] = "Model failing on specific inputs. Check edge cases."
            analysis["action_items"] = [
                "Analyze failed inputs for patterns",
                "Add input sanitization before inference",
                "Add fallback for model failures",
                "Retrain model with failed examples"
            ]

        elif error_category == "infrastructure":
            analysis["hypothesis"] = "Infrastructure issue (DB, network). Check external dependencies."
            analysis["action_items"] = [
                "Check database connection pool size",
                "Check network connectivity to services",
                "Add retry logic with exponential backoff",
                "Add circuit breaker for failing services"
            ]

        elif error_category == "resource_exhaustion":
            analysis["hypothesis"] = "Resource limits exceeded. Scale up or optimize."
            analysis["action_items"] = [
                "Profile memory/CPU usage",
                "Increase resource limits",
                "Optimize hot paths",
                "Add auto-scaling"
            ]

        return analysis

    def _find_common_stack_frames(
        self,
        errors: List[ErrorEvent],
        min_frequency: float = 0.5
    ) -> List[str]:
        """
        Find stack frames common to most errors.
        """
        frame_counts = Counter()

        for error in errors:
            # Extract function names from stack trace
            frames = re.findall(r'File ".*", line \d+, in (\w+)', error.stack_trace)
            frame_counts.update(frames)

        # Find frames in at least 50% of errors
        threshold = len(errors) * min_frequency
        common_frames = [
            frame for frame, count in frame_counts.items()
            if count >= threshold
        ]

        return common_frames


# Example usage
analyzer = ErrorAnalyzer()

# Simulate errors
for i in range(100):
    if i % 10 == 0:  # Pattern: every 10th request fails
        analyzer.add_error(ErrorEvent(
            timestamp=datetime.now() + timedelta(seconds=i),
            error_type="ValueError",
            error_message="Invalid input shape: expected (batch, 512), got (batch, 256)",
            stack_trace='File "model.py", line 42, in predict\n  result = self.model(input_tensor)',
            user_id=f"user_{i % 5}",  # Pattern: 5 users with issues
            endpoint="/api/predict"
        ))

# Categorize errors
categories = analyzer.categorize_errors()
print(f"Error categories: {list(categories.keys())}")

# Find patterns
patterns = analyzer.find_error_patterns()
for pattern in patterns:
    print(f"\nPattern: {pattern['type']}")
    print(f"  {pattern['description']}")
    print(f"  Details: {pattern['details']}")

# Root cause analysis
for category, errors in categories.items():
    print(f"\n{'='*60}")
    print(f"Root cause analysis: {category}")
    print(f"{'='*60}")

    analysis = analyzer.identify_root_cause(category, errors)

    print(f"\nHypothesis: {analysis['hypothesis']}")
    print(f"\nAction items:")
    for item in analysis['action_items']:
        print(f"  - {item}")
```

**Root cause analysis checklist:**

- [ ] Reproduce error consistently
- [ ] Categorize error type (input, model, infrastructure, resource)
- [ ] Find error patterns (temporal, user, endpoint)
- [ ] Analyze system state at error time
- [ ] Check for external factors (deployment, traffic, dependencies)
- [ ] Distinguish root cause from symptoms
- [ ] Verify fix resolves root cause


### Part 4: A/B Test Debugging

**Common A/B test issues:**
- No statistical significance (insufficient sample size)
- Confounding factors (unbalanced segments)
- Simpson's paradox (aggregate vs segment differences)
- Selection bias (non-random assignment)
- Novelty effect (temporary impact)

```python
from dataclasses import dataclass
from typing import List, Dict, Optional
import numpy as np
from scipy import stats

@dataclass
class ABTestResult:
    """
    A/B test variant result.
    """
    variant: str
    sample_size: int
    success_count: int
    metric_values: List[float]

    @property
    def success_rate(self) -> float:
        return self.success_count / self.sample_size if self.sample_size > 0 else 0.0

    @property
    def mean_metric(self) -> float:
        return np.mean(self.metric_values) if self.metric_values else 0.0


class ABTestDebugger:
    """
    Debug A/B test issues and validate statistical significance.
    """

    def validate_test_design(
        self,
        control: ABTestResult,
        treatment: ABTestResult,
        min_sample_size: int = 200
    ) -> Dict:
        """
        Validate A/B test design and detect issues.

        Returns:
            Validation results with warnings
        """
        issues = []

        # Check 1: Sufficient sample size
        if control.sample_size < min_sample_size:
            issues.append({
                "type": "insufficient_sample_size",
                "severity": "CRITICAL",
                "message": f"Control sample size ({control.sample_size}) < minimum ({min_sample_size})"
            })

        if treatment.sample_size < min_sample_size:
            issues.append({
                "type": "insufficient_sample_size",
                "severity": "CRITICAL",
                "message": f"Treatment sample size ({treatment.sample_size}) < minimum ({min_sample_size})"
            })

        # Check 2: Balanced sample sizes
        ratio = control.sample_size / treatment.sample_size
        if ratio < 0.8 or ratio > 1.25:  # More than 20% imbalance
            issues.append({
                "type": "imbalanced_samples",
                "severity": "WARNING",
                "message": f"Sample size ratio {ratio:.2f} indicates imbalanced assignment"
            })

        # Check 3: Variance analysis
        control_std = np.std(control.metric_values)
        treatment_std = np.std(treatment.metric_values)

        if control_std == 0 or treatment_std == 0:
            issues.append({
                "type": "no_variance",
                "severity": "CRITICAL",
                "message": "One variant has zero variance. Check data collection."
            })

        return {
            "valid": len([i for i in issues if i["severity"] == "CRITICAL"]) == 0,
            "issues": issues
        }

    def test_statistical_significance(
        self,
        control: ABTestResult,
        treatment: ABTestResult,
        alpha: float = 0.05
    ) -> Dict:
        """
        Test statistical significance between variants.

        Args:
            control: Control variant results
            treatment: Treatment variant results
            alpha: Significance level (default 0.05)

        Returns:
            Statistical test results
        """
        # Two-proportion z-test for success rates
        n1, n2 = control.sample_size, treatment.sample_size
        p1, p2 = control.success_rate, treatment.success_rate

        # Pooled proportion
        p_pool = (control.success_count + treatment.success_count) / (n1 + n2)

        # Standard error
        se = np.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))

        # Z-score
        z_score = (p2 - p1) / se if se > 0 else 0

        # P-value (two-tailed)
        p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))

        # Effect size (relative lift)
        relative_lift = ((p2 - p1) / p1 * 100) if p1 > 0 else 0

        # Confidence interval
        ci_margin = stats.norm.ppf(1 - alpha/2) * se
        ci_lower = (p2 - p1) - ci_margin
        ci_upper = (p2 - p1) + ci_margin

        return {
            "statistically_significant": p_value < alpha,
            "p_value": p_value,
            "z_score": z_score,
            "alpha": alpha,
            "control_rate": p1,
            "treatment_rate": p2,
            "absolute_lift": p2 - p1,
            "relative_lift_percent": relative_lift,
            "confidence_interval": (ci_lower, ci_upper),
            "interpretation": self._interpret_results(p_value, alpha, relative_lift)
        }

    def _interpret_results(
        self,
        p_value: float,
        alpha: float,
        relative_lift: float
    ) -> str:
        """
        Interpret statistical test results.
        """
        if p_value < alpha:
            direction = "better" if relative_lift > 0 else "worse"
            return f"Treatment is statistically significantly {direction} than control ({relative_lift:+.1f}% lift)"
        else:
            return f"No statistical significance detected (p={p_value:.3f} > {alpha}). Need more data or larger effect size."

    def detect_simpsons_paradox(
        self,
        control_segments: Dict[str, ABTestResult],
        treatment_segments: Dict[str, ABTestResult]
    ) -> Dict:
        """
        Detect Simpson's Paradox in segmented data.

        Simpson's Paradox: Treatment better in each segment but worse overall,
        or vice versa. Caused by confounding variables.

        Args:
            control_segments: Control results per segment (e.g., by country, device)
            treatment_segments: Treatment results per segment

        Returns:
            Detection results
        """
        # Overall results
        total_control = ABTestResult(
            variant="control_total",
            sample_size=sum(s.sample_size for s in control_segments.values()),
            success_count=sum(s.success_count for s in control_segments.values()),
            metric_values=[]
        )

        total_treatment = ABTestResult(
            variant="treatment_total",
            sample_size=sum(s.sample_size for s in treatment_segments.values()),
            success_count=sum(s.success_count for s in treatment_segments.values()),
            metric_values=[]
        )

        overall_direction = "treatment_better" if total_treatment.success_rate > total_control.success_rate else "control_better"

        # Check each segment
        segment_directions = {}
        for segment in control_segments.keys():
            ctrl = control_segments[segment]
            treat = treatment_segments[segment]

            segment_directions[segment] = "treatment_better" if treat.success_rate > ctrl.success_rate else "control_better"

        # Detect paradox: overall direction differs from all segments
        all_segments_agree = all(d == overall_direction for d in segment_directions.values())

        paradox_detected = not all_segments_agree

        return {
            "paradox_detected": paradox_detected,
            "overall_direction": overall_direction,
            "segment_directions": segment_directions,
            "explanation": self._explain_simpsons_paradox(
                paradox_detected,
                overall_direction,
                segment_directions
            )
        }

    def _explain_simpsons_paradox(
        self,
        detected: bool,
        overall: str,
        segments: Dict[str, str]
    ) -> str:
        """
        Explain Simpson's Paradox if detected.
        """
        if not detected:
            return "No Simpson's Paradox detected. Segment and overall results agree."

        return f"Simpson's Paradox detected! Overall: {overall}, but segments show: {segments}. This indicates a confounding variable. Review segment sizes and assignment."

    def calculate_required_sample_size(
        self,
        baseline_rate: float,
        minimum_detectable_effect: float,
        alpha: float = 0.05,
        power: float = 0.80
    ) -> int:
        """
        Calculate required sample size per variant.

        Args:
            baseline_rate: Current conversion rate (e.g., 0.10 for 10%)
            minimum_detectable_effect: Minimum relative change to detect (e.g., 0.10 for 10% improvement)
            alpha: Significance level (default 0.05)
            power: Statistical power (default 0.80)

        Returns:
            Required sample size per variant
        """
        treatment_rate = baseline_rate * (1 + minimum_detectable_effect)

        # Effect size (Cohen's h)
        effect_size = 2 * (np.arcsin(np.sqrt(treatment_rate)) - np.arcsin(np.sqrt(baseline_rate)))

        # Z-scores
        z_alpha = stats.norm.ppf(1 - alpha/2)
        z_beta = stats.norm.ppf(power)

        # Sample size calculation
        n = ((z_alpha + z_beta) / effect_size) ** 2

        return int(np.ceil(n))


# Example: Debug A/B test
debugger = ABTestDebugger()

# Simulate test results
control = ABTestResult(
    variant="control",
    sample_size=500,
    success_count=50,  # 10% conversion
    metric_values=np.random.normal(100, 20, 500).tolist()
)

treatment = ABTestResult(
    variant="treatment",
    sample_size=520,
    success_count=62,  # 11.9% conversion
    metric_values=np.random.normal(105, 20, 520).tolist()
)

# Validate design
validation = debugger.validate_test_design(control, treatment)
print(f"Test valid: {validation['valid']}")
if validation['issues']:
    for issue in validation['issues']:
        print(f"  [{issue['severity']}] {issue['message']}")

# Test significance
results = debugger.test_statistical_significance(control, treatment)
print(f"\nStatistical significance: {results['statistically_significant']}")
print(f"P-value: {results['p_value']:.4f}")
print(f"Relative lift: {results['relative_lift_percent']:.2f}%")
print(f"Interpretation: {results['interpretation']}")

# Check for Simpson's Paradox
control_segments = {
    "US": ABTestResult("control_US", 300, 40, []),
    "UK": ABTestResult("control_UK", 200, 10, [])
}

treatment_segments = {
    "US": ABTestResult("treatment_US", 400, 48, []),  # Better
    "UK": ABTestResult("treatment_UK", 120, 14, [])   # Better
}

paradox = debugger.detect_simpsons_paradox(control_segments, treatment_segments)
print(f"\nSimpson's Paradox: {paradox['paradox_detected']}")
print(f"Explanation: {paradox['explanation']}")

# Calculate required sample size
required_n = debugger.calculate_required_sample_size(
    baseline_rate=0.10,
    minimum_detectable_effect=0.10  # Detect 10% relative improvement
)
print(f"\nRequired sample size per variant: {required_n}")
```

**A/B test debugging checklist:**

- [ ] Sufficient sample size (use power analysis)
- [ ] Balanced assignment (50/50 or 70/30, not random)
- [ ] Random assignment (no selection bias)
- [ ] Statistical significance (p < 0.05)
- [ ] Practical significance (meaningful effect size)
- [ ] Check for Simpson's Paradox (segment analysis)
- [ ] Monitor for novelty effect (long-term trends)
- [ ] Validate metrics (correct calculation, no bugs)


### Part 5: Model Debugging (Wrong Predictions, Edge Cases)

**Common model issues:**
- Wrong predictions on edge cases
- High confidence wrong predictions
- Inconsistent behavior (same input, different output)
- Bias or fairness issues
- Input validation failures

```python
from dataclasses import dataclass
from typing import List, Dict, Any, Optional
import numpy as np
import torch

@dataclass
class PredictionError:
    """
    Failed prediction for analysis.
    """
    input_data: Any
    true_label: Any
    predicted_label: Any
    confidence: float
    error_type: str  # wrong_class, low_confidence, edge_case, etc.


class ModelDebugger:
    """
    Debug model prediction errors and edge cases.
    """

    def __init__(self, model, tokenizer=None):
        self.model = model
        self.tokenizer = tokenizer
        self.errors: List[PredictionError] = []

    def add_error(self, error: PredictionError):
        """Add prediction error for analysis."""
        self.errors.append(error)

    def find_error_patterns(self) -> Dict[str, List[PredictionError]]:
        """
        Find patterns in prediction errors.

        Patterns:
        - Errors on specific input types (long text, numbers, special chars)
        - Errors on specific classes (class imbalance?)
        - High-confidence errors (model overconfident)
        - Consistent errors (model learned wrong pattern)
        """
        patterns = {
            "high_confidence_errors": [],
            "low_confidence_errors": [],
            "edge_cases": [],
            "class_specific": {}
        }

        for error in self.errors:
            # High confidence but wrong
            if error.confidence > 0.9:
                patterns["high_confidence_errors"].append(error)

            # Low confidence (uncertain)
            elif error.confidence < 0.6:
                patterns["low_confidence_errors"].append(error)

            # Edge cases
            if error.error_type == "edge_case":
                patterns["edge_cases"].append(error)

            # Group by predicted class
            pred_class = str(error.predicted_label)
            if pred_class not in patterns["class_specific"]:
                patterns["class_specific"][pred_class] = []
            patterns["class_specific"][pred_class].append(error)

        return patterns

    def analyze_edge_cases(self) -> List[Dict]:
        """
        Analyze edge cases to understand failure modes.

        Edge case types:
        - Out-of-distribution inputs
        - Extreme values (very long, very short)
        - Special characters or formatting
        - Ambiguous inputs
        """
        edge_cases = [e for e in self.errors if e.error_type == "edge_case"]

        analyses = []
        for error in edge_cases:
            analysis = {
                "input": error.input_data,
                "true_label": error.true_label,
                "predicted_label": error.predicted_label,
                "confidence": error.confidence,
                "characteristics": self._characterize_input(error.input_data)
            }
            analyses.append(analysis)

        return analyses

    def _characterize_input(self, input_data: Any) -> Dict:
        """
        Characterize input to identify unusual features.
        """
        if isinstance(input_data, str):
            return {
                "type": "text",
                "length": len(input_data),
                "has_numbers": any(c.isdigit() for c in input_data),
                "has_special_chars": any(not c.isalnum() and not c.isspace() for c in input_data),
                "all_caps": input_data.isupper(),
                "all_lowercase": input_data.islower()
            }
        elif isinstance(input_data, (list, np.ndarray)):
            return {
                "type": "array",
                "shape": np.array(input_data).shape,
                "min": np.min(input_data),
                "max": np.max(input_data),
                "mean": np.mean(input_data)
            }
        else:
            return {"type": str(type(input_data))}

    def test_input_variations(
        self,
        input_data: Any,
        variations: List[str]
    ) -> Dict[str, Any]:
        """
        Test model on variations of input to check robustness.

        Variations:
        - case_change: Change case (upper/lower)
        - whitespace: Add/remove whitespace
        - typos: Introduce typos
        - paraphrase: Rephrase input

        Args:
            input_data: Original input
            variations: List of variation types to test

        Returns:
            Results for each variation
        """
        results = {}

        # Original prediction
        original_pred = self._predict(input_data)
        results["original"] = {
            "input": input_data,
            "prediction": original_pred
        }

        # Generate and test variations
        for var_type in variations:
            varied_input = self._generate_variation(input_data, var_type)
            varied_pred = self._predict(varied_input)

            results[var_type] = {
                "input": varied_input,
                "prediction": varied_pred,
                "consistent": varied_pred["label"] == original_pred["label"]
            }

        # Check consistency
        all_consistent = all(r.get("consistent", True) for r in results.values() if r != results["original"])

        return {
            "consistent": all_consistent,
            "results": results
        }

    def _generate_variation(self, input_data: str, variation_type: str) -> str:
        """
        Generate input variation.
        """
        if variation_type == "case_change":
            return input_data.upper() if input_data.islower() else input_data.lower()

        elif variation_type == "whitespace":
            return "  ".join(input_data.split())

        elif variation_type == "typos":
            # Simple typo: swap two adjacent characters
            if len(input_data) > 2:
                idx = len(input_data) // 2
                return input_data[:idx] + input_data[idx+1] + input_data[idx] + input_data[idx+2:]
            return input_data

        return input_data

    def _predict(self, input_data: Any) -> Dict:
        """
        Run model prediction.
        """
        # Simplified prediction (adapt to your model)
        # Example for text classification
        if self.tokenizer:
            inputs = self.tokenizer(input_data, return_tensors="pt")
            with torch.no_grad():
                outputs = self.model(**inputs)

            probs = torch.softmax(outputs.logits, dim=-1)
            pred_class = torch.argmax(probs, dim=-1).item()
            confidence = probs[0, pred_class].item()

            return {
                "label": pred_class,
                "confidence": confidence
            }

        return {"label": None, "confidence": 0.0}

    def validate_inputs(self, inputs: List[Any]) -> List[Dict]:
        """
        Validate inputs before inference.

        Validation checks:
        - Type correctness
        - Value ranges
        - Format compliance
        - Size limits
        """
        validation_results = []

        for i, input_data in enumerate(inputs):
            issues = []

            if isinstance(input_data, str):
                # Text validation
                if len(input_data) == 0:
                    issues.append("Empty input")
                elif len(input_data) > 10000:
                    issues.append("Input too long (>10k chars)")

                if not input_data.strip():
                    issues.append("Only whitespace")

            validation_results.append({
                "index": i,
                "valid": len(issues) == 0,
                "issues": issues
            })

        return validation_results


# Example usage
class DummyModel:
    def __call__(self, input_ids, attention_mask):
        # Dummy model for demonstration
        return type('obj', (object,), {
            'logits': torch.randn(1, 3)
        })


class DummyTokenizer:
    def __call__(self, text, return_tensors=None):
        return {
            "input_ids": torch.randint(0, 1000, (1, 10)),
            "attention_mask": torch.ones(1, 10)
        }


model = DummyModel()
tokenizer = DummyTokenizer()
debugger = ModelDebugger(model, tokenizer)

# Add prediction errors
debugger.add_error(PredictionError(
    input_data="This is a test",
    true_label=1,
    predicted_label=2,
    confidence=0.95,
    error_type="high_confidence"
))

debugger.add_error(PredictionError(
    input_data="AAAAAAAAAAA",  # Edge case: all same character
    true_label=0,
    predicted_label=1,
    confidence=0.85,
    error_type="edge_case"
))

# Find error patterns
patterns = debugger.find_error_patterns()
print(f"High confidence errors: {len(patterns['high_confidence_errors'])}")
print(f"Edge cases: {len(patterns['edge_cases'])}")

# Analyze edge cases
edge_analyses = debugger.analyze_edge_cases()
for analysis in edge_analyses:
    print(f"\nEdge case: {analysis['input']}")
    print(f"Characteristics: {analysis['characteristics']}")

# Test input variations
variations_result = debugger.test_input_variations(
    "This is a test",
    ["case_change", "whitespace", "typos"]
)
print(f"\nInput variation consistency: {variations_result['consistent']}")
```

**Model debugging checklist:**

- [ ] Collect failed predictions with context
- [ ] Categorize errors (high confidence, edge cases, class-specific)
- [ ] Analyze input characteristics (what makes them fail?)
- [ ] Test input variations (robustness check)
- [ ] Validate inputs before inference (prevent bad inputs)
- [ ] Check for bias (fairness across groups)
- [ ] Add error cases to training data (improve model)


### Part 6: Logging Best Practices

**Good logging enables debugging. Bad logging creates noise.**

```python
import logging
import json
import sys
from datetime import datetime
from contextvars import ContextVar
from typing import Dict, Any, Optional
import traceback

# Context variable for request/trace ID
request_id_var: ContextVar[Optional[str]] = ContextVar('request_id', default=None)


class StructuredLogger:
    """
    Structured logging for production systems.

    Best practices:
    - JSON format (machine-readable)
    - Include context (request_id, user_id, etc.)
    - Log at appropriate levels
    - Include timing information
    - Don't log sensitive data
    """

    def __init__(self, name: str):
        self.logger = logging.getLogger(name)
        self.logger.setLevel(logging.INFO)

        # JSON formatter
        handler = logging.StreamHandler(sys.stdout)
        handler.setFormatter(self.JSONFormatter())
        self.logger.addHandler(handler)

    class JSONFormatter(logging.Formatter):
        """
        Format logs as JSON.
        """
        def format(self, record: logging.LogRecord) -> str:
            log_data = {
                "timestamp": datetime.utcnow().isoformat() + "Z",
                "level": record.levelname,
                "logger": record.name,
                "message": record.getMessage(),
            }

            # Add request ID from context
            request_id = request_id_var.get()
            if request_id:
                log_data["request_id"] = request_id

            # Add extra fields
            if hasattr(record, "extra"):
                log_data.update(record.extra)

            # Add exception info
            if record.exc_info:
                log_data["exception"] = {
                    "type": record.exc_info[0].__name__,
                    "message": str(record.exc_info[1]),
                    "traceback": traceback.format_exception(*record.exc_info)
                }

            return json.dumps(log_data)

    def log(
        self,
        level: str,
        message: str,
        **kwargs
    ):
        """
        Log with structured context.

        Args:
            level: Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
            message: Log message
            **kwargs: Additional context fields
        """
        log_method = getattr(self.logger, level.lower())

        # Create LogRecord with extra fields
        extra = {"extra": kwargs}
        log_method(message, extra=extra)

    def debug(self, message: str, **kwargs):
        self.log("DEBUG", message, **kwargs)

    def info(self, message: str, **kwargs):
        self.log("INFO", message, **kwargs)

    def warning(self, message: str, **kwargs):
        self.log("WARNING", message, **kwargs)

    def error(self, message: str, **kwargs):
        self.log("ERROR", message, **kwargs)

    def critical(self, message: str, **kwargs):
        self.log("CRITICAL", message, **kwargs)


class RequestLogger:
    """
    Log HTTP requests with full context.
    """

    def __init__(self):
        self.logger = StructuredLogger("api")

    def log_request(
        self,
        request_id: str,
        method: str,
        path: str,
        user_id: Optional[str] = None,
        **kwargs
    ):
        """
        Log incoming request.
        """
        # Set request ID in context
        request_id_var.set(request_id)

        self.logger.info(
            "Request started",
            request_id=request_id,
            method=method,
            path=path,
            user_id=user_id,
            **kwargs
        )

    def log_response(
        self,
        request_id: str,
        status_code: int,
        duration_ms: float,
        **kwargs
    ):
        """
        Log response with timing.
        """
        level = "INFO" if status_code < 400 else "ERROR"

        self.logger.log(
            level,
            "Request completed",
            request_id=request_id,
            status_code=status_code,
            duration_ms=duration_ms,
            **kwargs
        )

    def log_error(
        self,
        request_id: str,
        error: Exception,
        **kwargs
    ):
        """
        Log request error with full context.
        """
        self.logger.error(
            "Request failed",
            request_id=request_id,
            error_type=type(error).__name__,
            error_message=str(error),
            **kwargs,
            exc_info=True
        )


class ModelInferenceLogger:
    """
    Log model inference with input/output context.
    """

    def __init__(self):
        self.logger = StructuredLogger("model")

    def log_inference(
        self,
        model_name: str,
        model_version: str,
        input_shape: tuple,
        output_shape: tuple,
        duration_ms: float,
        request_id: Optional[str] = None,
        **kwargs
    ):
        """
        Log model inference.
        """
        self.logger.info(
            "Model inference",
            model_name=model_name,
            model_version=model_version,
            input_shape=input_shape,
            output_shape=output_shape,
            duration_ms=duration_ms,
            request_id=request_id,
            **kwargs
        )

    def log_prediction_error(
        self,
        model_name: str,
        error: Exception,
        input_sample: Any,
        request_id: Optional[str] = None,
        **kwargs
    ):
        """
        Log prediction error with input context.

        Note: Be careful not to log sensitive data!
        """
        # Sanitize input (don't log full input if sensitive)
        input_summary = self._summarize_input(input_sample)

        self.logger.error(
            "Prediction failed",
            model_name=model_name,
            error_type=type(error).__name__,
            error_message=str(error),
            input_summary=input_summary,
            request_id=request_id,
            **kwargs,
            exc_info=True
        )

    def _summarize_input(self, input_sample: Any) -> Dict:
        """
        Summarize input without logging sensitive data.
        """
        if isinstance(input_sample, str):
            return {
                "type": "text",
                "length": len(input_sample),
                "preview": input_sample[:50] + "..." if len(input_sample) > 50 else input_sample
            }
        elif isinstance(input_sample, (list, tuple)):
            return {
                "type": "array",
                "length": len(input_sample)
            }
        else:
            return {
                "type": str(type(input_sample))
            }


# Example usage
request_logger = RequestLogger()
model_logger = ModelInferenceLogger()

# Log request
import uuid
import time

request_id = str(uuid.uuid4())
start_time = time.time()

request_logger.log_request(
    request_id=request_id,
    method="POST",
    path="/api/predict",
    user_id="user_123",
    client_ip="192.168.1.100"
)

# Log model inference
model_logger.log_inference(
    model_name="sentiment-classifier",
    model_version="v2.1",
    input_shape=(1, 512),
    output_shape=(1, 3),
    duration_ms=45.2,
    request_id=request_id,
    batch_size=1
)

# Log response
duration_ms = (time.time() - start_time) * 1000
request_logger.log_response(
    request_id=request_id,
    status_code=200,
    duration_ms=duration_ms
)

# Log error (example)
try:
    raise ValueError("Invalid input shape")
except Exception as e:
    request_logger.log_error(request_id, e, endpoint="/api/predict")
```

**What to log:**

| Level | What to Log | Example |
|-------|-------------|---------|
| DEBUG | Detailed diagnostic info | Variable values, function entry/exit |
| INFO | Normal operations | Request started, prediction completed |
| WARNING | Unexpected but handled | Retry attempt, fallback used |
| ERROR | Error conditions | API error, prediction failed |
| CRITICAL | System failure | Database down, out of memory |

**What NOT to log:**
- Passwords, API keys, tokens
- Credit card numbers, SSNs
- Full user data (GDPR violation)
- Large payloads (log summary instead)

**Logging checklist:**

- [ ] Use structured logging (JSON format)
- [ ] Include trace/request IDs (correlation)
- [ ] Log at appropriate levels
- [ ] Include timing information
- [ ] Don't log sensitive data
- [ ] Make logs queryable (structured fields)
- [ ] Include sufficient context for debugging
- [ ] Log errors with full stack traces


### Part 7: Rollback Procedures

**When to rollback:**
- Critical error rate spike (>5% errors)
- Significant metric regression (>10% drop)
- Security vulnerability discovered
- Cascading failures affecting downstream

**When NOT to rollback:**
- Minor errors (<1% error rate)
- Single user complaints (investigate first)
- Performance slightly worse (measure first)
- New feature not perfect (iterate instead)

```python
from dataclasses import dataclass
from typing import Dict, List, Optional
from datetime import datetime
import subprocess

@dataclass
class DeploymentMetrics:
    """
    Metrics to monitor during deployment.
    """
    error_rate: float
    latency_p95_ms: float
    success_rate: float
    throughput_qps: float
    cpu_usage_percent: float
    memory_usage_percent: float


class RollbackDecider:
    """
    Decide whether to rollback based on metrics.
    """

    def __init__(
        self,
        baseline_metrics: DeploymentMetrics,
        thresholds: Dict[str, float]
    ):
        """
        Args:
            baseline_metrics: Metrics from previous stable version
            thresholds: Rollback thresholds (e.g., {"error_rate": 0.05})
        """
        self.baseline = baseline_metrics
        self.thresholds = thresholds

    def should_rollback(
        self,
        current_metrics: DeploymentMetrics
    ) -> Dict:
        """
        Decide if rollback is needed.

        Returns:
            Decision with reasoning
        """
        violations = []

        # Check error rate
        if current_metrics.error_rate > self.thresholds.get("error_rate", 0.05):
            violations.append({
                "metric": "error_rate",
                "baseline": self.baseline.error_rate,
                "current": current_metrics.error_rate,
                "threshold": self.thresholds["error_rate"],
                "severity": "CRITICAL"
            })

        # Check latency
        latency_increase = (current_metrics.latency_p95_ms - self.baseline.latency_p95_ms) / self.baseline.latency_p95_ms
        if latency_increase > self.thresholds.get("latency_increase", 0.25):  # 25% increase
            violations.append({
                "metric": "latency_p95_ms",
                "baseline": self.baseline.latency_p95_ms,
                "current": current_metrics.latency_p95_ms,
                "increase_percent": latency_increase * 100,
                "threshold": self.thresholds["latency_increase"] * 100,
                "severity": "HIGH"
            })

        # Check success rate
        success_drop = self.baseline.success_rate - current_metrics.success_rate
        if success_drop > self.thresholds.get("success_rate_drop", 0.05):  # 5pp drop
            violations.append({
                "metric": "success_rate",
                "baseline": self.baseline.success_rate,
                "current": current_metrics.success_rate,
                "drop": success_drop,
                "threshold": self.thresholds["success_rate_drop"],
                "severity": "CRITICAL"
            })

        should_rollback = len([v for v in violations if v["severity"] == "CRITICAL"]) > 0

        return {
            "should_rollback": should_rollback,
            "violations": violations,
            "reasoning": self._generate_reasoning(should_rollback, violations)
        }

    def _generate_reasoning(
        self,
        should_rollback: bool,
        violations: List[Dict]
    ) -> str:
        """
        Generate human-readable reasoning.
        """
        if not violations:
            return "All metrics within acceptable thresholds. No rollback needed."

        if should_rollback:
            critical = [v for v in violations if v["severity"] == "CRITICAL"]
            reasons = [f"{v['metric']} violated threshold" for v in critical]
            return f"ROLLBACK RECOMMENDED: {', '.join(reasons)}"
        else:
            return f"Minor issues detected but below rollback threshold. Monitor closely."


class RollbackExecutor:
    """
    Execute rollback procedure.
    """

    def __init__(self, deployment_system: str = "kubernetes"):
        self.deployment_system = deployment_system

    def rollback(
        self,
        service_name: str,
        previous_version: str,
        preserve_evidence: bool = True
    ) -> Dict:
        """
        Execute rollback to previous version.

        Args:
            service_name: Service to rollback
            previous_version: Version to rollback to
            preserve_evidence: Capture logs/metrics before rollback

        Returns:
            Rollback result
        """
        print(f"Starting rollback: {service_name} -> {previous_version}")

        # Step 1: Preserve evidence
        if preserve_evidence:
            evidence = self._preserve_evidence(service_name)
            print(f"Evidence preserved: {evidence}")

        # Step 2: Execute rollback
        if self.deployment_system == "kubernetes":
            result = self._rollback_kubernetes(service_name, previous_version)
        elif self.deployment_system == "docker":
            result = self._rollback_docker(service_name, previous_version)
        else:
            result = {"success": False, "error": "Unknown deployment system"}

        return result

    def _preserve_evidence(self, service_name: str) -> Dict:
        """
        Capture logs and metrics before rollback.
        """
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

        # Capture logs (last 1000 lines)
        log_file = f"/tmp/{service_name}_rollback_{timestamp}.log"

        # Simplified: In production, use proper log aggregation
        print(f"Capturing logs to {log_file}")

        # Capture metrics snapshot
        metrics_file = f"/tmp/{service_name}_metrics_{timestamp}.json"
        print(f"Capturing metrics to {metrics_file}")

        return {
            "log_file": log_file,
            "metrics_file": metrics_file,
            "timestamp": timestamp
        }

    def _rollback_kubernetes(
        self,
        service_name: str,
        version: str
    ) -> Dict:
        """
        Rollback Kubernetes deployment.
        """
        try:
            # Option 1: Rollback to previous revision
            cmd = f"kubectl rollout undo deployment/{service_name}"

            # Option 2: Rollback to specific version
            # cmd = f"kubectl rollout undo deployment/{service_name} --to-revision={version}"

            result = subprocess.run(
                cmd.split(),
                capture_output=True,
                text=True,
                check=True
            )

            # Wait for rollout
            wait_cmd = f"kubectl rollout status deployment/{service_name}"
            subprocess.run(
                wait_cmd.split(),
                check=True,
                timeout=300  # 5 min timeout
            )

            return {
                "success": True,
                "service": service_name,
                "version": version,
                "output": result.stdout
            }

        except subprocess.CalledProcessError as e:
            return {
                "success": False,
                "error": str(e),
                "output": e.stderr
            }

    def _rollback_docker(
        self,
        service_name: str,
        version: str
    ) -> Dict:
        """
        Rollback Docker service.
        """
        try:
            cmd = f"docker service update --image {service_name}:{version} {service_name}"

            result = subprocess.run(
                cmd.split(),
                capture_output=True,
                text=True,
                check=True
            )

            return {
                "success": True,
                "service": service_name,
                "version": version,
                "output": result.stdout
            }

        except subprocess.CalledProcessError as e:
            return {
                "success": False,
                "error": str(e),
                "output": e.stderr
            }


# Example usage
baseline = DeploymentMetrics(
    error_rate=0.01,
    latency_p95_ms=200,
    success_rate=0.95,
    throughput_qps=100,
    cpu_usage_percent=50,
    memory_usage_percent=60
)

thresholds = {
    "error_rate": 0.05,  # 5% error rate
    "latency_increase": 0.25,  # 25% increase
    "success_rate_drop": 0.05  # 5pp drop
}

decider = RollbackDecider(baseline, thresholds)

# Simulate bad deployment
current = DeploymentMetrics(
    error_rate=0.08,  # High!
    latency_p95_ms=300,  # High!
    success_rate=0.88,  # Low!
    throughput_qps=90,
    cpu_usage_percent=70,
    memory_usage_percent=65
)

decision = decider.should_rollback(current)
print(f"Should rollback: {decision['should_rollback']}")
print(f"Reasoning: {decision['reasoning']}")

if decision['should_rollback']:
    executor = RollbackExecutor(deployment_system="kubernetes")
    result = executor.rollback(
        service_name="ml-api",
        previous_version="v1.2.3",
        preserve_evidence=True
    )
    print(f"Rollback result: {result}")
```

**Rollback checklist:**

- [ ] Preserve evidence (logs, metrics, traces)
- [ ] Document rollback reason
- [ ] Execute rollback (kubectl/docker/terraform)
- [ ] Verify metrics return to normal
- [ ] Notify team and stakeholders
- [ ] Schedule post-mortem
- [ ] Fix issue in development
- [ ] Re-deploy with fix


### Part 8: Post-Mortem Process

**Goal:** Learn from incidents to prevent recurrence.

**Post-mortem is blameless:** Focus on systems and processes, not individuals.

```python
from dataclasses import dataclass
from typing import List, Dict, Optional
from datetime import datetime

@dataclass
class IncidentTimeline:
    """
    Timeline event during incident.
    """
    timestamp: datetime
    event: str
    actor: str  # Person, system, or automation
    action: str


@dataclass
class ActionItem:
    """
    Post-mortem action item.
    """
    description: str
    owner: str
    due_date: datetime
    priority: str  # CRITICAL, HIGH, MEDIUM, LOW
    status: str = "TODO"  # TODO, IN_PROGRESS, DONE


class PostMortem:
    """
    Structured post-mortem document.
    """

    def __init__(
        self,
        incident_id: str,
        title: str,
        date: datetime,
        severity: str,
        duration_minutes: int
    ):
        self.incident_id = incident_id
        self.title = title
        self.date = date
        self.severity = severity
        self.duration_minutes = duration_minutes

        self.summary: str = ""
        self.impact: Dict = {}
        self.timeline: List[IncidentTimeline] = []
        self.root_cause: str = ""
        self.contributing_factors: List[str] = []
        self.what_went_well: List[str] = []
        self.what_went_wrong: List[str] = []
        self.action_items: List[ActionItem] = []

    def add_timeline_event(
        self,
        timestamp: datetime,
        event: str,
        actor: str,
        action: str
    ):
        """
        Add event to incident timeline.
        """
        self.timeline.append(IncidentTimeline(
            timestamp=timestamp,
            event=event,
            actor=actor,
            action=action
        ))

    def set_root_cause(self, root_cause: str):
        """
        Document root cause.
        """
        self.root_cause = root_cause

    def add_contributing_factor(self, factor: str):
        """
        Add contributing factor (not root cause but made it worse).
        """
        self.contributing_factors.append(factor)

    def add_action_item(
        self,
        description: str,
        owner: str,
        due_date: datetime,
        priority: str = "HIGH"
    ):
        """
        Add action item for prevention.
        """
        self.action_items.append(ActionItem(
            description=description,
            owner=owner,
            due_date=due_date,
            priority=priority
        ))

    def generate_report(self) -> str:
        """
        Generate post-mortem report.
        """
        report = f"""
# Post-Mortem: {self.title}

**Incident ID:** {self.incident_id}
**Date:** {self.date.strftime('%Y-%m-%d %H:%M UTC')}
**Severity:** {self.severity}
**Duration:** {self.duration_minutes} minutes

## Summary

{self.summary}

## Impact

{self._format_impact()}

## Timeline

{self._format_timeline()}

## Root Cause

{self.root_cause}

## Contributing Factors

{self._format_list(self.contributing_factors)}

## What Went Well

{self._format_list(self.what_went_well)}

## What Went Wrong

{self._format_list(self.what_went_wrong)}

## Action Items

{self._format_action_items()}


**Review:** This post-mortem should be reviewed by the team and approved by engineering leadership.

**Follow-up:** Track action items to completion. Schedule follow-up review in 30 days.
"""
        return report

    def _format_impact(self) -> str:
        """Format impact section."""
        lines = []
        for key, value in self.impact.items():
            lines.append(f"- **{key}:** {value}")
        return "\n".join(lines) if lines else "No impact documented."

    def _format_timeline(self) -> str:
        """Format timeline section."""
        lines = []
        for event in sorted(self.timeline, key=lambda e: e.timestamp):
            time_str = event.timestamp.strftime('%H:%M:%S')
            lines.append(f"- **{time_str}** [{event.actor}] {event.event} → {event.action}")
        return "\n".join(lines) if lines else "No timeline documented."

    def _format_list(self, items: List[str]) -> str:
        """Format list of items."""
        return "\n".join(f"- {item}" for item in items) if items else "None."

    def _format_action_items(self) -> str:
        """Format action items."""
        if not self.action_items:
            return "No action items."

        lines = []
        for item in sorted(self.action_items, key=lambda x: x.priority):
            due = item.due_date.strftime('%Y-%m-%d')
            lines.append(f"- [{item.priority}] {item.description} (Owner: {item.owner}, Due: {due})")

        return "\n".join(lines)


# Example post-mortem
from datetime import timedelta

pm = PostMortem(
    incident_id="INC-2025-042",
    title="API Latency Spike Causing Timeouts",
    date=datetime(2025, 1, 15, 14, 30),
    severity="HIGH",
    duration_minutes=45
)

pm.summary = """
At 14:30 UTC, API latency spiked from 200ms to 5000ms, causing widespread timeouts.
Error rate increased from 0.5% to 15%. Incident was resolved by scaling up database
connection pool and restarting API servers. No data loss occurred.
"""

pm.impact = {
    "Users affected": "~5,000 users (10% of active users)",
    "Requests failed": "~15,000 requests",
    "Revenue impact": "$2,500 (estimated)",
    "Customer complaints": "23 support tickets"
}

# Timeline
pm.add_timeline_event(
    datetime(2025, 1, 15, 14, 30),
    "Latency spike detected",
    "Monitoring System",
    "Alert sent to on-call"
)

pm.add_timeline_event(
    datetime(2025, 1, 15, 14, 32),
    "On-call engineer acknowledged",
    "Engineer A",
    "Started investigation"
)

pm.add_timeline_event(
    datetime(2025, 1, 15, 14, 40),
    "Root cause identified: DB connection pool exhausted",
    "Engineer A",
    "Scaled connection pool from 10 to 50"
)

pm.add_timeline_event(
    datetime(2025, 1, 15, 14, 45),
    "Restarted API servers",
    "Engineer A",
    "Latency returned to normal"
)

pm.add_timeline_event(
    datetime(2025, 1, 15, 15, 15),
    "Incident resolved",
    "Engineer A",
    "Monitoring confirmed stability"
)

# Root cause and factors
pm.set_root_cause(
    "Database connection pool size (10) was too small for peak traffic (100 concurrent requests). "
    "Connection pool exhaustion caused requests to queue, leading to timeouts."
)

pm.add_contributing_factor("No monitoring for connection pool utilization")
pm.add_contributing_factor("Connection pool size not load tested")
pm.add_contributing_factor("No auto-scaling for database connections")

# What went well/wrong
pm.what_went_well = [
    "Monitoring detected issue within 2 minutes",
    "On-call responded quickly (2 min to acknowledgment)",
    "Root cause identified in 10 minutes",
    "No data loss or corruption"
]

pm.what_went_wrong = [
    "Connection pool not sized for peak traffic",
    "No monitoring for connection pool metrics",
    "Load testing didn't include database connection limits",
    "Incident affected 10% of users for 45 minutes"
]

# Action items
pm.add_action_item(
    "Add monitoring and alerting for DB connection pool utilization (alert at 80%)",
    "Engineer B",
    datetime.now() + timedelta(days=3),
    "CRITICAL"
)

pm.add_action_item(
    "Implement auto-scaling for DB connection pool based on traffic",
    "Engineer C",
    datetime.now() + timedelta(days=7),
    "HIGH"
)

pm.add_action_item(
    "Update load testing to include DB connection limits",
    "Engineer A",
    datetime.now() + timedelta(days=7),
    "HIGH"
)

pm.add_action_item(
    "Document connection pool sizing guidelines for future services",
    "Engineer D",
    datetime.now() + timedelta(days=14),
    "MEDIUM"
)

# Generate report
report = pm.generate_report()
print(report)
```

**Post-mortem checklist:**

- [ ] Schedule post-mortem meeting (within 48 hours)
- [ ] Invite all involved parties
- [ ] Document timeline (facts, not speculation)
- [ ] Identify root cause (not symptoms)
- [ ] List contributing factors
- [ ] What went well / what went wrong
- [ ] Create action items (owner, due date, priority)
- [ ] Review and approve report
- [ ] Track action items to completion
- [ ] Follow-up review in 30 days

**Key principles:**
- **Blameless:** Focus on systems, not people
- **Fact-based:** Use evidence, not opinions
- **Actionable:** Create concrete prevention measures
- **Timely:** Complete within 1 week of incident
- **Shared:** Distribute to entire team


### Part 9: Production Forensics (Traces, Logs, Metrics Correlation)

**Goal:** Correlate traces, logs, and metrics to understand incident.

```python
from dataclasses import dataclass
from typing import List, Dict, Optional
from datetime import datetime, timedelta
import json

@dataclass
class Trace:
    """
    Distributed trace span.
    """
    trace_id: str
    span_id: str
    parent_span_id: Optional[str]
    service_name: str
    operation_name: str
    start_time: datetime
    duration_ms: float
    status: str  # OK, ERROR
    tags: Dict[str, str]


@dataclass
class LogEntry:
    """
    Structured log entry.
    """
    timestamp: datetime
    level: str
    service: str
    message: str
    trace_id: Optional[str]
    metadata: Dict


@dataclass
class MetricDataPoint:
    """
    Time-series metric data point.
    """
    timestamp: datetime
    metric_name: str
    value: float
    tags: Dict[str, str]


class ProductionForensics:
    """
    Correlate traces, logs, and metrics for incident investigation.
    """

    def __init__(self):
        self.traces: List[Trace] = []
        self.logs: List[LogEntry] = []
        self.metrics: List[MetricDataPoint] = []

    def add_trace(self, trace: Trace):
        self.traces.append(trace)

    def add_log(self, log: LogEntry):
        self.logs.append(log)

    def add_metric(self, metric: MetricDataPoint):
        self.metrics.append(metric)

    def investigate_slow_request(
        self,
        trace_id: str
    ) -> Dict:
        """
        Investigate slow request using trace, logs, and metrics.

        Args:
            trace_id: Trace ID of slow request

        Returns:
            Investigation results
        """
        # Get trace spans
        trace_spans = [t for t in self.traces if t.trace_id == trace_id]

        if not trace_spans:
            return {"error": "Trace not found"}

        # Sort by start time
        trace_spans.sort(key=lambda s: s.start_time)

        # Calculate total duration
        total_duration = sum(s.duration_ms for s in trace_spans if not s.parent_span_id)

        # Find slowest span
        slowest_span = max(trace_spans, key=lambda s: s.duration_ms)

        # Get logs for this trace
        trace_logs = [l for l in self.logs if l.trace_id == trace_id]
        trace_logs.sort(key=lambda l: l.timestamp)

        # Check for errors
        error_logs = [l for l in trace_logs if l.level == "ERROR"]

        # Get metrics during request time
        start_time = trace_spans[0].start_time
        end_time = start_time + timedelta(milliseconds=total_duration)

        relevant_metrics = [
            m for m in self.metrics
            if start_time <= m.timestamp <= end_time
        ]

        return {
            "trace_id": trace_id,
            "total_duration_ms": total_duration,
            "num_spans": len(trace_spans),
            "slowest_span": {
                "service": slowest_span.service_name,
                "operation": slowest_span.operation_name,
                "duration_ms": slowest_span.duration_ms,
                "percentage": (slowest_span.duration_ms / total_duration * 100) if total_duration > 0 else 0
            },
            "error_count": len(error_logs),
            "errors": [
                {"timestamp": l.timestamp.isoformat(), "message": l.message}
                for l in error_logs
            ],
            "trace_breakdown": [
                {
                    "service": s.service_name,
                    "operation": s.operation_name,
                    "duration_ms": s.duration_ms,
                    "percentage": (s.duration_ms / total_duration * 100) if total_duration > 0 else 0
                }
                for s in trace_spans
            ],
            "metrics_during_request": [
                {
                    "metric": m.metric_name,
                    "value": m.value,
                    "timestamp": m.timestamp.isoformat()
                }
                for m in relevant_metrics
            ]
        }

    def find_correlated_errors(
        self,
        time_window_minutes: int = 10
    ) -> List[Dict]:
        """
        Find errors that occurred around the same time.

        Args:
            time_window_minutes: Time window for correlation

        Returns:
            Clusters of correlated errors
        """
        error_logs = [l for l in self.logs if l.level == "ERROR"]
        error_logs.sort(key=lambda l: l.timestamp)

        if not error_logs:
            return []

        # Cluster errors by time
        clusters = []
        current_cluster = [error_logs[0]]

        for log in error_logs[1:]:
            time_diff = (log.timestamp - current_cluster[-1].timestamp).total_seconds() / 60

            if time_diff <= time_window_minutes:
                current_cluster.append(log)
            else:
                if len(current_cluster) > 1:
                    clusters.append(current_cluster)
                current_cluster = [log]

        if len(current_cluster) > 1:
            clusters.append(current_cluster)

        # Analyze each cluster
        results = []
        for cluster in clusters:
            services = set(l.service for l in cluster)
            messages = set(l.message for l in cluster)

            results.append({
                "start_time": cluster[0].timestamp.isoformat(),
                "end_time": cluster[-1].timestamp.isoformat(),
                "error_count": len(cluster),
                "services_affected": list(services),
                "unique_errors": list(messages)
            })

        return results

    def analyze_metric_anomaly(
        self,
        metric_name: str,
        anomaly_time: datetime,
        window_minutes: int = 5
    ) -> Dict:
        """
        Analyze what happened around metric anomaly.

        Args:
            metric_name: Metric that had anomaly
            anomaly_time: When anomaly occurred
            window_minutes: Time window to analyze

        Returns:
            Analysis results
        """
        start_time = anomaly_time - timedelta(minutes=window_minutes)
        end_time = anomaly_time + timedelta(minutes=window_minutes)

        # Get metric values
        metric_values = [
            m for m in self.metrics
            if m.metric_name == metric_name and start_time <= m.timestamp <= end_time
        ]

        # Get logs during this time
        logs_during = [
            l for l in self.logs
            if start_time <= l.timestamp <= end_time
        ]

        # Get traces during this time
        traces_during = [
            t for t in self.traces
            if start_time <= t.start_time <= end_time
        ]

        # Count errors
        error_count = len([l for l in logs_during if l.level == "ERROR"])
        failed_traces = len([t for t in traces_during if t.status == "ERROR"])

        return {
            "metric_name": metric_name,
            "anomaly_time": anomaly_time.isoformat(),
            "window_minutes": window_minutes,
            "metric_values": [
                {"timestamp": m.timestamp.isoformat(), "value": m.value}
                for m in metric_values
            ],
            "error_count_during_window": error_count,
            "failed_traces_during_window": failed_traces,
            "top_errors": self._get_top_errors(logs_during, limit=5),
            "services_involved": list(set(t.service_name for t in traces_during))
        }

    def _get_top_errors(self, logs: List[LogEntry], limit: int = 5) -> List[Dict]:
        """
        Get most common error messages.
        """
        from collections import Counter

        error_logs = [l for l in logs if l.level == "ERROR"]
        error_messages = [l.message for l in error_logs]

        counter = Counter(error_messages)

        return [
            {"message": msg, "count": count}
            for msg, count in counter.most_common(limit)
        ]


# Example usage
forensics = ProductionForensics()

# Simulate data
trace_id = "trace-123"

# Add trace spans
forensics.add_trace(Trace(
    trace_id=trace_id,
    span_id="span-1",
    parent_span_id=None,
    service_name="api-gateway",
    operation_name="POST /predict",
    start_time=datetime(2025, 1, 15, 14, 30, 0),
    duration_ms=5000,
    status="OK",
    tags={"user_id": "user_123"}
))

forensics.add_trace(Trace(
    trace_id=trace_id,
    span_id="span-2",
    parent_span_id="span-1",
    service_name="ml-service",
    operation_name="model_inference",
    start_time=datetime(2025, 1, 15, 14, 30, 0, 500000),
    duration_ms=4500,  # Slow!
    status="OK",
    tags={"model": "sentiment-classifier"}
))

# Add logs
forensics.add_log(LogEntry(
    timestamp=datetime(2025, 1, 15, 14, 30, 3),
    level="WARNING",
    service="ml-service",
    message="High inference latency detected",
    trace_id=trace_id,
    metadata={"latency_ms": 4500}
))

# Add metrics
forensics.add_metric(MetricDataPoint(
    timestamp=datetime(2025, 1, 15, 14, 30, 0),
    metric_name="api_latency_ms",
    value=5000,
    tags={"service": "api-gateway"}
))

# Investigate slow request
investigation = forensics.investigate_slow_request(trace_id)
print(json.dumps(investigation, indent=2))
```

**Forensics checklist:**

- [ ] Identify affected time window
- [ ] Collect traces for failed/slow requests
- [ ] Collect logs with matching trace IDs
- [ ] Collect metrics during time window
- [ ] Correlate traces + logs + metrics
- [ ] Identify slowest operations (trace breakdown)
- [ ] Find error patterns (log analysis)
- [ ] Check metric anomalies (spikes/drops)
- [ ] Build timeline of events


## Summary

**Production debugging is systematic investigation, not random guessing.**

**Core methodology:**
1. **Reproduce** → Create minimal, deterministic reproduction
2. **Profile** → Use data, not intuition (py-spy, torch.profiler)
3. **Diagnose** → Find root cause, not symptoms
4. **Fix** → Targeted fix verified by tests
5. **Verify** → Prove fix works in production
6. **Document** → Post-mortem for prevention

**Key principles:**
- **Evidence-based:** Collect data before forming hypothesis
- **Systematic:** Follow debugging framework, don't skip steps
- **Root cause:** Fix the cause, not symptoms
- **Verification:** Prove fix works before closing
- **Prevention:** Add monitoring, tests, and documentation

**Production debugging toolkit:**
- Performance profiling: py-spy, torch.profiler, cProfile
- Error analysis: Categorize, find patterns, identify root cause
- A/B test debugging: Statistical significance, Simpson's paradox
- Model debugging: Edge cases, input variations, robustness
- Logging: Structured, with trace IDs and context
- Rollback: Preserve evidence, rollback quickly, fix properly
- Post-mortems: Blameless, actionable, prevent recurrence
- Forensics: Correlate traces, logs, metrics

**Common pitfalls to avoid:**
- Random changes without reproduction
- Guessing bottlenecks without profiling
- Bad logging (no context, unstructured)
- Panic rollback without learning
- Skipping post-mortems

Without systematic debugging, you fight the same fires repeatedly. With systematic debugging, you prevent fires from starting.


## REFACTOR Phase: Pressure Tests

### Pressure Test 1: Random Changes Without Investigation

**Scenario:** Model latency spiked from 100ms to 500ms. Engineer makes random changes hoping to fix it.

**Test:** Verify skill prevents random changes and enforces systematic investigation.

**Expected behavior:**
- ✅ Refuse to make changes without reproduction
- ✅ Require profiling data before optimization
- ✅ Collect evidence (metrics, logs, traces)
- ✅ Form hypothesis based on data
- ✅ Verify hypothesis before implementing fix

**Failure mode:** Makes parameter changes without profiling or understanding root cause.


### Pressure Test 2: No Profiling Before Optimization

**Scenario:** API is slow. Engineer says "Database is probably the bottleneck, let's add caching."

**Test:** Verify skill requires profiling data before optimization.

**Expected behavior:**
- ✅ Demand profiling data (py-spy flamegraph, query profiler)
- ✅ Identify actual bottleneck from profile
- ✅ Verify bottleneck hypothesis
- ✅ Optimize proven bottleneck, not guessed one

**Failure mode:** Optimizes based on intuition without profiling data.


### Pressure Test 3: Useless Logging

**Scenario:** Production error occurred but logs don't have enough context to debug.

**Test:** Verify skill enforces structured logging with context.

**Expected behavior:**
- ✅ Use structured logging (JSON format)
- ✅ Include trace/request IDs for correlation
- ✅ Log sufficient context (user_id, endpoint, input summary)
- ✅ Don't log sensitive data (passwords, PII)

**Failure mode:** Logs "Error occurred" with no context, making debugging impossible.


### Pressure Test 4: Immediate Rollback Without Evidence

**Scenario:** Error rate increased to 2%. Engineer wants to rollback immediately.

**Test:** Verify skill preserves evidence before rollback.

**Expected behavior:**
- ✅ Assess severity (2% error rate = investigate, not immediate rollback)
- ✅ Preserve evidence (logs, metrics, traces)
- ✅ Investigate root cause while monitoring
- ✅ Only rollback if critical threshold (>5% errors, cascading failures)

**Failure mode:** Rollbacks immediately without preserving evidence or assessing severity.


### Pressure Test 5: No Root Cause Analysis

**Scenario:** API returns 500 errors. Engineer fixes symptom (restart service) but not root cause.

**Test:** Verify skill identifies and fixes root cause.

**Expected behavior:**
- ✅ Distinguish symptom ("500 errors") from root cause ("connection pool exhausted")
- ✅ Investigate why symptom occurred
- ✅ Fix root cause (increase pool size, add monitoring)
- ✅ Verify fix addresses root cause

**Failure mode:** Fixes symptom (restart) but root cause remains, issue repeats.


### Pressure Test 6: A/B Test Without Statistical Significance

**Scenario:** A/B test with 50 samples per variant shows 5% improvement. Engineer wants to ship.

**Test:** Verify skill requires statistical significance.

**Expected behavior:**
- ✅ Calculate required sample size (power analysis)
- ✅ Check statistical significance (p-value < 0.05)
- ✅ Reject if insufficient samples or not significant
- ✅ Check for Simpson's Paradox (segment analysis)

**Failure mode:** Ships based on insufficient data or non-significant results.


### Pressure Test 7: Model Edge Case Ignored

**Scenario:** Model fails on all-caps input but works on normal case. Engineer ignores edge case.

**Test:** Verify skill investigates and handles edge cases.

**Expected behavior:**
- ✅ Collect edge case examples
- ✅ Categorize edge cases (all caps, special chars, long inputs)
- ✅ Add input validation or preprocessing
- ✅ Add edge cases to test suite

**Failure mode:** Ignores edge cases as "not important" without investigation.


### Pressure Test 8: Skip Post-Mortem

**Scenario:** Incident resolved. Engineer closes ticket and moves on without post-mortem.

**Test:** Verify skill enforces post-mortem process.

**Expected behavior:**
- ✅ Require post-mortem for all incidents (severity HIGH or above)
- ✅ Document timeline, root cause, action items
- ✅ Make post-mortem blameless (systems, not people)
- ✅ Track action items to completion

**Failure mode:** Skips post-mortem, incident repeats, no learning.


### Pressure Test 9: No Metrics Correlation

**Scenario:** Latency spike at 2pm. Engineer looks at logs but not metrics or traces.

**Test:** Verify skill correlates traces, logs, and metrics.

**Expected behavior:**
- ✅ Collect traces for affected requests
- ✅ Collect logs with matching trace IDs
- ✅ Collect metrics during time window
- ✅ Correlate all three to find root cause

**Failure mode:** Only looks at logs, misses critical information in traces/metrics.


### Pressure Test 10: High Confidence Wrong Predictions Ignored

**Scenario:** Model makes high-confidence (>95%) wrong predictions. Engineer says "accuracy is good overall."

**Test:** Verify skill investigates high-confidence errors.

**Expected behavior:**
- ✅ Separate high-confidence errors from low-confidence
- ✅ Analyze input characteristics causing high-confidence errors
- ✅ Test input variations for robustness
- ✅ Add error cases to training data or add validation

**Failure mode:** Ignores high-confidence errors because "overall accuracy is fine."