3467 lines
100 KiB
Markdown
3467 lines
100 KiB
Markdown
|
|
# Production Debugging Techniques Skill
|
|
|
|
## When to Use This Skill
|
|
|
|
Use this skill when:
|
|
- Investigating production incidents or outages
|
|
- Debugging performance bottlenecks or latency spikes
|
|
- Analyzing model quality issues (wrong predictions, hallucinations)
|
|
- Investigating A/B test anomalies or statistical issues
|
|
- Performing post-incident analysis and root cause investigation
|
|
- Debugging edge cases or unexpected behavior
|
|
- Analyzing production logs, traces, and metrics
|
|
|
|
**When NOT to use:** Development debugging (use IDE debugger), unit test failures (use TDD), or pre-production validation.
|
|
|
|
## Core Principle
|
|
|
|
**Production debugging is forensic investigation, not random guessing.**
|
|
|
|
Without systematic debugging:
|
|
- You make random changes hoping to fix issues (doesn't address root cause)
|
|
- You guess bottlenecks without data (optimize the wrong things)
|
|
- You can't diagnose issues from logs (missing critical information)
|
|
- You panic and rollback without learning (incidents repeat)
|
|
- You skip post-mortems (no prevention, just reaction)
|
|
|
|
**Formula:** Reproduce → Profile → Diagnose → Fix → Verify → Document = Systematic resolution.
|
|
|
|
## Production Debugging Framework
|
|
|
|
```
|
|
┌─────────────────────────────────┐
|
|
│ Incident Detection/Report │
|
|
└──────────┬──────────────────────┘
|
|
│
|
|
┌──────────▼──────────────────────┐
|
|
│ Systematic Reproduction │
|
|
│ Minimal repro, not speculation │
|
|
└──────────┬──────────────────────┘
|
|
│
|
|
┌──────────────┼──────────────┐
|
|
│ │ │
|
|
┌───────▼───────┐ ┌───▼──────┐ ┌────▼────────┐
|
|
│ Performance │ │ Error │ │ Model │
|
|
│ Profiling │ │ Analysis │ │ Debugging │
|
|
└───────┬───────┘ └───┬──────┘ └────┬────────┘
|
|
│ │ │
|
|
└──────────────┼─────────────┘
|
|
│
|
|
┌──────────────▼──────────────────┐
|
|
│ Root Cause Identification │
|
|
│ Not symptoms, actual cause │
|
|
└──────────────┬──────────────────┘
|
|
│
|
|
┌──────────────▼──────────────────┐
|
|
│ Fix Implementation │
|
|
│ Targeted, verified fix │
|
|
└──────────────┬──────────────────┘
|
|
│
|
|
┌──────────────▼──────────────────┐
|
|
│ Verification │
|
|
│ Prove fix works │
|
|
└──────────────┬──────────────────┘
|
|
│
|
|
┌──────────────▼──────────────────┐
|
|
│ Post-Mortem & Prevention │
|
|
│ Blameless, actionable │
|
|
└──────────────────────────────────┘
|
|
```
|
|
|
|
|
|
## RED Phase: Common Debugging Anti-Patterns
|
|
|
|
### Anti-Pattern 1: Random Changes (No Systematic Debugging)
|
|
|
|
**Symptom:** "Let me try changing this parameter and see if it helps."
|
|
|
|
**Why it fails:**
|
|
- No reproduction of the issue (can't verify fix)
|
|
- No understanding of root cause (might fix symptom, not cause)
|
|
- No measurement of impact (did it actually help?)
|
|
- Creates more problems (unintended side effects)
|
|
|
|
**Example:**
|
|
|
|
```python
|
|
# WRONG: Random parameter changes without investigation
|
|
def fix_slow_inference():
|
|
# User reported slow inference, let's just try stuff
|
|
model.batch_size = 32 # Maybe this helps?
|
|
model.num_threads = 8 # Or this?
|
|
model.use_cache = True # Definitely cache!
|
|
# Did any of this help? Who knows!
|
|
```
|
|
|
|
**Consequences:**
|
|
- Issue not actually fixed (root cause still present)
|
|
- New issues introduced (different batch size breaks memory)
|
|
- Can't explain what fixed it (no learning)
|
|
- Incident repeats (no prevention)
|
|
|
|
### Anti-Pattern 2: No Profiling (Guess Bottlenecks)
|
|
|
|
**Symptom:** "The database is probably slow, let's add caching everywhere."
|
|
|
|
**Why it fails:**
|
|
- Optimize based on intuition, not data
|
|
- Miss actual bottleneck (CPU, not DB)
|
|
- Waste time on irrelevant optimizations
|
|
- No measurable improvement
|
|
|
|
**Example:**
|
|
|
|
```python
|
|
# WRONG: Adding caching without profiling
|
|
def optimize_without_profiling():
|
|
# Guess: Database is slow
|
|
@cache # Add caching everywhere
|
|
def get_user_data(user_id):
|
|
return db.query(user_id)
|
|
|
|
# Actual bottleneck: JSON serialization (not DB)
|
|
# Caching doesn't help!
|
|
```
|
|
|
|
**Consequences:**
|
|
- Latency still high (actual bottleneck not addressed)
|
|
- Increased complexity (caching layer adds bugs)
|
|
- Wasted optimization effort (wrong target)
|
|
- No improvement in metrics
|
|
|
|
### Anti-Pattern 3: Bad Logging (Can't Diagnose Issues)
|
|
|
|
**Symptom:** "An error occurred but I can't figure out what caused it."
|
|
|
|
**Why it fails:**
|
|
- Missing context (no user ID, request ID, timestamp)
|
|
- No structured logging (can't query or aggregate)
|
|
- Too much noise (logs everything, signal buried)
|
|
- No trace IDs (can't follow request across services)
|
|
|
|
**Example:**
|
|
|
|
```python
|
|
# WRONG: Useless logging
|
|
def process_request(request):
|
|
print("Processing request") # What request? When? By whom?
|
|
|
|
try:
|
|
result = model.predict(request.data)
|
|
except Exception as e:
|
|
print(f"Error: {e}") # No context, can't debug
|
|
|
|
print("Done") # Success or failure?
|
|
```
|
|
|
|
**Consequences:**
|
|
- Can't reproduce issues (missing critical context)
|
|
- Can't trace distributed requests (no correlation)
|
|
- Can't analyze patterns (unstructured data)
|
|
- Slow investigation (manual log digging)
|
|
|
|
### Anti-Pattern 4: Panic Rollback (Don't Learn from Incidents)
|
|
|
|
**Symptom:** "There's an error! Rollback immediately! Now!"
|
|
|
|
**Why it fails:**
|
|
- No evidence collection (can't do post-mortem)
|
|
- No root cause analysis (will happen again)
|
|
- Lose opportunity to learn (panic mode)
|
|
- No distinction between minor and critical issues
|
|
|
|
**Example:**
|
|
|
|
```python
|
|
# WRONG: Immediate rollback without investigation
|
|
def handle_incident():
|
|
if error_rate > 0.1: # Any errors = panic
|
|
# Rollback immediately!
|
|
deploy_previous_version()
|
|
# Wait, what was the error? We'll never know now...
|
|
```
|
|
|
|
**Consequences:**
|
|
- Issue repeats (root cause not fixed)
|
|
- Lost learning opportunity (no forensics)
|
|
- Unnecessary rollbacks (minor issues treated as critical)
|
|
- Team doesn't improve (no post-mortem)
|
|
|
|
### Anti-Pattern 5: No Post-Mortems
|
|
|
|
**Symptom:** "Incident resolved, let's move on to the next task."
|
|
|
|
**Why it fails:**
|
|
- No prevention (same incident repeats)
|
|
- No learning (team doesn't improve)
|
|
- No action items (nothing changes)
|
|
- Culture of blame (fear of investigation)
|
|
|
|
**Example:**
|
|
|
|
```python
|
|
# WRONG: No post-mortem process
|
|
def resolve_incident(incident):
|
|
fix_issue(incident)
|
|
close_ticket(incident)
|
|
# Done! What incident? Already forgot...
|
|
# No documentation, no prevention, no learning
|
|
```
|
|
|
|
**Consequences:**
|
|
- Incidents repeat (no prevention mechanisms)
|
|
- No improvement (same mistakes over and over)
|
|
- Low bus factor (knowledge not shared)
|
|
- Reactive culture (firefighting, not prevention)
|
|
|
|
|
|
## GREEN Phase: Systematic Debugging Methodology
|
|
|
|
### Part 1: Systematic Debugging Framework
|
|
|
|
**Core principle:** Reproduce → Diagnose → Fix → Verify
|
|
|
|
**Step-by-step process:**
|
|
|
|
```python
|
|
from dataclasses import dataclass
|
|
from typing import Any, Dict, List, Optional
|
|
from datetime import datetime
|
|
import logging
|
|
|
|
@dataclass
|
|
class DebuggingSession:
|
|
"""
|
|
Structured debugging session with systematic methodology.
|
|
"""
|
|
incident_id: str
|
|
reported_by: str
|
|
reported_at: datetime
|
|
description: str
|
|
severity: str # CRITICAL, HIGH, MEDIUM, LOW
|
|
|
|
# Reproduction
|
|
reproduction_steps: List[str] = None
|
|
minimal_repro: str = None
|
|
reproduction_rate: float = 0.0 # 0.0 to 1.0
|
|
|
|
# Diagnosis
|
|
hypothesis: str = None
|
|
evidence: Dict[str, Any] = None
|
|
root_cause: str = None
|
|
|
|
# Fix
|
|
fix_description: str = None
|
|
fix_verification: str = None
|
|
|
|
# Prevention
|
|
prevention_measures: List[str] = None
|
|
|
|
def __post_init__(self):
|
|
self.reproduction_steps = []
|
|
self.evidence = {}
|
|
self.prevention_measures = []
|
|
|
|
|
|
class SystematicDebugger:
|
|
"""
|
|
Systematic debugging methodology for production issues.
|
|
"""
|
|
|
|
def __init__(self):
|
|
self.logger = logging.getLogger(__name__)
|
|
self.sessions: Dict[str, DebuggingSession] = {}
|
|
|
|
def start_session(
|
|
self,
|
|
incident_id: str,
|
|
reported_by: str,
|
|
description: str,
|
|
severity: str
|
|
) -> DebuggingSession:
|
|
"""
|
|
Start a new debugging session.
|
|
|
|
Args:
|
|
incident_id: Unique incident identifier
|
|
reported_by: Who reported the issue
|
|
description: What is the problem
|
|
severity: CRITICAL, HIGH, MEDIUM, LOW
|
|
|
|
Returns:
|
|
DebuggingSession object
|
|
"""
|
|
session = DebuggingSession(
|
|
incident_id=incident_id,
|
|
reported_by=reported_by,
|
|
reported_at=datetime.now(),
|
|
description=description,
|
|
severity=severity
|
|
)
|
|
|
|
self.sessions[incident_id] = session
|
|
self.logger.info(
|
|
f"Started debugging session",
|
|
extra={
|
|
"incident_id": incident_id,
|
|
"severity": severity,
|
|
"description": description
|
|
}
|
|
)
|
|
|
|
return session
|
|
|
|
def reproduce_issue(
|
|
self,
|
|
session: DebuggingSession,
|
|
reproduction_steps: List[str]
|
|
) -> bool:
|
|
"""
|
|
Step 1: Reproduce the issue with minimal test case.
|
|
|
|
Goal: Create minimal, deterministic reproduction.
|
|
|
|
Args:
|
|
session: Debugging session
|
|
reproduction_steps: Steps to reproduce
|
|
|
|
Returns:
|
|
True if successfully reproduced
|
|
"""
|
|
session.reproduction_steps = reproduction_steps
|
|
|
|
# Try to reproduce
|
|
for attempt in range(10):
|
|
if self._attempt_reproduction(reproduction_steps):
|
|
session.reproduction_rate += 0.1
|
|
|
|
session.reproduction_rate = session.reproduction_rate
|
|
|
|
reproduced = session.reproduction_rate > 0.5
|
|
|
|
self.logger.info(
|
|
f"Reproduction attempt",
|
|
extra={
|
|
"incident_id": session.incident_id,
|
|
"reproduced": reproduced,
|
|
"reproduction_rate": session.reproduction_rate
|
|
}
|
|
)
|
|
|
|
return reproduced
|
|
|
|
def _attempt_reproduction(self, steps: List[str]) -> bool:
|
|
"""
|
|
Attempt to reproduce issue.
|
|
Implementation depends on issue type.
|
|
"""
|
|
# Override in subclass
|
|
return False
|
|
|
|
def collect_evidence(
|
|
self,
|
|
session: DebuggingSession,
|
|
evidence_type: str,
|
|
evidence_data: Any
|
|
):
|
|
"""
|
|
Step 2: Collect evidence from multiple sources.
|
|
|
|
Evidence types:
|
|
- logs: Application logs
|
|
- traces: Distributed traces
|
|
- metrics: Performance metrics
|
|
- profiles: CPU/memory profiles
|
|
- requests: Failed request data
|
|
"""
|
|
if evidence_type not in session.evidence:
|
|
session.evidence[evidence_type] = []
|
|
|
|
session.evidence[evidence_type].append({
|
|
"timestamp": datetime.now(),
|
|
"data": evidence_data
|
|
})
|
|
|
|
self.logger.info(
|
|
f"Collected evidence",
|
|
extra={
|
|
"incident_id": session.incident_id,
|
|
"evidence_type": evidence_type
|
|
}
|
|
)
|
|
|
|
def form_hypothesis(
|
|
self,
|
|
session: DebuggingSession,
|
|
hypothesis: str
|
|
):
|
|
"""
|
|
Step 3: Form hypothesis based on evidence.
|
|
|
|
Good hypothesis:
|
|
- Specific and testable
|
|
- Based on evidence, not intuition
|
|
- Explains all symptoms
|
|
"""
|
|
session.hypothesis = hypothesis
|
|
|
|
self.logger.info(
|
|
f"Formed hypothesis",
|
|
extra={
|
|
"incident_id": session.incident_id,
|
|
"hypothesis": hypothesis
|
|
}
|
|
)
|
|
|
|
def verify_hypothesis(
|
|
self,
|
|
session: DebuggingSession,
|
|
verification_test: str,
|
|
result: bool
|
|
) -> bool:
|
|
"""
|
|
Step 4: Verify hypothesis with targeted test.
|
|
|
|
Args:
|
|
session: Debugging session
|
|
verification_test: What test was run
|
|
result: Did hypothesis hold?
|
|
|
|
Returns:
|
|
True if hypothesis verified
|
|
"""
|
|
self.collect_evidence(
|
|
session,
|
|
"hypothesis_verification",
|
|
{
|
|
"test": verification_test,
|
|
"result": result,
|
|
"hypothesis": session.hypothesis
|
|
}
|
|
)
|
|
|
|
return result
|
|
|
|
def identify_root_cause(
|
|
self,
|
|
session: DebuggingSession,
|
|
root_cause: str
|
|
):
|
|
"""
|
|
Step 5: Identify root cause (not just symptoms).
|
|
|
|
Root cause vs symptom:
|
|
- Symptom: "API returns 500 errors"
|
|
- Root cause: "Connection pool exhausted due to connection leak"
|
|
"""
|
|
session.root_cause = root_cause
|
|
|
|
self.logger.info(
|
|
f"Identified root cause",
|
|
extra={
|
|
"incident_id": session.incident_id,
|
|
"root_cause": root_cause
|
|
}
|
|
)
|
|
|
|
def implement_fix(
|
|
self,
|
|
session: DebuggingSession,
|
|
fix_description: str,
|
|
fix_code: str = None
|
|
):
|
|
"""
|
|
Step 6: Implement targeted fix.
|
|
|
|
Good fix:
|
|
- Addresses root cause, not symptom
|
|
- Minimal changes (surgical fix)
|
|
- Includes verification test
|
|
"""
|
|
session.fix_description = fix_description
|
|
|
|
self.logger.info(
|
|
f"Implemented fix",
|
|
extra={
|
|
"incident_id": session.incident_id,
|
|
"fix_description": fix_description
|
|
}
|
|
)
|
|
|
|
def verify_fix(
|
|
self,
|
|
session: DebuggingSession,
|
|
verification_method: str,
|
|
verified: bool
|
|
) -> bool:
|
|
"""
|
|
Step 7: Verify fix resolves the issue.
|
|
|
|
Verification methods:
|
|
- Reproduction test no longer fails
|
|
- Metrics return to normal
|
|
- No new errors in logs
|
|
- A/B test shows improvement
|
|
"""
|
|
session.fix_verification = verification_method
|
|
|
|
self.logger.info(
|
|
f"Verified fix",
|
|
extra={
|
|
"incident_id": session.incident_id,
|
|
"verified": verified,
|
|
"verification_method": verification_method
|
|
}
|
|
)
|
|
|
|
return verified
|
|
|
|
def add_prevention_measure(
|
|
self,
|
|
session: DebuggingSession,
|
|
measure: str
|
|
):
|
|
"""
|
|
Step 8: Add prevention measures.
|
|
|
|
Prevention types:
|
|
- Monitoring: Alert on similar patterns
|
|
- Testing: Add regression test
|
|
- Validation: Input validation to prevent
|
|
- Documentation: Runbook for similar issues
|
|
"""
|
|
session.prevention_measures.append(measure)
|
|
|
|
self.logger.info(
|
|
f"Added prevention measure",
|
|
extra={
|
|
"incident_id": session.incident_id,
|
|
"measure": measure
|
|
}
|
|
)
|
|
|
|
|
|
# Example usage
|
|
debugger = SystematicDebugger()
|
|
|
|
# Start debugging session
|
|
session = debugger.start_session(
|
|
incident_id="INC-2025-001",
|
|
reported_by="oncall-engineer",
|
|
description="API latency spike from 200ms to 2000ms",
|
|
severity="HIGH"
|
|
)
|
|
|
|
# Step 1: Reproduce
|
|
reproduced = debugger.reproduce_issue(
|
|
session,
|
|
reproduction_steps=[
|
|
"Send 100 concurrent requests to /api/predict",
|
|
"Observe latency increase after 50 requests",
|
|
"Check connection pool metrics"
|
|
]
|
|
)
|
|
|
|
if reproduced:
|
|
# Step 2: Collect evidence
|
|
debugger.collect_evidence(session, "metrics", {
|
|
"latency_p50": 2000,
|
|
"latency_p95": 5000,
|
|
"connection_pool_size": 10,
|
|
"active_connections": 10,
|
|
"waiting_requests": 90
|
|
})
|
|
|
|
# Step 3: Form hypothesis
|
|
debugger.form_hypothesis(
|
|
session,
|
|
"Connection pool exhausted. Pool size (10) too small for load (100 concurrent)."
|
|
)
|
|
|
|
# Step 4: Verify hypothesis
|
|
verified = debugger.verify_hypothesis(
|
|
session,
|
|
"Increased pool size to 50, latency returned to normal",
|
|
True
|
|
)
|
|
|
|
if verified:
|
|
# Step 5: Root cause
|
|
debugger.identify_root_cause(
|
|
session,
|
|
"Connection pool size not scaled with traffic increase"
|
|
)
|
|
|
|
# Step 6: Implement fix
|
|
debugger.implement_fix(
|
|
session,
|
|
"Increase connection pool size to 50 and add auto-scaling"
|
|
)
|
|
|
|
# Step 7: Verify fix
|
|
debugger.verify_fix(
|
|
session,
|
|
"A/B test: latency p95 < 300ms for 1 hour",
|
|
True
|
|
)
|
|
|
|
# Step 8: Prevention
|
|
debugger.add_prevention_measure(
|
|
session,
|
|
"Alert when connection pool utilization > 80%"
|
|
)
|
|
debugger.add_prevention_measure(
|
|
session,
|
|
"Load test before deploying to production"
|
|
)
|
|
```
|
|
|
|
**Key principles:**
|
|
|
|
1. **Reproduce first:** Can't debug what you can't reproduce
|
|
2. **Evidence-based:** Collect data before forming hypothesis
|
|
3. **Root cause, not symptom:** Fix the actual cause
|
|
4. **Verify fix:** Prove it works before closing
|
|
5. **Prevent recurrence:** Add monitoring and tests
|
|
|
|
|
|
### Part 2: Performance Profiling
|
|
|
|
**When to profile:**
|
|
- Latency spikes or slow responses
|
|
- High CPU or memory usage
|
|
- Resource exhaustion (connections, threads)
|
|
- Optimization opportunities
|
|
|
|
#### CPU Profiling with py-spy
|
|
|
|
```python
|
|
import subprocess
|
|
import signal
|
|
import time
|
|
from pathlib import Path
|
|
|
|
class ProductionProfiler:
|
|
"""
|
|
Non-intrusive profiling for production systems.
|
|
"""
|
|
|
|
def __init__(self, output_dir: str = "./profiles"):
|
|
self.output_dir = Path(output_dir)
|
|
self.output_dir.mkdir(exist_ok=True)
|
|
|
|
def profile_cpu(
|
|
self,
|
|
pid: int,
|
|
duration: int = 60,
|
|
rate: int = 100
|
|
) -> str:
|
|
"""
|
|
Profile CPU usage with py-spy (no code changes needed).
|
|
|
|
Args:
|
|
pid: Process ID to profile
|
|
duration: How long to profile (seconds)
|
|
rate: Sampling rate (samples/second)
|
|
|
|
Returns:
|
|
Path to flamegraph SVG
|
|
|
|
Usage:
|
|
# Install: pip install py-spy
|
|
# Run: sudo py-spy record -o profile.svg --pid 12345 --duration 60
|
|
"""
|
|
output_file = self.output_dir / f"cpu_profile_{pid}_{int(time.time())}.svg"
|
|
|
|
cmd = [
|
|
"py-spy", "record",
|
|
"-o", str(output_file),
|
|
"--pid", str(pid),
|
|
"--duration", str(duration),
|
|
"--rate", str(rate),
|
|
"--format", "flamegraph"
|
|
]
|
|
|
|
print(f"Profiling PID {pid} for {duration} seconds...")
|
|
subprocess.run(cmd, check=True)
|
|
|
|
print(f"Profile saved to: {output_file}")
|
|
return str(output_file)
|
|
|
|
def profile_memory(
|
|
self,
|
|
pid: int,
|
|
duration: int = 60
|
|
) -> str:
|
|
"""
|
|
Profile memory usage with memory_profiler.
|
|
|
|
Returns:
|
|
Path to memory profile
|
|
"""
|
|
output_file = self.output_dir / f"memory_profile_{pid}_{int(time.time())}.txt"
|
|
|
|
# Use memory_profiler for line-by-line analysis
|
|
cmd = [
|
|
"python", "-m", "memory_profiler",
|
|
"--backend", "psutil",
|
|
str(pid)
|
|
]
|
|
|
|
result = subprocess.run(
|
|
cmd,
|
|
capture_output=True,
|
|
text=True,
|
|
timeout=duration
|
|
)
|
|
|
|
output_file.write_text(result.stdout)
|
|
print(f"Memory profile saved to: {output_file}")
|
|
|
|
return str(output_file)
|
|
|
|
|
|
# Example: Profile production inference
|
|
profiler = ProductionProfiler()
|
|
|
|
# Get PID of running process
|
|
import os
|
|
pid = os.getpid()
|
|
|
|
# Profile for 60 seconds
|
|
flamegraph = profiler.profile_cpu(pid, duration=60)
|
|
print(f"View flamegraph: {flamegraph}")
|
|
|
|
# Analyze flamegraph:
|
|
# - Wide bars = most time spent (bottleneck)
|
|
# - Look for unexpected functions
|
|
# - Check for excessive I/O waits
|
|
```
|
|
|
|
#### PyTorch Model Profiling
|
|
|
|
```python
|
|
import torch
|
|
import torch.profiler as profiler
|
|
from typing import Dict, List
|
|
import json
|
|
|
|
class ModelProfiler:
|
|
"""
|
|
Profile PyTorch model performance.
|
|
"""
|
|
|
|
def profile_model(
|
|
self,
|
|
model: torch.nn.Module,
|
|
sample_input: torch.Tensor,
|
|
num_steps: int = 100
|
|
) -> Dict[str, any]:
|
|
"""
|
|
Profile model inference with PyTorch profiler.
|
|
|
|
Args:
|
|
model: PyTorch model
|
|
sample_input: Sample input tensor
|
|
num_steps: Number of profiling steps
|
|
|
|
Returns:
|
|
Profiling results
|
|
"""
|
|
model.eval()
|
|
|
|
with profiler.profile(
|
|
activities=[
|
|
profiler.ProfilerActivity.CPU,
|
|
profiler.ProfilerActivity.CUDA,
|
|
],
|
|
record_shapes=True,
|
|
profile_memory=True,
|
|
with_stack=True
|
|
) as prof:
|
|
with profiler.record_function("model_inference"):
|
|
for _ in range(num_steps):
|
|
with torch.no_grad():
|
|
_ = model(sample_input)
|
|
|
|
# Print report
|
|
print(prof.key_averages().table(
|
|
sort_by="cuda_time_total",
|
|
row_limit=10
|
|
))
|
|
|
|
# Save trace
|
|
prof.export_chrome_trace("model_trace.json")
|
|
|
|
# Analyze results
|
|
results = self._analyze_profile(prof)
|
|
return results
|
|
|
|
def _analyze_profile(self, prof) -> Dict[str, any]:
|
|
"""
|
|
Analyze profiling results.
|
|
"""
|
|
events = prof.key_averages()
|
|
|
|
# Find bottlenecks
|
|
cpu_events = sorted(
|
|
[e for e in events if e.device_type == profiler.DeviceType.CPU],
|
|
key=lambda e: e.self_cpu_time_total,
|
|
reverse=True
|
|
)
|
|
|
|
cuda_events = sorted(
|
|
[e for e in events if e.device_type == profiler.DeviceType.CUDA],
|
|
key=lambda e: e.self_cuda_time_total,
|
|
reverse=True
|
|
)
|
|
|
|
results = {
|
|
"top_cpu_ops": [
|
|
{
|
|
"name": e.key,
|
|
"cpu_time_ms": e.self_cpu_time_total / 1000,
|
|
"calls": e.count
|
|
}
|
|
for e in cpu_events[:10]
|
|
],
|
|
"top_cuda_ops": [
|
|
{
|
|
"name": e.key,
|
|
"cuda_time_ms": e.self_cuda_time_total / 1000,
|
|
"calls": e.count
|
|
}
|
|
for e in cuda_events[:10]
|
|
],
|
|
"total_cpu_time_ms": sum(e.self_cpu_time_total for e in events) / 1000,
|
|
"total_cuda_time_ms": sum(e.self_cuda_time_total for e in events) / 1000,
|
|
}
|
|
|
|
return results
|
|
|
|
|
|
# Example usage
|
|
import torch.nn as nn
|
|
|
|
model = nn.Transformer(
|
|
d_model=512,
|
|
nhead=8,
|
|
num_encoder_layers=6,
|
|
num_decoder_layers=6
|
|
)
|
|
|
|
sample_input = torch.randn(10, 32, 512) # (seq_len, batch, d_model)
|
|
|
|
profiler = ModelProfiler()
|
|
results = profiler.profile_model(model, sample_input)
|
|
|
|
print(json.dumps(results, indent=2))
|
|
|
|
# Identify bottlenecks:
|
|
# - Which operations take most time?
|
|
# - CPU vs GPU time (data transfer overhead?)
|
|
# - Memory usage patterns
|
|
```
|
|
|
|
#### Database Query Profiling
|
|
|
|
```python
|
|
import time
|
|
from contextlib import contextmanager
|
|
from typing import Dict, List
|
|
import logging
|
|
|
|
class QueryProfiler:
|
|
"""
|
|
Profile database query performance.
|
|
"""
|
|
|
|
def __init__(self):
|
|
self.logger = logging.getLogger(__name__)
|
|
self.query_stats: List[Dict] = []
|
|
|
|
@contextmanager
|
|
def profile_query(self, query_name: str):
|
|
"""
|
|
Context manager to profile a query.
|
|
|
|
Usage:
|
|
with profiler.profile_query("get_user"):
|
|
user = db.query(User).filter_by(id=user_id).first()
|
|
"""
|
|
start = time.perf_counter()
|
|
|
|
try:
|
|
yield
|
|
finally:
|
|
duration = (time.perf_counter() - start) * 1000 # ms
|
|
|
|
self.query_stats.append({
|
|
"query": query_name,
|
|
"duration_ms": duration,
|
|
"timestamp": time.time()
|
|
})
|
|
|
|
if duration > 100: # Slow query threshold
|
|
self.logger.warning(
|
|
f"Slow query detected",
|
|
extra={
|
|
"query": query_name,
|
|
"duration_ms": duration
|
|
}
|
|
)
|
|
|
|
def get_slow_queries(self, threshold_ms: float = 100) -> List[Dict]:
|
|
"""
|
|
Get queries slower than threshold.
|
|
"""
|
|
return [
|
|
q for q in self.query_stats
|
|
if q["duration_ms"] > threshold_ms
|
|
]
|
|
|
|
def get_query_stats(self) -> Dict[str, Dict]:
|
|
"""
|
|
Get aggregate statistics per query.
|
|
"""
|
|
from collections import defaultdict
|
|
import statistics
|
|
|
|
stats_by_query = defaultdict(list)
|
|
|
|
for q in self.query_stats:
|
|
stats_by_query[q["query"]].append(q["duration_ms"])
|
|
|
|
result = {}
|
|
for query, durations in stats_by_query.items():
|
|
result[query] = {
|
|
"count": len(durations),
|
|
"mean_ms": statistics.mean(durations),
|
|
"median_ms": statistics.median(durations),
|
|
"p95_ms": sorted(durations)[int(len(durations) * 0.95)]
|
|
if len(durations) > 0 else 0,
|
|
"max_ms": max(durations)
|
|
}
|
|
|
|
return result
|
|
|
|
|
|
# Example usage
|
|
profiler = QueryProfiler()
|
|
|
|
# Profile queries
|
|
for user_id in range(100):
|
|
with profiler.profile_query("get_user"):
|
|
# user = db.query(User).filter_by(id=user_id).first()
|
|
time.sleep(0.05) # Simulate query
|
|
|
|
with profiler.profile_query("get_posts"):
|
|
# posts = db.query(Post).filter_by(user_id=user_id).all()
|
|
time.sleep(0.15) # Simulate slow query
|
|
|
|
# Analyze
|
|
slow_queries = profiler.get_slow_queries(threshold_ms=100)
|
|
print(f"Found {len(slow_queries)} slow queries")
|
|
|
|
stats = profiler.get_query_stats()
|
|
for query, metrics in stats.items():
|
|
print(f"{query}: {metrics}")
|
|
```
|
|
|
|
**Key profiling insights:**
|
|
|
|
| Profile Type | Tool | What to Look For |
|
|
|--------------|------|------------------|
|
|
| CPU | py-spy | Wide bars in flamegraph (bottlenecks) |
|
|
| Memory | memory_profiler | Memory leaks, large allocations |
|
|
| Model | torch.profiler | Slow operations, CPU-GPU transfer |
|
|
| Database | Query profiler | Slow queries, N+1 queries |
|
|
| Network | distributed tracing | High latency services, cascading failures |
|
|
|
|
|
|
### Part 3: Error Analysis and Root Cause Investigation
|
|
|
|
**Goal:** Categorize errors, find patterns, identify root cause (not symptoms).
|
|
|
|
```python
|
|
from dataclasses import dataclass
|
|
from typing import List, Dict, Optional
|
|
from collections import Counter, defaultdict
|
|
from datetime import datetime, timedelta
|
|
import re
|
|
|
|
@dataclass
|
|
class ErrorEvent:
|
|
"""
|
|
Structured error event.
|
|
"""
|
|
timestamp: datetime
|
|
error_type: str
|
|
error_message: str
|
|
stack_trace: str
|
|
user_id: Optional[str] = None
|
|
request_id: Optional[str] = None
|
|
endpoint: Optional[str] = None
|
|
severity: str = "ERROR" # DEBUG, INFO, WARNING, ERROR, CRITICAL
|
|
|
|
# Context
|
|
input_data: Optional[Dict] = None
|
|
system_state: Optional[Dict] = None
|
|
|
|
|
|
class ErrorAnalyzer:
|
|
"""
|
|
Analyze error patterns and identify root causes.
|
|
"""
|
|
|
|
def __init__(self):
|
|
self.errors: List[ErrorEvent] = []
|
|
|
|
def add_error(self, error: ErrorEvent):
|
|
"""Add error to analysis."""
|
|
self.errors.append(error)
|
|
|
|
def categorize_errors(self) -> Dict[str, List[ErrorEvent]]:
|
|
"""
|
|
Categorize errors by type.
|
|
|
|
Categories:
|
|
- Input validation errors
|
|
- Model inference errors
|
|
- Infrastructure errors (DB, network)
|
|
- Third-party API errors
|
|
- Resource exhaustion errors
|
|
"""
|
|
categories = defaultdict(list)
|
|
|
|
for error in self.errors:
|
|
category = self._categorize_single_error(error)
|
|
categories[category].append(error)
|
|
|
|
return dict(categories)
|
|
|
|
def _categorize_single_error(self, error: ErrorEvent) -> str:
|
|
"""
|
|
Categorize single error based on error message and type.
|
|
"""
|
|
msg = error.error_message.lower()
|
|
|
|
# Input validation
|
|
if any(keyword in msg for keyword in ["invalid", "validation", "schema"]):
|
|
return "input_validation"
|
|
|
|
# Model errors
|
|
if any(keyword in msg for keyword in ["model", "inference", "prediction"]):
|
|
return "model_inference"
|
|
|
|
# Infrastructure
|
|
if any(keyword in msg for keyword in ["connection", "timeout", "database"]):
|
|
return "infrastructure"
|
|
|
|
# Resource exhaustion
|
|
if any(keyword in msg for keyword in ["memory", "cpu", "quota", "limit"]):
|
|
return "resource_exhaustion"
|
|
|
|
# Third-party
|
|
if any(keyword in msg for keyword in ["api", "external", "service"]):
|
|
return "third_party"
|
|
|
|
return "unknown"
|
|
|
|
def find_error_patterns(self) -> List[Dict]:
|
|
"""
|
|
Find patterns in errors (temporal, user, endpoint).
|
|
"""
|
|
patterns = []
|
|
|
|
# Temporal clustering (errors spike at certain times?)
|
|
temporal = self._analyze_temporal_patterns()
|
|
if temporal:
|
|
patterns.append({
|
|
"type": "temporal",
|
|
"description": f"Error spike detected",
|
|
"details": temporal
|
|
})
|
|
|
|
# User clustering (errors for specific users?)
|
|
user_errors = defaultdict(int)
|
|
for error in self.errors:
|
|
if error.user_id:
|
|
user_errors[error.user_id] += 1
|
|
|
|
# Top 5 users with most errors
|
|
top_users = sorted(
|
|
user_errors.items(),
|
|
key=lambda x: x[1],
|
|
reverse=True
|
|
)[:5]
|
|
|
|
if top_users and top_users[0][1] > 10:
|
|
patterns.append({
|
|
"type": "user_specific",
|
|
"description": f"High error rate for specific users",
|
|
"details": {"top_users": top_users}
|
|
})
|
|
|
|
# Endpoint clustering
|
|
endpoint_errors = defaultdict(int)
|
|
for error in self.errors:
|
|
if error.endpoint:
|
|
endpoint_errors[error.endpoint] += 1
|
|
|
|
top_endpoints = sorted(
|
|
endpoint_errors.items(),
|
|
key=lambda x: x[1],
|
|
reverse=True
|
|
)[:5]
|
|
|
|
if top_endpoints:
|
|
patterns.append({
|
|
"type": "endpoint_specific",
|
|
"description": f"Errors concentrated in specific endpoints",
|
|
"details": {"top_endpoints": top_endpoints}
|
|
})
|
|
|
|
return patterns
|
|
|
|
def _analyze_temporal_patterns(self) -> Optional[Dict]:
|
|
"""
|
|
Detect temporal error patterns (spikes, periodicity).
|
|
"""
|
|
if len(self.errors) < 10:
|
|
return None
|
|
|
|
# Group by hour
|
|
errors_by_hour = defaultdict(int)
|
|
for error in self.errors:
|
|
hour_key = error.timestamp.replace(minute=0, second=0, microsecond=0)
|
|
errors_by_hour[hour_key] += 1
|
|
|
|
# Calculate average and detect spikes
|
|
error_counts = list(errors_by_hour.values())
|
|
avg_errors = sum(error_counts) / len(error_counts)
|
|
max_errors = max(error_counts)
|
|
|
|
if max_errors > avg_errors * 3: # 3x spike
|
|
spike_hour = max(errors_by_hour, key=errors_by_hour.get)
|
|
return {
|
|
"avg_errors_per_hour": avg_errors,
|
|
"max_errors_per_hour": max_errors,
|
|
"spike_time": spike_hour.isoformat(),
|
|
"spike_magnitude": max_errors / avg_errors
|
|
}
|
|
|
|
return None
|
|
|
|
def identify_root_cause(
|
|
self,
|
|
error_category: str,
|
|
errors: List[ErrorEvent]
|
|
) -> Dict:
|
|
"""
|
|
Identify root cause for category of errors.
|
|
|
|
Analysis steps:
|
|
1. Find common patterns in error messages
|
|
2. Analyze system state at error time
|
|
3. Check for external factors (deployment, traffic spike)
|
|
4. Identify root cause vs symptoms
|
|
"""
|
|
analysis = {
|
|
"category": error_category,
|
|
"total_errors": len(errors),
|
|
"time_range": {
|
|
"start": min(e.timestamp for e in errors).isoformat(),
|
|
"end": max(e.timestamp for e in errors).isoformat()
|
|
}
|
|
}
|
|
|
|
# Common error messages
|
|
error_messages = [e.error_message for e in errors]
|
|
message_counts = Counter(error_messages)
|
|
analysis["most_common_errors"] = message_counts.most_common(5)
|
|
|
|
# Stack trace analysis (find common frames)
|
|
common_frames = self._find_common_stack_frames(errors)
|
|
analysis["common_stack_frames"] = common_frames
|
|
|
|
# Hypothesis based on category
|
|
if error_category == "input_validation":
|
|
analysis["hypothesis"] = "Client sending invalid data. Check API contract."
|
|
analysis["action_items"] = [
|
|
"Add input validation at API layer",
|
|
"Return clear error messages to client",
|
|
"Add monitoring for validation failures"
|
|
]
|
|
|
|
elif error_category == "model_inference":
|
|
analysis["hypothesis"] = "Model failing on specific inputs. Check edge cases."
|
|
analysis["action_items"] = [
|
|
"Analyze failed inputs for patterns",
|
|
"Add input sanitization before inference",
|
|
"Add fallback for model failures",
|
|
"Retrain model with failed examples"
|
|
]
|
|
|
|
elif error_category == "infrastructure":
|
|
analysis["hypothesis"] = "Infrastructure issue (DB, network). Check external dependencies."
|
|
analysis["action_items"] = [
|
|
"Check database connection pool size",
|
|
"Check network connectivity to services",
|
|
"Add retry logic with exponential backoff",
|
|
"Add circuit breaker for failing services"
|
|
]
|
|
|
|
elif error_category == "resource_exhaustion":
|
|
analysis["hypothesis"] = "Resource limits exceeded. Scale up or optimize."
|
|
analysis["action_items"] = [
|
|
"Profile memory/CPU usage",
|
|
"Increase resource limits",
|
|
"Optimize hot paths",
|
|
"Add auto-scaling"
|
|
]
|
|
|
|
return analysis
|
|
|
|
def _find_common_stack_frames(
|
|
self,
|
|
errors: List[ErrorEvent],
|
|
min_frequency: float = 0.5
|
|
) -> List[str]:
|
|
"""
|
|
Find stack frames common to most errors.
|
|
"""
|
|
frame_counts = Counter()
|
|
|
|
for error in errors:
|
|
# Extract function names from stack trace
|
|
frames = re.findall(r'File ".*", line \d+, in (\w+)', error.stack_trace)
|
|
frame_counts.update(frames)
|
|
|
|
# Find frames in at least 50% of errors
|
|
threshold = len(errors) * min_frequency
|
|
common_frames = [
|
|
frame for frame, count in frame_counts.items()
|
|
if count >= threshold
|
|
]
|
|
|
|
return common_frames
|
|
|
|
|
|
# Example usage
|
|
analyzer = ErrorAnalyzer()
|
|
|
|
# Simulate errors
|
|
for i in range(100):
|
|
if i % 10 == 0: # Pattern: every 10th request fails
|
|
analyzer.add_error(ErrorEvent(
|
|
timestamp=datetime.now() + timedelta(seconds=i),
|
|
error_type="ValueError",
|
|
error_message="Invalid input shape: expected (batch, 512), got (batch, 256)",
|
|
stack_trace='File "model.py", line 42, in predict\n result = self.model(input_tensor)',
|
|
user_id=f"user_{i % 5}", # Pattern: 5 users with issues
|
|
endpoint="/api/predict"
|
|
))
|
|
|
|
# Categorize errors
|
|
categories = analyzer.categorize_errors()
|
|
print(f"Error categories: {list(categories.keys())}")
|
|
|
|
# Find patterns
|
|
patterns = analyzer.find_error_patterns()
|
|
for pattern in patterns:
|
|
print(f"\nPattern: {pattern['type']}")
|
|
print(f" {pattern['description']}")
|
|
print(f" Details: {pattern['details']}")
|
|
|
|
# Root cause analysis
|
|
for category, errors in categories.items():
|
|
print(f"\n{'='*60}")
|
|
print(f"Root cause analysis: {category}")
|
|
print(f"{'='*60}")
|
|
|
|
analysis = analyzer.identify_root_cause(category, errors)
|
|
|
|
print(f"\nHypothesis: {analysis['hypothesis']}")
|
|
print(f"\nAction items:")
|
|
for item in analysis['action_items']:
|
|
print(f" - {item}")
|
|
```
|
|
|
|
**Root cause analysis checklist:**
|
|
|
|
- [ ] Reproduce error consistently
|
|
- [ ] Categorize error type (input, model, infrastructure, resource)
|
|
- [ ] Find error patterns (temporal, user, endpoint)
|
|
- [ ] Analyze system state at error time
|
|
- [ ] Check for external factors (deployment, traffic, dependencies)
|
|
- [ ] Distinguish root cause from symptoms
|
|
- [ ] Verify fix resolves root cause
|
|
|
|
|
|
### Part 4: A/B Test Debugging
|
|
|
|
**Common A/B test issues:**
|
|
- No statistical significance (insufficient sample size)
|
|
- Confounding factors (unbalanced segments)
|
|
- Simpson's paradox (aggregate vs segment differences)
|
|
- Selection bias (non-random assignment)
|
|
- Novelty effect (temporary impact)
|
|
|
|
```python
|
|
from dataclasses import dataclass
|
|
from typing import List, Dict, Optional
|
|
import numpy as np
|
|
from scipy import stats
|
|
|
|
@dataclass
|
|
class ABTestResult:
|
|
"""
|
|
A/B test variant result.
|
|
"""
|
|
variant: str
|
|
sample_size: int
|
|
success_count: int
|
|
metric_values: List[float]
|
|
|
|
@property
|
|
def success_rate(self) -> float:
|
|
return self.success_count / self.sample_size if self.sample_size > 0 else 0.0
|
|
|
|
@property
|
|
def mean_metric(self) -> float:
|
|
return np.mean(self.metric_values) if self.metric_values else 0.0
|
|
|
|
|
|
class ABTestDebugger:
|
|
"""
|
|
Debug A/B test issues and validate statistical significance.
|
|
"""
|
|
|
|
def validate_test_design(
|
|
self,
|
|
control: ABTestResult,
|
|
treatment: ABTestResult,
|
|
min_sample_size: int = 200
|
|
) -> Dict:
|
|
"""
|
|
Validate A/B test design and detect issues.
|
|
|
|
Returns:
|
|
Validation results with warnings
|
|
"""
|
|
issues = []
|
|
|
|
# Check 1: Sufficient sample size
|
|
if control.sample_size < min_sample_size:
|
|
issues.append({
|
|
"type": "insufficient_sample_size",
|
|
"severity": "CRITICAL",
|
|
"message": f"Control sample size ({control.sample_size}) < minimum ({min_sample_size})"
|
|
})
|
|
|
|
if treatment.sample_size < min_sample_size:
|
|
issues.append({
|
|
"type": "insufficient_sample_size",
|
|
"severity": "CRITICAL",
|
|
"message": f"Treatment sample size ({treatment.sample_size}) < minimum ({min_sample_size})"
|
|
})
|
|
|
|
# Check 2: Balanced sample sizes
|
|
ratio = control.sample_size / treatment.sample_size
|
|
if ratio < 0.8 or ratio > 1.25: # More than 20% imbalance
|
|
issues.append({
|
|
"type": "imbalanced_samples",
|
|
"severity": "WARNING",
|
|
"message": f"Sample size ratio {ratio:.2f} indicates imbalanced assignment"
|
|
})
|
|
|
|
# Check 3: Variance analysis
|
|
control_std = np.std(control.metric_values)
|
|
treatment_std = np.std(treatment.metric_values)
|
|
|
|
if control_std == 0 or treatment_std == 0:
|
|
issues.append({
|
|
"type": "no_variance",
|
|
"severity": "CRITICAL",
|
|
"message": "One variant has zero variance. Check data collection."
|
|
})
|
|
|
|
return {
|
|
"valid": len([i for i in issues if i["severity"] == "CRITICAL"]) == 0,
|
|
"issues": issues
|
|
}
|
|
|
|
def test_statistical_significance(
|
|
self,
|
|
control: ABTestResult,
|
|
treatment: ABTestResult,
|
|
alpha: float = 0.05
|
|
) -> Dict:
|
|
"""
|
|
Test statistical significance between variants.
|
|
|
|
Args:
|
|
control: Control variant results
|
|
treatment: Treatment variant results
|
|
alpha: Significance level (default 0.05)
|
|
|
|
Returns:
|
|
Statistical test results
|
|
"""
|
|
# Two-proportion z-test for success rates
|
|
n1, n2 = control.sample_size, treatment.sample_size
|
|
p1, p2 = control.success_rate, treatment.success_rate
|
|
|
|
# Pooled proportion
|
|
p_pool = (control.success_count + treatment.success_count) / (n1 + n2)
|
|
|
|
# Standard error
|
|
se = np.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
|
|
|
|
# Z-score
|
|
z_score = (p2 - p1) / se if se > 0 else 0
|
|
|
|
# P-value (two-tailed)
|
|
p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))
|
|
|
|
# Effect size (relative lift)
|
|
relative_lift = ((p2 - p1) / p1 * 100) if p1 > 0 else 0
|
|
|
|
# Confidence interval
|
|
ci_margin = stats.norm.ppf(1 - alpha/2) * se
|
|
ci_lower = (p2 - p1) - ci_margin
|
|
ci_upper = (p2 - p1) + ci_margin
|
|
|
|
return {
|
|
"statistically_significant": p_value < alpha,
|
|
"p_value": p_value,
|
|
"z_score": z_score,
|
|
"alpha": alpha,
|
|
"control_rate": p1,
|
|
"treatment_rate": p2,
|
|
"absolute_lift": p2 - p1,
|
|
"relative_lift_percent": relative_lift,
|
|
"confidence_interval": (ci_lower, ci_upper),
|
|
"interpretation": self._interpret_results(p_value, alpha, relative_lift)
|
|
}
|
|
|
|
def _interpret_results(
|
|
self,
|
|
p_value: float,
|
|
alpha: float,
|
|
relative_lift: float
|
|
) -> str:
|
|
"""
|
|
Interpret statistical test results.
|
|
"""
|
|
if p_value < alpha:
|
|
direction = "better" if relative_lift > 0 else "worse"
|
|
return f"Treatment is statistically significantly {direction} than control ({relative_lift:+.1f}% lift)"
|
|
else:
|
|
return f"No statistical significance detected (p={p_value:.3f} > {alpha}). Need more data or larger effect size."
|
|
|
|
def detect_simpsons_paradox(
|
|
self,
|
|
control_segments: Dict[str, ABTestResult],
|
|
treatment_segments: Dict[str, ABTestResult]
|
|
) -> Dict:
|
|
"""
|
|
Detect Simpson's Paradox in segmented data.
|
|
|
|
Simpson's Paradox: Treatment better in each segment but worse overall,
|
|
or vice versa. Caused by confounding variables.
|
|
|
|
Args:
|
|
control_segments: Control results per segment (e.g., by country, device)
|
|
treatment_segments: Treatment results per segment
|
|
|
|
Returns:
|
|
Detection results
|
|
"""
|
|
# Overall results
|
|
total_control = ABTestResult(
|
|
variant="control_total",
|
|
sample_size=sum(s.sample_size for s in control_segments.values()),
|
|
success_count=sum(s.success_count for s in control_segments.values()),
|
|
metric_values=[]
|
|
)
|
|
|
|
total_treatment = ABTestResult(
|
|
variant="treatment_total",
|
|
sample_size=sum(s.sample_size for s in treatment_segments.values()),
|
|
success_count=sum(s.success_count for s in treatment_segments.values()),
|
|
metric_values=[]
|
|
)
|
|
|
|
overall_direction = "treatment_better" if total_treatment.success_rate > total_control.success_rate else "control_better"
|
|
|
|
# Check each segment
|
|
segment_directions = {}
|
|
for segment in control_segments.keys():
|
|
ctrl = control_segments[segment]
|
|
treat = treatment_segments[segment]
|
|
|
|
segment_directions[segment] = "treatment_better" if treat.success_rate > ctrl.success_rate else "control_better"
|
|
|
|
# Detect paradox: overall direction differs from all segments
|
|
all_segments_agree = all(d == overall_direction for d in segment_directions.values())
|
|
|
|
paradox_detected = not all_segments_agree
|
|
|
|
return {
|
|
"paradox_detected": paradox_detected,
|
|
"overall_direction": overall_direction,
|
|
"segment_directions": segment_directions,
|
|
"explanation": self._explain_simpsons_paradox(
|
|
paradox_detected,
|
|
overall_direction,
|
|
segment_directions
|
|
)
|
|
}
|
|
|
|
def _explain_simpsons_paradox(
|
|
self,
|
|
detected: bool,
|
|
overall: str,
|
|
segments: Dict[str, str]
|
|
) -> str:
|
|
"""
|
|
Explain Simpson's Paradox if detected.
|
|
"""
|
|
if not detected:
|
|
return "No Simpson's Paradox detected. Segment and overall results agree."
|
|
|
|
return f"Simpson's Paradox detected! Overall: {overall}, but segments show: {segments}. This indicates a confounding variable. Review segment sizes and assignment."
|
|
|
|
def calculate_required_sample_size(
|
|
self,
|
|
baseline_rate: float,
|
|
minimum_detectable_effect: float,
|
|
alpha: float = 0.05,
|
|
power: float = 0.80
|
|
) -> int:
|
|
"""
|
|
Calculate required sample size per variant.
|
|
|
|
Args:
|
|
baseline_rate: Current conversion rate (e.g., 0.10 for 10%)
|
|
minimum_detectable_effect: Minimum relative change to detect (e.g., 0.10 for 10% improvement)
|
|
alpha: Significance level (default 0.05)
|
|
power: Statistical power (default 0.80)
|
|
|
|
Returns:
|
|
Required sample size per variant
|
|
"""
|
|
treatment_rate = baseline_rate * (1 + minimum_detectable_effect)
|
|
|
|
# Effect size (Cohen's h)
|
|
effect_size = 2 * (np.arcsin(np.sqrt(treatment_rate)) - np.arcsin(np.sqrt(baseline_rate)))
|
|
|
|
# Z-scores
|
|
z_alpha = stats.norm.ppf(1 - alpha/2)
|
|
z_beta = stats.norm.ppf(power)
|
|
|
|
# Sample size calculation
|
|
n = ((z_alpha + z_beta) / effect_size) ** 2
|
|
|
|
return int(np.ceil(n))
|
|
|
|
|
|
# Example: Debug A/B test
|
|
debugger = ABTestDebugger()
|
|
|
|
# Simulate test results
|
|
control = ABTestResult(
|
|
variant="control",
|
|
sample_size=500,
|
|
success_count=50, # 10% conversion
|
|
metric_values=np.random.normal(100, 20, 500).tolist()
|
|
)
|
|
|
|
treatment = ABTestResult(
|
|
variant="treatment",
|
|
sample_size=520,
|
|
success_count=62, # 11.9% conversion
|
|
metric_values=np.random.normal(105, 20, 520).tolist()
|
|
)
|
|
|
|
# Validate design
|
|
validation = debugger.validate_test_design(control, treatment)
|
|
print(f"Test valid: {validation['valid']}")
|
|
if validation['issues']:
|
|
for issue in validation['issues']:
|
|
print(f" [{issue['severity']}] {issue['message']}")
|
|
|
|
# Test significance
|
|
results = debugger.test_statistical_significance(control, treatment)
|
|
print(f"\nStatistical significance: {results['statistically_significant']}")
|
|
print(f"P-value: {results['p_value']:.4f}")
|
|
print(f"Relative lift: {results['relative_lift_percent']:.2f}%")
|
|
print(f"Interpretation: {results['interpretation']}")
|
|
|
|
# Check for Simpson's Paradox
|
|
control_segments = {
|
|
"US": ABTestResult("control_US", 300, 40, []),
|
|
"UK": ABTestResult("control_UK", 200, 10, [])
|
|
}
|
|
|
|
treatment_segments = {
|
|
"US": ABTestResult("treatment_US", 400, 48, []), # Better
|
|
"UK": ABTestResult("treatment_UK", 120, 14, []) # Better
|
|
}
|
|
|
|
paradox = debugger.detect_simpsons_paradox(control_segments, treatment_segments)
|
|
print(f"\nSimpson's Paradox: {paradox['paradox_detected']}")
|
|
print(f"Explanation: {paradox['explanation']}")
|
|
|
|
# Calculate required sample size
|
|
required_n = debugger.calculate_required_sample_size(
|
|
baseline_rate=0.10,
|
|
minimum_detectable_effect=0.10 # Detect 10% relative improvement
|
|
)
|
|
print(f"\nRequired sample size per variant: {required_n}")
|
|
```
|
|
|
|
**A/B test debugging checklist:**
|
|
|
|
- [ ] Sufficient sample size (use power analysis)
|
|
- [ ] Balanced assignment (50/50 or 70/30, not random)
|
|
- [ ] Random assignment (no selection bias)
|
|
- [ ] Statistical significance (p < 0.05)
|
|
- [ ] Practical significance (meaningful effect size)
|
|
- [ ] Check for Simpson's Paradox (segment analysis)
|
|
- [ ] Monitor for novelty effect (long-term trends)
|
|
- [ ] Validate metrics (correct calculation, no bugs)
|
|
|
|
|
|
### Part 5: Model Debugging (Wrong Predictions, Edge Cases)
|
|
|
|
**Common model issues:**
|
|
- Wrong predictions on edge cases
|
|
- High confidence wrong predictions
|
|
- Inconsistent behavior (same input, different output)
|
|
- Bias or fairness issues
|
|
- Input validation failures
|
|
|
|
```python
|
|
from dataclasses import dataclass
|
|
from typing import List, Dict, Any, Optional
|
|
import numpy as np
|
|
import torch
|
|
|
|
@dataclass
|
|
class PredictionError:
|
|
"""
|
|
Failed prediction for analysis.
|
|
"""
|
|
input_data: Any
|
|
true_label: Any
|
|
predicted_label: Any
|
|
confidence: float
|
|
error_type: str # wrong_class, low_confidence, edge_case, etc.
|
|
|
|
|
|
class ModelDebugger:
|
|
"""
|
|
Debug model prediction errors and edge cases.
|
|
"""
|
|
|
|
def __init__(self, model, tokenizer=None):
|
|
self.model = model
|
|
self.tokenizer = tokenizer
|
|
self.errors: List[PredictionError] = []
|
|
|
|
def add_error(self, error: PredictionError):
|
|
"""Add prediction error for analysis."""
|
|
self.errors.append(error)
|
|
|
|
def find_error_patterns(self) -> Dict[str, List[PredictionError]]:
|
|
"""
|
|
Find patterns in prediction errors.
|
|
|
|
Patterns:
|
|
- Errors on specific input types (long text, numbers, special chars)
|
|
- Errors on specific classes (class imbalance?)
|
|
- High-confidence errors (model overconfident)
|
|
- Consistent errors (model learned wrong pattern)
|
|
"""
|
|
patterns = {
|
|
"high_confidence_errors": [],
|
|
"low_confidence_errors": [],
|
|
"edge_cases": [],
|
|
"class_specific": {}
|
|
}
|
|
|
|
for error in self.errors:
|
|
# High confidence but wrong
|
|
if error.confidence > 0.9:
|
|
patterns["high_confidence_errors"].append(error)
|
|
|
|
# Low confidence (uncertain)
|
|
elif error.confidence < 0.6:
|
|
patterns["low_confidence_errors"].append(error)
|
|
|
|
# Edge cases
|
|
if error.error_type == "edge_case":
|
|
patterns["edge_cases"].append(error)
|
|
|
|
# Group by predicted class
|
|
pred_class = str(error.predicted_label)
|
|
if pred_class not in patterns["class_specific"]:
|
|
patterns["class_specific"][pred_class] = []
|
|
patterns["class_specific"][pred_class].append(error)
|
|
|
|
return patterns
|
|
|
|
def analyze_edge_cases(self) -> List[Dict]:
|
|
"""
|
|
Analyze edge cases to understand failure modes.
|
|
|
|
Edge case types:
|
|
- Out-of-distribution inputs
|
|
- Extreme values (very long, very short)
|
|
- Special characters or formatting
|
|
- Ambiguous inputs
|
|
"""
|
|
edge_cases = [e for e in self.errors if e.error_type == "edge_case"]
|
|
|
|
analyses = []
|
|
for error in edge_cases:
|
|
analysis = {
|
|
"input": error.input_data,
|
|
"true_label": error.true_label,
|
|
"predicted_label": error.predicted_label,
|
|
"confidence": error.confidence,
|
|
"characteristics": self._characterize_input(error.input_data)
|
|
}
|
|
analyses.append(analysis)
|
|
|
|
return analyses
|
|
|
|
def _characterize_input(self, input_data: Any) -> Dict:
|
|
"""
|
|
Characterize input to identify unusual features.
|
|
"""
|
|
if isinstance(input_data, str):
|
|
return {
|
|
"type": "text",
|
|
"length": len(input_data),
|
|
"has_numbers": any(c.isdigit() for c in input_data),
|
|
"has_special_chars": any(not c.isalnum() and not c.isspace() for c in input_data),
|
|
"all_caps": input_data.isupper(),
|
|
"all_lowercase": input_data.islower()
|
|
}
|
|
elif isinstance(input_data, (list, np.ndarray)):
|
|
return {
|
|
"type": "array",
|
|
"shape": np.array(input_data).shape,
|
|
"min": np.min(input_data),
|
|
"max": np.max(input_data),
|
|
"mean": np.mean(input_data)
|
|
}
|
|
else:
|
|
return {"type": str(type(input_data))}
|
|
|
|
def test_input_variations(
|
|
self,
|
|
input_data: Any,
|
|
variations: List[str]
|
|
) -> Dict[str, Any]:
|
|
"""
|
|
Test model on variations of input to check robustness.
|
|
|
|
Variations:
|
|
- case_change: Change case (upper/lower)
|
|
- whitespace: Add/remove whitespace
|
|
- typos: Introduce typos
|
|
- paraphrase: Rephrase input
|
|
|
|
Args:
|
|
input_data: Original input
|
|
variations: List of variation types to test
|
|
|
|
Returns:
|
|
Results for each variation
|
|
"""
|
|
results = {}
|
|
|
|
# Original prediction
|
|
original_pred = self._predict(input_data)
|
|
results["original"] = {
|
|
"input": input_data,
|
|
"prediction": original_pred
|
|
}
|
|
|
|
# Generate and test variations
|
|
for var_type in variations:
|
|
varied_input = self._generate_variation(input_data, var_type)
|
|
varied_pred = self._predict(varied_input)
|
|
|
|
results[var_type] = {
|
|
"input": varied_input,
|
|
"prediction": varied_pred,
|
|
"consistent": varied_pred["label"] == original_pred["label"]
|
|
}
|
|
|
|
# Check consistency
|
|
all_consistent = all(r.get("consistent", True) for r in results.values() if r != results["original"])
|
|
|
|
return {
|
|
"consistent": all_consistent,
|
|
"results": results
|
|
}
|
|
|
|
def _generate_variation(self, input_data: str, variation_type: str) -> str:
|
|
"""
|
|
Generate input variation.
|
|
"""
|
|
if variation_type == "case_change":
|
|
return input_data.upper() if input_data.islower() else input_data.lower()
|
|
|
|
elif variation_type == "whitespace":
|
|
return " ".join(input_data.split())
|
|
|
|
elif variation_type == "typos":
|
|
# Simple typo: swap two adjacent characters
|
|
if len(input_data) > 2:
|
|
idx = len(input_data) // 2
|
|
return input_data[:idx] + input_data[idx+1] + input_data[idx] + input_data[idx+2:]
|
|
return input_data
|
|
|
|
return input_data
|
|
|
|
def _predict(self, input_data: Any) -> Dict:
|
|
"""
|
|
Run model prediction.
|
|
"""
|
|
# Simplified prediction (adapt to your model)
|
|
# Example for text classification
|
|
if self.tokenizer:
|
|
inputs = self.tokenizer(input_data, return_tensors="pt")
|
|
with torch.no_grad():
|
|
outputs = self.model(**inputs)
|
|
|
|
probs = torch.softmax(outputs.logits, dim=-1)
|
|
pred_class = torch.argmax(probs, dim=-1).item()
|
|
confidence = probs[0, pred_class].item()
|
|
|
|
return {
|
|
"label": pred_class,
|
|
"confidence": confidence
|
|
}
|
|
|
|
return {"label": None, "confidence": 0.0}
|
|
|
|
def validate_inputs(self, inputs: List[Any]) -> List[Dict]:
|
|
"""
|
|
Validate inputs before inference.
|
|
|
|
Validation checks:
|
|
- Type correctness
|
|
- Value ranges
|
|
- Format compliance
|
|
- Size limits
|
|
"""
|
|
validation_results = []
|
|
|
|
for i, input_data in enumerate(inputs):
|
|
issues = []
|
|
|
|
if isinstance(input_data, str):
|
|
# Text validation
|
|
if len(input_data) == 0:
|
|
issues.append("Empty input")
|
|
elif len(input_data) > 10000:
|
|
issues.append("Input too long (>10k chars)")
|
|
|
|
if not input_data.strip():
|
|
issues.append("Only whitespace")
|
|
|
|
validation_results.append({
|
|
"index": i,
|
|
"valid": len(issues) == 0,
|
|
"issues": issues
|
|
})
|
|
|
|
return validation_results
|
|
|
|
|
|
# Example usage
|
|
class DummyModel:
|
|
def __call__(self, input_ids, attention_mask):
|
|
# Dummy model for demonstration
|
|
return type('obj', (object,), {
|
|
'logits': torch.randn(1, 3)
|
|
})
|
|
|
|
|
|
class DummyTokenizer:
|
|
def __call__(self, text, return_tensors=None):
|
|
return {
|
|
"input_ids": torch.randint(0, 1000, (1, 10)),
|
|
"attention_mask": torch.ones(1, 10)
|
|
}
|
|
|
|
|
|
model = DummyModel()
|
|
tokenizer = DummyTokenizer()
|
|
debugger = ModelDebugger(model, tokenizer)
|
|
|
|
# Add prediction errors
|
|
debugger.add_error(PredictionError(
|
|
input_data="This is a test",
|
|
true_label=1,
|
|
predicted_label=2,
|
|
confidence=0.95,
|
|
error_type="high_confidence"
|
|
))
|
|
|
|
debugger.add_error(PredictionError(
|
|
input_data="AAAAAAAAAAA", # Edge case: all same character
|
|
true_label=0,
|
|
predicted_label=1,
|
|
confidence=0.85,
|
|
error_type="edge_case"
|
|
))
|
|
|
|
# Find error patterns
|
|
patterns = debugger.find_error_patterns()
|
|
print(f"High confidence errors: {len(patterns['high_confidence_errors'])}")
|
|
print(f"Edge cases: {len(patterns['edge_cases'])}")
|
|
|
|
# Analyze edge cases
|
|
edge_analyses = debugger.analyze_edge_cases()
|
|
for analysis in edge_analyses:
|
|
print(f"\nEdge case: {analysis['input']}")
|
|
print(f"Characteristics: {analysis['characteristics']}")
|
|
|
|
# Test input variations
|
|
variations_result = debugger.test_input_variations(
|
|
"This is a test",
|
|
["case_change", "whitespace", "typos"]
|
|
)
|
|
print(f"\nInput variation consistency: {variations_result['consistent']}")
|
|
```
|
|
|
|
**Model debugging checklist:**
|
|
|
|
- [ ] Collect failed predictions with context
|
|
- [ ] Categorize errors (high confidence, edge cases, class-specific)
|
|
- [ ] Analyze input characteristics (what makes them fail?)
|
|
- [ ] Test input variations (robustness check)
|
|
- [ ] Validate inputs before inference (prevent bad inputs)
|
|
- [ ] Check for bias (fairness across groups)
|
|
- [ ] Add error cases to training data (improve model)
|
|
|
|
|
|
### Part 6: Logging Best Practices
|
|
|
|
**Good logging enables debugging. Bad logging creates noise.**
|
|
|
|
```python
|
|
import logging
|
|
import json
|
|
import sys
|
|
from datetime import datetime
|
|
from contextvars import ContextVar
|
|
from typing import Dict, Any, Optional
|
|
import traceback
|
|
|
|
# Context variable for request/trace ID
|
|
request_id_var: ContextVar[Optional[str]] = ContextVar('request_id', default=None)
|
|
|
|
|
|
class StructuredLogger:
|
|
"""
|
|
Structured logging for production systems.
|
|
|
|
Best practices:
|
|
- JSON format (machine-readable)
|
|
- Include context (request_id, user_id, etc.)
|
|
- Log at appropriate levels
|
|
- Include timing information
|
|
- Don't log sensitive data
|
|
"""
|
|
|
|
def __init__(self, name: str):
|
|
self.logger = logging.getLogger(name)
|
|
self.logger.setLevel(logging.INFO)
|
|
|
|
# JSON formatter
|
|
handler = logging.StreamHandler(sys.stdout)
|
|
handler.setFormatter(self.JSONFormatter())
|
|
self.logger.addHandler(handler)
|
|
|
|
class JSONFormatter(logging.Formatter):
|
|
"""
|
|
Format logs as JSON.
|
|
"""
|
|
def format(self, record: logging.LogRecord) -> str:
|
|
log_data = {
|
|
"timestamp": datetime.utcnow().isoformat() + "Z",
|
|
"level": record.levelname,
|
|
"logger": record.name,
|
|
"message": record.getMessage(),
|
|
}
|
|
|
|
# Add request ID from context
|
|
request_id = request_id_var.get()
|
|
if request_id:
|
|
log_data["request_id"] = request_id
|
|
|
|
# Add extra fields
|
|
if hasattr(record, "extra"):
|
|
log_data.update(record.extra)
|
|
|
|
# Add exception info
|
|
if record.exc_info:
|
|
log_data["exception"] = {
|
|
"type": record.exc_info[0].__name__,
|
|
"message": str(record.exc_info[1]),
|
|
"traceback": traceback.format_exception(*record.exc_info)
|
|
}
|
|
|
|
return json.dumps(log_data)
|
|
|
|
def log(
|
|
self,
|
|
level: str,
|
|
message: str,
|
|
**kwargs
|
|
):
|
|
"""
|
|
Log with structured context.
|
|
|
|
Args:
|
|
level: Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
|
|
message: Log message
|
|
**kwargs: Additional context fields
|
|
"""
|
|
log_method = getattr(self.logger, level.lower())
|
|
|
|
# Create LogRecord with extra fields
|
|
extra = {"extra": kwargs}
|
|
log_method(message, extra=extra)
|
|
|
|
def debug(self, message: str, **kwargs):
|
|
self.log("DEBUG", message, **kwargs)
|
|
|
|
def info(self, message: str, **kwargs):
|
|
self.log("INFO", message, **kwargs)
|
|
|
|
def warning(self, message: str, **kwargs):
|
|
self.log("WARNING", message, **kwargs)
|
|
|
|
def error(self, message: str, **kwargs):
|
|
self.log("ERROR", message, **kwargs)
|
|
|
|
def critical(self, message: str, **kwargs):
|
|
self.log("CRITICAL", message, **kwargs)
|
|
|
|
|
|
class RequestLogger:
|
|
"""
|
|
Log HTTP requests with full context.
|
|
"""
|
|
|
|
def __init__(self):
|
|
self.logger = StructuredLogger("api")
|
|
|
|
def log_request(
|
|
self,
|
|
request_id: str,
|
|
method: str,
|
|
path: str,
|
|
user_id: Optional[str] = None,
|
|
**kwargs
|
|
):
|
|
"""
|
|
Log incoming request.
|
|
"""
|
|
# Set request ID in context
|
|
request_id_var.set(request_id)
|
|
|
|
self.logger.info(
|
|
"Request started",
|
|
request_id=request_id,
|
|
method=method,
|
|
path=path,
|
|
user_id=user_id,
|
|
**kwargs
|
|
)
|
|
|
|
def log_response(
|
|
self,
|
|
request_id: str,
|
|
status_code: int,
|
|
duration_ms: float,
|
|
**kwargs
|
|
):
|
|
"""
|
|
Log response with timing.
|
|
"""
|
|
level = "INFO" if status_code < 400 else "ERROR"
|
|
|
|
self.logger.log(
|
|
level,
|
|
"Request completed",
|
|
request_id=request_id,
|
|
status_code=status_code,
|
|
duration_ms=duration_ms,
|
|
**kwargs
|
|
)
|
|
|
|
def log_error(
|
|
self,
|
|
request_id: str,
|
|
error: Exception,
|
|
**kwargs
|
|
):
|
|
"""
|
|
Log request error with full context.
|
|
"""
|
|
self.logger.error(
|
|
"Request failed",
|
|
request_id=request_id,
|
|
error_type=type(error).__name__,
|
|
error_message=str(error),
|
|
**kwargs,
|
|
exc_info=True
|
|
)
|
|
|
|
|
|
class ModelInferenceLogger:
|
|
"""
|
|
Log model inference with input/output context.
|
|
"""
|
|
|
|
def __init__(self):
|
|
self.logger = StructuredLogger("model")
|
|
|
|
def log_inference(
|
|
self,
|
|
model_name: str,
|
|
model_version: str,
|
|
input_shape: tuple,
|
|
output_shape: tuple,
|
|
duration_ms: float,
|
|
request_id: Optional[str] = None,
|
|
**kwargs
|
|
):
|
|
"""
|
|
Log model inference.
|
|
"""
|
|
self.logger.info(
|
|
"Model inference",
|
|
model_name=model_name,
|
|
model_version=model_version,
|
|
input_shape=input_shape,
|
|
output_shape=output_shape,
|
|
duration_ms=duration_ms,
|
|
request_id=request_id,
|
|
**kwargs
|
|
)
|
|
|
|
def log_prediction_error(
|
|
self,
|
|
model_name: str,
|
|
error: Exception,
|
|
input_sample: Any,
|
|
request_id: Optional[str] = None,
|
|
**kwargs
|
|
):
|
|
"""
|
|
Log prediction error with input context.
|
|
|
|
Note: Be careful not to log sensitive data!
|
|
"""
|
|
# Sanitize input (don't log full input if sensitive)
|
|
input_summary = self._summarize_input(input_sample)
|
|
|
|
self.logger.error(
|
|
"Prediction failed",
|
|
model_name=model_name,
|
|
error_type=type(error).__name__,
|
|
error_message=str(error),
|
|
input_summary=input_summary,
|
|
request_id=request_id,
|
|
**kwargs,
|
|
exc_info=True
|
|
)
|
|
|
|
def _summarize_input(self, input_sample: Any) -> Dict:
|
|
"""
|
|
Summarize input without logging sensitive data.
|
|
"""
|
|
if isinstance(input_sample, str):
|
|
return {
|
|
"type": "text",
|
|
"length": len(input_sample),
|
|
"preview": input_sample[:50] + "..." if len(input_sample) > 50 else input_sample
|
|
}
|
|
elif isinstance(input_sample, (list, tuple)):
|
|
return {
|
|
"type": "array",
|
|
"length": len(input_sample)
|
|
}
|
|
else:
|
|
return {
|
|
"type": str(type(input_sample))
|
|
}
|
|
|
|
|
|
# Example usage
|
|
request_logger = RequestLogger()
|
|
model_logger = ModelInferenceLogger()
|
|
|
|
# Log request
|
|
import uuid
|
|
import time
|
|
|
|
request_id = str(uuid.uuid4())
|
|
start_time = time.time()
|
|
|
|
request_logger.log_request(
|
|
request_id=request_id,
|
|
method="POST",
|
|
path="/api/predict",
|
|
user_id="user_123",
|
|
client_ip="192.168.1.100"
|
|
)
|
|
|
|
# Log model inference
|
|
model_logger.log_inference(
|
|
model_name="sentiment-classifier",
|
|
model_version="v2.1",
|
|
input_shape=(1, 512),
|
|
output_shape=(1, 3),
|
|
duration_ms=45.2,
|
|
request_id=request_id,
|
|
batch_size=1
|
|
)
|
|
|
|
# Log response
|
|
duration_ms = (time.time() - start_time) * 1000
|
|
request_logger.log_response(
|
|
request_id=request_id,
|
|
status_code=200,
|
|
duration_ms=duration_ms
|
|
)
|
|
|
|
# Log error (example)
|
|
try:
|
|
raise ValueError("Invalid input shape")
|
|
except Exception as e:
|
|
request_logger.log_error(request_id, e, endpoint="/api/predict")
|
|
```
|
|
|
|
**What to log:**
|
|
|
|
| Level | What to Log | Example |
|
|
|-------|-------------|---------|
|
|
| DEBUG | Detailed diagnostic info | Variable values, function entry/exit |
|
|
| INFO | Normal operations | Request started, prediction completed |
|
|
| WARNING | Unexpected but handled | Retry attempt, fallback used |
|
|
| ERROR | Error conditions | API error, prediction failed |
|
|
| CRITICAL | System failure | Database down, out of memory |
|
|
|
|
**What NOT to log:**
|
|
- Passwords, API keys, tokens
|
|
- Credit card numbers, SSNs
|
|
- Full user data (GDPR violation)
|
|
- Large payloads (log summary instead)
|
|
|
|
**Logging checklist:**
|
|
|
|
- [ ] Use structured logging (JSON format)
|
|
- [ ] Include trace/request IDs (correlation)
|
|
- [ ] Log at appropriate levels
|
|
- [ ] Include timing information
|
|
- [ ] Don't log sensitive data
|
|
- [ ] Make logs queryable (structured fields)
|
|
- [ ] Include sufficient context for debugging
|
|
- [ ] Log errors with full stack traces
|
|
|
|
|
|
### Part 7: Rollback Procedures
|
|
|
|
**When to rollback:**
|
|
- Critical error rate spike (>5% errors)
|
|
- Significant metric regression (>10% drop)
|
|
- Security vulnerability discovered
|
|
- Cascading failures affecting downstream
|
|
|
|
**When NOT to rollback:**
|
|
- Minor errors (<1% error rate)
|
|
- Single user complaints (investigate first)
|
|
- Performance slightly worse (measure first)
|
|
- New feature not perfect (iterate instead)
|
|
|
|
```python
|
|
from dataclasses import dataclass
|
|
from typing import Dict, List, Optional
|
|
from datetime import datetime
|
|
import subprocess
|
|
|
|
@dataclass
|
|
class DeploymentMetrics:
|
|
"""
|
|
Metrics to monitor during deployment.
|
|
"""
|
|
error_rate: float
|
|
latency_p95_ms: float
|
|
success_rate: float
|
|
throughput_qps: float
|
|
cpu_usage_percent: float
|
|
memory_usage_percent: float
|
|
|
|
|
|
class RollbackDecider:
|
|
"""
|
|
Decide whether to rollback based on metrics.
|
|
"""
|
|
|
|
def __init__(
|
|
self,
|
|
baseline_metrics: DeploymentMetrics,
|
|
thresholds: Dict[str, float]
|
|
):
|
|
"""
|
|
Args:
|
|
baseline_metrics: Metrics from previous stable version
|
|
thresholds: Rollback thresholds (e.g., {"error_rate": 0.05})
|
|
"""
|
|
self.baseline = baseline_metrics
|
|
self.thresholds = thresholds
|
|
|
|
def should_rollback(
|
|
self,
|
|
current_metrics: DeploymentMetrics
|
|
) -> Dict:
|
|
"""
|
|
Decide if rollback is needed.
|
|
|
|
Returns:
|
|
Decision with reasoning
|
|
"""
|
|
violations = []
|
|
|
|
# Check error rate
|
|
if current_metrics.error_rate > self.thresholds.get("error_rate", 0.05):
|
|
violations.append({
|
|
"metric": "error_rate",
|
|
"baseline": self.baseline.error_rate,
|
|
"current": current_metrics.error_rate,
|
|
"threshold": self.thresholds["error_rate"],
|
|
"severity": "CRITICAL"
|
|
})
|
|
|
|
# Check latency
|
|
latency_increase = (current_metrics.latency_p95_ms - self.baseline.latency_p95_ms) / self.baseline.latency_p95_ms
|
|
if latency_increase > self.thresholds.get("latency_increase", 0.25): # 25% increase
|
|
violations.append({
|
|
"metric": "latency_p95_ms",
|
|
"baseline": self.baseline.latency_p95_ms,
|
|
"current": current_metrics.latency_p95_ms,
|
|
"increase_percent": latency_increase * 100,
|
|
"threshold": self.thresholds["latency_increase"] * 100,
|
|
"severity": "HIGH"
|
|
})
|
|
|
|
# Check success rate
|
|
success_drop = self.baseline.success_rate - current_metrics.success_rate
|
|
if success_drop > self.thresholds.get("success_rate_drop", 0.05): # 5pp drop
|
|
violations.append({
|
|
"metric": "success_rate",
|
|
"baseline": self.baseline.success_rate,
|
|
"current": current_metrics.success_rate,
|
|
"drop": success_drop,
|
|
"threshold": self.thresholds["success_rate_drop"],
|
|
"severity": "CRITICAL"
|
|
})
|
|
|
|
should_rollback = len([v for v in violations if v["severity"] == "CRITICAL"]) > 0
|
|
|
|
return {
|
|
"should_rollback": should_rollback,
|
|
"violations": violations,
|
|
"reasoning": self._generate_reasoning(should_rollback, violations)
|
|
}
|
|
|
|
def _generate_reasoning(
|
|
self,
|
|
should_rollback: bool,
|
|
violations: List[Dict]
|
|
) -> str:
|
|
"""
|
|
Generate human-readable reasoning.
|
|
"""
|
|
if not violations:
|
|
return "All metrics within acceptable thresholds. No rollback needed."
|
|
|
|
if should_rollback:
|
|
critical = [v for v in violations if v["severity"] == "CRITICAL"]
|
|
reasons = [f"{v['metric']} violated threshold" for v in critical]
|
|
return f"ROLLBACK RECOMMENDED: {', '.join(reasons)}"
|
|
else:
|
|
return f"Minor issues detected but below rollback threshold. Monitor closely."
|
|
|
|
|
|
class RollbackExecutor:
|
|
"""
|
|
Execute rollback procedure.
|
|
"""
|
|
|
|
def __init__(self, deployment_system: str = "kubernetes"):
|
|
self.deployment_system = deployment_system
|
|
|
|
def rollback(
|
|
self,
|
|
service_name: str,
|
|
previous_version: str,
|
|
preserve_evidence: bool = True
|
|
) -> Dict:
|
|
"""
|
|
Execute rollback to previous version.
|
|
|
|
Args:
|
|
service_name: Service to rollback
|
|
previous_version: Version to rollback to
|
|
preserve_evidence: Capture logs/metrics before rollback
|
|
|
|
Returns:
|
|
Rollback result
|
|
"""
|
|
print(f"Starting rollback: {service_name} -> {previous_version}")
|
|
|
|
# Step 1: Preserve evidence
|
|
if preserve_evidence:
|
|
evidence = self._preserve_evidence(service_name)
|
|
print(f"Evidence preserved: {evidence}")
|
|
|
|
# Step 2: Execute rollback
|
|
if self.deployment_system == "kubernetes":
|
|
result = self._rollback_kubernetes(service_name, previous_version)
|
|
elif self.deployment_system == "docker":
|
|
result = self._rollback_docker(service_name, previous_version)
|
|
else:
|
|
result = {"success": False, "error": "Unknown deployment system"}
|
|
|
|
return result
|
|
|
|
def _preserve_evidence(self, service_name: str) -> Dict:
|
|
"""
|
|
Capture logs and metrics before rollback.
|
|
"""
|
|
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
|
|
|
# Capture logs (last 1000 lines)
|
|
log_file = f"/tmp/{service_name}_rollback_{timestamp}.log"
|
|
|
|
# Simplified: In production, use proper log aggregation
|
|
print(f"Capturing logs to {log_file}")
|
|
|
|
# Capture metrics snapshot
|
|
metrics_file = f"/tmp/{service_name}_metrics_{timestamp}.json"
|
|
print(f"Capturing metrics to {metrics_file}")
|
|
|
|
return {
|
|
"log_file": log_file,
|
|
"metrics_file": metrics_file,
|
|
"timestamp": timestamp
|
|
}
|
|
|
|
def _rollback_kubernetes(
|
|
self,
|
|
service_name: str,
|
|
version: str
|
|
) -> Dict:
|
|
"""
|
|
Rollback Kubernetes deployment.
|
|
"""
|
|
try:
|
|
# Option 1: Rollback to previous revision
|
|
cmd = f"kubectl rollout undo deployment/{service_name}"
|
|
|
|
# Option 2: Rollback to specific version
|
|
# cmd = f"kubectl rollout undo deployment/{service_name} --to-revision={version}"
|
|
|
|
result = subprocess.run(
|
|
cmd.split(),
|
|
capture_output=True,
|
|
text=True,
|
|
check=True
|
|
)
|
|
|
|
# Wait for rollout
|
|
wait_cmd = f"kubectl rollout status deployment/{service_name}"
|
|
subprocess.run(
|
|
wait_cmd.split(),
|
|
check=True,
|
|
timeout=300 # 5 min timeout
|
|
)
|
|
|
|
return {
|
|
"success": True,
|
|
"service": service_name,
|
|
"version": version,
|
|
"output": result.stdout
|
|
}
|
|
|
|
except subprocess.CalledProcessError as e:
|
|
return {
|
|
"success": False,
|
|
"error": str(e),
|
|
"output": e.stderr
|
|
}
|
|
|
|
def _rollback_docker(
|
|
self,
|
|
service_name: str,
|
|
version: str
|
|
) -> Dict:
|
|
"""
|
|
Rollback Docker service.
|
|
"""
|
|
try:
|
|
cmd = f"docker service update --image {service_name}:{version} {service_name}"
|
|
|
|
result = subprocess.run(
|
|
cmd.split(),
|
|
capture_output=True,
|
|
text=True,
|
|
check=True
|
|
)
|
|
|
|
return {
|
|
"success": True,
|
|
"service": service_name,
|
|
"version": version,
|
|
"output": result.stdout
|
|
}
|
|
|
|
except subprocess.CalledProcessError as e:
|
|
return {
|
|
"success": False,
|
|
"error": str(e),
|
|
"output": e.stderr
|
|
}
|
|
|
|
|
|
# Example usage
|
|
baseline = DeploymentMetrics(
|
|
error_rate=0.01,
|
|
latency_p95_ms=200,
|
|
success_rate=0.95,
|
|
throughput_qps=100,
|
|
cpu_usage_percent=50,
|
|
memory_usage_percent=60
|
|
)
|
|
|
|
thresholds = {
|
|
"error_rate": 0.05, # 5% error rate
|
|
"latency_increase": 0.25, # 25% increase
|
|
"success_rate_drop": 0.05 # 5pp drop
|
|
}
|
|
|
|
decider = RollbackDecider(baseline, thresholds)
|
|
|
|
# Simulate bad deployment
|
|
current = DeploymentMetrics(
|
|
error_rate=0.08, # High!
|
|
latency_p95_ms=300, # High!
|
|
success_rate=0.88, # Low!
|
|
throughput_qps=90,
|
|
cpu_usage_percent=70,
|
|
memory_usage_percent=65
|
|
)
|
|
|
|
decision = decider.should_rollback(current)
|
|
print(f"Should rollback: {decision['should_rollback']}")
|
|
print(f"Reasoning: {decision['reasoning']}")
|
|
|
|
if decision['should_rollback']:
|
|
executor = RollbackExecutor(deployment_system="kubernetes")
|
|
result = executor.rollback(
|
|
service_name="ml-api",
|
|
previous_version="v1.2.3",
|
|
preserve_evidence=True
|
|
)
|
|
print(f"Rollback result: {result}")
|
|
```
|
|
|
|
**Rollback checklist:**
|
|
|
|
- [ ] Preserve evidence (logs, metrics, traces)
|
|
- [ ] Document rollback reason
|
|
- [ ] Execute rollback (kubectl/docker/terraform)
|
|
- [ ] Verify metrics return to normal
|
|
- [ ] Notify team and stakeholders
|
|
- [ ] Schedule post-mortem
|
|
- [ ] Fix issue in development
|
|
- [ ] Re-deploy with fix
|
|
|
|
|
|
### Part 8: Post-Mortem Process
|
|
|
|
**Goal:** Learn from incidents to prevent recurrence.
|
|
|
|
**Post-mortem is blameless:** Focus on systems and processes, not individuals.
|
|
|
|
```python
|
|
from dataclasses import dataclass
|
|
from typing import List, Dict, Optional
|
|
from datetime import datetime
|
|
|
|
@dataclass
|
|
class IncidentTimeline:
|
|
"""
|
|
Timeline event during incident.
|
|
"""
|
|
timestamp: datetime
|
|
event: str
|
|
actor: str # Person, system, or automation
|
|
action: str
|
|
|
|
|
|
@dataclass
|
|
class ActionItem:
|
|
"""
|
|
Post-mortem action item.
|
|
"""
|
|
description: str
|
|
owner: str
|
|
due_date: datetime
|
|
priority: str # CRITICAL, HIGH, MEDIUM, LOW
|
|
status: str = "TODO" # TODO, IN_PROGRESS, DONE
|
|
|
|
|
|
class PostMortem:
|
|
"""
|
|
Structured post-mortem document.
|
|
"""
|
|
|
|
def __init__(
|
|
self,
|
|
incident_id: str,
|
|
title: str,
|
|
date: datetime,
|
|
severity: str,
|
|
duration_minutes: int
|
|
):
|
|
self.incident_id = incident_id
|
|
self.title = title
|
|
self.date = date
|
|
self.severity = severity
|
|
self.duration_minutes = duration_minutes
|
|
|
|
self.summary: str = ""
|
|
self.impact: Dict = {}
|
|
self.timeline: List[IncidentTimeline] = []
|
|
self.root_cause: str = ""
|
|
self.contributing_factors: List[str] = []
|
|
self.what_went_well: List[str] = []
|
|
self.what_went_wrong: List[str] = []
|
|
self.action_items: List[ActionItem] = []
|
|
|
|
def add_timeline_event(
|
|
self,
|
|
timestamp: datetime,
|
|
event: str,
|
|
actor: str,
|
|
action: str
|
|
):
|
|
"""
|
|
Add event to incident timeline.
|
|
"""
|
|
self.timeline.append(IncidentTimeline(
|
|
timestamp=timestamp,
|
|
event=event,
|
|
actor=actor,
|
|
action=action
|
|
))
|
|
|
|
def set_root_cause(self, root_cause: str):
|
|
"""
|
|
Document root cause.
|
|
"""
|
|
self.root_cause = root_cause
|
|
|
|
def add_contributing_factor(self, factor: str):
|
|
"""
|
|
Add contributing factor (not root cause but made it worse).
|
|
"""
|
|
self.contributing_factors.append(factor)
|
|
|
|
def add_action_item(
|
|
self,
|
|
description: str,
|
|
owner: str,
|
|
due_date: datetime,
|
|
priority: str = "HIGH"
|
|
):
|
|
"""
|
|
Add action item for prevention.
|
|
"""
|
|
self.action_items.append(ActionItem(
|
|
description=description,
|
|
owner=owner,
|
|
due_date=due_date,
|
|
priority=priority
|
|
))
|
|
|
|
def generate_report(self) -> str:
|
|
"""
|
|
Generate post-mortem report.
|
|
"""
|
|
report = f"""
|
|
# Post-Mortem: {self.title}
|
|
|
|
**Incident ID:** {self.incident_id}
|
|
**Date:** {self.date.strftime('%Y-%m-%d %H:%M UTC')}
|
|
**Severity:** {self.severity}
|
|
**Duration:** {self.duration_minutes} minutes
|
|
|
|
## Summary
|
|
|
|
{self.summary}
|
|
|
|
## Impact
|
|
|
|
{self._format_impact()}
|
|
|
|
## Timeline
|
|
|
|
{self._format_timeline()}
|
|
|
|
## Root Cause
|
|
|
|
{self.root_cause}
|
|
|
|
## Contributing Factors
|
|
|
|
{self._format_list(self.contributing_factors)}
|
|
|
|
## What Went Well
|
|
|
|
{self._format_list(self.what_went_well)}
|
|
|
|
## What Went Wrong
|
|
|
|
{self._format_list(self.what_went_wrong)}
|
|
|
|
## Action Items
|
|
|
|
{self._format_action_items()}
|
|
|
|
|
|
**Review:** This post-mortem should be reviewed by the team and approved by engineering leadership.
|
|
|
|
**Follow-up:** Track action items to completion. Schedule follow-up review in 30 days.
|
|
"""
|
|
return report
|
|
|
|
def _format_impact(self) -> str:
|
|
"""Format impact section."""
|
|
lines = []
|
|
for key, value in self.impact.items():
|
|
lines.append(f"- **{key}:** {value}")
|
|
return "\n".join(lines) if lines else "No impact documented."
|
|
|
|
def _format_timeline(self) -> str:
|
|
"""Format timeline section."""
|
|
lines = []
|
|
for event in sorted(self.timeline, key=lambda e: e.timestamp):
|
|
time_str = event.timestamp.strftime('%H:%M:%S')
|
|
lines.append(f"- **{time_str}** [{event.actor}] {event.event} → {event.action}")
|
|
return "\n".join(lines) if lines else "No timeline documented."
|
|
|
|
def _format_list(self, items: List[str]) -> str:
|
|
"""Format list of items."""
|
|
return "\n".join(f"- {item}" for item in items) if items else "None."
|
|
|
|
def _format_action_items(self) -> str:
|
|
"""Format action items."""
|
|
if not self.action_items:
|
|
return "No action items."
|
|
|
|
lines = []
|
|
for item in sorted(self.action_items, key=lambda x: x.priority):
|
|
due = item.due_date.strftime('%Y-%m-%d')
|
|
lines.append(f"- [{item.priority}] {item.description} (Owner: {item.owner}, Due: {due})")
|
|
|
|
return "\n".join(lines)
|
|
|
|
|
|
# Example post-mortem
|
|
from datetime import timedelta
|
|
|
|
pm = PostMortem(
|
|
incident_id="INC-2025-042",
|
|
title="API Latency Spike Causing Timeouts",
|
|
date=datetime(2025, 1, 15, 14, 30),
|
|
severity="HIGH",
|
|
duration_minutes=45
|
|
)
|
|
|
|
pm.summary = """
|
|
At 14:30 UTC, API latency spiked from 200ms to 5000ms, causing widespread timeouts.
|
|
Error rate increased from 0.5% to 15%. Incident was resolved by scaling up database
|
|
connection pool and restarting API servers. No data loss occurred.
|
|
"""
|
|
|
|
pm.impact = {
|
|
"Users affected": "~5,000 users (10% of active users)",
|
|
"Requests failed": "~15,000 requests",
|
|
"Revenue impact": "$2,500 (estimated)",
|
|
"Customer complaints": "23 support tickets"
|
|
}
|
|
|
|
# Timeline
|
|
pm.add_timeline_event(
|
|
datetime(2025, 1, 15, 14, 30),
|
|
"Latency spike detected",
|
|
"Monitoring System",
|
|
"Alert sent to on-call"
|
|
)
|
|
|
|
pm.add_timeline_event(
|
|
datetime(2025, 1, 15, 14, 32),
|
|
"On-call engineer acknowledged",
|
|
"Engineer A",
|
|
"Started investigation"
|
|
)
|
|
|
|
pm.add_timeline_event(
|
|
datetime(2025, 1, 15, 14, 40),
|
|
"Root cause identified: DB connection pool exhausted",
|
|
"Engineer A",
|
|
"Scaled connection pool from 10 to 50"
|
|
)
|
|
|
|
pm.add_timeline_event(
|
|
datetime(2025, 1, 15, 14, 45),
|
|
"Restarted API servers",
|
|
"Engineer A",
|
|
"Latency returned to normal"
|
|
)
|
|
|
|
pm.add_timeline_event(
|
|
datetime(2025, 1, 15, 15, 15),
|
|
"Incident resolved",
|
|
"Engineer A",
|
|
"Monitoring confirmed stability"
|
|
)
|
|
|
|
# Root cause and factors
|
|
pm.set_root_cause(
|
|
"Database connection pool size (10) was too small for peak traffic (100 concurrent requests). "
|
|
"Connection pool exhaustion caused requests to queue, leading to timeouts."
|
|
)
|
|
|
|
pm.add_contributing_factor("No monitoring for connection pool utilization")
|
|
pm.add_contributing_factor("Connection pool size not load tested")
|
|
pm.add_contributing_factor("No auto-scaling for database connections")
|
|
|
|
# What went well/wrong
|
|
pm.what_went_well = [
|
|
"Monitoring detected issue within 2 minutes",
|
|
"On-call responded quickly (2 min to acknowledgment)",
|
|
"Root cause identified in 10 minutes",
|
|
"No data loss or corruption"
|
|
]
|
|
|
|
pm.what_went_wrong = [
|
|
"Connection pool not sized for peak traffic",
|
|
"No monitoring for connection pool metrics",
|
|
"Load testing didn't include database connection limits",
|
|
"Incident affected 10% of users for 45 minutes"
|
|
]
|
|
|
|
# Action items
|
|
pm.add_action_item(
|
|
"Add monitoring and alerting for DB connection pool utilization (alert at 80%)",
|
|
"Engineer B",
|
|
datetime.now() + timedelta(days=3),
|
|
"CRITICAL"
|
|
)
|
|
|
|
pm.add_action_item(
|
|
"Implement auto-scaling for DB connection pool based on traffic",
|
|
"Engineer C",
|
|
datetime.now() + timedelta(days=7),
|
|
"HIGH"
|
|
)
|
|
|
|
pm.add_action_item(
|
|
"Update load testing to include DB connection limits",
|
|
"Engineer A",
|
|
datetime.now() + timedelta(days=7),
|
|
"HIGH"
|
|
)
|
|
|
|
pm.add_action_item(
|
|
"Document connection pool sizing guidelines for future services",
|
|
"Engineer D",
|
|
datetime.now() + timedelta(days=14),
|
|
"MEDIUM"
|
|
)
|
|
|
|
# Generate report
|
|
report = pm.generate_report()
|
|
print(report)
|
|
```
|
|
|
|
**Post-mortem checklist:**
|
|
|
|
- [ ] Schedule post-mortem meeting (within 48 hours)
|
|
- [ ] Invite all involved parties
|
|
- [ ] Document timeline (facts, not speculation)
|
|
- [ ] Identify root cause (not symptoms)
|
|
- [ ] List contributing factors
|
|
- [ ] What went well / what went wrong
|
|
- [ ] Create action items (owner, due date, priority)
|
|
- [ ] Review and approve report
|
|
- [ ] Track action items to completion
|
|
- [ ] Follow-up review in 30 days
|
|
|
|
**Key principles:**
|
|
- **Blameless:** Focus on systems, not people
|
|
- **Fact-based:** Use evidence, not opinions
|
|
- **Actionable:** Create concrete prevention measures
|
|
- **Timely:** Complete within 1 week of incident
|
|
- **Shared:** Distribute to entire team
|
|
|
|
|
|
### Part 9: Production Forensics (Traces, Logs, Metrics Correlation)
|
|
|
|
**Goal:** Correlate traces, logs, and metrics to understand incident.
|
|
|
|
```python
|
|
from dataclasses import dataclass
|
|
from typing import List, Dict, Optional
|
|
from datetime import datetime, timedelta
|
|
import json
|
|
|
|
@dataclass
|
|
class Trace:
|
|
"""
|
|
Distributed trace span.
|
|
"""
|
|
trace_id: str
|
|
span_id: str
|
|
parent_span_id: Optional[str]
|
|
service_name: str
|
|
operation_name: str
|
|
start_time: datetime
|
|
duration_ms: float
|
|
status: str # OK, ERROR
|
|
tags: Dict[str, str]
|
|
|
|
|
|
@dataclass
|
|
class LogEntry:
|
|
"""
|
|
Structured log entry.
|
|
"""
|
|
timestamp: datetime
|
|
level: str
|
|
service: str
|
|
message: str
|
|
trace_id: Optional[str]
|
|
metadata: Dict
|
|
|
|
|
|
@dataclass
|
|
class MetricDataPoint:
|
|
"""
|
|
Time-series metric data point.
|
|
"""
|
|
timestamp: datetime
|
|
metric_name: str
|
|
value: float
|
|
tags: Dict[str, str]
|
|
|
|
|
|
class ProductionForensics:
|
|
"""
|
|
Correlate traces, logs, and metrics for incident investigation.
|
|
"""
|
|
|
|
def __init__(self):
|
|
self.traces: List[Trace] = []
|
|
self.logs: List[LogEntry] = []
|
|
self.metrics: List[MetricDataPoint] = []
|
|
|
|
def add_trace(self, trace: Trace):
|
|
self.traces.append(trace)
|
|
|
|
def add_log(self, log: LogEntry):
|
|
self.logs.append(log)
|
|
|
|
def add_metric(self, metric: MetricDataPoint):
|
|
self.metrics.append(metric)
|
|
|
|
def investigate_slow_request(
|
|
self,
|
|
trace_id: str
|
|
) -> Dict:
|
|
"""
|
|
Investigate slow request using trace, logs, and metrics.
|
|
|
|
Args:
|
|
trace_id: Trace ID of slow request
|
|
|
|
Returns:
|
|
Investigation results
|
|
"""
|
|
# Get trace spans
|
|
trace_spans = [t for t in self.traces if t.trace_id == trace_id]
|
|
|
|
if not trace_spans:
|
|
return {"error": "Trace not found"}
|
|
|
|
# Sort by start time
|
|
trace_spans.sort(key=lambda s: s.start_time)
|
|
|
|
# Calculate total duration
|
|
total_duration = sum(s.duration_ms for s in trace_spans if not s.parent_span_id)
|
|
|
|
# Find slowest span
|
|
slowest_span = max(trace_spans, key=lambda s: s.duration_ms)
|
|
|
|
# Get logs for this trace
|
|
trace_logs = [l for l in self.logs if l.trace_id == trace_id]
|
|
trace_logs.sort(key=lambda l: l.timestamp)
|
|
|
|
# Check for errors
|
|
error_logs = [l for l in trace_logs if l.level == "ERROR"]
|
|
|
|
# Get metrics during request time
|
|
start_time = trace_spans[0].start_time
|
|
end_time = start_time + timedelta(milliseconds=total_duration)
|
|
|
|
relevant_metrics = [
|
|
m for m in self.metrics
|
|
if start_time <= m.timestamp <= end_time
|
|
]
|
|
|
|
return {
|
|
"trace_id": trace_id,
|
|
"total_duration_ms": total_duration,
|
|
"num_spans": len(trace_spans),
|
|
"slowest_span": {
|
|
"service": slowest_span.service_name,
|
|
"operation": slowest_span.operation_name,
|
|
"duration_ms": slowest_span.duration_ms,
|
|
"percentage": (slowest_span.duration_ms / total_duration * 100) if total_duration > 0 else 0
|
|
},
|
|
"error_count": len(error_logs),
|
|
"errors": [
|
|
{"timestamp": l.timestamp.isoformat(), "message": l.message}
|
|
for l in error_logs
|
|
],
|
|
"trace_breakdown": [
|
|
{
|
|
"service": s.service_name,
|
|
"operation": s.operation_name,
|
|
"duration_ms": s.duration_ms,
|
|
"percentage": (s.duration_ms / total_duration * 100) if total_duration > 0 else 0
|
|
}
|
|
for s in trace_spans
|
|
],
|
|
"metrics_during_request": [
|
|
{
|
|
"metric": m.metric_name,
|
|
"value": m.value,
|
|
"timestamp": m.timestamp.isoformat()
|
|
}
|
|
for m in relevant_metrics
|
|
]
|
|
}
|
|
|
|
def find_correlated_errors(
|
|
self,
|
|
time_window_minutes: int = 10
|
|
) -> List[Dict]:
|
|
"""
|
|
Find errors that occurred around the same time.
|
|
|
|
Args:
|
|
time_window_minutes: Time window for correlation
|
|
|
|
Returns:
|
|
Clusters of correlated errors
|
|
"""
|
|
error_logs = [l for l in self.logs if l.level == "ERROR"]
|
|
error_logs.sort(key=lambda l: l.timestamp)
|
|
|
|
if not error_logs:
|
|
return []
|
|
|
|
# Cluster errors by time
|
|
clusters = []
|
|
current_cluster = [error_logs[0]]
|
|
|
|
for log in error_logs[1:]:
|
|
time_diff = (log.timestamp - current_cluster[-1].timestamp).total_seconds() / 60
|
|
|
|
if time_diff <= time_window_minutes:
|
|
current_cluster.append(log)
|
|
else:
|
|
if len(current_cluster) > 1:
|
|
clusters.append(current_cluster)
|
|
current_cluster = [log]
|
|
|
|
if len(current_cluster) > 1:
|
|
clusters.append(current_cluster)
|
|
|
|
# Analyze each cluster
|
|
results = []
|
|
for cluster in clusters:
|
|
services = set(l.service for l in cluster)
|
|
messages = set(l.message for l in cluster)
|
|
|
|
results.append({
|
|
"start_time": cluster[0].timestamp.isoformat(),
|
|
"end_time": cluster[-1].timestamp.isoformat(),
|
|
"error_count": len(cluster),
|
|
"services_affected": list(services),
|
|
"unique_errors": list(messages)
|
|
})
|
|
|
|
return results
|
|
|
|
def analyze_metric_anomaly(
|
|
self,
|
|
metric_name: str,
|
|
anomaly_time: datetime,
|
|
window_minutes: int = 5
|
|
) -> Dict:
|
|
"""
|
|
Analyze what happened around metric anomaly.
|
|
|
|
Args:
|
|
metric_name: Metric that had anomaly
|
|
anomaly_time: When anomaly occurred
|
|
window_minutes: Time window to analyze
|
|
|
|
Returns:
|
|
Analysis results
|
|
"""
|
|
start_time = anomaly_time - timedelta(minutes=window_minutes)
|
|
end_time = anomaly_time + timedelta(minutes=window_minutes)
|
|
|
|
# Get metric values
|
|
metric_values = [
|
|
m for m in self.metrics
|
|
if m.metric_name == metric_name and start_time <= m.timestamp <= end_time
|
|
]
|
|
|
|
# Get logs during this time
|
|
logs_during = [
|
|
l for l in self.logs
|
|
if start_time <= l.timestamp <= end_time
|
|
]
|
|
|
|
# Get traces during this time
|
|
traces_during = [
|
|
t for t in self.traces
|
|
if start_time <= t.start_time <= end_time
|
|
]
|
|
|
|
# Count errors
|
|
error_count = len([l for l in logs_during if l.level == "ERROR"])
|
|
failed_traces = len([t for t in traces_during if t.status == "ERROR"])
|
|
|
|
return {
|
|
"metric_name": metric_name,
|
|
"anomaly_time": anomaly_time.isoformat(),
|
|
"window_minutes": window_minutes,
|
|
"metric_values": [
|
|
{"timestamp": m.timestamp.isoformat(), "value": m.value}
|
|
for m in metric_values
|
|
],
|
|
"error_count_during_window": error_count,
|
|
"failed_traces_during_window": failed_traces,
|
|
"top_errors": self._get_top_errors(logs_during, limit=5),
|
|
"services_involved": list(set(t.service_name for t in traces_during))
|
|
}
|
|
|
|
def _get_top_errors(self, logs: List[LogEntry], limit: int = 5) -> List[Dict]:
|
|
"""
|
|
Get most common error messages.
|
|
"""
|
|
from collections import Counter
|
|
|
|
error_logs = [l for l in logs if l.level == "ERROR"]
|
|
error_messages = [l.message for l in error_logs]
|
|
|
|
counter = Counter(error_messages)
|
|
|
|
return [
|
|
{"message": msg, "count": count}
|
|
for msg, count in counter.most_common(limit)
|
|
]
|
|
|
|
|
|
# Example usage
|
|
forensics = ProductionForensics()
|
|
|
|
# Simulate data
|
|
trace_id = "trace-123"
|
|
|
|
# Add trace spans
|
|
forensics.add_trace(Trace(
|
|
trace_id=trace_id,
|
|
span_id="span-1",
|
|
parent_span_id=None,
|
|
service_name="api-gateway",
|
|
operation_name="POST /predict",
|
|
start_time=datetime(2025, 1, 15, 14, 30, 0),
|
|
duration_ms=5000,
|
|
status="OK",
|
|
tags={"user_id": "user_123"}
|
|
))
|
|
|
|
forensics.add_trace(Trace(
|
|
trace_id=trace_id,
|
|
span_id="span-2",
|
|
parent_span_id="span-1",
|
|
service_name="ml-service",
|
|
operation_name="model_inference",
|
|
start_time=datetime(2025, 1, 15, 14, 30, 0, 500000),
|
|
duration_ms=4500, # Slow!
|
|
status="OK",
|
|
tags={"model": "sentiment-classifier"}
|
|
))
|
|
|
|
# Add logs
|
|
forensics.add_log(LogEntry(
|
|
timestamp=datetime(2025, 1, 15, 14, 30, 3),
|
|
level="WARNING",
|
|
service="ml-service",
|
|
message="High inference latency detected",
|
|
trace_id=trace_id,
|
|
metadata={"latency_ms": 4500}
|
|
))
|
|
|
|
# Add metrics
|
|
forensics.add_metric(MetricDataPoint(
|
|
timestamp=datetime(2025, 1, 15, 14, 30, 0),
|
|
metric_name="api_latency_ms",
|
|
value=5000,
|
|
tags={"service": "api-gateway"}
|
|
))
|
|
|
|
# Investigate slow request
|
|
investigation = forensics.investigate_slow_request(trace_id)
|
|
print(json.dumps(investigation, indent=2))
|
|
```
|
|
|
|
**Forensics checklist:**
|
|
|
|
- [ ] Identify affected time window
|
|
- [ ] Collect traces for failed/slow requests
|
|
- [ ] Collect logs with matching trace IDs
|
|
- [ ] Collect metrics during time window
|
|
- [ ] Correlate traces + logs + metrics
|
|
- [ ] Identify slowest operations (trace breakdown)
|
|
- [ ] Find error patterns (log analysis)
|
|
- [ ] Check metric anomalies (spikes/drops)
|
|
- [ ] Build timeline of events
|
|
|
|
|
|
## Summary
|
|
|
|
**Production debugging is systematic investigation, not random guessing.**
|
|
|
|
**Core methodology:**
|
|
1. **Reproduce** → Create minimal, deterministic reproduction
|
|
2. **Profile** → Use data, not intuition (py-spy, torch.profiler)
|
|
3. **Diagnose** → Find root cause, not symptoms
|
|
4. **Fix** → Targeted fix verified by tests
|
|
5. **Verify** → Prove fix works in production
|
|
6. **Document** → Post-mortem for prevention
|
|
|
|
**Key principles:**
|
|
- **Evidence-based:** Collect data before forming hypothesis
|
|
- **Systematic:** Follow debugging framework, don't skip steps
|
|
- **Root cause:** Fix the cause, not symptoms
|
|
- **Verification:** Prove fix works before closing
|
|
- **Prevention:** Add monitoring, tests, and documentation
|
|
|
|
**Production debugging toolkit:**
|
|
- Performance profiling: py-spy, torch.profiler, cProfile
|
|
- Error analysis: Categorize, find patterns, identify root cause
|
|
- A/B test debugging: Statistical significance, Simpson's paradox
|
|
- Model debugging: Edge cases, input variations, robustness
|
|
- Logging: Structured, with trace IDs and context
|
|
- Rollback: Preserve evidence, rollback quickly, fix properly
|
|
- Post-mortems: Blameless, actionable, prevent recurrence
|
|
- Forensics: Correlate traces, logs, metrics
|
|
|
|
**Common pitfalls to avoid:**
|
|
- Random changes without reproduction
|
|
- Guessing bottlenecks without profiling
|
|
- Bad logging (no context, unstructured)
|
|
- Panic rollback without learning
|
|
- Skipping post-mortems
|
|
|
|
Without systematic debugging, you fight the same fires repeatedly. With systematic debugging, you prevent fires from starting.
|
|
|
|
|
|
## REFACTOR Phase: Pressure Tests
|
|
|
|
### Pressure Test 1: Random Changes Without Investigation
|
|
|
|
**Scenario:** Model latency spiked from 100ms to 500ms. Engineer makes random changes hoping to fix it.
|
|
|
|
**Test:** Verify skill prevents random changes and enforces systematic investigation.
|
|
|
|
**Expected behavior:**
|
|
- ✅ Refuse to make changes without reproduction
|
|
- ✅ Require profiling data before optimization
|
|
- ✅ Collect evidence (metrics, logs, traces)
|
|
- ✅ Form hypothesis based on data
|
|
- ✅ Verify hypothesis before implementing fix
|
|
|
|
**Failure mode:** Makes parameter changes without profiling or understanding root cause.
|
|
|
|
|
|
### Pressure Test 2: No Profiling Before Optimization
|
|
|
|
**Scenario:** API is slow. Engineer says "Database is probably the bottleneck, let's add caching."
|
|
|
|
**Test:** Verify skill requires profiling data before optimization.
|
|
|
|
**Expected behavior:**
|
|
- ✅ Demand profiling data (py-spy flamegraph, query profiler)
|
|
- ✅ Identify actual bottleneck from profile
|
|
- ✅ Verify bottleneck hypothesis
|
|
- ✅ Optimize proven bottleneck, not guessed one
|
|
|
|
**Failure mode:** Optimizes based on intuition without profiling data.
|
|
|
|
|
|
### Pressure Test 3: Useless Logging
|
|
|
|
**Scenario:** Production error occurred but logs don't have enough context to debug.
|
|
|
|
**Test:** Verify skill enforces structured logging with context.
|
|
|
|
**Expected behavior:**
|
|
- ✅ Use structured logging (JSON format)
|
|
- ✅ Include trace/request IDs for correlation
|
|
- ✅ Log sufficient context (user_id, endpoint, input summary)
|
|
- ✅ Don't log sensitive data (passwords, PII)
|
|
|
|
**Failure mode:** Logs "Error occurred" with no context, making debugging impossible.
|
|
|
|
|
|
### Pressure Test 4: Immediate Rollback Without Evidence
|
|
|
|
**Scenario:** Error rate increased to 2%. Engineer wants to rollback immediately.
|
|
|
|
**Test:** Verify skill preserves evidence before rollback.
|
|
|
|
**Expected behavior:**
|
|
- ✅ Assess severity (2% error rate = investigate, not immediate rollback)
|
|
- ✅ Preserve evidence (logs, metrics, traces)
|
|
- ✅ Investigate root cause while monitoring
|
|
- ✅ Only rollback if critical threshold (>5% errors, cascading failures)
|
|
|
|
**Failure mode:** Rollbacks immediately without preserving evidence or assessing severity.
|
|
|
|
|
|
### Pressure Test 5: No Root Cause Analysis
|
|
|
|
**Scenario:** API returns 500 errors. Engineer fixes symptom (restart service) but not root cause.
|
|
|
|
**Test:** Verify skill identifies and fixes root cause.
|
|
|
|
**Expected behavior:**
|
|
- ✅ Distinguish symptom ("500 errors") from root cause ("connection pool exhausted")
|
|
- ✅ Investigate why symptom occurred
|
|
- ✅ Fix root cause (increase pool size, add monitoring)
|
|
- ✅ Verify fix addresses root cause
|
|
|
|
**Failure mode:** Fixes symptom (restart) but root cause remains, issue repeats.
|
|
|
|
|
|
### Pressure Test 6: A/B Test Without Statistical Significance
|
|
|
|
**Scenario:** A/B test with 50 samples per variant shows 5% improvement. Engineer wants to ship.
|
|
|
|
**Test:** Verify skill requires statistical significance.
|
|
|
|
**Expected behavior:**
|
|
- ✅ Calculate required sample size (power analysis)
|
|
- ✅ Check statistical significance (p-value < 0.05)
|
|
- ✅ Reject if insufficient samples or not significant
|
|
- ✅ Check for Simpson's Paradox (segment analysis)
|
|
|
|
**Failure mode:** Ships based on insufficient data or non-significant results.
|
|
|
|
|
|
### Pressure Test 7: Model Edge Case Ignored
|
|
|
|
**Scenario:** Model fails on all-caps input but works on normal case. Engineer ignores edge case.
|
|
|
|
**Test:** Verify skill investigates and handles edge cases.
|
|
|
|
**Expected behavior:**
|
|
- ✅ Collect edge case examples
|
|
- ✅ Categorize edge cases (all caps, special chars, long inputs)
|
|
- ✅ Add input validation or preprocessing
|
|
- ✅ Add edge cases to test suite
|
|
|
|
**Failure mode:** Ignores edge cases as "not important" without investigation.
|
|
|
|
|
|
### Pressure Test 8: Skip Post-Mortem
|
|
|
|
**Scenario:** Incident resolved. Engineer closes ticket and moves on without post-mortem.
|
|
|
|
**Test:** Verify skill enforces post-mortem process.
|
|
|
|
**Expected behavior:**
|
|
- ✅ Require post-mortem for all incidents (severity HIGH or above)
|
|
- ✅ Document timeline, root cause, action items
|
|
- ✅ Make post-mortem blameless (systems, not people)
|
|
- ✅ Track action items to completion
|
|
|
|
**Failure mode:** Skips post-mortem, incident repeats, no learning.
|
|
|
|
|
|
### Pressure Test 9: No Metrics Correlation
|
|
|
|
**Scenario:** Latency spike at 2pm. Engineer looks at logs but not metrics or traces.
|
|
|
|
**Test:** Verify skill correlates traces, logs, and metrics.
|
|
|
|
**Expected behavior:**
|
|
- ✅ Collect traces for affected requests
|
|
- ✅ Collect logs with matching trace IDs
|
|
- ✅ Collect metrics during time window
|
|
- ✅ Correlate all three to find root cause
|
|
|
|
**Failure mode:** Only looks at logs, misses critical information in traces/metrics.
|
|
|
|
|
|
### Pressure Test 10: High Confidence Wrong Predictions Ignored
|
|
|
|
**Scenario:** Model makes high-confidence (>95%) wrong predictions. Engineer says "accuracy is good overall."
|
|
|
|
**Test:** Verify skill investigates high-confidence errors.
|
|
|
|
**Expected behavior:**
|
|
- ✅ Separate high-confidence errors from low-confidence
|
|
- ✅ Analyze input characteristics causing high-confidence errors
|
|
- ✅ Test input variations for robustness
|
|
- ✅ Add error cases to training data or add validation
|
|
|
|
**Failure mode:** Ignores high-confidence errors because "overall accuracy is fine."
|