# Deployment Strategies for AI Models ## When to Use This Skill Use this skill when: - Deploying new AI models to production - Comparing model versions in real traffic - Gradually rolling out model updates - Testing models without user impact (shadow mode) - Building automated rollback procedures - Validating model improvements with statistical rigor - Managing feature flags for model control **When NOT to use:** Development environments or single-user testing where gradual rollout isn't needed. ## Core Principle **Instant deployment breaks production. Gradual deployment with validation saves production.** Without safe deployment: - Instant 100% deployment: One bad model breaks all users - No A/B testing: Can't prove new model is better - Canary without metrics: Deploy blindly, detect issues after damage - Shadow mode forever: Never promote, wasted computation - No rollback plan: Scramble to fix when things break **Formula:** Shadow mode (validate without impact) → Canary 5% (detect issues early) → A/B test 50/50 (statistical proof) → Full rollout 100% (confident deployment) + Automated rollback (safety net) = Production-ready deployment. ## Deployment Framework ``` ┌─────────────────────────────────────────┐ │ 1. Shadow Mode (0% user impact) │ │ Run new model alongside old, compare │ │ Don't serve to users, collect metrics │ └──────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────┐ │ 2. Canary (5% → 25%) │ │ Serve to small % of traffic │ │ Monitor error rates, latency │ │ Auto-rollback if metrics degrade │ └──────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────┐ │ 3. A/B Test (50% / 50%) │ │ Split traffic evenly │ │ Measure statistical significance │ │ Compare quality, latency, cost │ └──────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────┐ │ 4. Full Rollout (100%) │ │ Promote winner to all traffic │ │ Keep old model warm for rollback │ └──────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────┐ │ 5. Monitor & Rollback │ │ Track metrics continuously │ │ Automatic rollback on regression │ └─────────────────────────────────────────┘ ``` ## RED Phase: What Goes Wrong (600-800 lines) ### Failure 1: Instant 100% Deployment **Context:** Deploy new model directly to 100% of traffic, no gradual rollout. **What Happens:** ```python # BAD: Instant deployment to all users def deploy_model_instant(): """Deploy new model to 100% traffic immediately.""" config.MODEL_VERSION = "v2.0" # Everyone gets new model NOW restart_service() print("Model v2.0 deployed to 100% traffic!") ``` **User does this:** ``` User: "Deploy the new sentiment model v2.0 to production." Agent: "Sure! I'll update the config and restart: def deploy_new_model(): update_config('MODEL_VERSION', 'v2.0') restart_service() The new model is now live for all users!" ``` **Why This Fails:** 1. **No validation:** What if v2.0 has a bug? 2. **All users affected:** One bad model breaks everyone 3. **No comparison:** Can't measure if v2.0 is better than v1.0 4. **Hard to rollback:** Need emergency deployment to revert **Real Production Impact:** ``` 09:00 - Deploy v2.0 to 100% 09:05 - Error rate spikes from 0.1% → 15% (150× worse!) 09:10 - Customer support flooded with complaints 09:15 - Engineers scramble to diagnose 09:30 - Emergency rollback deployed 09:45 - Service stabilizes, but 45 minutes of downtime Cost: 45 min downtime, 10,000 failed requests, angry customers ``` **What Agent Should Say:** "STOP! Never deploy instantly to 100%. Use gradual rollout: 1. **Shadow mode first** (0% user impact): Run v2.0 alongside v1.0, compare outputs 2. **Canary 5%**: Serve v2.0 to 5% of users, monitor metrics 3. **Expand to 25%** if metrics good 4. **A/B test 50/50** to measure improvement 5. **Full rollout 100%** only after statistical validation **Implementation:** ```python # GOOD: Gradual deployment with validation deployment_plan = { 'shadow': {'duration': '24h', 'traffic': 0}, 'canary_5': {'duration': '2h', 'traffic': 5}, 'canary_25': {'duration': '6h', 'traffic': 25}, 'ab_test': {'duration': '48h', 'traffic': 50}, 'full': {'traffic': 100} } def deploy_gradual(model_version, stage): set_traffic_split(old=100-stage['traffic'], new=stage['traffic']) monitor_metrics(duration=stage['duration']) if metrics_acceptable(): proceed_to_next_stage() else: rollback() ``` **Principle: Deploy gradually. Shadow → Canary → A/B → Full. Validate at each stage.**" ### Failure 2: A/B Test Without Statistical Validation **Context:** Run A/B test but don't measure statistical significance, make decision on small sample. **What Happens:** ```python # BAD: No statistical testing def compare_models(): """Compare models without statistical rigor.""" model_a_accuracy = 0.87 model_b_accuracy = 0.89 if model_b_accuracy > model_a_accuracy: print("Model B is better! Deploy it.") deploy(model_b) ``` **User does this:** ``` User: "Run A/B test for 1 hour. Model A got 87% accuracy, Model B got 89%. Deploy B." Agent: "Model B has higher accuracy (89% vs 87%), so it's better. Deploying Model B: def deploy_winner(): if model_b_score > model_a_score: deploy(model_b) Model B is now live!" ``` **Why This Fails:** 1. **No sample size calculation:** 1 hour might be 100 requests (too small) 2. **No confidence interval:** Is 89% vs 87% statistically significant? 3. **No variance consideration:** Model B might be 89% ± 5% (could be 84-94%) 4. **Could be random noise:** 2% difference might be luck **Real Production Impact:** ``` A/B test for 1 hour: - Model A: 87% on 50 samples (43 correct, 7 wrong) - Model B: 89% on 50 samples (44 correct, 6 wrong) - Difference: 1 more correct prediction Deploy Model B to 100% After 1 week: - Model B: 85% on 10,000 samples (actually worse!) - Model A was better, but sample size was too small to tell Cost: Deployed worse model, degraded user experience, need to rollback ``` **Statistical Reality:** ```python # Calculate statistical significance from scipy.stats import proportions_ztest # Model A: 43/50 = 86%, Model B: 44/50 = 88% stat, p_value = proportions_ztest([43, 44], [50, 50]) print(f"p-value: {p_value:.3f}") # p=0.72 (NOT significant!) # Need p < 0.05 for significance # With only 50 samples each, difference is random noise ``` **What Agent Should Say:** "STOP! You can't conclude Model B is better from 100 samples. **Problem 1: No sample size calculation** ```python # Required sample size for 80% power, 5% significance, 2% effect size from statsmodels.stats.power import zt_ind_solve_power required_n = zt_ind_solve_power( effect_size=0.02 / 0.1, # 2% improvement, 10% baseline std alpha=0.05, # 5% significance level power=0.8, # 80% power ) print(f"Need {required_n:.0f} samples per model") # ~3,100 samples! # You only have 50 samples per model - need 62× more data ``` **Problem 2: No significance testing** ```python # Proper statistical test def ab_test_significance(a_correct, a_total, b_correct, b_total): from scipy.stats import proportions_ztest stat, p_value = proportions_ztest( [a_correct, b_correct], [a_total, b_total] ) if p_value < 0.05: return "Statistically significant" else: return f"NOT significant (p={p_value:.3f}). Need more data." result = ab_test_significance(43, 50, 44, 50) print(result) # "NOT significant (p=0.72). Need more data." ``` **Problem 3: No confidence intervals** ```python # Calculate confidence intervals from statsmodels.stats.proportion import proportion_confint ci_a = proportion_confint(43, 50, alpha=0.05, method='wilson') ci_b = proportion_confint(44, 50, alpha=0.05, method='wilson') print(f"Model A: 86% CI=[{ci_a[0]:.1%}, {ci_a[1]:.1%}]") # [73%, 94%] print(f"Model B: 88% CI=[{ci_b[0]:.1%}, {ci_b[1]:.1%}]") # [76%, 95%] # Confidence intervals overlap heavily! Can't distinguish models. ``` **Fix: Proper A/B testing** ```python def run_ab_test_properly(model_a, model_b, min_samples=3000): # Calculate required sample size required_n = calculate_sample_size( effect_size=0.02, # Minimum detectable difference alpha=0.05, # Significance level power=0.8 # Statistical power ) # Collect data results_a = [] results_b = [] while len(results_a) < max(required_n, min_samples): # 50/50 traffic split result = route_request(a_weight=0.5, b_weight=0.5) results_a.append(result['a']) results_b.append(result['b']) # Statistical test stat, p_value = proportions_ztest( [sum(results_a), sum(results_b)], [len(results_a), len(results_b)] ) # Calculate effect size effect_size = (sum(results_b)/len(results_b) - sum(results_a)/len(results_a)) # Confidence intervals ci_a = proportion_confint(sum(results_a), len(results_a)) ci_b = proportion_confint(sum(results_b), len(results_b)) # Decision if p_value < 0.05 and effect_size > 0.01: # Significant AND meaningful return { 'winner': 'Model B', 'p_value': p_value, 'effect_size': effect_size, 'confidence': 'HIGH', 'recommendation': 'Deploy Model B' } else: return { 'winner': 'INCONCLUSIVE', 'p_value': p_value, 'effect_size': effect_size, 'confidence': 'LOW', 'recommendation': 'Keep Model A or collect more data' } ``` **Principle: A/B tests need statistical rigor. Calculate sample size, test significance, check confidence intervals. Don't trust small samples.**" ### Failure 3: Canary Deployment Without Metrics **Context:** Deploy to 5% of users but don't monitor metrics, just "see if anyone complains." **What Happens:** ```python # BAD: Canary without monitoring def canary_deploy(): """Deploy to 5% with no monitoring.""" set_traffic_split(old=95, new=5) print("Deployed to 5% of users. Let's wait and see...") time.sleep(3600) # Wait 1 hour, no monitoring print("No complaints? Deploy to 100%!") set_traffic_split(old=0, new=100) ``` **User does this:** ``` User: "Start canary deployment at 5% for the new model." Agent: "Done! I've set 5% of traffic to the new model: def canary(): route_traffic(old=95, new=5) wait(hours=2) if no_complaints(): deploy_full() The canary is running. I'll check in 2 hours." ``` **Why This Fails:** 1. **No automated metrics:** Relying on users to complain 2. **Users don't always complain:** They silently leave 3. **5% = small sample:** Might miss rare errors 4. **No rollback trigger:** Just "wait and see" 5. **Silent failures:** API errors, slow responses unnoticed **Real Production Impact:** ``` 10:00 - Canary deploy to 5% (5,000 users/hour) 10:00-12:00 - No complaints received 12:00 - Deploy to 100% based on "no news is good news" 12:30 - Users report slow responses (2s → 8s latency) 13:00 - Error rate 10× higher (0.1% → 1%) 13:30 - Emergency rollback Post-mortem: - Canary phase HAD issues (8s latency for 5% of users) - Users didn't complain (they just refreshed) - No automated monitoring caught the problem - Full deployment amplified issue to 100% of users Cost: 1.5 hours of degraded service, lost user trust ``` **What metrics should have caught it:** ```python # Canary metrics that would have caught the issue canary_metrics = { 'latency_p50': 8.2, # vs 2.0s baseline (4× WORSE) ← RED FLAG 'latency_p95': 15.3, # vs 4.0s baseline (4× WORSE) ← RED FLAG 'error_rate': 1.2, # vs 0.1% baseline (12× WORSE) ← RED FLAG 'timeout_rate': 2.1, # vs 0.5% baseline (4× WORSE) ← RED FLAG } # Automated check would have rolled back immediately ``` **What Agent Should Say:** "STOP! Never run canary without automated metrics monitoring. **Problem: No observability** ``` Current: Deploy 5%, wait, 'no complaints' = success Reality: Users don't complain, they leave ``` **Fix: Canary with automated monitoring** ```python def canary_deploy_with_monitoring( new_model, canary_percent=5, duration_minutes=120, check_interval_minutes=5 ): \"\"\" Canary deployment with automated metrics monitoring. Monitors: - Latency (p50, p95, p99) - Error rate - Timeout rate - User satisfaction (if available) Auto-rollback if any metric degrades beyond threshold. \"\"\" # Baseline metrics from old model baseline = get_baseline_metrics(hours=24) # Start canary set_traffic_split(old=100-canary_percent, new=canary_percent) print(f"Canary started: {canary_percent}% traffic to new model") # Monitor for duration for elapsed in range(0, duration_minutes, check_interval_minutes): # Get canary metrics canary_metrics = get_canary_metrics(minutes=check_interval_minutes) # Compare to baseline checks = { 'latency_p50': canary_metrics['latency_p50'] < baseline['latency_p50'] * 1.2, # Allow 20% increase 'latency_p95': canary_metrics['latency_p95'] < baseline['latency_p95'] * 1.5, # Allow 50% increase 'error_rate': canary_metrics['error_rate'] < baseline['error_rate'] * 2.0, # Allow 2× increase 'timeout_rate': canary_metrics['timeout_rate'] < baseline['timeout_rate'] * 2.0, # Allow 2× increase } # Check for failures failed_checks = [k for k, v in checks.items() if not v] if failed_checks: print(f"ROLLBACK TRIGGERED! Failed checks: {failed_checks}") print(f"Baseline: {baseline}") print(f"Canary: {canary_metrics}") # Automatic rollback set_traffic_split(old=100, new=0) alert_team( message=f"Canary rollback: {failed_checks}", metrics={'baseline': baseline, 'canary': canary_metrics} ) return {'status': 'ROLLED_BACK', 'reason': failed_checks} print(f"Canary healthy at {elapsed + check_interval_minutes} min: {canary_metrics}") time.sleep(check_interval_minutes * 60) # Canary succeeded print("Canary succeeded! Ready to expand.") return {'status': 'SUCCESS', 'metrics': canary_metrics} # Usage result = canary_deploy_with_monitoring( new_model='v2.0', canary_percent=5, duration_minutes=120 ) if result['status'] == 'SUCCESS': # Expand to 25% canary_deploy_with_monitoring(new_model='v2.0', canary_percent=25) ``` **Key metrics to monitor:** | Metric | Threshold | Why | |--------|-----------|-----| | Latency p50 | < 1.2× baseline | Most users shouldn't notice slowdown | | Latency p95 | < 1.5× baseline | Tail latency matters for UX | | Error rate | < 2× baseline | Small increase acceptable, large = bug | | Timeout rate | < 2× baseline | Timeouts = bad UX | | CPU usage | < 1.5× baseline | Cost and capacity concern | | Memory usage | < 1.5× baseline | Memory leaks show up here | **Principle: Canary without metrics is hope, not engineering. Monitor automatically, rollback automatically.**" ### Failure 4: Shadow Mode Without Comparison **Context:** Run new model in shadow mode but never compare outputs to old model. **What Happens:** ```python # BAD: Shadow mode with no comparison def shadow_mode(): """Run new model in background, don't compare.""" old_result = model_v1(input) new_result = model_v2(input) # Compute but ignore return old_result # Always return old model # New model runs forever in shadow, never evaluated ``` **User does this:** ``` User: "Run the new model in shadow mode alongside the current model." Agent: "I'll set up shadow mode: def shadow_deploy(): # Serve old model to users result = old_model.predict(input) # Run new model in background (don't serve) new_model.predict(input) # Async, don't wait return result Shadow mode active! The new model is running but not served to users." ``` **Why This Fails:** 1. **No comparison:** Running two models but not comparing outputs 2. **Wasted compute:** New model runs but insights ignored 3. **No decision criteria:** When to promote new model? 4. **Stays in shadow forever:** No graduation path **Real Production Impact:** ``` Week 1: Shadow mode running, new model processes 1M requests Week 2: Still in shadow mode, 2M requests total Week 3: Engineer asks: "Should we promote the new model?" Answer: "No idea. We didn't compare outputs." Cost: 3 weeks of wasted compute (running but not learning) ``` **What Agent Should Say:** "Shadow mode without comparison is wasted compute. You need to COMPARE outputs and decide promotion criteria. **Fix: Shadow mode with comparison and decision criteria** ```python import numpy as np from typing import Dict, List from dataclasses import dataclass @dataclass class ShadowResult: input: str old_output: str new_output: str old_latency: float new_latency: float match: bool timestamp: float class ShadowModeComparison: \"\"\" Run new model in shadow, compare to old model, decide when to promote. \"\"\" def __init__(self, old_model, new_model, sample_rate=1.0): self.old_model = old_model self.new_model = new_model self.sample_rate = sample_rate self.results: List[ShadowResult] = [] def predict_with_shadow(self, input: str) -> str: \"\"\" Predict with old model, run new model in shadow for comparison. \"\"\" import time # Old model (served to users) start = time.time() old_output = self.old_model.predict(input) old_latency = time.time() - start # New model (shadow, not served) if np.random.random() < self.sample_rate: start = time.time() new_output = self.new_model.predict(input) new_latency = time.time() - start # Compare outputs match = self._compare_outputs(old_output, new_output) # Store for analysis self.results.append(ShadowResult( input=input, old_output=old_output, new_output=new_output, old_latency=old_latency, new_latency=new_latency, match=match, timestamp=time.time() )) return old_output # Always serve old model def _compare_outputs(self, old: str, new: str) -> bool: \"\"\"Compare outputs (exact match or semantic similarity).\"\"\" # For classification: exact match if old in ['positive', 'negative', 'neutral']: return old == new # For text generation: semantic similarity from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') old_emb = model.encode(old) new_emb = model.encode(new) similarity = np.dot(old_emb, new_emb) / ( np.linalg.norm(old_emb) * np.linalg.norm(new_emb) ) return similarity > 0.9 # 90% similar = match def get_analysis(self) -> Dict: \"\"\" Analyze shadow mode results and recommend promotion. \"\"\" if len(self.results) < 100: return { 'status': 'INSUFFICIENT_DATA', 'message': f'Only {len(self.results)} samples. Need 100+ for decision.', 'recommendation': 'Continue shadow mode' } # Calculate metrics agreement_rate = np.mean([r.match for r in self.results]) old_latency_p50 = np.median([r.old_latency for r in self.results]) new_latency_p50 = np.median([r.new_latency for r in self.results]) old_latency_p95 = np.percentile([r.old_latency for r in self.results], 95) new_latency_p95 = np.percentile([r.new_latency for r in self.results], 95) # Decision criteria latency_acceptable = new_latency_p95 < old_latency_p95 * 1.5 # Max 50% slower agreement_acceptable = agreement_rate > 0.85 # 85% agreement # Recommendation if latency_acceptable and agreement_acceptable: recommendation = 'PROMOTE_TO_CANARY' message = ( f'Shadow mode successful! ' f'Agreement: {agreement_rate:.1%}, ' f'Latency p95: {new_latency_p95:.3f}s vs {old_latency_p95:.3f}s' ) elif not latency_acceptable: recommendation = 'OPTIMIZE_LATENCY' message = ( f'New model too slow: ' f'{new_latency_p95:.3f}s vs {old_latency_p95:.3f}s (>{1.5:.1f}× threshold)' ) else: # not agreement_acceptable recommendation = 'INVESTIGATE_DISAGREEMENT' message = ( f'Low agreement: {agreement_rate:.1%}. ' f'Review disagreement cases before promoting.' ) return { 'status': 'ANALYSIS_COMPLETE', 'samples': len(self.results), 'agreement_rate': agreement_rate, 'old_latency_p50': old_latency_p50, 'new_latency_p50': new_latency_p50, 'old_latency_p95': old_latency_p95, 'new_latency_p95': new_latency_p95, 'recommendation': recommendation, 'message': message } # Usage shadow = ShadowModeComparison(old_model, new_model, sample_rate=0.1) # Run for 1 week for request in production_traffic: result = shadow.predict_with_shadow(request) serve_to_user(result) # Analyze after 1 week analysis = shadow.get_analysis() print(analysis) # Example output: { 'status': 'ANALYSIS_COMPLETE', 'samples': 10000, 'agreement_rate': 0.92, # 92% agree 'old_latency_p50': 0.15, 'new_latency_p50': 0.18, # 20% slower (acceptable) 'old_latency_p95': 0.45, 'new_latency_p95': 0.58, # 29% slower (acceptable < 50%) 'recommendation': 'PROMOTE_TO_CANARY', 'message': 'Shadow mode successful! Agreement: 92%, Latency p95: 0.58s vs 0.45s' } ``` **Principle: Shadow mode is for comparison, not just running. Compare outputs, latency, errors. Decide promotion criteria upfront.**" ### Failure 5: No Rollback Plan **Context:** Deploy new model but have no rollback procedure when things go wrong. **What Happens:** ```python # BAD: No rollback plan def deploy(): """Deploy with no rollback.""" config.MODEL = 'v2.0' restart_service() # Hope nothing breaks! ``` **User does this:** ``` User: "Deploy model v2.0." Agent: "Deployed! def deploy(): set_model('v2.0') restart() Model v2.0 is live." [30 minutes later] User: "Error rate is spiking! Roll back to v1.0!" Agent: "Um... let me see... I need to find the v1.0 model file... where is it? And I need to update config... and restart... this will take a few minutes..." [10 minutes of scrambling] Agent: "OK, rolled back. That took 40 minutes of downtime total." ``` **Why This Fails:** 1. **No automated rollback:** Manual process takes time 2. **Old model not warm:** Need to load and warm up 3. **Config changes manual:** Error-prone under pressure 4. **No rollback testing:** Don't know if rollback works until emergency 5. **Delayed response:** 40 minutes vs 30 seconds with automation **Real Production Impact:** ``` 10:00 - Deploy v2.0 to 100% 10:30 - Error rate spikes, need rollback 10:30 - Begin manual rollback 10:32 - Find v1.0 model artifact (2 min) 10:35 - Update config (3 min) 10:38 - Restart service (3 min) 10:40 - Wait for healthcheck (2 min) 10:40 - Service restored (10 minutes downtime) Better with automation: 10:30 - Error rate spikes, trigger automatic rollback 10:30:30 - Rollback complete (30 seconds) ``` **What Agent Should Say:** "NEVER deploy without automated rollback. Build rollback first, then deploy. **Fix: Automated rollback with blue-green deployment** ```python from enum import Enum from dataclasses import dataclass from typing import Optional import time class ModelVersion(Enum): BLUE = "blue" GREEN = "green" @dataclass class DeploymentState: active_version: ModelVersion blue_model: str green_model: str blue_weight: int green_weight: int class BlueGreenDeployment: \"\"\" Blue-green deployment with instant rollback. Strategy: - Blue = current production model - Green = new model being deployed - Traffic splits between blue and green - Rollback = instant traffic shift to blue \"\"\" def __init__(self, blue_model: str): self.state = DeploymentState( active_version=ModelVersion.BLUE, blue_model=blue_model, green_model=None, blue_weight=100, green_weight=0 ) # Keep both models warm self.models = { ModelVersion.BLUE: load_model(blue_model), ModelVersion.GREEN: None } def deploy_green(self, green_model: str): \"\"\"Deploy new model to green slot.\"\"\" print(f"Loading green model: {green_model}") self.models[ModelVersion.GREEN] = load_model(green_model) self.state.green_model = green_model print("Green model loaded and warm") def shift_traffic(self, blue_weight: int, green_weight: int): \"\"\"Shift traffic between blue and green.\"\"\" if blue_weight + green_weight != 100: raise ValueError("Weights must sum to 100") self.state.blue_weight = blue_weight self.state.green_weight = green_weight # Update load balancer update_load_balancer({ 'blue': blue_weight, 'green': green_weight }) print(f"Traffic split: Blue={blue_weight}%, Green={green_weight}%") def rollback(self, reason: str = "Manual rollback"): \"\"\" INSTANT rollback to blue (stable version). Takes ~1 second (just update load balancer). \"\"\" print(f"ROLLBACK TRIGGERED: {reason}") print(f"Shifting 100% traffic to Blue ({self.state.blue_model})") self.shift_traffic(blue_weight=100, green_weight=0) alert_team( message=f"Rollback executed: {reason}", old_state={'blue': self.state.blue_weight, 'green': self.state.green_weight}, new_state={'blue': 100, 'green': 0} ) print("Rollback complete (< 1 second)") def promote_green(self): \"\"\" Promote green to blue (make green the new stable). Process: 1. Green is at 100% traffic (already tested) 2. Swap blue ↔ green labels 3. Old blue becomes new green (ready for next deployment) \"\"\" print("Promoting green to blue") # Swap models old_blue = self.state.blue_model old_blue_model = self.models[ModelVersion.BLUE] self.state.blue_model = self.state.green_model self.state.green_model = old_blue self.models[ModelVersion.BLUE] = self.models[ModelVersion.GREEN] self.models[ModelVersion.GREEN] = old_blue_model # Update traffic (blue=100%, green=0%) self.state.blue_weight = 100 self.state.green_weight = 0 print(f"Promotion complete: {self.state.blue_model} is now stable") def gradual_rollout( self, green_model: str, stages: list = [5, 25, 50, 100], stage_duration_minutes: int = 60 ): \"\"\" Gradual rollout with automatic rollback on errors. \"\"\" # Deploy to green slot self.deploy_green(green_model) # Monitor metrics baseline_metrics = get_metrics(window_minutes=60) for stage in stages: print(f"\\n=== Stage: {stage}% to green ===") # Shift traffic self.shift_traffic(blue_weight=100-stage, green_weight=stage) # Monitor for duration print(f"Monitoring for {stage_duration_minutes} minutes...") for minute in range(stage_duration_minutes): time.sleep(60) # Check metrics every minute current_metrics = get_metrics(window_minutes=5) # Automated health check health = self._check_health(baseline_metrics, current_metrics) if not health['healthy']: print(f"Health check FAILED: {health['reason']}") self.rollback(reason=health['reason']) return {'status': 'ROLLED_BACK', 'reason': health['reason']} if (minute + 1) % 10 == 0: print(f" {minute + 1}/{stage_duration_minutes} min - Healthy") print(f"Stage {stage}% complete. Metrics healthy.") # All stages passed, promote green to blue self.promote_green() return {'status': 'SUCCESS', 'model': green_model} def _check_health(self, baseline: dict, current: dict) -> dict: \"\"\"Check if current metrics are healthy compared to baseline.\"\"\" checks = { 'error_rate': current['error_rate'] < baseline['error_rate'] * 2.0, 'latency_p95': current['latency_p95'] < baseline['latency_p95'] * 1.5, 'timeout_rate': current['timeout_rate'] < baseline['timeout_rate'] * 2.0, } failed = [k for k, v in checks.items() if not v] if failed: return { 'healthy': False, 'reason': f"Metrics degraded: {failed}. Current: {current}, Baseline: {baseline}" } return {'healthy': True} # Usage deployment = BlueGreenDeployment(blue_model='v1.0') # Deploy v2.0 with gradual rollout and automatic rollback result = deployment.gradual_rollout( green_model='v2.0', stages=[5, 25, 50, 100], # Canary 5% → 25% → A/B 50% → Full 100% stage_duration_minutes=60 ) # If any stage fails, automatic rollback to v1.0 (< 1 second) # If all stages pass, v2.0 promoted to stable print(result) # {'status': 'SUCCESS', 'model': 'v2.0'} # or # {'status': 'ROLLED_BACK', 'reason': 'Metrics degraded: error_rate. Current: {...}, Baseline: {...}'} ``` **Rollback timing comparison:** | Method | Rollback Time | Risk | |--------|---------------|------| | Manual | 5-10 minutes | High (human error, stress) | | Scripted | 2-3 minutes | Medium (still manual trigger) | | Automated | < 30 seconds | Low (instant, no human) | | Blue-green | < 1 second | Minimal (just traffic shift) | **Principle: Build rollback before deploying. Automated, instant, tested. Blue-green deployment makes rollback a config change, not a deploy.**" ## Summary of RED Phase Failures **5 Failures Covered:** 1. **Instant 100% deployment** → All users impacted by bugs 2. **A/B test without statistics** → Wrong conclusions from small samples 3. **Canary without metrics** → Silent failures go unnoticed 4. **Shadow mode without comparison** → Wasted compute, no learning 5. **No rollback plan** → Slow recovery from failures **Common themes:** - **No validation** → Hope-driven deployment - **No automation** → Manual processes fail under pressure - **No metrics** → Flying blind - **No gradual rollout** → All-or-nothing risk - **No rollback** → Long recovery time **Core insight:** Safe deployment requires automation, metrics, gradual rollout, and instant rollback. Each step must validate before proceeding. ## GREEN Phase: Safe Deployment Patterns (900-1200 lines) ### Pattern 1: A/B Testing with Statistical Validation **Goal:** Compare two models with statistical rigor to make confident decisions. **Complete Implementation:** ```python import numpy as np from scipy import stats from statsmodels.stats.power import zt_ind_solve_power from statsmodels.stats.proportion import proportion_confint from dataclasses import dataclass from typing import List, Dict, Optional, Tuple from enum import Enum import time class ABTestStatus(Enum): NOT_STARTED = "not_started" RUNNING = "running" INCONCLUSIVE = "inconclusive" A_WINS = "a_wins" B_WINS = "b_wins" TIE = "tie" @dataclass class ABTestConfig: """Configuration for A/B test.""" min_sample_size: int = 1000 # Minimum samples per variant significance_level: float = 0.05 # Alpha (5% significance) power: float = 0.8 # 80% statistical power min_effect_size: float = 0.02 # Minimum detectable effect (2%) traffic_split: float = 0.5 # 50/50 split @dataclass class ABTestResult: """Result of A/B test.""" status: ABTestStatus winner: Optional[str] p_value: float effect_size: float confidence_interval_a: Tuple[float, float] confidence_interval_b: Tuple[float, float] sample_size_a: int sample_size_b: int metric_a: float metric_b: float required_sample_size: int recommendation: str class ABTest: """ A/B testing framework with statistical validation. Features: - Sample size calculation (power analysis) - Statistical significance testing (z-test) - Confidence intervals - Effect size calculation - Multi-metric evaluation - Automatic decision making """ def __init__(self, model_a, model_b, config: ABTestConfig = None): self.model_a = model_a self.model_b = model_b self.config = config or ABTestConfig() self.results_a = [] self.results_b = [] self.metadata_a = [] self.metadata_b = [] def calculate_required_sample_size( self, baseline_rate: float = 0.5, effect_size: float = None ) -> int: """ Calculate required sample size for statistical power. Args: baseline_rate: Expected baseline conversion/success rate effect_size: Minimum detectable effect (default from config) Returns: Required sample size per variant """ effect_size = effect_size or self.config.min_effect_size # Convert effect size to Cohen's h p1 = baseline_rate p2 = baseline_rate + effect_size cohens_h = 2 * (np.arcsin(np.sqrt(p2)) - np.arcsin(np.sqrt(p1))) # Calculate required sample size n = zt_ind_solve_power( effect_size=cohens_h, alpha=self.config.significance_level, power=self.config.power, ratio=1.0, # Equal sample sizes alternative='two-sided' ) return int(np.ceil(n)) def route_request(self, request) -> Tuple[str, any]: """ Route request to A or B based on traffic split. Returns: (variant, result) where variant is 'a' or 'b' """ if np.random.random() < self.config.traffic_split: variant = 'a' result = self.model_a.predict(request) else: variant = 'b' result = self.model_b.predict(request) return variant, result def record_result(self, variant: str, success: bool, metadata: dict = None): """ Record result for variant. Args: variant: 'a' or 'b' success: Whether the prediction was successful (1) or not (0) metadata: Optional metadata (latency, user_id, etc.) """ if variant == 'a': self.results_a.append(1 if success else 0) self.metadata_a.append(metadata or {}) else: self.results_b.append(1 if success else 0) self.metadata_b.append(metadata or {}) def test_significance(self) -> ABTestResult: """ Test statistical significance of results. Returns: ABTestResult with decision and metrics """ n_a = len(self.results_a) n_b = len(self.results_b) # Check minimum sample size required_n = self.calculate_required_sample_size() if n_a < required_n or n_b < required_n: return ABTestResult( status=ABTestStatus.INCONCLUSIVE, winner=None, p_value=1.0, effect_size=0.0, confidence_interval_a=(0.0, 0.0), confidence_interval_b=(0.0, 0.0), sample_size_a=n_a, sample_size_b=n_b, metric_a=0.0, metric_b=0.0, required_sample_size=required_n, recommendation=f"Continue test. Need {required_n - min(n_a, n_b)} more samples." ) # Calculate metrics successes_a = sum(self.results_a) successes_b = sum(self.results_b) rate_a = successes_a / n_a rate_b = successes_b / n_b # Statistical test (two-proportion z-test) from statsmodels.stats.proportion import proportions_ztest stat, p_value = proportions_ztest( [successes_a, successes_b], [n_a, n_b] ) # Confidence intervals ci_a = proportion_confint(successes_a, n_a, alpha=self.config.significance_level, method='wilson') ci_b = proportion_confint(successes_b, n_b, alpha=self.config.significance_level, method='wilson') # Effect size effect_size = rate_b - rate_a # Decision is_significant = p_value < self.config.significance_level is_meaningful = abs(effect_size) >= self.config.min_effect_size if is_significant and is_meaningful: if effect_size > 0: status = ABTestStatus.B_WINS winner = 'b' recommendation = f"Deploy Model B. {rate_b:.1%} vs {rate_a:.1%} (p={p_value:.4f})" else: status = ABTestStatus.A_WINS winner = 'a' recommendation = f"Keep Model A. {rate_a:.1%} vs {rate_b:.1%} (p={p_value:.4f})" elif is_significant and not is_meaningful: status = ABTestStatus.TIE winner = None recommendation = f"Models equivalent. Effect size {effect_size:.1%} below threshold {self.config.min_effect_size:.1%}." else: status = ABTestStatus.INCONCLUSIVE winner = None recommendation = f"No significant difference (p={p_value:.4f}). Consider longer test or accept tie." return ABTestResult( status=status, winner=winner, p_value=p_value, effect_size=effect_size, confidence_interval_a=ci_a, confidence_interval_b=ci_b, sample_size_a=n_a, sample_size_b=n_b, metric_a=rate_a, metric_b=rate_b, required_sample_size=required_n, recommendation=recommendation ) def run_test( self, traffic_generator, max_duration_hours: int = 48, check_interval_minutes: int = 60 ) -> ABTestResult: """ Run A/B test with automatic stopping. Args: traffic_generator: Generator yielding (request, ground_truth) tuples max_duration_hours: Maximum test duration check_interval_minutes: How often to check for significance Returns: ABTestResult with final decision """ start_time = time.time() last_check = start_time print(f"Starting A/B test: Model A vs Model B") print(f"Config: {self.config}") print(f"Required sample size: {self.calculate_required_sample_size()} per variant") for request, ground_truth in traffic_generator: # Route request variant, prediction = self.route_request(request) # Evaluate success = self._evaluate(prediction, ground_truth) # Record with metadata metadata = { 'timestamp': time.time(), 'request': request, 'prediction': prediction, 'ground_truth': ground_truth } self.record_result(variant, success, metadata) # Check for significance periodically if time.time() - last_check > check_interval_minutes * 60: result = self.test_significance() print(f"\n=== Check at {len(self.results_a) + len(self.results_b)} samples ===") print(f"Model A: {result.metric_a:.1%} ({result.sample_size_a} samples)") print(f"Model B: {result.metric_b:.1%} ({result.sample_size_b} samples)") print(f"Status: {result.status.value}") print(f"p-value: {result.p_value:.4f}") print(f"Effect size: {result.effect_size:+.1%}") print(f"Recommendation: {result.recommendation}") # Stop if conclusive if result.status in [ABTestStatus.A_WINS, ABTestStatus.B_WINS, ABTestStatus.TIE]: print(f"\nTest concluded: {result.status.value}") return result last_check = time.time() # Stop if max duration reached if time.time() - start_time > max_duration_hours * 3600: print(f"\nMax duration ({max_duration_hours}h) reached") result = self.test_significance() return result # Test ended (traffic exhausted) return self.test_significance() def _evaluate(self, prediction, ground_truth) -> bool: """Evaluate if prediction matches ground truth.""" return prediction == ground_truth def analyze_segments(self, segment_key: str = 'user_type') -> Dict[str, ABTestResult]: """ Analyze results by segments (e.g., user type, geography). Args: segment_key: Key in metadata to segment by Returns: Dict mapping segment to ABTestResult """ # Group by segment segments_a = {} segments_b = {} for result, metadata in zip(self.results_a, self.metadata_a): segment = metadata.get(segment_key, 'unknown') if segment not in segments_a: segments_a[segment] = [] segments_a[segment].append(result) for result, metadata in zip(self.results_b, self.metadata_b): segment = metadata.get(segment_key, 'unknown') if segment not in segments_b: segments_b[segment] = [] segments_b[segment].append(result) # Analyze each segment segment_results = {} for segment in set(segments_a.keys()) | set(segments_b.keys()): results_a = segments_a.get(segment, []) results_b = segments_b.get(segment, []) # Create temporary AB test for segment segment_test = ABTest(self.model_a, self.model_b, self.config) segment_test.results_a = results_a segment_test.results_b = results_b segment_results[segment] = segment_test.test_significance() return segment_results # Example usage if __name__ == "__main__": # Mock models class ModelA: def predict(self, x): return "positive" if np.random.random() < 0.75 else "negative" class ModelB: def predict(self, x): return "positive" if np.random.random() < 0.78 else "negative" # 3% better # Traffic generator (mock) def traffic_generator(): for i in range(10000): request = f"Review {i}" ground_truth = "positive" if np.random.random() < 0.75 else "negative" yield request, ground_truth # Run A/B test ab_test = ABTest(ModelA(), ModelB()) result = ab_test.run_test( traffic_generator(), max_duration_hours=48, check_interval_minutes=60 ) print("\n" + "="*50) print("FINAL RESULT") print("="*50) print(f"Status: {result.status.value}") print(f"Winner: {result.winner}") print(f"Model A: {result.metric_a:.1%} CI=[{result.confidence_interval_a[0]:.1%}, {result.confidence_interval_a[1]:.1%}]") print(f"Model B: {result.metric_b:.1%} CI=[{result.confidence_interval_b[0]:.1%}, {result.confidence_interval_b[1]:.1%}]") print(f"Effect size: {result.effect_size:+.1%}") print(f"p-value: {result.p_value:.4f}") print(f"Recommendation: {result.recommendation}") ``` **Key Features:** 1. **Sample size calculation:** Power analysis ensures sufficient data 2. **Statistical testing:** Two-proportion z-test with significance level 3. **Confidence intervals:** Quantify uncertainty 4. **Effect size:** Measure practical significance 5. **Automatic stopping:** Stop when conclusive or time limit reached 6. **Segment analysis:** Analyze by user type, geography, etc. **Usage guidelines:** | Scenario | Min Sample Size | Duration | Traffic Split | |----------|-----------------|----------|---------------| | Small effect (2%) | 3,000/variant | 1-2 weeks | 50/50 | | Medium effect (5%) | 500/variant | 3-5 days | 50/50 | | Large effect (10%) | 200/variant | 1-2 days | 50/50 | ### Pattern 2: Canary Deployment with Automated Rollback **Goal:** Gradually increase traffic to new model while monitoring metrics and auto-rollback on regression. **Complete Implementation:** ```python import time import numpy as np from dataclasses import dataclass from typing import Dict, List, Callable, Optional from enum import Enum import logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) class CanaryStage(Enum): SHADOW = "shadow" # 0% user traffic CANARY_5 = "canary_5" # 5% traffic CANARY_25 = "canary_25" # 25% traffic AB_TEST = "ab_test" # 50% traffic FULL = "full" # 100% traffic class CanaryStatus(Enum): NOT_STARTED = "not_started" IN_PROGRESS = "in_progress" SUCCESS = "success" ROLLED_BACK = "rolled_back" @dataclass class CanaryConfig: """Configuration for canary deployment.""" stages: List[Dict] = None # List of {'stage': CanaryStage, 'duration_minutes': int} check_interval_minutes: int = 5 # How often to check metrics # Metric thresholds for rollback max_error_rate_multiplier: float = 2.0 # Allow 2× baseline error rate max_latency_p95_multiplier: float = 1.5 # Allow 1.5× baseline latency max_timeout_rate_multiplier: float = 2.0 # Allow 2× baseline timeout rate def __post_init__(self): if self.stages is None: self.stages = [ {'stage': CanaryStage.SHADOW, 'duration_minutes': 60}, {'stage': CanaryStage.CANARY_5, 'duration_minutes': 120}, {'stage': CanaryStage.CANARY_25, 'duration_minutes': 240}, {'stage': CanaryStage.AB_TEST, 'duration_minutes': 1440}, # 24 hours {'stage': CanaryStage.FULL, 'duration_minutes': 0}, # Indefinite ] @dataclass class Metrics: """Metrics for monitoring.""" error_rate: float latency_p50: float latency_p95: float latency_p99: float timeout_rate: float requests_per_second: float timestamp: float def __repr__(self): return ( f"Metrics(error_rate={self.error_rate:.2%}, " f"latency_p95={self.latency_p95:.3f}s, " f"timeout_rate={self.timeout_rate:.2%})" ) class MetricsCollector: """Collect and aggregate metrics.""" def __init__(self): self.results = [] self.latencies = [] def record(self, success: bool, latency: float, timeout: bool): """Record single request result.""" self.results.append({ 'success': success, 'latency': latency, 'timeout': timeout, 'timestamp': time.time() }) self.latencies.append(latency) def get_metrics(self, window_minutes: int = 5) -> Metrics: """Get metrics for recent window.""" cutoff = time.time() - window_minutes * 60 recent = [r for r in self.results if r['timestamp'] > cutoff] if not recent: return Metrics( error_rate=0.0, latency_p50=0.0, latency_p95=0.0, latency_p99=0.0, timeout_rate=0.0, requests_per_second=0.0, timestamp=time.time() ) recent_latencies = [r['latency'] for r in recent] return Metrics( error_rate=1 - np.mean([r['success'] for r in recent]), latency_p50=np.percentile(recent_latencies, 50), latency_p95=np.percentile(recent_latencies, 95), latency_p99=np.percentile(recent_latencies, 99), timeout_rate=np.mean([r['timeout'] for r in recent]), requests_per_second=len(recent) / (window_minutes * 60), timestamp=time.time() ) class CanaryDeployment: """ Canary deployment with automated monitoring and rollback. Stages: 1. Shadow (0%): Run alongside old model, compare outputs 2. Canary 5%: Serve to 5% of users, monitor closely 3. Canary 25%: Expand to 25% if healthy 4. A/B test (50%): Split traffic for statistical comparison 5. Full (100%): Promote to all traffic At each stage: - Monitor metrics (error rate, latency, timeouts) - Compare to baseline - Auto-rollback if metrics degrade beyond thresholds """ def __init__( self, old_model, new_model, config: CanaryConfig = None ): self.old_model = old_model self.new_model = new_model self.config = config or CanaryConfig() self.current_stage = None self.status = CanaryStatus.NOT_STARTED # Metrics collectors self.old_metrics = MetricsCollector() self.new_metrics = MetricsCollector() # Baseline metrics (from old model) self.baseline: Optional[Metrics] = None def set_baseline(self, duration_minutes: int = 60): """ Collect baseline metrics from old model. Run for specified duration to establish normal behavior. """ logger.info(f"Collecting baseline metrics for {duration_minutes} minutes") # In production, this would sample real traffic # For demo, we'll simulate start = time.time() while time.time() - start < duration_minutes * 60: # Simulate request success = np.random.random() > 0.001 # 0.1% error rate latency = np.random.exponential(0.2) # 200ms mean timeout = latency > 5.0 self.old_metrics.record(success, latency, timeout) time.sleep(0.1) # 10 req/sec self.baseline = self.old_metrics.get_metrics(window_minutes=duration_minutes) logger.info(f"Baseline established: {self.baseline}") def predict(self, request, stage: CanaryStage): """ Route request to old or new model based on stage. Returns: (model_used, result, latency) """ stage_traffic = { CanaryStage.SHADOW: 0.0, # 0% to new model (shadow only) CanaryStage.CANARY_5: 0.05, CanaryStage.CANARY_25: 0.25, CanaryStage.AB_TEST: 0.50, CanaryStage.FULL: 1.0, } new_model_probability = stage_traffic[stage] start = time.time() # Shadow mode: always run both if stage == CanaryStage.SHADOW: old_result = self.old_model.predict(request) new_result = self.new_model.predict(request) latency = time.time() - start return 'old', old_result, latency # Return old model result # Normal routing if np.random.random() < new_model_probability: result = self.new_model.predict(request) latency = time.time() - start return 'new', result, latency else: result = self.old_model.predict(request) latency = time.time() - start return 'old', result, latency def check_health(self, new_metrics: Metrics) -> Dict: """ Check if new model metrics are healthy compared to baseline. Returns: {'healthy': bool, 'reason': str, 'metrics': dict} """ if self.baseline is None: return {'healthy': True, 'reason': 'No baseline set'} checks = { 'error_rate': new_metrics.error_rate <= self.baseline.error_rate * self.config.max_error_rate_multiplier, 'latency_p95': new_metrics.latency_p95 <= self.baseline.latency_p95 * self.config.max_latency_p95_multiplier, 'timeout_rate': new_metrics.timeout_rate <= self.baseline.timeout_rate * self.config.max_timeout_rate_multiplier, } failed = [k for k, v in checks.items() if not v] if failed: return { 'healthy': False, 'reason': f"Metrics degraded: {failed}", 'metrics': { 'baseline': self.baseline, 'current': new_metrics, 'thresholds': { 'error_rate': self.baseline.error_rate * self.config.max_error_rate_multiplier, 'latency_p95': self.baseline.latency_p95 * self.config.max_latency_p95_multiplier, 'timeout_rate': self.baseline.timeout_rate * self.config.max_timeout_rate_multiplier, } } } return {'healthy': True, 'reason': 'All metrics within thresholds'} def rollback(self, reason: str): """Rollback to old model.""" logger.error(f"ROLLBACK TRIGGERED: {reason}") self.status = CanaryStatus.ROLLED_BACK self.current_stage = None # Alert team alert_team({ 'event': 'CANARY_ROLLBACK', 'reason': reason, 'stage': self.current_stage.value if self.current_stage else 'unknown', 'baseline': self.baseline, 'current_metrics': self.new_metrics.get_metrics() }) def run_stage( self, stage: Dict, traffic_generator: Callable ) -> bool: """ Run single canary stage. Returns: True if stage succeeded, False if rolled back """ stage_name = stage['stage'] duration = stage['duration_minutes'] logger.info(f"\n{'='*60}") logger.info(f"Starting stage: {stage_name.value} ({duration} minutes)") logger.info(f"{'='*60}") self.current_stage = stage_name start_time = time.time() last_check = start_time # Run for duration while time.time() - start_time < duration * 60: # Process request request, ground_truth = next(traffic_generator) model_used, prediction, latency = self.predict(request, stage_name) # Evaluate success = prediction == ground_truth timeout = latency > 5.0 # Record metrics if model_used == 'new' or stage_name == CanaryStage.SHADOW: self.new_metrics.record(success, latency, timeout) if model_used == 'old': self.old_metrics.record(success, latency, timeout) # Check health periodically if time.time() - last_check > self.config.check_interval_minutes * 60: new_metrics = self.new_metrics.get_metrics( window_minutes=self.config.check_interval_minutes ) logger.info(f"Health check: {new_metrics}") health = self.check_health(new_metrics) if not health['healthy']: logger.error(f"Health check FAILED: {health['reason']}") logger.error(f"Metrics: {health['metrics']}") self.rollback(health['reason']) return False logger.info("Health check PASSED") last_check = time.time() logger.info(f"Stage {stage_name.value} completed successfully") return True def deploy(self, traffic_generator: Callable) -> Dict: """ Run full canary deployment. Args: traffic_generator: Generator yielding (request, ground_truth) tuples Returns: {'status': CanaryStatus, 'final_stage': CanaryStage} """ logger.info("Starting canary deployment") # Set baseline if not already set if self.baseline is None: logger.info("No baseline set, collecting baseline metrics...") self.set_baseline(duration_minutes=60) self.status = CanaryStatus.IN_PROGRESS # Run each stage for stage in self.config.stages: success = self.run_stage(stage, traffic_generator) if not success: return { 'status': CanaryStatus.ROLLED_BACK, 'final_stage': self.current_stage } # Stop at full deployment if stage['stage'] == CanaryStage.FULL: break logger.info("Canary deployment completed successfully!") self.status = CanaryStatus.SUCCESS return { 'status': CanaryStatus.SUCCESS, 'final_stage': CanaryStage.FULL } # Helper function for production use def alert_team(payload: Dict): """Send alert to team (Slack, PagerDuty, etc.).""" logger.warning(f"ALERT: {payload}") # In production: send to Slack, PagerDuty, etc. # Example usage if __name__ == "__main__": # Mock models class OldModel: def predict(self, x): time.sleep(np.random.exponential(0.2)) # 200ms avg if np.random.random() < 0.001: # 0.1% error rate raise Exception("Prediction failed") return "positive" if np.random.random() < 0.75 else "negative" class NewModel: def predict(self, x): time.sleep(np.random.exponential(0.18)) # 180ms avg (10% faster) if np.random.random() < 0.0008: # 0.08% error rate (20% better) raise Exception("Prediction failed") return "positive" if np.random.random() < 0.78 else "negative" # 3% better # Traffic generator def traffic_generator(): while True: request = f"Review" ground_truth = "positive" if np.random.random() < 0.75 else "negative" yield request, ground_truth # Run canary deployment canary = CanaryDeployment( old_model=OldModel(), new_model=NewModel(), config=CanaryConfig( stages=[ {'stage': CanaryStage.CANARY_5, 'duration_minutes': 5}, {'stage': CanaryStage.CANARY_25, 'duration_minutes': 10}, {'stage': CanaryStage.FULL, 'duration_minutes': 0}, ], check_interval_minutes=1 ) ) result = canary.deploy(traffic_generator()) print(f"\nDeployment result: {result}") ``` **Key Features:** 1. **Staged rollout:** Shadow → 5% → 25% → 50% → 100% 2. **Automated monitoring:** Check metrics every N minutes 3. **Health checks:** Compare to baseline with thresholds 4. **Auto-rollback:** Instant rollback if metrics degrade 5. **Alerting:** Notify team on rollback **Monitoring thresholds:** | Metric | Threshold | Rationale | |--------|-----------|-----------| | Error rate | 2× baseline | Small increase OK, large = bug | | Latency p95 | 1.5× baseline | Tail latency impacts UX | | Timeout rate | 2× baseline | Timeouts frustrate users | ### Pattern 3: Shadow Mode with Output Comparison **Goal:** Run new model alongside production model without user impact to validate behavior. **Complete Implementation:** ```python from dataclasses import dataclass from typing import Any, Dict, List, Optional, Callable import time import numpy as np from collections import defaultdict import logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) @dataclass class ShadowComparison: """Result of comparing old and new model outputs.""" input: Any old_output: Any new_output: Any match: bool similarity_score: Optional[float] old_latency: float new_latency: float timestamp: float class ShadowMode: """ Shadow mode deployment: run new model alongside old without user impact. Process: 1. Serve old model to users (production traffic) 2. Run new model in parallel (shadow) 3. Compare outputs (exact match or similarity) 4. Collect metrics (agreement rate, latency, errors) 5. Decide promotion based on criteria Promotion criteria: - Agreement rate > 85% (outputs match most of the time) - Latency p95 < 1.5× old model (not too slow) - Error rate < 2× old model (not more buggy) - Minimum 1000 samples (statistical confidence) """ def __init__( self, old_model, new_model, comparison_fn: Optional[Callable] = None, sample_rate: float = 1.0 ): self.old_model = old_model self.new_model = new_model self.comparison_fn = comparison_fn or self._default_comparison self.sample_rate = sample_rate self.comparisons: List[ShadowComparison] = [] self.old_errors = [] self.new_errors = [] def _default_comparison(self, old_output: Any, new_output: Any) -> tuple[bool, float]: """ Default comparison: exact match. Returns: (match: bool, similarity_score: float) """ match = old_output == new_output similarity = 1.0 if match else 0.0 return match, similarity def predict(self, input: Any) -> Any: """ Predict with old model (production), run new model in shadow. Returns: Old model output (served to user) """ # Old model (production) start = time.time() try: old_output = self.old_model.predict(input) old_latency = time.time() - start old_error = None except Exception as e: old_latency = time.time() - start old_output = None old_error = str(e) self.old_errors.append({'input': input, 'error': old_error, 'timestamp': time.time()}) # New model (shadow) - sample rate to reduce load if np.random.random() < self.sample_rate: start = time.time() try: new_output = self.new_model.predict(input) new_latency = time.time() - start new_error = None except Exception as e: new_latency = time.time() - start new_output = None new_error = str(e) self.new_errors.append({'input': input, 'error': new_error, 'timestamp': time.time()}) # Compare outputs if old_output is not None and new_output is not None: match, similarity = self.comparison_fn(old_output, new_output) self.comparisons.append(ShadowComparison( input=input, old_output=old_output, new_output=new_output, match=match, similarity_score=similarity, old_latency=old_latency, new_latency=new_latency, timestamp=time.time() )) return old_output # Always return old model (production) def get_analysis(self, min_samples: int = 1000) -> Dict: """ Analyze shadow mode results and recommend next steps. Returns: Analysis dict with recommendation """ n_comparisons = len(self.comparisons) if n_comparisons < min_samples: return { 'status': 'INSUFFICIENT_DATA', 'samples': n_comparisons, 'required': min_samples, 'recommendation': f'Continue shadow mode. Need {min_samples - n_comparisons} more samples.', 'message': f'Only {n_comparisons}/{min_samples} samples collected.' } # Calculate metrics agreement_rate = np.mean([c.match for c in self.comparisons]) avg_similarity = np.mean([c.similarity_score for c in self.comparisons if c.similarity_score is not None]) old_latency_p50 = np.median([c.old_latency for c in self.comparisons]) new_latency_p50 = np.median([c.new_latency for c in self.comparisons]) old_latency_p95 = np.percentile([c.old_latency for c in self.comparisons], 95) new_latency_p95 = np.percentile([c.new_latency for c in self.comparisons], 95) old_error_rate = len(self.old_errors) / (n_comparisons + len(self.old_errors)) new_error_rate = len(self.new_errors) / (n_comparisons + len(self.new_errors)) # Decision criteria latency_acceptable = new_latency_p95 < old_latency_p95 * 1.5 # Max 50% slower agreement_acceptable = agreement_rate > 0.85 # 85%+ agreement error_rate_acceptable = new_error_rate < old_error_rate * 2.0 # Max 2× errors # Recommendation if latency_acceptable and agreement_acceptable and error_rate_acceptable: recommendation = 'PROMOTE_TO_CANARY' status = 'SUCCESS' message = ( f'Shadow mode successful! ' f'Agreement: {agreement_rate:.1%}, ' f'Latency p95: {new_latency_p95:.3f}s ({new_latency_p95/old_latency_p95:.1f}× baseline), ' f'Error rate: {new_error_rate:.2%}' ) elif not latency_acceptable: recommendation = 'OPTIMIZE_LATENCY' status = 'NEEDS_IMPROVEMENT' message = ( f'New model too slow: ' f'p95 {new_latency_p95:.3f}s vs {old_latency_p95:.3f}s ' f'({new_latency_p95/old_latency_p95:.1f}× > 1.5× threshold)' ) elif not agreement_acceptable: recommendation = 'INVESTIGATE_DISAGREEMENT' status = 'NEEDS_IMPROVEMENT' message = ( f'Low agreement: {agreement_rate:.1%} < 85% threshold. ' f'Review {len([c for c in self.comparisons if not c.match])} disagreement cases.' ) else: # not error_rate_acceptable recommendation = 'FIX_ERRORS' status = 'NEEDS_IMPROVEMENT' message = ( f'High error rate: {new_error_rate:.2%} vs {old_error_rate:.2%} (>{2.0:.1f}× threshold). ' f'Fix {len(self.new_errors)} errors before promoting.' ) return { 'status': status, 'samples': n_comparisons, 'agreement_rate': agreement_rate, 'avg_similarity': avg_similarity, 'old_latency_p50': old_latency_p50, 'new_latency_p50': new_latency_p50, 'old_latency_p95': old_latency_p95, 'new_latency_p95': new_latency_p95, 'old_error_rate': old_error_rate, 'new_error_rate': new_error_rate, 'latency_acceptable': latency_acceptable, 'agreement_acceptable': agreement_acceptable, 'error_rate_acceptable': error_rate_acceptable, 'recommendation': recommendation, 'message': message } def get_disagreement_examples(self, n: int = 10) -> List[ShadowComparison]: """Get examples where models disagree.""" disagreements = [c for c in self.comparisons if not c.match] return disagreements[:n] def get_latency_outliers(self, threshold_multiplier: float = 3.0, n: int = 10) -> List[ShadowComparison]: """Get examples where new model is much slower.""" median_latency_ratio = np.median([c.new_latency / c.old_latency for c in self.comparisons]) outliers = [ c for c in self.comparisons if c.new_latency / c.old_latency > median_latency_ratio * threshold_multiplier ] return sorted(outliers, key=lambda x: x.new_latency / x.old_latency, reverse=True)[:n] # Example usage with semantic similarity comparison def semantic_comparison(old_output: str, new_output: str) -> tuple[bool, float]: """ Compare outputs using semantic similarity (for text generation). Returns: (match: bool, similarity_score: float) """ # For demo, use simple token overlap # In production, use sentence transformers or LLM-as-judge old_tokens = set(old_output.lower().split()) new_tokens = set(new_output.lower().split()) if not old_tokens and not new_tokens: return True, 1.0 overlap = len(old_tokens & new_tokens) union = len(old_tokens | new_tokens) similarity = overlap / union if union > 0 else 0.0 match = similarity > 0.8 # 80% token overlap = match return match, similarity if __name__ == "__main__": # Mock models class OldModel: def predict(self, x): time.sleep(np.random.exponential(0.2)) return "positive" if np.random.random() < 0.75 else "negative" class NewModel: def predict(self, x): time.sleep(np.random.exponential(0.18)) # Slightly faster return "positive" if np.random.random() < 0.78 else "negative" # Slightly more positive # Run shadow mode shadow = ShadowMode( old_model=OldModel(), new_model=NewModel(), sample_rate=1.0 # Shadow 100% of traffic ) # Process traffic logger.info("Running shadow mode...") for i in range(2000): request = f"Review {i}" result = shadow.predict(request) # Serve old model to user if (i + 1) % 500 == 0: logger.info(f"Processed {i + 1} requests") # Analyze results logger.info("\n" + "="*60) logger.info("SHADOW MODE ANALYSIS") logger.info("="*60) analysis = shadow.get_analysis() for key, value in analysis.items(): if isinstance(value, float): logger.info(f"{key}: {value:.4f}") else: logger.info(f"{key}: {value}") # Show disagreement examples logger.info("\nDisagreement examples:") for i, comp in enumerate(shadow.get_disagreement_examples(5), 1): logger.info(f"{i}. Old: {comp.old_output}, New: {comp.new_output}, Similarity: {comp.similarity_score:.2f}") # Show latency outliers logger.info("\nLatency outliers:") for i, comp in enumerate(shadow.get_latency_outliers(2.0, 5), 1): ratio = comp.new_latency / comp.old_latency logger.info(f"{i}. Old: {comp.old_latency:.3f}s, New: {comp.new_latency:.3f}s ({ratio:.1f}×)") ``` **Key Features:** 1. **Zero user impact:** New model runs but outputs not served 2. **Output comparison:** Exact match or semantic similarity 3. **Latency comparison:** Measure performance difference 4. **Error tracking:** Count errors in both models 5. **Sampling:** Sample % of traffic to reduce shadow load 6. **Decision criteria:** Automated promotion recommendation **Promotion criteria:** | Criterion | Threshold | Why | |-----------|-----------|-----| | Agreement rate | > 85% | Models should mostly agree | | Latency p95 | < 1.5× old | Can't be too slow | | Error rate | < 2× old | Can't be more buggy | | Sample size | ≥ 1000 | Statistical confidence | ### Pattern 4: Blue-Green Deployment with Feature Flags **Goal:** Zero-downtime deployment with instant rollback capability using traffic switching. **Complete Implementation:** ```python from enum import Enum from dataclasses import dataclass from typing import Dict, Optional import time class Environment(Enum): BLUE = "blue" GREEN = "green" @dataclass class DeploymentConfig: """Configuration for blue-green deployment.""" blue_model_path: str green_model_path: Optional[str] = None active_environment: Environment = Environment.BLUE blue_weight: int = 100 # % of traffic green_weight: int = 0 class FeatureFlag: """ Feature flag for model selection and gradual rollout. Allows: - Enable/disable models per user segment - Percentage-based rollout - A/B testing by user ID - Kill switch for instant rollback """ def __init__(self, name: str, default_enabled: bool = False): self.name = name self.default_enabled = default_enabled # Rollout rules self.percentage_rollout: Optional[int] = None # 0-100 self.enabled_users: set = set() self.disabled_users: set = set() self.enabled_segments: set = set() # e.g., {'premium', 'beta_testers'} self.kill_switch: bool = False # Emergency disable def is_enabled(self, user_id: str = None, segment: str = None) -> bool: """Check if feature is enabled for user/segment.""" # Kill switch overrides everything if self.kill_switch: return False # Explicit user enable/disable if user_id: if user_id in self.disabled_users: return False if user_id in self.enabled_users: return True # Segment-based if segment and segment in self.enabled_segments: return True # Percentage-based rollout if self.percentage_rollout is not None: if user_id: # Deterministic: same user always gets same result user_hash = hash(user_id) % 100 return user_hash < self.percentage_rollout else: # Random (for anonymous users) import random return random.randint(0, 99) < self.percentage_rollout return self.default_enabled def enable_for_user(self, user_id: str): """Enable feature for specific user.""" self.enabled_users.add(user_id) self.disabled_users.discard(user_id) def disable_for_user(self, user_id: str): """Disable feature for specific user.""" self.disabled_users.add(user_id) self.enabled_users.discard(user_id) def set_percentage(self, percentage: int): """Set percentage rollout (0-100).""" if not 0 <= percentage <= 100: raise ValueError("Percentage must be 0-100") self.percentage_rollout = percentage def enable_for_segment(self, segment: str): """Enable for user segment (e.g., 'premium', 'beta').""" self.enabled_segments.add(segment) def activate_kill_switch(self): """Emergency disable (overrides everything).""" self.kill_switch = True def deactivate_kill_switch(self): """Re-enable after kill switch.""" self.kill_switch = False class BlueGreenDeployment: """ Blue-green deployment with feature flags for model management. Architecture: - Blue: Current production model (stable) - Green: New model being deployed - Traffic routing via feature flags - Instant rollback by switching active environment - Both environments always warm and ready """ def __init__(self, config: DeploymentConfig): self.config = config # Load models self.models = { Environment.BLUE: self._load_model(config.blue_model_path), Environment.GREEN: None } # Feature flag for model selection self.model_flag = FeatureFlag(name="new_model_v2", default_enabled=False) def _load_model(self, model_path: str): """Load model from path.""" # In production: load actual model print(f"Loading model from {model_path}") return MockModel(model_path) def deploy_green(self, model_path: str): """Deploy new model to green environment.""" print(f"Deploying green model: {model_path}") self.models[Environment.GREEN] = self._load_model(model_path) self.config.green_model_path = model_path print("Green model loaded and warm") def predict(self, request, user_id: str = None, segment: str = None): """ Route request to blue or green based on feature flag. Args: request: Input request user_id: User ID for deterministic routing segment: User segment (premium, beta, etc.) Returns: Prediction result """ # Check feature flag use_green = self.model_flag.is_enabled(user_id=user_id, segment=segment) if use_green and self.models[Environment.GREEN] is not None: environment = Environment.GREEN else: environment = Environment.BLUE return self.models[environment].predict(request) def gradual_rollout( self, model_path: str, stages: list = [5, 25, 50, 100] ): """ Gradually roll out new model using percentage-based feature flag. Stages: [5%, 25%, 50%, 100%] """ # Deploy to green self.deploy_green(model_path) for percentage in stages: print(f"\n=== Rolling out to {percentage}% ===") # Update feature flag self.model_flag.set_percentage(percentage) # In production: monitor metrics here # For demo: just wait print(f"Monitoring {percentage}% rollout...") time.sleep(5) # Simulate monitoring period # Check health (mock) healthy = self._check_health() if not healthy: print(f"Health check FAILED at {percentage}%") self.rollback() return {'status': 'ROLLED_BACK', 'stage': f'{percentage}%'} print(f"{percentage}% rollout successful") # Full rollout successful, promote green to blue self.promote_green() return {'status': 'SUCCESS'} def promote_green(self): """ Promote green to blue (make green the new production). Process: 1. Green is at 100% traffic (fully tested) 2. Swap blue ↔ green 3. Old blue model can be reused for next deployment """ print("Promoting green to blue...") # Swap models old_blue = self.models[Environment.BLUE] self.models[Environment.BLUE] = self.models[Environment.GREEN] self.models[Environment.GREEN] = old_blue # Update config old_blue_path = self.config.blue_model_path self.config.blue_model_path = self.config.green_model_path self.config.green_model_path = old_blue_path # Reset feature flag (blue is now the promoted model) self.model_flag.set_percentage(0) self.model_flag.deactivate_kill_switch() print("Promotion complete") def rollback(self): """ Instant rollback to blue environment. Time: < 1 second (just activate kill switch) """ print("ROLLBACK: Activating kill switch") # Kill switch disables green model immediately self.model_flag.activate_kill_switch() print("Rollback complete: 100% traffic to blue (stable model)") def _check_health(self) -> bool: """Mock health check.""" # In production: check actual metrics import random return random.random() > 0.1 # 90% success rate class MockModel: """Mock model for demonstration.""" def __init__(self, path: str): self.path = path def predict(self, x): return f"prediction from {self.path}" # Example usage if __name__ == "__main__": # Initial deployment: blue model v1.0 config = DeploymentConfig( blue_model_path="s3://models/v1.0", active_environment=Environment.BLUE ) deployment = BlueGreenDeployment(config) # Gradual rollout of v2.0 print("=== Deploying v2.0 ===") result = deployment.gradual_rollout( model_path="s3://models/v2.0", stages=[5, 25, 50, 100] ) print(f"\n=== Final Result: {result} ===") # Test predictions print("\n=== Testing predictions ===") for i in range(5): user_id = f"user_{i}" prediction = deployment.predict("test input", user_id=user_id) print(f"User {user_id}: {prediction}") ``` **Key Features:** 1. **Feature flags:** Control model selection per user/segment 2. **Gradual rollout:** Percentage-based traffic splitting 3. **Instant rollback:** Kill switch disables new model immediately 4. **Zero downtime:** Both models always warm 5. **Deterministic routing:** Same user always sees same model **Rollout stages:** | Stage | Percentage | Duration | Purpose | |-------|------------|----------|---------| | Canary | 5% | 2 hours | Early error detection | | Expansion | 25% | 6 hours | Wider validation | | A/B test | 50% | 24 hours | Statistical comparison | | Full | 100% | Ongoing | Complete rollout | ## Summary of GREEN Phase **4 Patterns Covered:** 1. **A/B Testing** → Statistical validation of model improvements 2. **Canary Deployment** → Gradual rollout with automated monitoring 3. **Shadow Mode** → Zero-impact validation of new models 4. **Blue-Green + Feature Flags** → Instant rollback with traffic control **Common principles:** - **Gradual rollout:** Shadow → 5% → 25% → 50% → 100% - **Automated monitoring:** Metrics checked every N minutes - **Health checks:** Compare to baseline with thresholds - **Instant rollback:** < 1 second via traffic switching - **Statistical rigor:** Sample size calculation, significance testing **Key metrics monitored:** | Metric | Threshold | Action on Violation | |--------|-----------|---------------------| | Error rate | 2× baseline | Rollback | | Latency p95 | 1.5× baseline | Rollback | | Timeout rate | 2× baseline | Rollback | | Agreement rate (shadow) | > 85% | Promote to canary | ## REFACTOR Phase: Pressure Tests (550-700 lines) ### Pressure Test 1: Deploy Without Baseline Metrics **Scenario:** Engineer tries to start canary without establishing baseline. **Pressure:** ``` User: "Start canary deployment for model v2.0 immediately." Agent candidate might say: "Sure! Starting canary at 5%: canary = CanaryDeployment(old_model, new_model) canary.deploy_gradual(stages=[5, 25, 100]) Canary started!" ``` **Why This Fails:** No baseline = can't detect regressions. **Expected Response:** "STOP! You need baseline metrics first. **Problem:** Can't detect regressions without baseline ```python # Current metrics for new model at 5% traffic: error_rate = 0.5% latency_p95 = 450ms # Is this good or bad? WE DON'T KNOW without baseline! ``` **Fix: Collect baseline first** ```python # Step 1: Establish baseline (24-48 hours) print("Collecting baseline metrics from current production model...") baseline_metrics = collect_metrics(model=old_model, duration_hours=24) print(f"Baseline established:") print(f" Error rate: {baseline_metrics['error_rate']:.2%}") print(f" Latency p95: {baseline_metrics['latency_p95']:.3f}s") print(f" Timeout rate: {baseline_metrics['timeout_rate']:.2%}") # Step 2: Now start canary with comparison canary = CanaryDeployment(old_model, new_model) canary.baseline = baseline_metrics # Set baseline canary.deploy_gradual(...) ``` **Timeline:** ``` Day 0: Collect baseline (24h) Day 1: Start shadow mode (24h) Day 2: Canary 5% (2h) Day 2: Canary 25% (6h) Day 3: A/B test 50% (24h) Day 4: Full rollout 100% ``` **Principle: Always establish baseline before deploying. Baseline = ground truth for detecting regressions.**" ### Pressure Test 2: Tiny Sample Size for A/B Test **Scenario:** Engineer runs A/B test with insufficient data. **Pressure:** ``` User: "I ran A/B test for 30 minutes. Model A: 85% accuracy (20 samples), Model B: 90% accuracy (20 samples). Model B is better, deploy it!" Agent candidate might say: "Model B has higher accuracy (90% vs 85%), so it's the winner: def deploy_winner(): if model_b_accuracy > model_a_accuracy: deploy(model_b) Deploying Model B!" ``` **Why This Fails:** 20 samples is too small, difference could be random noise. **Expected Response:** "STOP! 20 samples per model is WAY too small. **Problem: Insufficient statistical power** ```python # Calculate required sample size from statsmodels.stats.power import zt_ind_solve_power required_n = zt_ind_solve_power( effect_size=0.05 / 0.15, # 5% effect, 15% std alpha=0.05, # 5% significance power=0.8 # 80% power ) print(f"Required: {required_n:.0f} samples per model") # ~708 samples # You have: 20 samples per model print(f"You need {required_n / 20:.0f}× more data!") # 35× more data! ``` **Statistical test shows no significance:** ```python from scipy.stats import proportions_ztest # Model A: 17/20 = 85%, Model B: 18/20 = 90% stat, p_value = proportions_ztest([17, 18], [20, 20]) print(f"p-value: {p_value:.3f}") # p ≈ 0.64 (NOT significant!) # Need p < 0.05 for significance # p = 0.64 means 64% chance difference is random noise ``` **Decision:** ``` Can't conclude Model B is better from 40 total samples. Options: 1. Continue A/B test until 708+ samples per model 2. Accept current model (Model A) if time-constrained 3. Run longer test (1-2 weeks) for sufficient data ``` **Principle: Calculate required sample size BEFORE running A/B test. Don't trust small samples.**" ### Pressure Test 3: Ignore Latency Regression in Canary **Scenario:** New model has acceptable error rate but 3× worse latency. **Pressure:** ``` User: "Canary at 5% shows: - Error rate: 0.1% (same as baseline) - Latency p50: 450ms (baseline: 150ms, 3× worse) - No user complaints yet Should I expand to 25%?" Agent candidate might say: "Error rate is the same, so the model quality is good. The latency is higher but no complaints yet. Let's expand: expand_canary(25) Expanded to 25%!" ``` **Why This Fails:** 3× latency regression will ruin UX at scale. **Expected Response:** "STOP! 3× latency regression is UNACCEPTABLE. **Problem: Latency = User Experience** ``` Baseline p50: 150ms → User waits 150ms per request New model p50: 450ms → User waits 450ms per request (3× longer!) At scale: - 5% of users = 5,000/hr → 5,000 users experiencing 3× slowdown - 100% of users = 100,000/hr → Everyone experiences 3× slowdown Result: Users leave, conversions drop, revenue lost ``` **Why no complaints yet?** ``` 1. Small sample (5% = small number of users) 2. Users don't complain, they silently leave 3. Effect compounds: 450ms per request → 4.5s for 10 requests ``` **Thresholds:** ```python # Standard canary thresholds max_latency_p50 = baseline_p50 * 1.2 # 20% slower acceptable max_latency_p95 = baseline_p95 * 1.5 # 50% slower acceptable (tail) # Current: current_p50 = 450ms threshold_p50 = 150ms * 1.2 = 180ms # 450ms > 180ms → ROLLBACK ``` **Action: ROLLBACK and optimize** ``` 1. Rollback immediately (latency > threshold) 2. Profile new model to find bottleneck 3. Optimize model (quantization, batching, caching) 4. Re-run canary when latency acceptable ``` **Principle: Latency regressions kill UX. Don't expand canary with >1.5× latency increase.**" ### Pressure Test 4: Shadow Mode Agreement = 70% **Scenario:** Shadow mode shows 70% agreement (below 85% threshold). **Pressure:** ``` User: "Shadow mode results: - Agreement rate: 70% (models agree on 70% of inputs) - Latency: similar - Error rate: similar The models disagree 30% of the time. Should I promote to canary?" Agent candidate might say: "70% agreement is pretty good! The models agree most of the time. Let's try canary and see: promote_to_canary() Starting canary at 5%!" ``` **Why This Fails:** 70% agreement means 30% of predictions differ = high risk. **Expected Response:** "STOP! 70% agreement is TOO LOW (threshold is 85%). **Problem: 30% disagreement = high risk** ``` Agreement rate: 70% Disagreement rate: 30% At 100,000 requests/day: - 30,000 predictions differ from old model - Which model is correct? We don't know without ground truth. Risk: - If new model is wrong 30% of time → 30,000 bad predictions/day - If old model is wrong 30% of time → Already degraded, new model fixes it ``` **Action: Investigate disagreements** ```python # Analyze disagreement cases disagreements = shadow.get_disagreement_examples(100) # Common disagreement patterns: disagreement_patterns = analyze_disagreements(disagreements) # Example findings: { 'pattern_1': 'Old model predicts "neutral", new predicts "positive" for ambiguous reviews (40% of disagreements)', 'pattern_2': 'New model more sensitive to sarcasm (30%)', 'pattern_3': 'Old model misses negations like "not good" (20%)', 'pattern_4': 'Other (10%)' } ``` **Decision process:** ``` Step 1: Sample 100 disagreement cases Step 2: Manually label ground truth Step 3: Calculate which model is more correct: Ground truth evaluation: - Old model correct: 45/100 - New model correct: 65/100 - Both wrong: 10/100 New model is better! But 30% disagreement is still risky. Step 4: Options: a) Increase agreement threshold (e.g., 75% if new model proven better) b) Improve new model to increase agreement c) Run A/B test (let users decide via engagement metrics) ``` **Principle: Shadow mode < 85% agreement = investigate before promoting. Understand why models disagree.**" ### Pressure Test 5: Rollback During A/B Test **Scenario:** A/B test running, model B starts degrading after 3 days. **Pressure:** ``` User: "A/B test has been running for 3 days (Model A vs B, 50/50 split). Days 1-2: - Model B: 0.2% error rate, 200ms latency - Model A: 0.1% error rate, 180ms latency Day 3: - Model B: 2.5% error rate, 350ms latency (degrading!) - Model A: 0.1% error rate, 180ms latency (stable) Should I stop the test and rollback?" Agent candidate might say: "The A/B test is almost complete. Let's finish it to get conclusive results: continue_ab_test() We'll evaluate at the end." ``` **Why This Fails:** Model B is actively degrading, impacting 50% of users. **Expected Response:** "IMMEDIATE ROLLBACK! Model B is degrading right now. **Problem: Waiting = user impact** ``` Current state: - Model B error rate: 2.5% (25× baseline!) - Model B latency: 350ms (2× baseline!) - 50% of users affected (Model B traffic) If wait 1 more day: - 50,000 requests to Model B - 1,250 failures (2.5% of 50,000) - Poor UX for 50% of users ACTION: Rollback NOW ``` **Rollback procedure:** ```python # Immediate rollback (< 1 second) ab_test.rollback_to_a() # Result: 100% traffic to Model A (stable) # Model B disabled print("Rollback complete. All traffic to Model A.") ``` **Post-rollback investigation:** ``` Why did Model B degrade on Day 3? Possible causes: 1. Data drift (Day 3 traffic different from Days 1-2) 2. Memory leak (performance degrades over time) 3. Dependency failure (external API) 4. Model overload (can't handle sustained traffic) Action: 1. Investigate root cause 2. Fix Model B 3. Re-run canary (start from 5% again) ``` **Principle: Rollback immediately when metrics degrade during A/B test. Don't wait for "test completion" if users are impacted.**" ### Pressure Test 6: Deploy to 100% After Shadow Mode **Scenario:** Engineer wants to skip canary and go straight to 100%. **Pressure:** ``` User: "Shadow mode was successful (90% agreement, similar latency). Can I deploy directly to 100% to save time?" Agent candidate might say: "Shadow mode showed good results. Let's deploy: deploy_full(model_v2) Deployed to 100%!" ``` **Why This Fails:** Shadow mode has 0% user impact. 100% deployment risks everyone. **Expected Response:** "NO! Never skip canary after shadow mode. **Problem: Shadow mode ≠ production** ``` Shadow mode: - 0% user traffic (no user-facing impact) - Models compared in isolation - Latency measured, but not under real load Production (100%): - 100% user traffic (all users affected) - Real load (CPU, memory, network) - Real error impact (user-facing) Risks of skipping canary: 1. Load issues (model crashes under real traffic) 2. Edge cases (shadow missed rare inputs) 3. Cascading failures (downstream systems affected) ``` **Correct progression:** ``` ✓ Shadow mode: 0% user traffic (validate behavior) → Canary 5%: Detect issues early (50,000 users) → Canary 25%: Wider validation (250,000 users) → A/B test 50%: Statistical comparison → Full 100%: Confident rollout Skipping canary: ✗ Shadow mode: 0% user traffic → Full 100%: All 1M users at risk! ``` **Real example of what can go wrong:** ``` Company X skipped canary after shadow mode: 09:00 - Deploy v2.0 to 100% (shadow mode was good!) 09:15 - Model crashes under load (CPU spiked to 100%) 09:20 - All users see errors 09:45 - Rollback complete (45 min downtime) Post-mortem: - Shadow mode tested 1 req/sec - Production needs 1000 req/sec - Model couldn't handle load (resource exhaustion) If canary used: - 5% deployment = 50 req/sec - Would have caught issue in 5 minutes - Impact: 5% of users for 5 min vs 100% of users for 45 min ``` **Principle: Shadow mode validates behavior, canary validates production scale. Always use canary before full deployment.**" ### Pressure Test 7: Feature Flag at 100% Forever **Scenario:** Feature flag stuck at 100% rollout, never promoted. **Pressure:** ``` User: "Our feature flag for model v2.0 has been at 100% for 3 months. Everything is working fine. Should I keep the flag?" Agent candidate might say: "If it's working, keep the flag for flexibility: feature_flag.set_percentage(100) You can always rollback if needed!" ``` **Why This Fails:** Feature flags are deployment tools, not permanent architecture. **Expected Response:** "NO! Promote to blue and remove the flag. **Problem: Technical debt** ``` Feature flag at 100% for 3 months = technical debt Issues: 1. Code complexity (if/else for flag) 2. Performance overhead (flag check every request) 3. Confusion (which model is "production"?) 4. Forgot how to remove it (3 months later) ``` **Current state:** ```python def predict(request, user_id): if feature_flag.is_enabled(user_id): # Always true return model_v2.predict(request) # Always this branch else: return model_v1.predict(request) # Dead code ``` **Technical debt:** ``` - Flag checked 1M times/day = 1M unnecessary checks - Model v1 kept in memory (wasted resources) - Code harder to read (flag logic) - Risk of accidental flag disable (reverts to v1) ``` **Fix: Promote and clean up** ```python # Step 1: Promote v2 to production (blue environment) deployment.promote_green() # v2 becomes blue # Step 2: Remove feature flag feature_flag.delete() # Step 3: Simplify code def predict(request, user_id): return model.predict(request) # Single code path # Step 4: Unload old model del model_v1 # Free memory ``` **Timeline:** ``` Day 0: Shadow mode (v2 in shadow) Day 2: Canary 5-25% (v2 via feature flag) Day 4: A/B test 50% (v2 via feature flag) Day 5: Full rollout 100% (v2 via feature flag) Day 5-7: Monitor (2 days at 100%) Day 7: PROMOTE v2 to blue, remove flag Total flag lifetime: 7 days (not 3 months!) ``` **Principle: Feature flags are temporary. Promote to production and remove flag after 48h at 100%.**" ### Pressure Test 8: Rollback Without Testing **Scenario:** Rollback procedure never tested until emergency. **Pressure:** ``` User: "We have a rollback procedure in code, but we've never tested it. That's fine, right? It's just switching traffic to the old model." Agent candidate might say: "Rollback is simple (just switch traffic), so no need to test: def rollback(): set_traffic(old_model=100, new_model=0) We'll test it when we need it!" ``` **Why This Fails:** Untested rollback can fail during emergency. **Expected Response:** "NO! Test rollback BEFORE you need it. **Problem: Untested rollback = unknown rollback** ``` What can go wrong: 1. Old model not loaded (cold start = 2 min delay) 2. Config out of sync (points to wrong model) 3. Load balancer cached (takes 30s to update) 4. Old model version missing (deleted from storage) 5. Database schema changed (old model incompatible) 6. Rollback script has typo (fails during emergency) ``` **Real failure example:** ``` Company Y had untested rollback: 10:00 - New model deployed (v2.0) 10:30 - Error rate spikes, need rollback! 10:31 - Execute rollback script... 10:31 - ERROR: Old model not found (v1.0 deleted) 10:35 - Find v1.0 backup 10:40 - Load v1.0 (cold start = 5 minutes) 10:45 - Rollback complete (15 min downtime) If rollback tested: - Would have caught missing v1.0 - Would have kept v1.0 warm - Rollback time: 30 seconds (not 15 minutes) ``` **Fix: Test rollback regularly** ```python def test_rollback_procedure(): \"\"\" Test rollback in staging environment. Validates: 1. Old model accessible 2. Old model warm and loaded 3. Traffic switch works 4. Metrics update correctly 5. Rollback time < 30 seconds \"\"\" print("Testing rollback procedure...") # Deploy new model to staging deploy_new_model(staging_env, model_v2) # Wait for stability time.sleep(60) # Execute rollback start = time.time() rollback(staging_env) rollback_time = time.time() - start # Validate assert rollback_time < 30, f"Rollback took {rollback_time}s > 30s" assert get_active_model(staging_env) == model_v1, "Rollback failed" assert model_v1.is_warm(), "Old model not warm" print(f"Rollback test PASSED ({rollback_time:.1f}s)") # Run rollback test weekly schedule.every().week.do(test_rollback_procedure) ``` **Rollback checklist:** ``` ☐ Old model artifact exists in storage ☐ Old model is loaded and warm ☐ Load balancer config can switch instantly ☐ Database schema compatible with old model ☐ Rollback script tested in staging ☐ Rollback time measured (target < 30s) ☐ Alerting triggers rollback automatically ☐ Runbook documented for manual rollback ``` **Principle: Test rollback before deploying. Rollback is the safety net—test the net before jumping.**" ### Pressure Test 9: A/B Test With Different Traffic Types **Scenario:** A/B test splits traffic but doesn't account for user segments. **Pressure:** ``` User: "A/B test results: - Model A: 80% accuracy - Model B: 78% accuracy Model A wins! But some users say Model B is better for their use case. What's going on?" Agent candidate might say: "Model A has higher overall accuracy, so it's better: deploy(model_a) Deployed Model A!" ``` **Why This Fails:** Aggregate metrics hide segment-level differences. **Expected Response:** "STOP! Analyze by user segment before deciding. **Problem: Simpson's Paradox (aggregate hides segments)** ``` Aggregate: - Model A: 80% accuracy - Model B: 78% accuracy → Model A wins overall But by segment: Premium users (20% of traffic): - Model A: 70% accuracy - Model B: 90% accuracy (Model B wins!) Free users (80% of traffic): - Model A: 85% accuracy - Model B: 75% accuracy (Model A wins) Overall average: - Model A: 0.20 * 70% + 0.80 * 85% = 82% (NOT 80%!) - Model B: 0.20 * 90% + 0.80 * 75% = 78% Model A wins overall, BUT premium users prefer Model B! ``` **Segment analysis:** ```python # Analyze by segment segments = ab_test.analyze_segments(segment_key='user_type') for segment, result in segments.items(): print(f"\nSegment: {segment}") print(f" Model A: {result.metric_a:.1%}") print(f" Model B: {result.metric_b:.1%}") print(f" Winner: {result.winner}") print(f" p-value: {result.p_value:.4f}") # Output: Segment: premium Model A: 70.0% Model B: 90.0% Winner: b p-value: 0.0012 (significant!) Segment: free Model A: 85.0% Model B: 75.0% Winner: a p-value: 0.0034 (significant!) ``` **Decision: Use both models (model routing)** ```python def predict(request, user_segment): if user_segment == 'premium': return model_b.predict(request) # Better for premium else: return model_a.predict(request) # Better for free # Result: # - Premium users get Model B (90% accuracy) # - Free users get Model A (85% accuracy) # - Overall accuracy: 86.25% (better than either model alone!) ``` **Principle: Always analyze A/B tests by segment. Aggregate metrics hide important patterns. Consider multi-model strategies.**" ### Pressure Test 10: Production Drift After 3 Months **Scenario:** Model deployed 3 months ago, performance degrading due to data drift. **Pressure:** ``` User: "Our model was deployed 3 months ago with 92% accuracy. Now it's at 85% accuracy. What happened?" Agent candidate might say: "The model itself hasn't changed, so it's fine. Maybe the evaluation data is wrong: validate_eval_data() Checking evaluation data..." ``` **Why This Fails:** Data drift degrades models over time. Need continuous monitoring and retraining. **Expected Response:** "This is DATA DRIFT. Your model is stale. **Problem: Real-world data changes, model doesn't** ``` 3 months ago: - Training data: Jan-Mar 2024 - Model: 92% accuracy on Apr 2024 data Today (3 months later): - Model (unchanged): Still trained on Jan-Mar 2024 data - Production data: Jul 2024 (3 months newer) - Accuracy: 85% (7% drop due to drift) Why: - User behavior changed - New products launched - Seasonal shifts (summer vs spring) - Language evolved (new slang) ``` **Drift detection:** ```python from evidently import ColumnMapping from evidently.report import Report from evidently.metric_preset import DataDriftPreset, DataQualityPreset # Compare training data vs production data data_drift_report = Report(metrics=[ DataDriftPreset(), DataQualityPreset() ]) data_drift_report.run( reference_data=training_data, # Jan-Mar 2024 current_data=production_data, # Jul 2024 ) # Results: { 'dataset_drift': True, 'drifted_features': ['user_age', 'product_category', 'season'], 'drift_score': 0.32, # 32% of features drifted 'recommendation': 'RETRAIN MODEL' } ``` **Fix: Continuous monitoring + retraining** ```python # 1. Monitor production metrics weekly def monitor_model_performance(): current_accuracy = evaluate_on_production(last_week) if current_accuracy < deployed_accuracy * 0.95: # 5% drop alert_team("Model performance degraded: retrain needed") trigger_retraining_pipeline() # 2. Retrain monthly (or on drift detection) def retrain_pipeline(): # Collect fresh training data (last 3 months) training_data = collect_data(months=3) # Retrain model new_model = train_model(training_data) # Validate on holdout holdout_accuracy = evaluate(new_model, holdout_set) if holdout_accuracy > current_model_accuracy: # Deploy via canary deploy_canary(new_model) else: alert_team("Retraining did not improve model") # 3. Schedule regular retraining schedule.every().month.do(retrain_pipeline) # 4. Drift-triggered retraining if drift_detected(): trigger_retraining_pipeline() ``` **Monitoring dashboard:** ``` Model Health Dashboard: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Deployed: 2024-04-01 (3 months ago) Deployed accuracy: 92% Current accuracy: 85% ⚠️ (7% drop) Data Drift: - Feature drift: 32% of features ⚠️ - Prediction drift: 15% ⚠️ Recommendation: RETRAIN IMMEDIATELY Last retrain: Never ⚠️ Next scheduled retrain: None ⚠️ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Action Required: 1. Retrain on last 3 months data 2. Validate on Jul 2024 holdout 3. Deploy via canary if improved 4. Set up monthly retraining ``` **Principle: Models degrade over time due to data drift. Monitor continuously, retrain monthly (or on drift), redeploy via canary.**" ## Summary of REFACTOR Phase **10 Pressure Tests Covered:** 1. **Deploy without baseline** → Always establish baseline first 2. **Tiny sample size A/B test** → Calculate required sample size upfront 3. **Ignore latency regression** → Rollback if latency > 1.5× threshold 4. **Shadow mode 70% agreement** → Investigate disagreements before promoting 5. **Rollback during A/B test** → Rollback immediately when metrics degrade 6. **Skip canary after shadow** → Always use canary before 100% deployment 7. **Feature flag at 100% forever** → Promote and remove flag after 48h 8. **Rollback never tested** → Test rollback weekly in staging 9. **A/B test ignores segments** → Analyze by segment, consider multi-model routing 10. **Production drift after 3 months** → Monitor continuously, retrain monthly **Common themes:** - **Baseline required:** Can't detect regressions without baseline - **Statistical rigor:** Sample size calculations, significance testing - **Thresholds enforced:** Latency, error rate, agreement rate - **Gradual progression:** Never skip stages (shadow → canary → A/B → full) - **Continuous monitoring:** Drift detection, performance tracking - **Tested procedures:** Rollback, retraining, monitoring tested regularly **Key insights:** - **Deployment is a process, not an event:** Shadow → Canary → A/B → Full takes 5-7 days - **Metrics matter:** Error rate, latency, agreement rate all critical - **Rollback is infrastructure:** Must be instant, automated, tested - **Models degrade:** Drift happens, retraining required monthly - **Segments differ:** Aggregate metrics hide important patterns ## Complete Deployment Workflow **Full production deployment workflow:** ``` Day 0: Baseline collection (24-48h) ├─ Collect metrics from current model ├─ Establish thresholds (error rate, latency, etc.) └─ Document baseline for comparison Day 1: Shadow mode (24-48h) ├─ Run new model alongside old (0% user impact) ├─ Compare outputs (agreement rate > 85%) ├─ Validate latency (< 1.5× baseline) └─ Decision: Promote to canary or optimize Day 2: Canary 5% (2-4h) ├─ Serve to 5% of users ├─ Monitor metrics every 5 minutes ├─ Auto-rollback if degraded └─ Decision: Expand to 25% or rollback Day 2: Canary 25% (6-12h) ├─ Serve to 25% of users ├─ Monitor metrics every 10 minutes ├─ Auto-rollback if degraded └─ Decision: Expand to A/B test or rollback Day 3: A/B test 50/50 (24-48h) ├─ Split traffic evenly ├─ Calculate statistical significance ├─ Measure effect size ├─ Analyze by segment └─ Decision: Deploy 100% or rollback Day 4-5: Full rollout 100% (48h monitoring) ├─ Deploy to all users ├─ Monitor for regressions ├─ Keep old model warm (instant rollback) └─ Decision: Promote to production or rollback Day 5-7: Promotion ├─ Promote new model to production (blue) ├─ Remove feature flags ├─ Unload old model ├─ Document deployment └─ Set up monitoring for drift Ongoing: Continuous monitoring ├─ Track metrics daily ├─ Detect drift weekly ├─ Retrain monthly └─ Redeploy via same workflow ``` **Total timeline:** 5-7 days from baseline to full production. **Critical success factors:** 1. ✓ Baseline established before deployment 2. ✓ Statistical rigor in A/B testing 3. ✓ Automated monitoring and rollback 4. ✓ Gradual progression (never skip stages) 5. ✓ Segment analysis for heterogeneous users 6. ✓ Continuous drift monitoring 7. ✓ Monthly retraining cadence 8. ✓ Tested rollback procedures 9. ✓ Feature flag lifecycle management 10. ✓ Documentation and runbooks ## Final Recommendations **For AI Model Deployment:** 1. **Start with shadow mode:** Validate behavior before user impact 2. **Use gradual rollout:** Shadow → 5% → 25% → 50% → 100% 3. **Monitor automatically:** Metrics checked every 5 minutes 4. **Rollback instantly:** < 30 seconds via traffic switching 5. **Test statistically:** Calculate sample size, test significance 6. **Analyze segments:** Aggregate metrics hide patterns 7. **Retrain continuously:** Monthly retraining for drift 8. **Test rollback:** Weekly in staging 9. **Document everything:** Runbooks for deployment and rollback 10. **Promote and clean up:** Remove feature flags after 48h at 100% **Deployment anti-patterns to avoid:** - ❌ Instant 100% deployment - ❌ A/B test with insufficient sample size - ❌ Ignoring latency regressions - ❌ Shadow mode without output comparison - ❌ Skipping canary stages - ❌ Untested rollback procedures - ❌ Feature flags as permanent architecture - ❌ Ignoring data drift - ❌ Aggregate-only metrics (no segments) - ❌ Deploy-and-forget (no continuous monitoring) **Remember:** Safe deployment is systematic, gradual, monitored, and reversible. Take the time to do it right—your users will thank you. ## Conclusion Deployment is not just pushing code—it's a systematic process of validation, monitoring, and risk mitigation. The patterns in this skill (A/B testing, canary deployments, shadow mode, blue-green with feature flags) provide the infrastructure for safe, confident deployments. Master these patterns, avoid the anti-patterns, and you'll deploy AI models to production with confidence and safety.