--- name: testing-in-production description: Use when implementing feature flags, canary deployments, shadow traffic, A/B testing, choosing blast radius limits, defining rollback criteria, or monitoring production experiments - provides technique selection, anti-patterns, and kill switch frameworks --- # Testing in Production ## Overview **Core principle:** Minimize blast radius, maximize observability, always have a kill switch. **Rule:** Testing in production is safe when you control exposure and can roll back instantly. **Regulated industries (healthcare, finance, government):** Production testing is still possible but requires additional controls - compliance review before experiments, audit trails for flag changes, avoiding PHI/PII in logs, Business Associate Agreements for third-party tools, and restricted techniques (shadow traffic may create prohibited data copies). Consult compliance team before first production test. ## Technique Selection Decision Tree | Your Goal | Risk Tolerance | Infrastructure Needed | Use | |-----------|----------------|----------------------|-----| | Test feature with specific users | Low | Feature flag service | **Feature Flags** | | Validate deployment safety | Medium | Load balancer, multiple instances | **Canary Deployment** | | Compare old vs new performance | Low | Traffic duplication | **Shadow Traffic** | | Measure business impact | Medium | A/B testing framework, analytics | **A/B Testing** | | Test without any user impact | Lowest | Service mesh, traffic mirroring | **Dark Launch** | **First technique:** Feature flags (lowest infrastructure requirement, highest control) ## Anti-Patterns Catalog ### ❌ Nested Feature Flags **Symptom:** Flags controlling other flags, creating combinatorial complexity **Why bad:** 2^N combinations to test, impossible to validate all paths, technical debt accumulates **Fix:** Maximum 1 level of flag nesting, delete flags after rollout ```python # ❌ Bad if feature_flags.enabled("new_checkout"): if feature_flags.enabled("express_shipping"): if feature_flags.enabled("gift_wrap"): # 8 possible combinations for 3 flags # ✅ Good if feature_flags.enabled("new_checkout_v2"): # Single flag for full feature return new_checkout_with_all_options() ``` --- ### ❌ Canary with Sticky Sessions **Symptom:** Users switch between old and new versions across requests due to session affinity **Why bad:** Inconsistent experience, state corruption, false negative metrics **Fix:** Route user to same version for entire session ```nginx # ✅ Good - Consistent routing upstream backend { hash $cookie_user_id consistent; # Sticky by user ID server backend-v1:8080 weight=95; server backend-v2:8080 weight=5; } ``` --- ### ❌ No Statistical Validation **Symptom:** Making rollout decisions on small sample sizes without confidence intervals **Why bad:** Random variance mistaken for real effects, premature rollback or expansion **Fix:** Minimum sample size, statistical significance testing ```python # ✅ Good - Statistical validation from scipy import stats def is_safe_to_rollout(control_errors, treatment_errors, min_sample=1000): if len(treatment_errors) < min_sample: return False, "Insufficient data" # Two-proportion z-test _, p_value = stats.proportions_ztest( [control_errors.sum(), treatment_errors.sum()], [len(control_errors), len(treatment_errors)] ) return p_value > 0.05, f"p-value: {p_value}" ``` --- ### ❌ Testing Without Rollback **Symptom:** Deploying feature flags or canaries without instant kill switch **Why bad:** When issues detected, can't stop impact immediately **Fix:** Kill switch tested before first production test --- ### ❌ Insufficient Monitoring **Symptom:** Monitoring only error rates, missing business/user metrics **Why bad:** Technical success but business failure (e.g., lower conversion) **Fix:** Monitor technical + business + user experience metrics ## Blast Radius Control Framework **Progressive rollout schedule:** | Phase | Exposure | Duration | Abort If | Continue If | |-------|----------|----------|----------|-------------| | **1. Internal** | 10-50 internal users | 1-2 days | Any errors | 0 errors, good UX feedback | | **2. Canary** | 1% production traffic | 4-24 hours | Error rate > +2%, latency > +10% | Metrics stable | | **3. Small** | 5% production | 1-2 days | Error rate > +5%, latency > +25% | Metrics stable or improved | | **4. Medium** | 25% production | 2-3 days | Error rate > +5%, latency > +25% | Metrics stable or improved | | **5. Majority** | 50% production | 3-7 days | Error rate > +5%, business metrics down | Metrics improved | | **6. Full** | 100% production | Monitor indefinitely | Business metrics drop | Cleanup old code | **Minimum dwell time:** Each phase needs minimum observation period to catch delayed issues **Rollback at any phase:** If metrics degrade, revert to previous phase ## Kill Switch Criteria **Immediate rollback triggers (automated):** | Metric | Threshold | Why | |--------|-----------|-----| | Error rate increase | > 5% above baseline | User impact | | p99 latency increase | > 50% above baseline | Performance degradation | | Critical errors (5xx) | > 0.1% of requests | Service failure | | Business metric drop | > 10% (conversion, revenue) | Revenue impact | **Warning triggers (manual investigation):** | Metric | Threshold | Action | |--------|-----------|--------| | Error rate increase | 2-5% above baseline | Halt rollout, investigate | | p95 latency increase | 25-50% above baseline | Monitor closely | | User complaints | >3 similar reports | Halt rollout, investigate | **Statistical validation:** ```python # Sample size for 95% confidence, 80% power # Minimum 1000 samples per variant for most A/B tests # For low-traffic features: wait 24-48 hours regardless ``` ## Monitoring Quick Reference **Required metrics (all tests):** | Category | Metrics | Alert Threshold | |----------|---------|-----------------| | **Errors** | Error rate, exception count, 5xx responses | > +5% vs baseline | | **Performance** | p50/p95/p99 latency, request duration | p99 > +50% vs baseline | | **Business** | Conversion rate, transaction completion, revenue | > -10% vs baseline | | **User Experience** | Client errors, page load, bounce rate | > +20% vs baseline | **Baseline calculation:** ```python # Collect baseline from previous 7-14 days baseline_p99 = np.percentile(historical_latencies, 99) current_p99 = np.percentile(current_latencies, 99) if current_p99 > baseline_p99 * 1.5: # 50% increase rollback() ``` ## Implementation Patterns ### Feature Flags Pattern ```python # Using LaunchDarkly, Split.io, or similar from launchdarkly import LDClient, Context client = LDClient("sdk-key") def handle_request(user_id): context = Context.builder(user_id).build() if client.variation("new-checkout", context, default=False): return new_checkout_flow(user_id) else: return old_checkout_flow(user_id) ``` **Best practices:** - Default to `False` (old behavior) for safety - Pass user context for targeting - Log flag evaluations for debugging - Delete flags within 30 days of full rollout ### Canary Deployment Pattern ```yaml # Kubernetes with Istio apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: my-service spec: hosts: - my-service http: - match: - headers: x-canary: exact: "true" route: - destination: host: my-service subset: v2 - route: - destination: host: my-service subset: v1 weight: 95 - destination: host: my-service subset: v2 weight: 5 ``` ### Shadow Traffic Pattern ```python # Duplicate requests to new service, ignore responses import asyncio async def handle_request(request): # Primary: serve user from old service response = await old_service(request) # Shadow: send to new service, don't wait asyncio.create_task(new_service(request.copy())) # Fire and forget return response # User sees old service response ``` ## Tool Ecosystem Quick Reference | Tool Category | Options | When to Use | |---------------|---------|-------------| | **Feature Flags** | LaunchDarkly, Split.io, Flagsmith, Unleash | User-level targeting, instant rollback | | **Canary/Blue-Green** | Istio, Linkerd, AWS App Mesh, Flagger | Service mesh, traffic shifting | | **A/B Testing** | Optimizely, VWO, Google Optimize | Business metric validation | | **Observability** | DataDog, New Relic, Honeycomb, Grafana | Metrics, traces, logs correlation | | **Statistical Analysis** | Statsig, Eppo, GrowthBook | Automated significance testing | **Recommendation for starting:** Feature flags (Flagsmith for self-hosted, LaunchDarkly for SaaS) + existing observability ## Your First Production Test **Goal:** Safely test a small feature with feature flags **Week 1: Setup** 1. **Choose feature flag tool** - Self-hosted: Flagsmith (free, open source) - SaaS: LaunchDarkly (free tier: 1000 MAU) 2. **Instrument code** ```python if feature_flags.enabled("my-first-test", user_id): return new_feature(user_id) else: return old_feature(user_id) ``` 3. **Set up monitoring** - Error rate dashboard - Latency percentiles (p50, p95, p99) - Business metric (conversion, completion rate) 4. **Define rollback criteria** - Error rate > +5% - p99 latency > +50% - Business metric < -10% **Week 2: Test Execution** **Day 1-2:** Internal users (10 people) - Enable flag for 10 employee user IDs - Monitor for errors, gather feedback **Day 3-5:** Canary (1% of users) - Enable for 1% random sample - Monitor metrics every hour - Rollback if any threshold exceeded **Day 6-8:** Small rollout (5%) - If canary successful, increase to 5% - Continue monitoring **Day 9-14:** Full rollout (100%) - Gradual increase: 25% → 50% → 100% - Monitor for 7 days at 100% **Week 3: Cleanup** - Remove flag from code - Archive flag in dashboard - Document learnings ## Common Mistakes ### ❌ Expanding Rollout Too Fast **Fix:** Follow minimum dwell times (24 hours per phase) --- ### ❌ Monitoring Only After Issues **Fix:** Dashboard ready before first rollout, alerts configured --- ### ❌ No Rollback Practice **Fix:** Test rollback in staging before production --- ### ❌ Ignoring Business Metrics **Fix:** Technical metrics AND business metrics required for go/no-go decisions ## Quick Reference **Technique Selection:** - User-specific: Feature flags - Deployment safety: Canary - Performance comparison: Shadow traffic - Business validation: A/B testing **Blast Radius Progression:** Internal → 1% → 5% → 25% → 50% → 100% **Kill Switch Thresholds:** - Error rate: > +5% - p99 latency: > +50% - Business metrics: > -10% **Minimum Sample Sizes:** - A/B test: 1000 samples per variant - Canary: 24 hours observation **Tool Recommendations:** - Feature flags: LaunchDarkly, Flagsmith - Canary: Istio, Flagger - Observability: DataDog, Grafana ## Bottom Line **Production testing is safe with three controls: exposure limits, observability, instant rollback.** Start with feature flags, use progressive rollout (1% → 5% → 25% → 100%), monitor technical + business metrics, and always have a kill switch.