Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:59:43 +08:00
commit 966ef521f7
25 changed files with 9763 additions and 0 deletions

View File

@@ -0,0 +1,363 @@
---
name: testing-in-production
description: Use when implementing feature flags, canary deployments, shadow traffic, A/B testing, choosing blast radius limits, defining rollback criteria, or monitoring production experiments - provides technique selection, anti-patterns, and kill switch frameworks
---
# Testing in Production
## Overview
**Core principle:** Minimize blast radius, maximize observability, always have a kill switch.
**Rule:** Testing in production is safe when you control exposure and can roll back instantly.
**Regulated industries (healthcare, finance, government):** Production testing is still possible but requires additional controls - compliance review before experiments, audit trails for flag changes, avoiding PHI/PII in logs, Business Associate Agreements for third-party tools, and restricted techniques (shadow traffic may create prohibited data copies). Consult compliance team before first production test.
## Technique Selection Decision Tree
| Your Goal | Risk Tolerance | Infrastructure Needed | Use |
|-----------|----------------|----------------------|-----|
| Test feature with specific users | Low | Feature flag service | **Feature Flags** |
| Validate deployment safety | Medium | Load balancer, multiple instances | **Canary Deployment** |
| Compare old vs new performance | Low | Traffic duplication | **Shadow Traffic** |
| Measure business impact | Medium | A/B testing framework, analytics | **A/B Testing** |
| Test without any user impact | Lowest | Service mesh, traffic mirroring | **Dark Launch** |
**First technique:** Feature flags (lowest infrastructure requirement, highest control)
## Anti-Patterns Catalog
### ❌ Nested Feature Flags
**Symptom:** Flags controlling other flags, creating combinatorial complexity
**Why bad:** 2^N combinations to test, impossible to validate all paths, technical debt accumulates
**Fix:** Maximum 1 level of flag nesting, delete flags after rollout
```python
# ❌ Bad
if feature_flags.enabled("new_checkout"):
if feature_flags.enabled("express_shipping"):
if feature_flags.enabled("gift_wrap"):
# 8 possible combinations for 3 flags
# ✅ Good
if feature_flags.enabled("new_checkout_v2"): # Single flag for full feature
return new_checkout_with_all_options()
```
---
### ❌ Canary with Sticky Sessions
**Symptom:** Users switch between old and new versions across requests due to session affinity
**Why bad:** Inconsistent experience, state corruption, false negative metrics
**Fix:** Route user to same version for entire session
```nginx
# ✅ Good - Consistent routing
upstream backend {
hash $cookie_user_id consistent; # Sticky by user ID
server backend-v1:8080 weight=95;
server backend-v2:8080 weight=5;
}
```
---
### ❌ No Statistical Validation
**Symptom:** Making rollout decisions on small sample sizes without confidence intervals
**Why bad:** Random variance mistaken for real effects, premature rollback or expansion
**Fix:** Minimum sample size, statistical significance testing
```python
# ✅ Good - Statistical validation
from scipy import stats
def is_safe_to_rollout(control_errors, treatment_errors, min_sample=1000):
if len(treatment_errors) < min_sample:
return False, "Insufficient data"
# Two-proportion z-test
_, p_value = stats.proportions_ztest(
[control_errors.sum(), treatment_errors.sum()],
[len(control_errors), len(treatment_errors)]
)
return p_value > 0.05, f"p-value: {p_value}"
```
---
### ❌ Testing Without Rollback
**Symptom:** Deploying feature flags or canaries without instant kill switch
**Why bad:** When issues detected, can't stop impact immediately
**Fix:** Kill switch tested before first production test
---
### ❌ Insufficient Monitoring
**Symptom:** Monitoring only error rates, missing business/user metrics
**Why bad:** Technical success but business failure (e.g., lower conversion)
**Fix:** Monitor technical + business + user experience metrics
## Blast Radius Control Framework
**Progressive rollout schedule:**
| Phase | Exposure | Duration | Abort If | Continue If |
|-------|----------|----------|----------|-------------|
| **1. Internal** | 10-50 internal users | 1-2 days | Any errors | 0 errors, good UX feedback |
| **2. Canary** | 1% production traffic | 4-24 hours | Error rate > +2%, latency > +10% | Metrics stable |
| **3. Small** | 5% production | 1-2 days | Error rate > +5%, latency > +25% | Metrics stable or improved |
| **4. Medium** | 25% production | 2-3 days | Error rate > +5%, latency > +25% | Metrics stable or improved |
| **5. Majority** | 50% production | 3-7 days | Error rate > +5%, business metrics down | Metrics improved |
| **6. Full** | 100% production | Monitor indefinitely | Business metrics drop | Cleanup old code |
**Minimum dwell time:** Each phase needs minimum observation period to catch delayed issues
**Rollback at any phase:** If metrics degrade, revert to previous phase
## Kill Switch Criteria
**Immediate rollback triggers (automated):**
| Metric | Threshold | Why |
|--------|-----------|-----|
| Error rate increase | > 5% above baseline | User impact |
| p99 latency increase | > 50% above baseline | Performance degradation |
| Critical errors (5xx) | > 0.1% of requests | Service failure |
| Business metric drop | > 10% (conversion, revenue) | Revenue impact |
**Warning triggers (manual investigation):**
| Metric | Threshold | Action |
|--------|-----------|--------|
| Error rate increase | 2-5% above baseline | Halt rollout, investigate |
| p95 latency increase | 25-50% above baseline | Monitor closely |
| User complaints | >3 similar reports | Halt rollout, investigate |
**Statistical validation:**
```python
# Sample size for 95% confidence, 80% power
# Minimum 1000 samples per variant for most A/B tests
# For low-traffic features: wait 24-48 hours regardless
```
## Monitoring Quick Reference
**Required metrics (all tests):**
| Category | Metrics | Alert Threshold |
|----------|---------|-----------------|
| **Errors** | Error rate, exception count, 5xx responses | > +5% vs baseline |
| **Performance** | p50/p95/p99 latency, request duration | p99 > +50% vs baseline |
| **Business** | Conversion rate, transaction completion, revenue | > -10% vs baseline |
| **User Experience** | Client errors, page load, bounce rate | > +20% vs baseline |
**Baseline calculation:**
```python
# Collect baseline from previous 7-14 days
baseline_p99 = np.percentile(historical_latencies, 99)
current_p99 = np.percentile(current_latencies, 99)
if current_p99 > baseline_p99 * 1.5: # 50% increase
rollback()
```
## Implementation Patterns
### Feature Flags Pattern
```python
# Using LaunchDarkly, Split.io, or similar
from launchdarkly import LDClient, Context
client = LDClient("sdk-key")
def handle_request(user_id):
context = Context.builder(user_id).build()
if client.variation("new-checkout", context, default=False):
return new_checkout_flow(user_id)
else:
return old_checkout_flow(user_id)
```
**Best practices:**
- Default to `False` (old behavior) for safety
- Pass user context for targeting
- Log flag evaluations for debugging
- Delete flags within 30 days of full rollout
### Canary Deployment Pattern
```yaml
# Kubernetes with Istio
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: my-service
spec:
hosts:
- my-service
http:
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: my-service
subset: v2
- route:
- destination:
host: my-service
subset: v1
weight: 95
- destination:
host: my-service
subset: v2
weight: 5
```
### Shadow Traffic Pattern
```python
# Duplicate requests to new service, ignore responses
import asyncio
async def handle_request(request):
# Primary: serve user from old service
response = await old_service(request)
# Shadow: send to new service, don't wait
asyncio.create_task(new_service(request.copy())) # Fire and forget
return response # User sees old service response
```
## Tool Ecosystem Quick Reference
| Tool Category | Options | When to Use |
|---------------|---------|-------------|
| **Feature Flags** | LaunchDarkly, Split.io, Flagsmith, Unleash | User-level targeting, instant rollback |
| **Canary/Blue-Green** | Istio, Linkerd, AWS App Mesh, Flagger | Service mesh, traffic shifting |
| **A/B Testing** | Optimizely, VWO, Google Optimize | Business metric validation |
| **Observability** | DataDog, New Relic, Honeycomb, Grafana | Metrics, traces, logs correlation |
| **Statistical Analysis** | Statsig, Eppo, GrowthBook | Automated significance testing |
**Recommendation for starting:** Feature flags (Flagsmith for self-hosted, LaunchDarkly for SaaS) + existing observability
## Your First Production Test
**Goal:** Safely test a small feature with feature flags
**Week 1: Setup**
1. **Choose feature flag tool**
- Self-hosted: Flagsmith (free, open source)
- SaaS: LaunchDarkly (free tier: 1000 MAU)
2. **Instrument code**
```python
if feature_flags.enabled("my-first-test", user_id):
return new_feature(user_id)
else:
return old_feature(user_id)
```
3. **Set up monitoring**
- Error rate dashboard
- Latency percentiles (p50, p95, p99)
- Business metric (conversion, completion rate)
4. **Define rollback criteria**
- Error rate > +5%
- p99 latency > +50%
- Business metric < -10%
**Week 2: Test Execution**
**Day 1-2:** Internal users (10 people)
- Enable flag for 10 employee user IDs
- Monitor for errors, gather feedback
**Day 3-5:** Canary (1% of users)
- Enable for 1% random sample
- Monitor metrics every hour
- Rollback if any threshold exceeded
**Day 6-8:** Small rollout (5%)
- If canary successful, increase to 5%
- Continue monitoring
**Day 9-14:** Full rollout (100%)
- Gradual increase: 25% → 50% → 100%
- Monitor for 7 days at 100%
**Week 3: Cleanup**
- Remove flag from code
- Archive flag in dashboard
- Document learnings
## Common Mistakes
### ❌ Expanding Rollout Too Fast
**Fix:** Follow minimum dwell times (24 hours per phase)
---
### ❌ Monitoring Only After Issues
**Fix:** Dashboard ready before first rollout, alerts configured
---
### ❌ No Rollback Practice
**Fix:** Test rollback in staging before production
---
### ❌ Ignoring Business Metrics
**Fix:** Technical metrics AND business metrics required for go/no-go decisions
## Quick Reference
**Technique Selection:**
- User-specific: Feature flags
- Deployment safety: Canary
- Performance comparison: Shadow traffic
- Business validation: A/B testing
**Blast Radius Progression:**
Internal → 1% → 5% → 25% → 50% → 100%
**Kill Switch Thresholds:**
- Error rate: > +5%
- p99 latency: > +50%
- Business metrics: > -10%
**Minimum Sample Sizes:**
- A/B test: 1000 samples per variant
- Canary: 24 hours observation
**Tool Recommendations:**
- Feature flags: LaunchDarkly, Flagsmith
- Canary: Istio, Flagger
- Observability: DataDog, Grafana
## Bottom Line
**Production testing is safe with three controls: exposure limits, observability, instant rollback.**
Start with feature flags, use progressive rollout (1% → 5% → 25% → 100%), monitor technical + business metrics, and always have a kill switch.