Initial commit
This commit is contained in:
363
skills/testing-in-production/SKILL.md
Normal file
363
skills/testing-in-production/SKILL.md
Normal file
@@ -0,0 +1,363 @@
|
||||
---
|
||||
name: testing-in-production
|
||||
description: Use when implementing feature flags, canary deployments, shadow traffic, A/B testing, choosing blast radius limits, defining rollback criteria, or monitoring production experiments - provides technique selection, anti-patterns, and kill switch frameworks
|
||||
---
|
||||
|
||||
# Testing in Production
|
||||
|
||||
## Overview
|
||||
|
||||
**Core principle:** Minimize blast radius, maximize observability, always have a kill switch.
|
||||
|
||||
**Rule:** Testing in production is safe when you control exposure and can roll back instantly.
|
||||
|
||||
**Regulated industries (healthcare, finance, government):** Production testing is still possible but requires additional controls - compliance review before experiments, audit trails for flag changes, avoiding PHI/PII in logs, Business Associate Agreements for third-party tools, and restricted techniques (shadow traffic may create prohibited data copies). Consult compliance team before first production test.
|
||||
|
||||
## Technique Selection Decision Tree
|
||||
|
||||
| Your Goal | Risk Tolerance | Infrastructure Needed | Use |
|
||||
|-----------|----------------|----------------------|-----|
|
||||
| Test feature with specific users | Low | Feature flag service | **Feature Flags** |
|
||||
| Validate deployment safety | Medium | Load balancer, multiple instances | **Canary Deployment** |
|
||||
| Compare old vs new performance | Low | Traffic duplication | **Shadow Traffic** |
|
||||
| Measure business impact | Medium | A/B testing framework, analytics | **A/B Testing** |
|
||||
| Test without any user impact | Lowest | Service mesh, traffic mirroring | **Dark Launch** |
|
||||
|
||||
**First technique:** Feature flags (lowest infrastructure requirement, highest control)
|
||||
|
||||
## Anti-Patterns Catalog
|
||||
|
||||
### ❌ Nested Feature Flags
|
||||
**Symptom:** Flags controlling other flags, creating combinatorial complexity
|
||||
|
||||
**Why bad:** 2^N combinations to test, impossible to validate all paths, technical debt accumulates
|
||||
|
||||
**Fix:** Maximum 1 level of flag nesting, delete flags after rollout
|
||||
|
||||
```python
|
||||
# ❌ Bad
|
||||
if feature_flags.enabled("new_checkout"):
|
||||
if feature_flags.enabled("express_shipping"):
|
||||
if feature_flags.enabled("gift_wrap"):
|
||||
# 8 possible combinations for 3 flags
|
||||
|
||||
# ✅ Good
|
||||
if feature_flags.enabled("new_checkout_v2"): # Single flag for full feature
|
||||
return new_checkout_with_all_options()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### ❌ Canary with Sticky Sessions
|
||||
**Symptom:** Users switch between old and new versions across requests due to session affinity
|
||||
|
||||
**Why bad:** Inconsistent experience, state corruption, false negative metrics
|
||||
|
||||
**Fix:** Route user to same version for entire session
|
||||
|
||||
```nginx
|
||||
# ✅ Good - Consistent routing
|
||||
upstream backend {
|
||||
hash $cookie_user_id consistent; # Sticky by user ID
|
||||
server backend-v1:8080 weight=95;
|
||||
server backend-v2:8080 weight=5;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### ❌ No Statistical Validation
|
||||
**Symptom:** Making rollout decisions on small sample sizes without confidence intervals
|
||||
|
||||
**Why bad:** Random variance mistaken for real effects, premature rollback or expansion
|
||||
|
||||
**Fix:** Minimum sample size, statistical significance testing
|
||||
|
||||
```python
|
||||
# ✅ Good - Statistical validation
|
||||
from scipy import stats
|
||||
|
||||
def is_safe_to_rollout(control_errors, treatment_errors, min_sample=1000):
|
||||
if len(treatment_errors) < min_sample:
|
||||
return False, "Insufficient data"
|
||||
|
||||
# Two-proportion z-test
|
||||
_, p_value = stats.proportions_ztest(
|
||||
[control_errors.sum(), treatment_errors.sum()],
|
||||
[len(control_errors), len(treatment_errors)]
|
||||
)
|
||||
|
||||
return p_value > 0.05, f"p-value: {p_value}"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### ❌ Testing Without Rollback
|
||||
**Symptom:** Deploying feature flags or canaries without instant kill switch
|
||||
|
||||
**Why bad:** When issues detected, can't stop impact immediately
|
||||
|
||||
**Fix:** Kill switch tested before first production test
|
||||
|
||||
---
|
||||
|
||||
### ❌ Insufficient Monitoring
|
||||
**Symptom:** Monitoring only error rates, missing business/user metrics
|
||||
|
||||
**Why bad:** Technical success but business failure (e.g., lower conversion)
|
||||
|
||||
**Fix:** Monitor technical + business + user experience metrics
|
||||
|
||||
## Blast Radius Control Framework
|
||||
|
||||
**Progressive rollout schedule:**
|
||||
|
||||
| Phase | Exposure | Duration | Abort If | Continue If |
|
||||
|-------|----------|----------|----------|-------------|
|
||||
| **1. Internal** | 10-50 internal users | 1-2 days | Any errors | 0 errors, good UX feedback |
|
||||
| **2. Canary** | 1% production traffic | 4-24 hours | Error rate > +2%, latency > +10% | Metrics stable |
|
||||
| **3. Small** | 5% production | 1-2 days | Error rate > +5%, latency > +25% | Metrics stable or improved |
|
||||
| **4. Medium** | 25% production | 2-3 days | Error rate > +5%, latency > +25% | Metrics stable or improved |
|
||||
| **5. Majority** | 50% production | 3-7 days | Error rate > +5%, business metrics down | Metrics improved |
|
||||
| **6. Full** | 100% production | Monitor indefinitely | Business metrics drop | Cleanup old code |
|
||||
|
||||
**Minimum dwell time:** Each phase needs minimum observation period to catch delayed issues
|
||||
|
||||
**Rollback at any phase:** If metrics degrade, revert to previous phase
|
||||
|
||||
## Kill Switch Criteria
|
||||
|
||||
**Immediate rollback triggers (automated):**
|
||||
|
||||
| Metric | Threshold | Why |
|
||||
|--------|-----------|-----|
|
||||
| Error rate increase | > 5% above baseline | User impact |
|
||||
| p99 latency increase | > 50% above baseline | Performance degradation |
|
||||
| Critical errors (5xx) | > 0.1% of requests | Service failure |
|
||||
| Business metric drop | > 10% (conversion, revenue) | Revenue impact |
|
||||
|
||||
**Warning triggers (manual investigation):**
|
||||
|
||||
| Metric | Threshold | Action |
|
||||
|--------|-----------|--------|
|
||||
| Error rate increase | 2-5% above baseline | Halt rollout, investigate |
|
||||
| p95 latency increase | 25-50% above baseline | Monitor closely |
|
||||
| User complaints | >3 similar reports | Halt rollout, investigate |
|
||||
|
||||
**Statistical validation:**
|
||||
|
||||
```python
|
||||
# Sample size for 95% confidence, 80% power
|
||||
# Minimum 1000 samples per variant for most A/B tests
|
||||
# For low-traffic features: wait 24-48 hours regardless
|
||||
```
|
||||
|
||||
## Monitoring Quick Reference
|
||||
|
||||
**Required metrics (all tests):**
|
||||
|
||||
| Category | Metrics | Alert Threshold |
|
||||
|----------|---------|-----------------|
|
||||
| **Errors** | Error rate, exception count, 5xx responses | > +5% vs baseline |
|
||||
| **Performance** | p50/p95/p99 latency, request duration | p99 > +50% vs baseline |
|
||||
| **Business** | Conversion rate, transaction completion, revenue | > -10% vs baseline |
|
||||
| **User Experience** | Client errors, page load, bounce rate | > +20% vs baseline |
|
||||
|
||||
**Baseline calculation:**
|
||||
|
||||
```python
|
||||
# Collect baseline from previous 7-14 days
|
||||
baseline_p99 = np.percentile(historical_latencies, 99)
|
||||
current_p99 = np.percentile(current_latencies, 99)
|
||||
|
||||
if current_p99 > baseline_p99 * 1.5: # 50% increase
|
||||
rollback()
|
||||
```
|
||||
|
||||
## Implementation Patterns
|
||||
|
||||
### Feature Flags Pattern
|
||||
|
||||
```python
|
||||
# Using LaunchDarkly, Split.io, or similar
|
||||
from launchdarkly import LDClient, Context
|
||||
|
||||
client = LDClient("sdk-key")
|
||||
|
||||
def handle_request(user_id):
|
||||
context = Context.builder(user_id).build()
|
||||
|
||||
if client.variation("new-checkout", context, default=False):
|
||||
return new_checkout_flow(user_id)
|
||||
else:
|
||||
return old_checkout_flow(user_id)
|
||||
```
|
||||
|
||||
**Best practices:**
|
||||
- Default to `False` (old behavior) for safety
|
||||
- Pass user context for targeting
|
||||
- Log flag evaluations for debugging
|
||||
- Delete flags within 30 days of full rollout
|
||||
|
||||
### Canary Deployment Pattern
|
||||
|
||||
```yaml
|
||||
# Kubernetes with Istio
|
||||
apiVersion: networking.istio.io/v1alpha3
|
||||
kind: VirtualService
|
||||
metadata:
|
||||
name: my-service
|
||||
spec:
|
||||
hosts:
|
||||
- my-service
|
||||
http:
|
||||
- match:
|
||||
- headers:
|
||||
x-canary:
|
||||
exact: "true"
|
||||
route:
|
||||
- destination:
|
||||
host: my-service
|
||||
subset: v2
|
||||
- route:
|
||||
- destination:
|
||||
host: my-service
|
||||
subset: v1
|
||||
weight: 95
|
||||
- destination:
|
||||
host: my-service
|
||||
subset: v2
|
||||
weight: 5
|
||||
```
|
||||
|
||||
### Shadow Traffic Pattern
|
||||
|
||||
```python
|
||||
# Duplicate requests to new service, ignore responses
|
||||
import asyncio
|
||||
|
||||
async def handle_request(request):
|
||||
# Primary: serve user from old service
|
||||
response = await old_service(request)
|
||||
|
||||
# Shadow: send to new service, don't wait
|
||||
asyncio.create_task(new_service(request.copy())) # Fire and forget
|
||||
|
||||
return response # User sees old service response
|
||||
```
|
||||
|
||||
## Tool Ecosystem Quick Reference
|
||||
|
||||
| Tool Category | Options | When to Use |
|
||||
|---------------|---------|-------------|
|
||||
| **Feature Flags** | LaunchDarkly, Split.io, Flagsmith, Unleash | User-level targeting, instant rollback |
|
||||
| **Canary/Blue-Green** | Istio, Linkerd, AWS App Mesh, Flagger | Service mesh, traffic shifting |
|
||||
| **A/B Testing** | Optimizely, VWO, Google Optimize | Business metric validation |
|
||||
| **Observability** | DataDog, New Relic, Honeycomb, Grafana | Metrics, traces, logs correlation |
|
||||
| **Statistical Analysis** | Statsig, Eppo, GrowthBook | Automated significance testing |
|
||||
|
||||
**Recommendation for starting:** Feature flags (Flagsmith for self-hosted, LaunchDarkly for SaaS) + existing observability
|
||||
|
||||
## Your First Production Test
|
||||
|
||||
**Goal:** Safely test a small feature with feature flags
|
||||
|
||||
**Week 1: Setup**
|
||||
|
||||
1. **Choose feature flag tool**
|
||||
- Self-hosted: Flagsmith (free, open source)
|
||||
- SaaS: LaunchDarkly (free tier: 1000 MAU)
|
||||
|
||||
2. **Instrument code**
|
||||
```python
|
||||
if feature_flags.enabled("my-first-test", user_id):
|
||||
return new_feature(user_id)
|
||||
else:
|
||||
return old_feature(user_id)
|
||||
```
|
||||
|
||||
3. **Set up monitoring**
|
||||
- Error rate dashboard
|
||||
- Latency percentiles (p50, p95, p99)
|
||||
- Business metric (conversion, completion rate)
|
||||
|
||||
4. **Define rollback criteria**
|
||||
- Error rate > +5%
|
||||
- p99 latency > +50%
|
||||
- Business metric < -10%
|
||||
|
||||
**Week 2: Test Execution**
|
||||
|
||||
**Day 1-2:** Internal users (10 people)
|
||||
- Enable flag for 10 employee user IDs
|
||||
- Monitor for errors, gather feedback
|
||||
|
||||
**Day 3-5:** Canary (1% of users)
|
||||
- Enable for 1% random sample
|
||||
- Monitor metrics every hour
|
||||
- Rollback if any threshold exceeded
|
||||
|
||||
**Day 6-8:** Small rollout (5%)
|
||||
- If canary successful, increase to 5%
|
||||
- Continue monitoring
|
||||
|
||||
**Day 9-14:** Full rollout (100%)
|
||||
- Gradual increase: 25% → 50% → 100%
|
||||
- Monitor for 7 days at 100%
|
||||
|
||||
**Week 3: Cleanup**
|
||||
|
||||
- Remove flag from code
|
||||
- Archive flag in dashboard
|
||||
- Document learnings
|
||||
|
||||
## Common Mistakes
|
||||
|
||||
### ❌ Expanding Rollout Too Fast
|
||||
**Fix:** Follow minimum dwell times (24 hours per phase)
|
||||
|
||||
---
|
||||
|
||||
### ❌ Monitoring Only After Issues
|
||||
**Fix:** Dashboard ready before first rollout, alerts configured
|
||||
|
||||
---
|
||||
|
||||
### ❌ No Rollback Practice
|
||||
**Fix:** Test rollback in staging before production
|
||||
|
||||
---
|
||||
|
||||
### ❌ Ignoring Business Metrics
|
||||
**Fix:** Technical metrics AND business metrics required for go/no-go decisions
|
||||
|
||||
## Quick Reference
|
||||
|
||||
**Technique Selection:**
|
||||
- User-specific: Feature flags
|
||||
- Deployment safety: Canary
|
||||
- Performance comparison: Shadow traffic
|
||||
- Business validation: A/B testing
|
||||
|
||||
**Blast Radius Progression:**
|
||||
Internal → 1% → 5% → 25% → 50% → 100%
|
||||
|
||||
**Kill Switch Thresholds:**
|
||||
- Error rate: > +5%
|
||||
- p99 latency: > +50%
|
||||
- Business metrics: > -10%
|
||||
|
||||
**Minimum Sample Sizes:**
|
||||
- A/B test: 1000 samples per variant
|
||||
- Canary: 24 hours observation
|
||||
|
||||
**Tool Recommendations:**
|
||||
- Feature flags: LaunchDarkly, Flagsmith
|
||||
- Canary: Istio, Flagger
|
||||
- Observability: DataDog, Grafana
|
||||
|
||||
## Bottom Line
|
||||
|
||||
**Production testing is safe with three controls: exposure limits, observability, instant rollback.**
|
||||
|
||||
Start with feature flags, use progressive rollout (1% → 5% → 25% → 100%), monitor technical + business metrics, and always have a kill switch.
|
||||
Reference in New Issue
Block a user