Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 17:51:22 +08:00
commit 23753b435e
24 changed files with 9837 additions and 0 deletions

652
references/slo_sla_guide.md Normal file
View File

@@ -0,0 +1,652 @@
# SLI, SLO, and SLA Guide
## Definitions
### SLI (Service Level Indicator)
**What**: A quantitative measure of service quality
**Examples**:
- Request latency (ms)
- Error rate (%)
- Availability (%)
- Throughput (requests/sec)
### SLO (Service Level Objective)
**What**: Target value or range for an SLI
**Examples**:
- "99.9% of requests return in < 500ms"
- "99.95% availability"
- "Error rate < 0.1%"
### SLA (Service Level Agreement)
**What**: Business contract with consequences for SLO violations
**Examples**:
- "99.9% uptime or 10% monthly credit"
- "p95 latency < 1s or refund"
### Relationship
```
SLI = Measurement
SLO = Target (internal goal)
SLA = Promise (customer contract with penalties)
Example:
SLI: Actual availability this month = 99.92%
SLO: Target availability = 99.9%
SLA: Guaranteed availability = 99.5% (with penalties)
```
---
## Choosing SLIs
### The Four Golden Signals as SLIs
1. **Latency SLIs**
- Request duration (p50, p95, p99)
- Time to first byte
- Page load time
2. **Availability/Success SLIs**
- % of successful requests
- % uptime
- % of requests completing
3. **Throughput SLIs** (less common)
- Requests per second
- Transactions per second
4. **Saturation SLIs** (internal only)
- Resource utilization
- Queue depth
### SLI Selection Criteria
**Good SLIs**:
- Measured from user perspective
- Directly impact user experience
- Aggregatable across instances
- Proportional to user happiness
**Bad SLIs**:
- Internal metrics only
- Not user-facing
- Hard to measure consistently
### Examples by Service Type
**Web Application**:
```
SLI 1: Request Success Rate
= successful_requests / total_requests
SLI 2: Request Latency (p95)
= 95th percentile of response times
SLI 3: Availability
= time_service_responding / total_time
```
**API Service**:
```
SLI 1: Error Rate
= (4xx_errors + 5xx_errors) / total_requests
SLI 2: Response Time (p99)
= 99th percentile latency
SLI 3: Throughput
= requests_per_second
```
**Batch Processing**:
```
SLI 1: Job Success Rate
= successful_jobs / total_jobs
SLI 2: Processing Latency
= time_from_submission_to_completion
SLI 3: Freshness
= age_of_oldest_unprocessed_item
```
**Storage Service**:
```
SLI 1: Durability
= data_not_lost / total_data
SLI 2: Read Latency (p99)
= 99th percentile read time
SLI 3: Write Success Rate
= successful_writes / total_writes
```
---
## Setting SLO Targets
### Start with Current Performance
1. **Measure baseline**: Collect 30 days of data
2. **Analyze distribution**: Look at p50, p95, p99, p99.9
3. **Set initial SLO**: Slightly better than worst performer
4. **Iterate**: Tighten or loosen based on feasibility
### Example Process
**Current Performance** (30 days):
```
p50 latency: 120ms
p95 latency: 450ms
p99 latency: 1200ms
p99.9 latency: 3500ms
Error rate: 0.05%
Availability: 99.95%
```
**Initial SLOs**:
```
Latency: p95 < 500ms (slightly worse than current p95)
Error rate: < 0.1% (double current rate)
Availability: 99.9% (slightly worse than current)
```
**Rationale**: Start loose, prevent false alarms, tighten over time
### Common SLO Targets
**Availability**:
- **99%** (3.65 days downtime/year): Internal tools
- **99.5%** (1.83 days/year): Non-critical services
- **99.9%** (8.76 hours/year): Standard production
- **99.95%** (4.38 hours/year): Critical services
- **99.99%** (52 minutes/year): High availability
- **99.999%** (5 minutes/year): Mission critical
**Latency**:
- **p50 < 100ms**: Excellent responsiveness
- **p95 < 500ms**: Standard web applications
- **p99 < 1s**: Acceptable for most users
- **p99.9 < 5s**: Acceptable for rare edge cases
**Error Rate**:
- **< 0.01%** (99.99% success): Critical operations
- **< 0.1%** (99.9% success): Standard production
- **< 1%** (99% success): Non-critical services
---
## Error Budgets
### Concept
Error budget = (100% - SLO target)
If SLO is 99.9%, error budget is 0.1%
**Purpose**: Balance reliability with feature velocity
### Calculation
**For availability**:
```
Monthly error budget = (1 - SLO) × time_period
Example (99.9% SLO, 30 days):
Error budget = 0.001 × 30 days = 0.03 days = 43.2 minutes
```
**For request-based SLIs**:
```
Error budget = (1 - SLO) × total_requests
Example (99.9% SLO, 10M requests/month):
Error budget = 0.001 × 10,000,000 = 10,000 failed requests
```
### Error Budget Consumption
**Formula**:
```
Budget consumed = actual_errors / allowed_errors × 100%
Example:
SLO: 99.9% (0.1% error budget)
Total requests: 1,000,000
Failed requests: 500
Allowed failures: 1,000
Budget consumed = 500 / 1,000 × 100% = 50%
Budget remaining = 50%
```
### Error Budget Policy
**Example policy**:
```markdown
## Error Budget Policy
### If error budget > 50%
- Deploy frequently (multiple times per day)
- Take calculated risks
- Experiment with new features
- Acceptable to have some incidents
### If error budget 20-50%
- Deploy normally (once per day)
- Increase testing
- Review recent changes
- Monitor closely
### If error budget < 20%
- Freeze non-critical deploys
- Focus on reliability improvements
- Postmortem all incidents
- Reduce change velocity
### If error budget exhausted (< 0%)
- Complete deploy freeze except rollbacks
- All hands on reliability
- Mandatory postmortems
- Executive escalation
```
---
## Error Budget Burn Rate
### Concept
Burn rate = rate of error budget consumption
**Example**:
- Monthly budget: 43.2 minutes (99.9% SLO)
- If consuming at 2x rate: Budget exhausted in 15 days
- If consuming at 10x rate: Budget exhausted in 3 days
### Burn Rate Calculation
```
Burn rate = (actual_error_rate / allowed_error_rate)
Example:
SLO: 99.9% (0.1% allowed error rate)
Current error rate: 0.5%
Burn rate = 0.5% / 0.1% = 5x
Time to exhaust = 30 days / 5 = 6 days
```
### Multi-Window Alerting
Alert on burn rate across multiple time windows:
**Fast burn** (1 hour window):
```
Burn rate > 14.4x → Exhausts budget in 2 days
Alert after 2 minutes
Severity: Critical (page immediately)
```
**Moderate burn** (6 hour window):
```
Burn rate > 6x → Exhausts budget in 5 days
Alert after 30 minutes
Severity: Warning (create ticket)
```
**Slow burn** (3 day window):
```
Burn rate > 1x → Exhausts budget by end of month
Alert after 6 hours
Severity: Info (monitor)
```
### Implementation
**Prometheus**:
```yaml
# Fast burn alert (1h window, 2m grace period)
- alert: ErrorBudgetFastBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001) # 14.4x burn rate for 99.9% SLO
for: 2m
labels:
severity: critical
annotations:
summary: "Fast error budget burn detected"
description: "Error budget will be exhausted in 2 days at current rate"
# Slow burn alert (6h window, 30m grace period)
- alert: ErrorBudgetSlowBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
) > (6 * 0.001) # 6x burn rate for 99.9% SLO
for: 30m
labels:
severity: warning
annotations:
summary: "Elevated error budget burn detected"
```
---
## SLO Reporting
### Dashboard Structure
**Overall Health**:
```
┌─────────────────────────────────────────┐
│ SLO Compliance: 99.92% ✅ │
│ Error Budget Remaining: 73% 🟢 │
│ Burn Rate: 0.8x 🟢 │
└─────────────────────────────────────────┘
```
**SLI Performance**:
```
Latency p95: 420ms (Target: 500ms) ✅
Error Rate: 0.08% (Target: < 0.1%) ✅
Availability: 99.95% (Target: > 99.9%) ✅
```
**Error Budget Trend**:
```
Graph showing:
- Error budget consumption over time
- Burn rate spikes
- Incidents marked
- Deploy events overlaid
```
### Monthly SLO Report
**Template**:
```markdown
# SLO Report: October 2024
## Executive Summary
- ✅ All SLOs met this month
- 🟡 Latency SLO came close to violation (99.1% compliance)
- 3 incidents consumed 47% of error budget
- Error budget remaining: 53%
## SLO Performance
### Availability SLO: 99.9%
- Actual: 99.92%
- Status: ✅ Met
- Error budget consumed: 33%
- Downtime: 23 minutes (allowed: 43.2 minutes)
### Latency SLO: p95 < 500ms
- Actual p95: 445ms
- Status: ✅ Met
- Compliance: 99.1% (target: 99%)
- 0.9% of requests exceeded threshold
### Error Rate SLO: < 0.1%
- Actual: 0.05%
- Status: ✅ Met
- Error budget consumed: 50%
## Incidents
### Incident #1: Database Overload (Oct 5)
- Duration: 15 minutes
- Error budget consumed: 35%
- Root cause: Slow query after schema change
- Prevention: Added query review to deploy checklist
### Incident #2: API Gateway Timeout (Oct 12)
- Duration: 5 minutes
- Error budget consumed: 10%
- Root cause: Configuration error in load balancer
- Prevention: Automated configuration validation
### Incident #3: Upstream Service Degradation (Oct 20)
- Duration: 3 minutes
- Error budget consumed: 2%
- Root cause: Third-party API outage
- Prevention: Implemented circuit breaker
## Recommendations
1. Investigate latency near-miss (Oct 15-17)
2. Add automated rollback for database changes
3. Increase circuit breaker thresholds for third-party APIs
4. Consider tightening availability SLO to 99.95%
## Next Month's Focus
- Reduce p95 latency to 400ms
- Implement automated canary deployments
- Add synthetic monitoring for critical paths
```
---
## SLA Structure
### Components
**Service Description**:
```
The API Service provides RESTful endpoints for user management,
authentication, and data retrieval.
```
**Covered Metrics**:
```
- Availability: Service is reachable and returns valid responses
- Latency: Time from request to response
- Error Rate: Percentage of requests returning errors
```
**SLA Targets**:
```
Service commits to:
1. 99.9% monthly uptime
2. p95 API response time < 1 second
3. Error rate < 0.5%
```
**Measurement**:
```
Metrics calculated from server-side monitoring:
- Uptime: Successful health check probes / total probes
- Latency: Server-side request duration (p95)
- Errors: HTTP 5xx responses / total responses
Calculated monthly (first of month for previous month).
```
**Exclusions**:
```
SLA does not cover:
- Scheduled maintenance (with 7 days notice)
- Client-side network issues
- DDoS attacks or force majeure
- Beta/preview features
- Issues caused by customer misuse
```
**Service Credits**:
```
Monthly Uptime | Service Credit
---------------- | --------------
< 99.9% (SLA) | 10%
< 99.0% | 25%
< 95.0% | 50%
```
**Claiming Credits**:
```
Customer must:
1. Report violation within 30 days
2. Provide ticket numbers for support requests
3. Credits applied to next month's invoice
4. Credits do not exceed monthly fee
```
### Example SLAs by Industry
**E-commerce**:
```
- 99.95% availability
- p95 page load < 2s
- p99 checkout < 5s
- Credits: 5% per 0.1% below target
```
**Financial Services**:
```
- 99.99% availability
- p99 transaction < 500ms
- Zero data loss
- Penalties: $10,000 per hour of downtime
```
**Media/Content**:
```
- 99.9% availability
- p95 video start < 3s
- No credit system (best effort latency)
```
---
## Best Practices
### 1. SLOs Should Be User-Centric
❌ "Database queries < 100ms"
✅ "API response time p95 < 500ms"
### 2. Start Loose, Tighten Over Time
- Begin with achievable targets
- Build reliability culture
- Gradually raise bar
### 3. Fewer, Better SLOs
- 1-3 SLOs per service
- Focus on user impact
- Avoid SLO sprawl
### 4. SLAs More Conservative Than SLOs
```
Internal SLO: 99.95%
Customer SLA: 99.9%
Margin: 0.05% buffer
```
### 5. Make Error Budgets Actionable
- Define policies at different thresholds
- Empower teams to make tradeoffs
- Review in planning meetings
### 6. Document Everything
- How SLIs are measured
- Why targets were chosen
- Who owns each SLO
- How to interpret metrics
### 7. Review Regularly
- Monthly SLO reviews
- Quarterly SLO adjustments
- Annual SLA renegotiation
---
## Common Pitfalls
### 1. Too Many SLOs
❌ 20 different SLOs per service
✅ 2-3 critical SLOs
### 2. Unrealistic Targets
❌ 99.999% for non-critical service
✅ 99.9% with room to improve
### 3. SLOs Without Error Budgets
❌ "Must always be 99.9%"
✅ "Budget for 0.1% errors"
### 4. No Consequences
❌ Missing SLO has no impact
✅ Deploy freeze when budget exhausted
### 5. SLA Equals SLO
❌ Promise exactly what you target
✅ SLA more conservative than SLO
### 6. Ignoring User Experience
❌ "Our servers are up 99.99%"
✅ "Users can complete actions 99.9% of the time"
### 7. Static Targets
❌ Set once, never revisit
✅ Quarterly reviews and adjustments
---
## Tools and Automation
### SLO Tracking Tools
**Prometheus + Grafana**:
- Use recording rules for SLIs
- Alert on burn rates
- Dashboard for compliance
**Google Cloud SLO Monitoring**:
- Built-in SLO tracking
- Automatic error budget calculation
- Integration with alerting
**Datadog SLOs**:
- UI for SLO definition
- Automatic burn rate alerts
- Status pages
**Custom Tools**:
- sloth: Generate Prometheus rules from SLO definitions
- slo-libsonnet: Jsonnet library for SLO monitoring
### Example: Prometheus Recording Rules
```yaml
groups:
- name: sli_recording
interval: 30s
rules:
# SLI: Request success rate
- record: sli:request_success:ratio
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# SLI: Request latency (p95)
- record: sli:request_latency:p95
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# Error budget burn rate (1h window)
- record: slo:error_budget_burn_rate:1h
expr: |
(1 - sli:request_success:ratio) / 0.001
```