Files
gh-ahmedasmar-devops-claude…/references/slo_sla_guide.md
2025-11-29 17:51:22 +08:00

653 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# SLI, SLO, and SLA Guide
## Definitions
### SLI (Service Level Indicator)
**What**: A quantitative measure of service quality
**Examples**:
- Request latency (ms)
- Error rate (%)
- Availability (%)
- Throughput (requests/sec)
### SLO (Service Level Objective)
**What**: Target value or range for an SLI
**Examples**:
- "99.9% of requests return in < 500ms"
- "99.95% availability"
- "Error rate < 0.1%"
### SLA (Service Level Agreement)
**What**: Business contract with consequences for SLO violations
**Examples**:
- "99.9% uptime or 10% monthly credit"
- "p95 latency < 1s or refund"
### Relationship
```
SLI = Measurement
SLO = Target (internal goal)
SLA = Promise (customer contract with penalties)
Example:
SLI: Actual availability this month = 99.92%
SLO: Target availability = 99.9%
SLA: Guaranteed availability = 99.5% (with penalties)
```
---
## Choosing SLIs
### The Four Golden Signals as SLIs
1. **Latency SLIs**
- Request duration (p50, p95, p99)
- Time to first byte
- Page load time
2. **Availability/Success SLIs**
- % of successful requests
- % uptime
- % of requests completing
3. **Throughput SLIs** (less common)
- Requests per second
- Transactions per second
4. **Saturation SLIs** (internal only)
- Resource utilization
- Queue depth
### SLI Selection Criteria
**Good SLIs**:
- Measured from user perspective
- Directly impact user experience
- Aggregatable across instances
- Proportional to user happiness
**Bad SLIs**:
- Internal metrics only
- Not user-facing
- Hard to measure consistently
### Examples by Service Type
**Web Application**:
```
SLI 1: Request Success Rate
= successful_requests / total_requests
SLI 2: Request Latency (p95)
= 95th percentile of response times
SLI 3: Availability
= time_service_responding / total_time
```
**API Service**:
```
SLI 1: Error Rate
= (4xx_errors + 5xx_errors) / total_requests
SLI 2: Response Time (p99)
= 99th percentile latency
SLI 3: Throughput
= requests_per_second
```
**Batch Processing**:
```
SLI 1: Job Success Rate
= successful_jobs / total_jobs
SLI 2: Processing Latency
= time_from_submission_to_completion
SLI 3: Freshness
= age_of_oldest_unprocessed_item
```
**Storage Service**:
```
SLI 1: Durability
= data_not_lost / total_data
SLI 2: Read Latency (p99)
= 99th percentile read time
SLI 3: Write Success Rate
= successful_writes / total_writes
```
---
## Setting SLO Targets
### Start with Current Performance
1. **Measure baseline**: Collect 30 days of data
2. **Analyze distribution**: Look at p50, p95, p99, p99.9
3. **Set initial SLO**: Slightly better than worst performer
4. **Iterate**: Tighten or loosen based on feasibility
### Example Process
**Current Performance** (30 days):
```
p50 latency: 120ms
p95 latency: 450ms
p99 latency: 1200ms
p99.9 latency: 3500ms
Error rate: 0.05%
Availability: 99.95%
```
**Initial SLOs**:
```
Latency: p95 < 500ms (slightly worse than current p95)
Error rate: < 0.1% (double current rate)
Availability: 99.9% (slightly worse than current)
```
**Rationale**: Start loose, prevent false alarms, tighten over time
### Common SLO Targets
**Availability**:
- **99%** (3.65 days downtime/year): Internal tools
- **99.5%** (1.83 days/year): Non-critical services
- **99.9%** (8.76 hours/year): Standard production
- **99.95%** (4.38 hours/year): Critical services
- **99.99%** (52 minutes/year): High availability
- **99.999%** (5 minutes/year): Mission critical
**Latency**:
- **p50 < 100ms**: Excellent responsiveness
- **p95 < 500ms**: Standard web applications
- **p99 < 1s**: Acceptable for most users
- **p99.9 < 5s**: Acceptable for rare edge cases
**Error Rate**:
- **< 0.01%** (99.99% success): Critical operations
- **< 0.1%** (99.9% success): Standard production
- **< 1%** (99% success): Non-critical services
---
## Error Budgets
### Concept
Error budget = (100% - SLO target)
If SLO is 99.9%, error budget is 0.1%
**Purpose**: Balance reliability with feature velocity
### Calculation
**For availability**:
```
Monthly error budget = (1 - SLO) × time_period
Example (99.9% SLO, 30 days):
Error budget = 0.001 × 30 days = 0.03 days = 43.2 minutes
```
**For request-based SLIs**:
```
Error budget = (1 - SLO) × total_requests
Example (99.9% SLO, 10M requests/month):
Error budget = 0.001 × 10,000,000 = 10,000 failed requests
```
### Error Budget Consumption
**Formula**:
```
Budget consumed = actual_errors / allowed_errors × 100%
Example:
SLO: 99.9% (0.1% error budget)
Total requests: 1,000,000
Failed requests: 500
Allowed failures: 1,000
Budget consumed = 500 / 1,000 × 100% = 50%
Budget remaining = 50%
```
### Error Budget Policy
**Example policy**:
```markdown
## Error Budget Policy
### If error budget > 50%
- Deploy frequently (multiple times per day)
- Take calculated risks
- Experiment with new features
- Acceptable to have some incidents
### If error budget 20-50%
- Deploy normally (once per day)
- Increase testing
- Review recent changes
- Monitor closely
### If error budget < 20%
- Freeze non-critical deploys
- Focus on reliability improvements
- Postmortem all incidents
- Reduce change velocity
### If error budget exhausted (< 0%)
- Complete deploy freeze except rollbacks
- All hands on reliability
- Mandatory postmortems
- Executive escalation
```
---
## Error Budget Burn Rate
### Concept
Burn rate = rate of error budget consumption
**Example**:
- Monthly budget: 43.2 minutes (99.9% SLO)
- If consuming at 2x rate: Budget exhausted in 15 days
- If consuming at 10x rate: Budget exhausted in 3 days
### Burn Rate Calculation
```
Burn rate = (actual_error_rate / allowed_error_rate)
Example:
SLO: 99.9% (0.1% allowed error rate)
Current error rate: 0.5%
Burn rate = 0.5% / 0.1% = 5x
Time to exhaust = 30 days / 5 = 6 days
```
### Multi-Window Alerting
Alert on burn rate across multiple time windows:
**Fast burn** (1 hour window):
```
Burn rate > 14.4x → Exhausts budget in 2 days
Alert after 2 minutes
Severity: Critical (page immediately)
```
**Moderate burn** (6 hour window):
```
Burn rate > 6x → Exhausts budget in 5 days
Alert after 30 minutes
Severity: Warning (create ticket)
```
**Slow burn** (3 day window):
```
Burn rate > 1x → Exhausts budget by end of month
Alert after 6 hours
Severity: Info (monitor)
```
### Implementation
**Prometheus**:
```yaml
# Fast burn alert (1h window, 2m grace period)
- alert: ErrorBudgetFastBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001) # 14.4x burn rate for 99.9% SLO
for: 2m
labels:
severity: critical
annotations:
summary: "Fast error budget burn detected"
description: "Error budget will be exhausted in 2 days at current rate"
# Slow burn alert (6h window, 30m grace period)
- alert: ErrorBudgetSlowBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
) > (6 * 0.001) # 6x burn rate for 99.9% SLO
for: 30m
labels:
severity: warning
annotations:
summary: "Elevated error budget burn detected"
```
---
## SLO Reporting
### Dashboard Structure
**Overall Health**:
```
┌─────────────────────────────────────────┐
│ SLO Compliance: 99.92% ✅ │
│ Error Budget Remaining: 73% 🟢 │
│ Burn Rate: 0.8x 🟢 │
└─────────────────────────────────────────┘
```
**SLI Performance**:
```
Latency p95: 420ms (Target: 500ms) ✅
Error Rate: 0.08% (Target: < 0.1%) ✅
Availability: 99.95% (Target: > 99.9%) ✅
```
**Error Budget Trend**:
```
Graph showing:
- Error budget consumption over time
- Burn rate spikes
- Incidents marked
- Deploy events overlaid
```
### Monthly SLO Report
**Template**:
```markdown
# SLO Report: October 2024
## Executive Summary
- ✅ All SLOs met this month
- 🟡 Latency SLO came close to violation (99.1% compliance)
- 3 incidents consumed 47% of error budget
- Error budget remaining: 53%
## SLO Performance
### Availability SLO: 99.9%
- Actual: 99.92%
- Status: ✅ Met
- Error budget consumed: 33%
- Downtime: 23 minutes (allowed: 43.2 minutes)
### Latency SLO: p95 < 500ms
- Actual p95: 445ms
- Status: ✅ Met
- Compliance: 99.1% (target: 99%)
- 0.9% of requests exceeded threshold
### Error Rate SLO: < 0.1%
- Actual: 0.05%
- Status: ✅ Met
- Error budget consumed: 50%
## Incidents
### Incident #1: Database Overload (Oct 5)
- Duration: 15 minutes
- Error budget consumed: 35%
- Root cause: Slow query after schema change
- Prevention: Added query review to deploy checklist
### Incident #2: API Gateway Timeout (Oct 12)
- Duration: 5 minutes
- Error budget consumed: 10%
- Root cause: Configuration error in load balancer
- Prevention: Automated configuration validation
### Incident #3: Upstream Service Degradation (Oct 20)
- Duration: 3 minutes
- Error budget consumed: 2%
- Root cause: Third-party API outage
- Prevention: Implemented circuit breaker
## Recommendations
1. Investigate latency near-miss (Oct 15-17)
2. Add automated rollback for database changes
3. Increase circuit breaker thresholds for third-party APIs
4. Consider tightening availability SLO to 99.95%
## Next Month's Focus
- Reduce p95 latency to 400ms
- Implement automated canary deployments
- Add synthetic monitoring for critical paths
```
---
## SLA Structure
### Components
**Service Description**:
```
The API Service provides RESTful endpoints for user management,
authentication, and data retrieval.
```
**Covered Metrics**:
```
- Availability: Service is reachable and returns valid responses
- Latency: Time from request to response
- Error Rate: Percentage of requests returning errors
```
**SLA Targets**:
```
Service commits to:
1. 99.9% monthly uptime
2. p95 API response time < 1 second
3. Error rate < 0.5%
```
**Measurement**:
```
Metrics calculated from server-side monitoring:
- Uptime: Successful health check probes / total probes
- Latency: Server-side request duration (p95)
- Errors: HTTP 5xx responses / total responses
Calculated monthly (first of month for previous month).
```
**Exclusions**:
```
SLA does not cover:
- Scheduled maintenance (with 7 days notice)
- Client-side network issues
- DDoS attacks or force majeure
- Beta/preview features
- Issues caused by customer misuse
```
**Service Credits**:
```
Monthly Uptime | Service Credit
---------------- | --------------
< 99.9% (SLA) | 10%
< 99.0% | 25%
< 95.0% | 50%
```
**Claiming Credits**:
```
Customer must:
1. Report violation within 30 days
2. Provide ticket numbers for support requests
3. Credits applied to next month's invoice
4. Credits do not exceed monthly fee
```
### Example SLAs by Industry
**E-commerce**:
```
- 99.95% availability
- p95 page load < 2s
- p99 checkout < 5s
- Credits: 5% per 0.1% below target
```
**Financial Services**:
```
- 99.99% availability
- p99 transaction < 500ms
- Zero data loss
- Penalties: $10,000 per hour of downtime
```
**Media/Content**:
```
- 99.9% availability
- p95 video start < 3s
- No credit system (best effort latency)
```
---
## Best Practices
### 1. SLOs Should Be User-Centric
❌ "Database queries < 100ms"
✅ "API response time p95 < 500ms"
### 2. Start Loose, Tighten Over Time
- Begin with achievable targets
- Build reliability culture
- Gradually raise bar
### 3. Fewer, Better SLOs
- 1-3 SLOs per service
- Focus on user impact
- Avoid SLO sprawl
### 4. SLAs More Conservative Than SLOs
```
Internal SLO: 99.95%
Customer SLA: 99.9%
Margin: 0.05% buffer
```
### 5. Make Error Budgets Actionable
- Define policies at different thresholds
- Empower teams to make tradeoffs
- Review in planning meetings
### 6. Document Everything
- How SLIs are measured
- Why targets were chosen
- Who owns each SLO
- How to interpret metrics
### 7. Review Regularly
- Monthly SLO reviews
- Quarterly SLO adjustments
- Annual SLA renegotiation
---
## Common Pitfalls
### 1. Too Many SLOs
❌ 20 different SLOs per service
✅ 2-3 critical SLOs
### 2. Unrealistic Targets
❌ 99.999% for non-critical service
✅ 99.9% with room to improve
### 3. SLOs Without Error Budgets
❌ "Must always be 99.9%"
✅ "Budget for 0.1% errors"
### 4. No Consequences
❌ Missing SLO has no impact
✅ Deploy freeze when budget exhausted
### 5. SLA Equals SLO
❌ Promise exactly what you target
✅ SLA more conservative than SLO
### 6. Ignoring User Experience
❌ "Our servers are up 99.99%"
✅ "Users can complete actions 99.9% of the time"
### 7. Static Targets
❌ Set once, never revisit
✅ Quarterly reviews and adjustments
---
## Tools and Automation
### SLO Tracking Tools
**Prometheus + Grafana**:
- Use recording rules for SLIs
- Alert on burn rates
- Dashboard for compliance
**Google Cloud SLO Monitoring**:
- Built-in SLO tracking
- Automatic error budget calculation
- Integration with alerting
**Datadog SLOs**:
- UI for SLO definition
- Automatic burn rate alerts
- Status pages
**Custom Tools**:
- sloth: Generate Prometheus rules from SLO definitions
- slo-libsonnet: Jsonnet library for SLO monitoring
### Example: Prometheus Recording Rules
```yaml
groups:
- name: sli_recording
interval: 30s
rules:
# SLI: Request success rate
- record: sli:request_success:ratio
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# SLI: Request latency (p95)
- record: sli:request_latency:p95
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# Error budget burn rate (1h window)
- record: slo:error_budget_burn_rate:1h
expr: |
(1 - sli:request_success:ratio) / 0.001
```