330 lines
8.4 KiB
Markdown
330 lines
8.4 KiB
Markdown
---
|
|
name: slo-implementation
|
|
description: Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or measuring service performance.
|
|
---
|
|
|
|
# SLO Implementation
|
|
|
|
Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
|
|
|
|
## Purpose
|
|
|
|
Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.
|
|
|
|
## When to Use
|
|
|
|
- Define service reliability targets
|
|
- Measure user-perceived reliability
|
|
- Implement error budgets
|
|
- Create SLO-based alerts
|
|
- Track reliability goals
|
|
|
|
## SLI/SLO/SLA Hierarchy
|
|
|
|
```
|
|
SLA (Service Level Agreement)
|
|
↓ Contract with customers
|
|
SLO (Service Level Objective)
|
|
↓ Internal reliability target
|
|
SLI (Service Level Indicator)
|
|
↓ Actual measurement
|
|
```
|
|
|
|
## Defining SLIs
|
|
|
|
### Common SLI Types
|
|
|
|
#### 1. Availability SLI
|
|
```promql
|
|
# Successful requests / Total requests
|
|
sum(rate(http_requests_total{status!~"5.."}[28d]))
|
|
/
|
|
sum(rate(http_requests_total[28d]))
|
|
```
|
|
|
|
#### 2. Latency SLI
|
|
```promql
|
|
# Requests below latency threshold / Total requests
|
|
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
|
|
/
|
|
sum(rate(http_request_duration_seconds_count[28d]))
|
|
```
|
|
|
|
#### 3. Durability SLI
|
|
```
|
|
# Successful writes / Total writes
|
|
sum(storage_writes_successful_total)
|
|
/
|
|
sum(storage_writes_total)
|
|
```
|
|
|
|
**Reference:** See `references/slo-definitions.md`
|
|
|
|
## Setting SLO Targets
|
|
|
|
### Availability SLO Examples
|
|
|
|
| SLO % | Downtime/Month | Downtime/Year |
|
|
|-------|----------------|---------------|
|
|
| 99% | 7.2 hours | 3.65 days |
|
|
| 99.9% | 43.2 minutes | 8.76 hours |
|
|
| 99.95%| 21.6 minutes | 4.38 hours |
|
|
| 99.99%| 4.32 minutes | 52.56 minutes |
|
|
|
|
### Choose Appropriate SLOs
|
|
|
|
**Consider:**
|
|
- User expectations
|
|
- Business requirements
|
|
- Current performance
|
|
- Cost of reliability
|
|
- Competitor benchmarks
|
|
|
|
**Example SLOs:**
|
|
```yaml
|
|
slos:
|
|
- name: api_availability
|
|
target: 99.9
|
|
window: 28d
|
|
sli: |
|
|
sum(rate(http_requests_total{status!~"5.."}[28d]))
|
|
/
|
|
sum(rate(http_requests_total[28d]))
|
|
|
|
- name: api_latency_p95
|
|
target: 99
|
|
window: 28d
|
|
sli: |
|
|
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
|
|
/
|
|
sum(rate(http_request_duration_seconds_count[28d]))
|
|
```
|
|
|
|
## Error Budget Calculation
|
|
|
|
### Error Budget Formula
|
|
|
|
```
|
|
Error Budget = 1 - SLO Target
|
|
```
|
|
|
|
**Example:**
|
|
- SLO: 99.9% availability
|
|
- Error Budget: 0.1% = 43.2 minutes/month
|
|
- Current Error: 0.05% = 21.6 minutes/month
|
|
- Remaining Budget: 50%
|
|
|
|
### Error Budget Policy
|
|
|
|
```yaml
|
|
error_budget_policy:
|
|
- remaining_budget: 100%
|
|
action: Normal development velocity
|
|
- remaining_budget: 50%
|
|
action: Consider postponing risky changes
|
|
- remaining_budget: 10%
|
|
action: Freeze non-critical changes
|
|
- remaining_budget: 0%
|
|
action: Feature freeze, focus on reliability
|
|
```
|
|
|
|
**Reference:** See `references/error-budget.md`
|
|
|
|
## SLO Implementation
|
|
|
|
### Prometheus Recording Rules
|
|
|
|
```yaml
|
|
# SLI Recording Rules
|
|
groups:
|
|
- name: sli_rules
|
|
interval: 30s
|
|
rules:
|
|
# Availability SLI
|
|
- record: sli:http_availability:ratio
|
|
expr: |
|
|
sum(rate(http_requests_total{status!~"5.."}[28d]))
|
|
/
|
|
sum(rate(http_requests_total[28d]))
|
|
|
|
# Latency SLI (requests < 500ms)
|
|
- record: sli:http_latency:ratio
|
|
expr: |
|
|
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
|
|
/
|
|
sum(rate(http_request_duration_seconds_count[28d]))
|
|
|
|
- name: slo_rules
|
|
interval: 5m
|
|
rules:
|
|
# SLO compliance (1 = meeting SLO, 0 = violating)
|
|
- record: slo:http_availability:compliance
|
|
expr: sli:http_availability:ratio >= bool 0.999
|
|
|
|
- record: slo:http_latency:compliance
|
|
expr: sli:http_latency:ratio >= bool 0.99
|
|
|
|
# Error budget remaining (percentage)
|
|
- record: slo:http_availability:error_budget_remaining
|
|
expr: |
|
|
(sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100
|
|
|
|
# Error budget burn rate
|
|
- record: slo:http_availability:burn_rate_5m
|
|
expr: |
|
|
(1 - (
|
|
sum(rate(http_requests_total{status!~"5.."}[5m]))
|
|
/
|
|
sum(rate(http_requests_total[5m]))
|
|
)) / (1 - 0.999)
|
|
```
|
|
|
|
### SLO Alerting Rules
|
|
|
|
```yaml
|
|
groups:
|
|
- name: slo_alerts
|
|
interval: 1m
|
|
rules:
|
|
# Fast burn: 14.4x rate, 1 hour window
|
|
# Consumes 2% error budget in 1 hour
|
|
- alert: SLOErrorBudgetBurnFast
|
|
expr: |
|
|
slo:http_availability:burn_rate_1h > 14.4
|
|
and
|
|
slo:http_availability:burn_rate_5m > 14.4
|
|
for: 2m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Fast error budget burn detected"
|
|
description: "Error budget burning at {{ $value }}x rate"
|
|
|
|
# Slow burn: 6x rate, 6 hour window
|
|
# Consumes 5% error budget in 6 hours
|
|
- alert: SLOErrorBudgetBurnSlow
|
|
expr: |
|
|
slo:http_availability:burn_rate_6h > 6
|
|
and
|
|
slo:http_availability:burn_rate_30m > 6
|
|
for: 15m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Slow error budget burn detected"
|
|
description: "Error budget burning at {{ $value }}x rate"
|
|
|
|
# Error budget exhausted
|
|
- alert: SLOErrorBudgetExhausted
|
|
expr: slo:http_availability:error_budget_remaining < 0
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "SLO error budget exhausted"
|
|
description: "Error budget remaining: {{ $value }}%"
|
|
```
|
|
|
|
## SLO Dashboard
|
|
|
|
**Grafana Dashboard Structure:**
|
|
|
|
```
|
|
┌────────────────────────────────────┐
|
|
│ SLO Compliance (Current) │
|
|
│ ✓ 99.95% (Target: 99.9%) │
|
|
├────────────────────────────────────┤
|
|
│ Error Budget Remaining: 65% │
|
|
│ ████████░░ 65% │
|
|
├────────────────────────────────────┤
|
|
│ SLI Trend (28 days) │
|
|
│ [Time series graph] │
|
|
├────────────────────────────────────┤
|
|
│ Burn Rate Analysis │
|
|
│ [Burn rate by time window] │
|
|
└────────────────────────────────────┘
|
|
```
|
|
|
|
**Example Queries:**
|
|
|
|
```promql
|
|
# Current SLO compliance
|
|
sli:http_availability:ratio * 100
|
|
|
|
# Error budget remaining
|
|
slo:http_availability:error_budget_remaining
|
|
|
|
# Days until error budget exhausted (at current burn rate)
|
|
(slo:http_availability:error_budget_remaining / 100)
|
|
*
|
|
28
|
|
/
|
|
(1 - sli:http_availability:ratio) * (1 - 0.999)
|
|
```
|
|
|
|
## Multi-Window Burn Rate Alerts
|
|
|
|
```yaml
|
|
# Combination of short and long windows reduces false positives
|
|
rules:
|
|
- alert: SLOBurnRateHigh
|
|
expr: |
|
|
(
|
|
slo:http_availability:burn_rate_1h > 14.4
|
|
and
|
|
slo:http_availability:burn_rate_5m > 14.4
|
|
)
|
|
or
|
|
(
|
|
slo:http_availability:burn_rate_6h > 6
|
|
and
|
|
slo:http_availability:burn_rate_30m > 6
|
|
)
|
|
labels:
|
|
severity: critical
|
|
```
|
|
|
|
## SLO Review Process
|
|
|
|
### Weekly Review
|
|
- Current SLO compliance
|
|
- Error budget status
|
|
- Trend analysis
|
|
- Incident impact
|
|
|
|
### Monthly Review
|
|
- SLO achievement
|
|
- Error budget usage
|
|
- Incident postmortems
|
|
- SLO adjustments
|
|
|
|
### Quarterly Review
|
|
- SLO relevance
|
|
- Target adjustments
|
|
- Process improvements
|
|
- Tooling enhancements
|
|
|
|
## Best Practices
|
|
|
|
1. **Start with user-facing services**
|
|
2. **Use multiple SLIs** (availability, latency, etc.)
|
|
3. **Set achievable SLOs** (don't aim for 100%)
|
|
4. **Implement multi-window alerts** to reduce noise
|
|
5. **Track error budget** consistently
|
|
6. **Review SLOs regularly**
|
|
7. **Document SLO decisions**
|
|
8. **Align with business goals**
|
|
9. **Automate SLO reporting**
|
|
10. **Use SLOs for prioritization**
|
|
|
|
## Reference Files
|
|
|
|
- `assets/slo-template.md` - SLO definition template
|
|
- `references/slo-definitions.md` - SLO definition patterns
|
|
- `references/error-budget.md` - Error budget calculations
|
|
|
|
## Related Skills
|
|
|
|
- `prometheus-configuration` - For metric collection
|
|
- `grafana-dashboards` - For SLO visualization
|