Initial commit
This commit is contained in:
329
skills/slo-implementation/SKILL.md
Normal file
329
skills/slo-implementation/SKILL.md
Normal file
@@ -0,0 +1,329 @@
|
||||
---
|
||||
name: slo-implementation
|
||||
description: Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or measuring service performance.
|
||||
---
|
||||
|
||||
# SLO Implementation
|
||||
|
||||
Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
|
||||
|
||||
## Purpose
|
||||
|
||||
Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.
|
||||
|
||||
## When to Use
|
||||
|
||||
- Define service reliability targets
|
||||
- Measure user-perceived reliability
|
||||
- Implement error budgets
|
||||
- Create SLO-based alerts
|
||||
- Track reliability goals
|
||||
|
||||
## SLI/SLO/SLA Hierarchy
|
||||
|
||||
```
|
||||
SLA (Service Level Agreement)
|
||||
↓ Contract with customers
|
||||
SLO (Service Level Objective)
|
||||
↓ Internal reliability target
|
||||
SLI (Service Level Indicator)
|
||||
↓ Actual measurement
|
||||
```
|
||||
|
||||
## Defining SLIs
|
||||
|
||||
### Common SLI Types
|
||||
|
||||
#### 1. Availability SLI
|
||||
```promql
|
||||
# Successful requests / Total requests
|
||||
sum(rate(http_requests_total{status!~"5.."}[28d]))
|
||||
/
|
||||
sum(rate(http_requests_total[28d]))
|
||||
```
|
||||
|
||||
#### 2. Latency SLI
|
||||
```promql
|
||||
# Requests below latency threshold / Total requests
|
||||
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
|
||||
/
|
||||
sum(rate(http_request_duration_seconds_count[28d]))
|
||||
```
|
||||
|
||||
#### 3. Durability SLI
|
||||
```
|
||||
# Successful writes / Total writes
|
||||
sum(storage_writes_successful_total)
|
||||
/
|
||||
sum(storage_writes_total)
|
||||
```
|
||||
|
||||
**Reference:** See `references/slo-definitions.md`
|
||||
|
||||
## Setting SLO Targets
|
||||
|
||||
### Availability SLO Examples
|
||||
|
||||
| SLO % | Downtime/Month | Downtime/Year |
|
||||
|-------|----------------|---------------|
|
||||
| 99% | 7.2 hours | 3.65 days |
|
||||
| 99.9% | 43.2 minutes | 8.76 hours |
|
||||
| 99.95%| 21.6 minutes | 4.38 hours |
|
||||
| 99.99%| 4.32 minutes | 52.56 minutes |
|
||||
|
||||
### Choose Appropriate SLOs
|
||||
|
||||
**Consider:**
|
||||
- User expectations
|
||||
- Business requirements
|
||||
- Current performance
|
||||
- Cost of reliability
|
||||
- Competitor benchmarks
|
||||
|
||||
**Example SLOs:**
|
||||
```yaml
|
||||
slos:
|
||||
- name: api_availability
|
||||
target: 99.9
|
||||
window: 28d
|
||||
sli: |
|
||||
sum(rate(http_requests_total{status!~"5.."}[28d]))
|
||||
/
|
||||
sum(rate(http_requests_total[28d]))
|
||||
|
||||
- name: api_latency_p95
|
||||
target: 99
|
||||
window: 28d
|
||||
sli: |
|
||||
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
|
||||
/
|
||||
sum(rate(http_request_duration_seconds_count[28d]))
|
||||
```
|
||||
|
||||
## Error Budget Calculation
|
||||
|
||||
### Error Budget Formula
|
||||
|
||||
```
|
||||
Error Budget = 1 - SLO Target
|
||||
```
|
||||
|
||||
**Example:**
|
||||
- SLO: 99.9% availability
|
||||
- Error Budget: 0.1% = 43.2 minutes/month
|
||||
- Current Error: 0.05% = 21.6 minutes/month
|
||||
- Remaining Budget: 50%
|
||||
|
||||
### Error Budget Policy
|
||||
|
||||
```yaml
|
||||
error_budget_policy:
|
||||
- remaining_budget: 100%
|
||||
action: Normal development velocity
|
||||
- remaining_budget: 50%
|
||||
action: Consider postponing risky changes
|
||||
- remaining_budget: 10%
|
||||
action: Freeze non-critical changes
|
||||
- remaining_budget: 0%
|
||||
action: Feature freeze, focus on reliability
|
||||
```
|
||||
|
||||
**Reference:** See `references/error-budget.md`
|
||||
|
||||
## SLO Implementation
|
||||
|
||||
### Prometheus Recording Rules
|
||||
|
||||
```yaml
|
||||
# SLI Recording Rules
|
||||
groups:
|
||||
- name: sli_rules
|
||||
interval: 30s
|
||||
rules:
|
||||
# Availability SLI
|
||||
- record: sli:http_availability:ratio
|
||||
expr: |
|
||||
sum(rate(http_requests_total{status!~"5.."}[28d]))
|
||||
/
|
||||
sum(rate(http_requests_total[28d]))
|
||||
|
||||
# Latency SLI (requests < 500ms)
|
||||
- record: sli:http_latency:ratio
|
||||
expr: |
|
||||
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
|
||||
/
|
||||
sum(rate(http_request_duration_seconds_count[28d]))
|
||||
|
||||
- name: slo_rules
|
||||
interval: 5m
|
||||
rules:
|
||||
# SLO compliance (1 = meeting SLO, 0 = violating)
|
||||
- record: slo:http_availability:compliance
|
||||
expr: sli:http_availability:ratio >= bool 0.999
|
||||
|
||||
- record: slo:http_latency:compliance
|
||||
expr: sli:http_latency:ratio >= bool 0.99
|
||||
|
||||
# Error budget remaining (percentage)
|
||||
- record: slo:http_availability:error_budget_remaining
|
||||
expr: |
|
||||
(sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100
|
||||
|
||||
# Error budget burn rate
|
||||
- record: slo:http_availability:burn_rate_5m
|
||||
expr: |
|
||||
(1 - (
|
||||
sum(rate(http_requests_total{status!~"5.."}[5m]))
|
||||
/
|
||||
sum(rate(http_requests_total[5m]))
|
||||
)) / (1 - 0.999)
|
||||
```
|
||||
|
||||
### SLO Alerting Rules
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: slo_alerts
|
||||
interval: 1m
|
||||
rules:
|
||||
# Fast burn: 14.4x rate, 1 hour window
|
||||
# Consumes 2% error budget in 1 hour
|
||||
- alert: SLOErrorBudgetBurnFast
|
||||
expr: |
|
||||
slo:http_availability:burn_rate_1h > 14.4
|
||||
and
|
||||
slo:http_availability:burn_rate_5m > 14.4
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Fast error budget burn detected"
|
||||
description: "Error budget burning at {{ $value }}x rate"
|
||||
|
||||
# Slow burn: 6x rate, 6 hour window
|
||||
# Consumes 5% error budget in 6 hours
|
||||
- alert: SLOErrorBudgetBurnSlow
|
||||
expr: |
|
||||
slo:http_availability:burn_rate_6h > 6
|
||||
and
|
||||
slo:http_availability:burn_rate_30m > 6
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Slow error budget burn detected"
|
||||
description: "Error budget burning at {{ $value }}x rate"
|
||||
|
||||
# Error budget exhausted
|
||||
- alert: SLOErrorBudgetExhausted
|
||||
expr: slo:http_availability:error_budget_remaining < 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "SLO error budget exhausted"
|
||||
description: "Error budget remaining: {{ $value }}%"
|
||||
```
|
||||
|
||||
## SLO Dashboard
|
||||
|
||||
**Grafana Dashboard Structure:**
|
||||
|
||||
```
|
||||
┌────────────────────────────────────┐
|
||||
│ SLO Compliance (Current) │
|
||||
│ ✓ 99.95% (Target: 99.9%) │
|
||||
├────────────────────────────────────┤
|
||||
│ Error Budget Remaining: 65% │
|
||||
│ ████████░░ 65% │
|
||||
├────────────────────────────────────┤
|
||||
│ SLI Trend (28 days) │
|
||||
│ [Time series graph] │
|
||||
├────────────────────────────────────┤
|
||||
│ Burn Rate Analysis │
|
||||
│ [Burn rate by time window] │
|
||||
└────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Example Queries:**
|
||||
|
||||
```promql
|
||||
# Current SLO compliance
|
||||
sli:http_availability:ratio * 100
|
||||
|
||||
# Error budget remaining
|
||||
slo:http_availability:error_budget_remaining
|
||||
|
||||
# Days until error budget exhausted (at current burn rate)
|
||||
(slo:http_availability:error_budget_remaining / 100)
|
||||
*
|
||||
28
|
||||
/
|
||||
(1 - sli:http_availability:ratio) * (1 - 0.999)
|
||||
```
|
||||
|
||||
## Multi-Window Burn Rate Alerts
|
||||
|
||||
```yaml
|
||||
# Combination of short and long windows reduces false positives
|
||||
rules:
|
||||
- alert: SLOBurnRateHigh
|
||||
expr: |
|
||||
(
|
||||
slo:http_availability:burn_rate_1h > 14.4
|
||||
and
|
||||
slo:http_availability:burn_rate_5m > 14.4
|
||||
)
|
||||
or
|
||||
(
|
||||
slo:http_availability:burn_rate_6h > 6
|
||||
and
|
||||
slo:http_availability:burn_rate_30m > 6
|
||||
)
|
||||
labels:
|
||||
severity: critical
|
||||
```
|
||||
|
||||
## SLO Review Process
|
||||
|
||||
### Weekly Review
|
||||
- Current SLO compliance
|
||||
- Error budget status
|
||||
- Trend analysis
|
||||
- Incident impact
|
||||
|
||||
### Monthly Review
|
||||
- SLO achievement
|
||||
- Error budget usage
|
||||
- Incident postmortems
|
||||
- SLO adjustments
|
||||
|
||||
### Quarterly Review
|
||||
- SLO relevance
|
||||
- Target adjustments
|
||||
- Process improvements
|
||||
- Tooling enhancements
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Start with user-facing services**
|
||||
2. **Use multiple SLIs** (availability, latency, etc.)
|
||||
3. **Set achievable SLOs** (don't aim for 100%)
|
||||
4. **Implement multi-window alerts** to reduce noise
|
||||
5. **Track error budget** consistently
|
||||
6. **Review SLOs regularly**
|
||||
7. **Document SLO decisions**
|
||||
8. **Align with business goals**
|
||||
9. **Automate SLO reporting**
|
||||
10. **Use SLOs for prioritization**
|
||||
|
||||
## Reference Files
|
||||
|
||||
- `assets/slo-template.md` - SLO definition template
|
||||
- `references/slo-definitions.md` - SLO definition patterns
|
||||
- `references/error-budget.md` - Error budget calculations
|
||||
|
||||
## Related Skills
|
||||
|
||||
- `prometheus-configuration` - For metric collection
|
||||
- `grafana-dashboards` - For SLO visualization
|
||||
Reference in New Issue
Block a user