--- name: slo-implementation description: Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or measuring service performance. --- # SLO Implementation Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets. ## Purpose Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity. ## When to Use - Define service reliability targets - Measure user-perceived reliability - Implement error budgets - Create SLO-based alerts - Track reliability goals ## SLI/SLO/SLA Hierarchy ``` SLA (Service Level Agreement) ↓ Contract with customers SLO (Service Level Objective) ↓ Internal reliability target SLI (Service Level Indicator) ↓ Actual measurement ``` ## Defining SLIs ### Common SLI Types #### 1. Availability SLI ```promql # Successful requests / Total requests sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d])) ``` #### 2. Latency SLI ```promql # Requests below latency threshold / Total requests sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d])) ``` #### 3. Durability SLI ``` # Successful writes / Total writes sum(storage_writes_successful_total) / sum(storage_writes_total) ``` **Reference:** See `references/slo-definitions.md` ## Setting SLO Targets ### Availability SLO Examples | SLO % | Downtime/Month | Downtime/Year | |-------|----------------|---------------| | 99% | 7.2 hours | 3.65 days | | 99.9% | 43.2 minutes | 8.76 hours | | 99.95%| 21.6 minutes | 4.38 hours | | 99.99%| 4.32 minutes | 52.56 minutes | ### Choose Appropriate SLOs **Consider:** - User expectations - Business requirements - Current performance - Cost of reliability - Competitor benchmarks **Example SLOs:** ```yaml slos: - name: api_availability target: 99.9 window: 28d sli: | sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d])) - name: api_latency_p95 target: 99 window: 28d sli: | sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d])) ``` ## Error Budget Calculation ### Error Budget Formula ``` Error Budget = 1 - SLO Target ``` **Example:** - SLO: 99.9% availability - Error Budget: 0.1% = 43.2 minutes/month - Current Error: 0.05% = 21.6 minutes/month - Remaining Budget: 50% ### Error Budget Policy ```yaml error_budget_policy: - remaining_budget: 100% action: Normal development velocity - remaining_budget: 50% action: Consider postponing risky changes - remaining_budget: 10% action: Freeze non-critical changes - remaining_budget: 0% action: Feature freeze, focus on reliability ``` **Reference:** See `references/error-budget.md` ## SLO Implementation ### Prometheus Recording Rules ```yaml # SLI Recording Rules groups: - name: sli_rules interval: 30s rules: # Availability SLI - record: sli:http_availability:ratio expr: | sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d])) # Latency SLI (requests < 500ms) - record: sli:http_latency:ratio expr: | sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d])) - name: slo_rules interval: 5m rules: # SLO compliance (1 = meeting SLO, 0 = violating) - record: slo:http_availability:compliance expr: sli:http_availability:ratio >= bool 0.999 - record: slo:http_latency:compliance expr: sli:http_latency:ratio >= bool 0.99 # Error budget remaining (percentage) - record: slo:http_availability:error_budget_remaining expr: | (sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100 # Error budget burn rate - record: slo:http_availability:burn_rate_5m expr: | (1 - ( sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) )) / (1 - 0.999) ``` ### SLO Alerting Rules ```yaml groups: - name: slo_alerts interval: 1m rules: # Fast burn: 14.4x rate, 1 hour window # Consumes 2% error budget in 1 hour - alert: SLOErrorBudgetBurnFast expr: | slo:http_availability:burn_rate_1h > 14.4 and slo:http_availability:burn_rate_5m > 14.4 for: 2m labels: severity: critical annotations: summary: "Fast error budget burn detected" description: "Error budget burning at {{ $value }}x rate" # Slow burn: 6x rate, 6 hour window # Consumes 5% error budget in 6 hours - alert: SLOErrorBudgetBurnSlow expr: | slo:http_availability:burn_rate_6h > 6 and slo:http_availability:burn_rate_30m > 6 for: 15m labels: severity: warning annotations: summary: "Slow error budget burn detected" description: "Error budget burning at {{ $value }}x rate" # Error budget exhausted - alert: SLOErrorBudgetExhausted expr: slo:http_availability:error_budget_remaining < 0 for: 5m labels: severity: critical annotations: summary: "SLO error budget exhausted" description: "Error budget remaining: {{ $value }}%" ``` ## SLO Dashboard **Grafana Dashboard Structure:** ``` ┌────────────────────────────────────┐ │ SLO Compliance (Current) │ │ ✓ 99.95% (Target: 99.9%) │ ├────────────────────────────────────┤ │ Error Budget Remaining: 65% │ │ ████████░░ 65% │ ├────────────────────────────────────┤ │ SLI Trend (28 days) │ │ [Time series graph] │ ├────────────────────────────────────┤ │ Burn Rate Analysis │ │ [Burn rate by time window] │ └────────────────────────────────────┘ ``` **Example Queries:** ```promql # Current SLO compliance sli:http_availability:ratio * 100 # Error budget remaining slo:http_availability:error_budget_remaining # Days until error budget exhausted (at current burn rate) (slo:http_availability:error_budget_remaining / 100) * 28 / (1 - sli:http_availability:ratio) * (1 - 0.999) ``` ## Multi-Window Burn Rate Alerts ```yaml # Combination of short and long windows reduces false positives rules: - alert: SLOBurnRateHigh expr: | ( slo:http_availability:burn_rate_1h > 14.4 and slo:http_availability:burn_rate_5m > 14.4 ) or ( slo:http_availability:burn_rate_6h > 6 and slo:http_availability:burn_rate_30m > 6 ) labels: severity: critical ``` ## SLO Review Process ### Weekly Review - Current SLO compliance - Error budget status - Trend analysis - Incident impact ### Monthly Review - SLO achievement - Error budget usage - Incident postmortems - SLO adjustments ### Quarterly Review - SLO relevance - Target adjustments - Process improvements - Tooling enhancements ## Best Practices 1. **Start with user-facing services** 2. **Use multiple SLIs** (availability, latency, etc.) 3. **Set achievable SLOs** (don't aim for 100%) 4. **Implement multi-window alerts** to reduce noise 5. **Track error budget** consistently 6. **Review SLOs regularly** 7. **Document SLO decisions** 8. **Align with business goals** 9. **Automate SLO reporting** 10. **Use SLOs for prioritization** ## Reference Files - `assets/slo-template.md` - SLO definition template - `references/slo-definitions.md` - SLO definition patterns - `references/error-budget.md` - Error budget calculations ## Related Skills - `prometheus-configuration` - For metric collection - `grafana-dashboards` - For SLO visualization