# SLI, SLO, and SLA Guide ## Definitions ### SLI (Service Level Indicator) **What**: A quantitative measure of service quality **Examples**: - Request latency (ms) - Error rate (%) - Availability (%) - Throughput (requests/sec) ### SLO (Service Level Objective) **What**: Target value or range for an SLI **Examples**: - "99.9% of requests return in < 500ms" - "99.95% availability" - "Error rate < 0.1%" ### SLA (Service Level Agreement) **What**: Business contract with consequences for SLO violations **Examples**: - "99.9% uptime or 10% monthly credit" - "p95 latency < 1s or refund" ### Relationship ``` SLI = Measurement SLO = Target (internal goal) SLA = Promise (customer contract with penalties) Example: SLI: Actual availability this month = 99.92% SLO: Target availability = 99.9% SLA: Guaranteed availability = 99.5% (with penalties) ``` --- ## Choosing SLIs ### The Four Golden Signals as SLIs 1. **Latency SLIs** - Request duration (p50, p95, p99) - Time to first byte - Page load time 2. **Availability/Success SLIs** - % of successful requests - % uptime - % of requests completing 3. **Throughput SLIs** (less common) - Requests per second - Transactions per second 4. **Saturation SLIs** (internal only) - Resource utilization - Queue depth ### SLI Selection Criteria ✅ **Good SLIs**: - Measured from user perspective - Directly impact user experience - Aggregatable across instances - Proportional to user happiness ❌ **Bad SLIs**: - Internal metrics only - Not user-facing - Hard to measure consistently ### Examples by Service Type **Web Application**: ``` SLI 1: Request Success Rate = successful_requests / total_requests SLI 2: Request Latency (p95) = 95th percentile of response times SLI 3: Availability = time_service_responding / total_time ``` **API Service**: ``` SLI 1: Error Rate = (4xx_errors + 5xx_errors) / total_requests SLI 2: Response Time (p99) = 99th percentile latency SLI 3: Throughput = requests_per_second ``` **Batch Processing**: ``` SLI 1: Job Success Rate = successful_jobs / total_jobs SLI 2: Processing Latency = time_from_submission_to_completion SLI 3: Freshness = age_of_oldest_unprocessed_item ``` **Storage Service**: ``` SLI 1: Durability = data_not_lost / total_data SLI 2: Read Latency (p99) = 99th percentile read time SLI 3: Write Success Rate = successful_writes / total_writes ``` --- ## Setting SLO Targets ### Start with Current Performance 1. **Measure baseline**: Collect 30 days of data 2. **Analyze distribution**: Look at p50, p95, p99, p99.9 3. **Set initial SLO**: Slightly better than worst performer 4. **Iterate**: Tighten or loosen based on feasibility ### Example Process **Current Performance** (30 days): ``` p50 latency: 120ms p95 latency: 450ms p99 latency: 1200ms p99.9 latency: 3500ms Error rate: 0.05% Availability: 99.95% ``` **Initial SLOs**: ``` Latency: p95 < 500ms (slightly worse than current p95) Error rate: < 0.1% (double current rate) Availability: 99.9% (slightly worse than current) ``` **Rationale**: Start loose, prevent false alarms, tighten over time ### Common SLO Targets **Availability**: - **99%** (3.65 days downtime/year): Internal tools - **99.5%** (1.83 days/year): Non-critical services - **99.9%** (8.76 hours/year): Standard production - **99.95%** (4.38 hours/year): Critical services - **99.99%** (52 minutes/year): High availability - **99.999%** (5 minutes/year): Mission critical **Latency**: - **p50 < 100ms**: Excellent responsiveness - **p95 < 500ms**: Standard web applications - **p99 < 1s**: Acceptable for most users - **p99.9 < 5s**: Acceptable for rare edge cases **Error Rate**: - **< 0.01%** (99.99% success): Critical operations - **< 0.1%** (99.9% success): Standard production - **< 1%** (99% success): Non-critical services --- ## Error Budgets ### Concept Error budget = (100% - SLO target) If SLO is 99.9%, error budget is 0.1% **Purpose**: Balance reliability with feature velocity ### Calculation **For availability**: ``` Monthly error budget = (1 - SLO) × time_period Example (99.9% SLO, 30 days): Error budget = 0.001 × 30 days = 0.03 days = 43.2 minutes ``` **For request-based SLIs**: ``` Error budget = (1 - SLO) × total_requests Example (99.9% SLO, 10M requests/month): Error budget = 0.001 × 10,000,000 = 10,000 failed requests ``` ### Error Budget Consumption **Formula**: ``` Budget consumed = actual_errors / allowed_errors × 100% Example: SLO: 99.9% (0.1% error budget) Total requests: 1,000,000 Failed requests: 500 Allowed failures: 1,000 Budget consumed = 500 / 1,000 × 100% = 50% Budget remaining = 50% ``` ### Error Budget Policy **Example policy**: ```markdown ## Error Budget Policy ### If error budget > 50% - Deploy frequently (multiple times per day) - Take calculated risks - Experiment with new features - Acceptable to have some incidents ### If error budget 20-50% - Deploy normally (once per day) - Increase testing - Review recent changes - Monitor closely ### If error budget < 20% - Freeze non-critical deploys - Focus on reliability improvements - Postmortem all incidents - Reduce change velocity ### If error budget exhausted (< 0%) - Complete deploy freeze except rollbacks - All hands on reliability - Mandatory postmortems - Executive escalation ``` --- ## Error Budget Burn Rate ### Concept Burn rate = rate of error budget consumption **Example**: - Monthly budget: 43.2 minutes (99.9% SLO) - If consuming at 2x rate: Budget exhausted in 15 days - If consuming at 10x rate: Budget exhausted in 3 days ### Burn Rate Calculation ``` Burn rate = (actual_error_rate / allowed_error_rate) Example: SLO: 99.9% (0.1% allowed error rate) Current error rate: 0.5% Burn rate = 0.5% / 0.1% = 5x Time to exhaust = 30 days / 5 = 6 days ``` ### Multi-Window Alerting Alert on burn rate across multiple time windows: **Fast burn** (1 hour window): ``` Burn rate > 14.4x → Exhausts budget in 2 days Alert after 2 minutes Severity: Critical (page immediately) ``` **Moderate burn** (6 hour window): ``` Burn rate > 6x → Exhausts budget in 5 days Alert after 30 minutes Severity: Warning (create ticket) ``` **Slow burn** (3 day window): ``` Burn rate > 1x → Exhausts budget by end of month Alert after 6 hours Severity: Info (monitor) ``` ### Implementation **Prometheus**: ```yaml # Fast burn alert (1h window, 2m grace period) - alert: ErrorBudgetFastBurn expr: | ( sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) > (14.4 * 0.001) # 14.4x burn rate for 99.9% SLO for: 2m labels: severity: critical annotations: summary: "Fast error budget burn detected" description: "Error budget will be exhausted in 2 days at current rate" # Slow burn alert (6h window, 30m grace period) - alert: ErrorBudgetSlowBurn expr: | ( sum(rate(http_requests_total{status=~"5.."}[6h])) / sum(rate(http_requests_total[6h])) ) > (6 * 0.001) # 6x burn rate for 99.9% SLO for: 30m labels: severity: warning annotations: summary: "Elevated error budget burn detected" ``` --- ## SLO Reporting ### Dashboard Structure **Overall Health**: ``` ┌─────────────────────────────────────────┐ │ SLO Compliance: 99.92% ✅ │ │ Error Budget Remaining: 73% 🟢 │ │ Burn Rate: 0.8x 🟢 │ └─────────────────────────────────────────┘ ``` **SLI Performance**: ``` Latency p95: 420ms (Target: 500ms) ✅ Error Rate: 0.08% (Target: < 0.1%) ✅ Availability: 99.95% (Target: > 99.9%) ✅ ``` **Error Budget Trend**: ``` Graph showing: - Error budget consumption over time - Burn rate spikes - Incidents marked - Deploy events overlaid ``` ### Monthly SLO Report **Template**: ```markdown # SLO Report: October 2024 ## Executive Summary - ✅ All SLOs met this month - 🟡 Latency SLO came close to violation (99.1% compliance) - 3 incidents consumed 47% of error budget - Error budget remaining: 53% ## SLO Performance ### Availability SLO: 99.9% - Actual: 99.92% - Status: ✅ Met - Error budget consumed: 33% - Downtime: 23 minutes (allowed: 43.2 minutes) ### Latency SLO: p95 < 500ms - Actual p95: 445ms - Status: ✅ Met - Compliance: 99.1% (target: 99%) - 0.9% of requests exceeded threshold ### Error Rate SLO: < 0.1% - Actual: 0.05% - Status: ✅ Met - Error budget consumed: 50% ## Incidents ### Incident #1: Database Overload (Oct 5) - Duration: 15 minutes - Error budget consumed: 35% - Root cause: Slow query after schema change - Prevention: Added query review to deploy checklist ### Incident #2: API Gateway Timeout (Oct 12) - Duration: 5 minutes - Error budget consumed: 10% - Root cause: Configuration error in load balancer - Prevention: Automated configuration validation ### Incident #3: Upstream Service Degradation (Oct 20) - Duration: 3 minutes - Error budget consumed: 2% - Root cause: Third-party API outage - Prevention: Implemented circuit breaker ## Recommendations 1. Investigate latency near-miss (Oct 15-17) 2. Add automated rollback for database changes 3. Increase circuit breaker thresholds for third-party APIs 4. Consider tightening availability SLO to 99.95% ## Next Month's Focus - Reduce p95 latency to 400ms - Implement automated canary deployments - Add synthetic monitoring for critical paths ``` --- ## SLA Structure ### Components **Service Description**: ``` The API Service provides RESTful endpoints for user management, authentication, and data retrieval. ``` **Covered Metrics**: ``` - Availability: Service is reachable and returns valid responses - Latency: Time from request to response - Error Rate: Percentage of requests returning errors ``` **SLA Targets**: ``` Service commits to: 1. 99.9% monthly uptime 2. p95 API response time < 1 second 3. Error rate < 0.5% ``` **Measurement**: ``` Metrics calculated from server-side monitoring: - Uptime: Successful health check probes / total probes - Latency: Server-side request duration (p95) - Errors: HTTP 5xx responses / total responses Calculated monthly (first of month for previous month). ``` **Exclusions**: ``` SLA does not cover: - Scheduled maintenance (with 7 days notice) - Client-side network issues - DDoS attacks or force majeure - Beta/preview features - Issues caused by customer misuse ``` **Service Credits**: ``` Monthly Uptime | Service Credit ---------------- | -------------- < 99.9% (SLA) | 10% < 99.0% | 25% < 95.0% | 50% ``` **Claiming Credits**: ``` Customer must: 1. Report violation within 30 days 2. Provide ticket numbers for support requests 3. Credits applied to next month's invoice 4. Credits do not exceed monthly fee ``` ### Example SLAs by Industry **E-commerce**: ``` - 99.95% availability - p95 page load < 2s - p99 checkout < 5s - Credits: 5% per 0.1% below target ``` **Financial Services**: ``` - 99.99% availability - p99 transaction < 500ms - Zero data loss - Penalties: $10,000 per hour of downtime ``` **Media/Content**: ``` - 99.9% availability - p95 video start < 3s - No credit system (best effort latency) ``` --- ## Best Practices ### 1. SLOs Should Be User-Centric ❌ "Database queries < 100ms" ✅ "API response time p95 < 500ms" ### 2. Start Loose, Tighten Over Time - Begin with achievable targets - Build reliability culture - Gradually raise bar ### 3. Fewer, Better SLOs - 1-3 SLOs per service - Focus on user impact - Avoid SLO sprawl ### 4. SLAs More Conservative Than SLOs ``` Internal SLO: 99.95% Customer SLA: 99.9% Margin: 0.05% buffer ``` ### 5. Make Error Budgets Actionable - Define policies at different thresholds - Empower teams to make tradeoffs - Review in planning meetings ### 6. Document Everything - How SLIs are measured - Why targets were chosen - Who owns each SLO - How to interpret metrics ### 7. Review Regularly - Monthly SLO reviews - Quarterly SLO adjustments - Annual SLA renegotiation --- ## Common Pitfalls ### 1. Too Many SLOs ❌ 20 different SLOs per service ✅ 2-3 critical SLOs ### 2. Unrealistic Targets ❌ 99.999% for non-critical service ✅ 99.9% with room to improve ### 3. SLOs Without Error Budgets ❌ "Must always be 99.9%" ✅ "Budget for 0.1% errors" ### 4. No Consequences ❌ Missing SLO has no impact ✅ Deploy freeze when budget exhausted ### 5. SLA Equals SLO ❌ Promise exactly what you target ✅ SLA more conservative than SLO ### 6. Ignoring User Experience ❌ "Our servers are up 99.99%" ✅ "Users can complete actions 99.9% of the time" ### 7. Static Targets ❌ Set once, never revisit ✅ Quarterly reviews and adjustments --- ## Tools and Automation ### SLO Tracking Tools **Prometheus + Grafana**: - Use recording rules for SLIs - Alert on burn rates - Dashboard for compliance **Google Cloud SLO Monitoring**: - Built-in SLO tracking - Automatic error budget calculation - Integration with alerting **Datadog SLOs**: - UI for SLO definition - Automatic burn rate alerts - Status pages **Custom Tools**: - sloth: Generate Prometheus rules from SLO definitions - slo-libsonnet: Jsonnet library for SLO monitoring ### Example: Prometheus Recording Rules ```yaml groups: - name: sli_recording interval: 30s rules: # SLI: Request success rate - record: sli:request_success:ratio expr: | sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) # SLI: Request latency (p95) - record: sli:request_latency:p95 expr: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) ) # Error budget burn rate (1h window) - record: slo:error_budget_burn_rate:1h expr: | (1 - sli:request_success:ratio) / 0.001 ```