14 KiB
SLI, SLO, and SLA Guide
Definitions
SLI (Service Level Indicator)
What: A quantitative measure of service quality
Examples:
- Request latency (ms)
- Error rate (%)
- Availability (%)
- Throughput (requests/sec)
SLO (Service Level Objective)
What: Target value or range for an SLI
Examples:
- "99.9% of requests return in < 500ms"
- "99.95% availability"
- "Error rate < 0.1%"
SLA (Service Level Agreement)
What: Business contract with consequences for SLO violations
Examples:
- "99.9% uptime or 10% monthly credit"
- "p95 latency < 1s or refund"
Relationship
SLI = Measurement
SLO = Target (internal goal)
SLA = Promise (customer contract with penalties)
Example:
SLI: Actual availability this month = 99.92%
SLO: Target availability = 99.9%
SLA: Guaranteed availability = 99.5% (with penalties)
Choosing SLIs
The Four Golden Signals as SLIs
-
Latency SLIs
- Request duration (p50, p95, p99)
- Time to first byte
- Page load time
-
Availability/Success SLIs
- % of successful requests
- % uptime
- % of requests completing
-
Throughput SLIs (less common)
- Requests per second
- Transactions per second
-
Saturation SLIs (internal only)
- Resource utilization
- Queue depth
SLI Selection Criteria
✅ Good SLIs:
- Measured from user perspective
- Directly impact user experience
- Aggregatable across instances
- Proportional to user happiness
❌ Bad SLIs:
- Internal metrics only
- Not user-facing
- Hard to measure consistently
Examples by Service Type
Web Application:
SLI 1: Request Success Rate
= successful_requests / total_requests
SLI 2: Request Latency (p95)
= 95th percentile of response times
SLI 3: Availability
= time_service_responding / total_time
API Service:
SLI 1: Error Rate
= (4xx_errors + 5xx_errors) / total_requests
SLI 2: Response Time (p99)
= 99th percentile latency
SLI 3: Throughput
= requests_per_second
Batch Processing:
SLI 1: Job Success Rate
= successful_jobs / total_jobs
SLI 2: Processing Latency
= time_from_submission_to_completion
SLI 3: Freshness
= age_of_oldest_unprocessed_item
Storage Service:
SLI 1: Durability
= data_not_lost / total_data
SLI 2: Read Latency (p99)
= 99th percentile read time
SLI 3: Write Success Rate
= successful_writes / total_writes
Setting SLO Targets
Start with Current Performance
- Measure baseline: Collect 30 days of data
- Analyze distribution: Look at p50, p95, p99, p99.9
- Set initial SLO: Slightly better than worst performer
- Iterate: Tighten or loosen based on feasibility
Example Process
Current Performance (30 days):
p50 latency: 120ms
p95 latency: 450ms
p99 latency: 1200ms
p99.9 latency: 3500ms
Error rate: 0.05%
Availability: 99.95%
Initial SLOs:
Latency: p95 < 500ms (slightly worse than current p95)
Error rate: < 0.1% (double current rate)
Availability: 99.9% (slightly worse than current)
Rationale: Start loose, prevent false alarms, tighten over time
Common SLO Targets
Availability:
- 99% (3.65 days downtime/year): Internal tools
- 99.5% (1.83 days/year): Non-critical services
- 99.9% (8.76 hours/year): Standard production
- 99.95% (4.38 hours/year): Critical services
- 99.99% (52 minutes/year): High availability
- 99.999% (5 minutes/year): Mission critical
Latency:
- p50 < 100ms: Excellent responsiveness
- p95 < 500ms: Standard web applications
- p99 < 1s: Acceptable for most users
- p99.9 < 5s: Acceptable for rare edge cases
Error Rate:
- < 0.01% (99.99% success): Critical operations
- < 0.1% (99.9% success): Standard production
- < 1% (99% success): Non-critical services
Error Budgets
Concept
Error budget = (100% - SLO target)
If SLO is 99.9%, error budget is 0.1%
Purpose: Balance reliability with feature velocity
Calculation
For availability:
Monthly error budget = (1 - SLO) × time_period
Example (99.9% SLO, 30 days):
Error budget = 0.001 × 30 days = 0.03 days = 43.2 minutes
For request-based SLIs:
Error budget = (1 - SLO) × total_requests
Example (99.9% SLO, 10M requests/month):
Error budget = 0.001 × 10,000,000 = 10,000 failed requests
Error Budget Consumption
Formula:
Budget consumed = actual_errors / allowed_errors × 100%
Example:
SLO: 99.9% (0.1% error budget)
Total requests: 1,000,000
Failed requests: 500
Allowed failures: 1,000
Budget consumed = 500 / 1,000 × 100% = 50%
Budget remaining = 50%
Error Budget Policy
Example policy:
## Error Budget Policy
### If error budget > 50%
- Deploy frequently (multiple times per day)
- Take calculated risks
- Experiment with new features
- Acceptable to have some incidents
### If error budget 20-50%
- Deploy normally (once per day)
- Increase testing
- Review recent changes
- Monitor closely
### If error budget < 20%
- Freeze non-critical deploys
- Focus on reliability improvements
- Postmortem all incidents
- Reduce change velocity
### If error budget exhausted (< 0%)
- Complete deploy freeze except rollbacks
- All hands on reliability
- Mandatory postmortems
- Executive escalation
Error Budget Burn Rate
Concept
Burn rate = rate of error budget consumption
Example:
- Monthly budget: 43.2 minutes (99.9% SLO)
- If consuming at 2x rate: Budget exhausted in 15 days
- If consuming at 10x rate: Budget exhausted in 3 days
Burn Rate Calculation
Burn rate = (actual_error_rate / allowed_error_rate)
Example:
SLO: 99.9% (0.1% allowed error rate)
Current error rate: 0.5%
Burn rate = 0.5% / 0.1% = 5x
Time to exhaust = 30 days / 5 = 6 days
Multi-Window Alerting
Alert on burn rate across multiple time windows:
Fast burn (1 hour window):
Burn rate > 14.4x → Exhausts budget in 2 days
Alert after 2 minutes
Severity: Critical (page immediately)
Moderate burn (6 hour window):
Burn rate > 6x → Exhausts budget in 5 days
Alert after 30 minutes
Severity: Warning (create ticket)
Slow burn (3 day window):
Burn rate > 1x → Exhausts budget by end of month
Alert after 6 hours
Severity: Info (monitor)
Implementation
Prometheus:
# Fast burn alert (1h window, 2m grace period)
- alert: ErrorBudgetFastBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001) # 14.4x burn rate for 99.9% SLO
for: 2m
labels:
severity: critical
annotations:
summary: "Fast error budget burn detected"
description: "Error budget will be exhausted in 2 days at current rate"
# Slow burn alert (6h window, 30m grace period)
- alert: ErrorBudgetSlowBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
) > (6 * 0.001) # 6x burn rate for 99.9% SLO
for: 30m
labels:
severity: warning
annotations:
summary: "Elevated error budget burn detected"
SLO Reporting
Dashboard Structure
Overall Health:
┌─────────────────────────────────────────┐
│ SLO Compliance: 99.92% ✅ │
│ Error Budget Remaining: 73% 🟢 │
│ Burn Rate: 0.8x 🟢 │
└─────────────────────────────────────────┘
SLI Performance:
Latency p95: 420ms (Target: 500ms) ✅
Error Rate: 0.08% (Target: < 0.1%) ✅
Availability: 99.95% (Target: > 99.9%) ✅
Error Budget Trend:
Graph showing:
- Error budget consumption over time
- Burn rate spikes
- Incidents marked
- Deploy events overlaid
Monthly SLO Report
Template:
# SLO Report: October 2024
## Executive Summary
- ✅ All SLOs met this month
- 🟡 Latency SLO came close to violation (99.1% compliance)
- 3 incidents consumed 47% of error budget
- Error budget remaining: 53%
## SLO Performance
### Availability SLO: 99.9%
- Actual: 99.92%
- Status: ✅ Met
- Error budget consumed: 33%
- Downtime: 23 minutes (allowed: 43.2 minutes)
### Latency SLO: p95 < 500ms
- Actual p95: 445ms
- Status: ✅ Met
- Compliance: 99.1% (target: 99%)
- 0.9% of requests exceeded threshold
### Error Rate SLO: < 0.1%
- Actual: 0.05%
- Status: ✅ Met
- Error budget consumed: 50%
## Incidents
### Incident #1: Database Overload (Oct 5)
- Duration: 15 minutes
- Error budget consumed: 35%
- Root cause: Slow query after schema change
- Prevention: Added query review to deploy checklist
### Incident #2: API Gateway Timeout (Oct 12)
- Duration: 5 minutes
- Error budget consumed: 10%
- Root cause: Configuration error in load balancer
- Prevention: Automated configuration validation
### Incident #3: Upstream Service Degradation (Oct 20)
- Duration: 3 minutes
- Error budget consumed: 2%
- Root cause: Third-party API outage
- Prevention: Implemented circuit breaker
## Recommendations
1. Investigate latency near-miss (Oct 15-17)
2. Add automated rollback for database changes
3. Increase circuit breaker thresholds for third-party APIs
4. Consider tightening availability SLO to 99.95%
## Next Month's Focus
- Reduce p95 latency to 400ms
- Implement automated canary deployments
- Add synthetic monitoring for critical paths
SLA Structure
Components
Service Description:
The API Service provides RESTful endpoints for user management,
authentication, and data retrieval.
Covered Metrics:
- Availability: Service is reachable and returns valid responses
- Latency: Time from request to response
- Error Rate: Percentage of requests returning errors
SLA Targets:
Service commits to:
1. 99.9% monthly uptime
2. p95 API response time < 1 second
3. Error rate < 0.5%
Measurement:
Metrics calculated from server-side monitoring:
- Uptime: Successful health check probes / total probes
- Latency: Server-side request duration (p95)
- Errors: HTTP 5xx responses / total responses
Calculated monthly (first of month for previous month).
Exclusions:
SLA does not cover:
- Scheduled maintenance (with 7 days notice)
- Client-side network issues
- DDoS attacks or force majeure
- Beta/preview features
- Issues caused by customer misuse
Service Credits:
Monthly Uptime | Service Credit
---------------- | --------------
< 99.9% (SLA) | 10%
< 99.0% | 25%
< 95.0% | 50%
Claiming Credits:
Customer must:
1. Report violation within 30 days
2. Provide ticket numbers for support requests
3. Credits applied to next month's invoice
4. Credits do not exceed monthly fee
Example SLAs by Industry
E-commerce:
- 99.95% availability
- p95 page load < 2s
- p99 checkout < 5s
- Credits: 5% per 0.1% below target
Financial Services:
- 99.99% availability
- p99 transaction < 500ms
- Zero data loss
- Penalties: $10,000 per hour of downtime
Media/Content:
- 99.9% availability
- p95 video start < 3s
- No credit system (best effort latency)
Best Practices
1. SLOs Should Be User-Centric
❌ "Database queries < 100ms" ✅ "API response time p95 < 500ms"
2. Start Loose, Tighten Over Time
- Begin with achievable targets
- Build reliability culture
- Gradually raise bar
3. Fewer, Better SLOs
- 1-3 SLOs per service
- Focus on user impact
- Avoid SLO sprawl
4. SLAs More Conservative Than SLOs
Internal SLO: 99.95%
Customer SLA: 99.9%
Margin: 0.05% buffer
5. Make Error Budgets Actionable
- Define policies at different thresholds
- Empower teams to make tradeoffs
- Review in planning meetings
6. Document Everything
- How SLIs are measured
- Why targets were chosen
- Who owns each SLO
- How to interpret metrics
7. Review Regularly
- Monthly SLO reviews
- Quarterly SLO adjustments
- Annual SLA renegotiation
Common Pitfalls
1. Too Many SLOs
❌ 20 different SLOs per service ✅ 2-3 critical SLOs
2. Unrealistic Targets
❌ 99.999% for non-critical service ✅ 99.9% with room to improve
3. SLOs Without Error Budgets
❌ "Must always be 99.9%" ✅ "Budget for 0.1% errors"
4. No Consequences
❌ Missing SLO has no impact ✅ Deploy freeze when budget exhausted
5. SLA Equals SLO
❌ Promise exactly what you target ✅ SLA more conservative than SLO
6. Ignoring User Experience
❌ "Our servers are up 99.99%" ✅ "Users can complete actions 99.9% of the time"
7. Static Targets
❌ Set once, never revisit ✅ Quarterly reviews and adjustments
Tools and Automation
SLO Tracking Tools
Prometheus + Grafana:
- Use recording rules for SLIs
- Alert on burn rates
- Dashboard for compliance
Google Cloud SLO Monitoring:
- Built-in SLO tracking
- Automatic error budget calculation
- Integration with alerting
Datadog SLOs:
- UI for SLO definition
- Automatic burn rate alerts
- Status pages
Custom Tools:
- sloth: Generate Prometheus rules from SLO definitions
- slo-libsonnet: Jsonnet library for SLO monitoring
Example: Prometheus Recording Rules
groups:
- name: sli_recording
interval: 30s
rules:
# SLI: Request success rate
- record: sli:request_success:ratio
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# SLI: Request latency (p95)
- record: sli:request_latency:p95
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# Error budget burn rate (1h window)
- record: slo:error_budget_burn_rate:1h
expr: |
(1 - sli:request_success:ratio) / 0.001