Files
gh-ahmedasmar-devops-claude…/references/slo_sla_guide.md
2025-11-29 17:51:22 +08:00

14 KiB
Raw Blame History

SLI, SLO, and SLA Guide

Definitions

SLI (Service Level Indicator)

What: A quantitative measure of service quality

Examples:

  • Request latency (ms)
  • Error rate (%)
  • Availability (%)
  • Throughput (requests/sec)

SLO (Service Level Objective)

What: Target value or range for an SLI

Examples:

  • "99.9% of requests return in < 500ms"
  • "99.95% availability"
  • "Error rate < 0.1%"

SLA (Service Level Agreement)

What: Business contract with consequences for SLO violations

Examples:

  • "99.9% uptime or 10% monthly credit"
  • "p95 latency < 1s or refund"

Relationship

SLI = Measurement
SLO = Target (internal goal)
SLA = Promise (customer contract with penalties)

Example:
SLI: Actual availability this month = 99.92%
SLO: Target availability = 99.9%
SLA: Guaranteed availability = 99.5% (with penalties)

Choosing SLIs

The Four Golden Signals as SLIs

  1. Latency SLIs

    • Request duration (p50, p95, p99)
    • Time to first byte
    • Page load time
  2. Availability/Success SLIs

    • % of successful requests
    • % uptime
    • % of requests completing
  3. Throughput SLIs (less common)

    • Requests per second
    • Transactions per second
  4. Saturation SLIs (internal only)

    • Resource utilization
    • Queue depth

SLI Selection Criteria

Good SLIs:

  • Measured from user perspective
  • Directly impact user experience
  • Aggregatable across instances
  • Proportional to user happiness

Bad SLIs:

  • Internal metrics only
  • Not user-facing
  • Hard to measure consistently

Examples by Service Type

Web Application:

SLI 1: Request Success Rate
  = successful_requests / total_requests

SLI 2: Request Latency (p95)
  = 95th percentile of response times

SLI 3: Availability
  = time_service_responding / total_time

API Service:

SLI 1: Error Rate
  = (4xx_errors + 5xx_errors) / total_requests

SLI 2: Response Time (p99)
  = 99th percentile latency

SLI 3: Throughput
  = requests_per_second

Batch Processing:

SLI 1: Job Success Rate
  = successful_jobs / total_jobs

SLI 2: Processing Latency
  = time_from_submission_to_completion

SLI 3: Freshness
  = age_of_oldest_unprocessed_item

Storage Service:

SLI 1: Durability
  = data_not_lost / total_data

SLI 2: Read Latency (p99)
  = 99th percentile read time

SLI 3: Write Success Rate
  = successful_writes / total_writes

Setting SLO Targets

Start with Current Performance

  1. Measure baseline: Collect 30 days of data
  2. Analyze distribution: Look at p50, p95, p99, p99.9
  3. Set initial SLO: Slightly better than worst performer
  4. Iterate: Tighten or loosen based on feasibility

Example Process

Current Performance (30 days):

p50 latency: 120ms
p95 latency: 450ms
p99 latency: 1200ms
p99.9 latency: 3500ms

Error rate: 0.05%
Availability: 99.95%

Initial SLOs:

Latency: p95 < 500ms (slightly worse than current p95)
Error rate: < 0.1% (double current rate)
Availability: 99.9% (slightly worse than current)

Rationale: Start loose, prevent false alarms, tighten over time

Common SLO Targets

Availability:

  • 99% (3.65 days downtime/year): Internal tools
  • 99.5% (1.83 days/year): Non-critical services
  • 99.9% (8.76 hours/year): Standard production
  • 99.95% (4.38 hours/year): Critical services
  • 99.99% (52 minutes/year): High availability
  • 99.999% (5 minutes/year): Mission critical

Latency:

  • p50 < 100ms: Excellent responsiveness
  • p95 < 500ms: Standard web applications
  • p99 < 1s: Acceptable for most users
  • p99.9 < 5s: Acceptable for rare edge cases

Error Rate:

  • < 0.01% (99.99% success): Critical operations
  • < 0.1% (99.9% success): Standard production
  • < 1% (99% success): Non-critical services

Error Budgets

Concept

Error budget = (100% - SLO target)

If SLO is 99.9%, error budget is 0.1%

Purpose: Balance reliability with feature velocity

Calculation

For availability:

Monthly error budget = (1 - SLO) × time_period

Example (99.9% SLO, 30 days):
Error budget = 0.001 × 30 days = 0.03 days = 43.2 minutes

For request-based SLIs:

Error budget = (1 - SLO) × total_requests

Example (99.9% SLO, 10M requests/month):
Error budget = 0.001 × 10,000,000 = 10,000 failed requests

Error Budget Consumption

Formula:

Budget consumed = actual_errors / allowed_errors × 100%

Example:
SLO: 99.9% (0.1% error budget)
Total requests: 1,000,000
Failed requests: 500
Allowed failures: 1,000

Budget consumed = 500 / 1,000 × 100% = 50%
Budget remaining = 50%

Error Budget Policy

Example policy:

## Error Budget Policy

### If error budget > 50%
- Deploy frequently (multiple times per day)
- Take calculated risks
- Experiment with new features
- Acceptable to have some incidents

### If error budget 20-50%
- Deploy normally (once per day)
- Increase testing
- Review recent changes
- Monitor closely

### If error budget < 20%
- Freeze non-critical deploys
- Focus on reliability improvements
- Postmortem all incidents
- Reduce change velocity

### If error budget exhausted (< 0%)
- Complete deploy freeze except rollbacks
- All hands on reliability
- Mandatory postmortems
- Executive escalation

Error Budget Burn Rate

Concept

Burn rate = rate of error budget consumption

Example:

  • Monthly budget: 43.2 minutes (99.9% SLO)
  • If consuming at 2x rate: Budget exhausted in 15 days
  • If consuming at 10x rate: Budget exhausted in 3 days

Burn Rate Calculation

Burn rate = (actual_error_rate / allowed_error_rate)

Example:
SLO: 99.9% (0.1% allowed error rate)
Current error rate: 0.5%

Burn rate = 0.5% / 0.1% = 5x
Time to exhaust = 30 days / 5 = 6 days

Multi-Window Alerting

Alert on burn rate across multiple time windows:

Fast burn (1 hour window):

Burn rate > 14.4x → Exhausts budget in 2 days
Alert after 2 minutes
Severity: Critical (page immediately)

Moderate burn (6 hour window):

Burn rate > 6x → Exhausts budget in 5 days
Alert after 30 minutes
Severity: Warning (create ticket)

Slow burn (3 day window):

Burn rate > 1x → Exhausts budget by end of month
Alert after 6 hours
Severity: Info (monitor)

Implementation

Prometheus:

# Fast burn alert (1h window, 2m grace period)
- alert: ErrorBudgetFastBurn
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[1h]))
      /
      sum(rate(http_requests_total[1h]))
    ) > (14.4 * 0.001)  # 14.4x burn rate for 99.9% SLO
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Fast error budget burn detected"
    description: "Error budget will be exhausted in 2 days at current rate"

# Slow burn alert (6h window, 30m grace period)
- alert: ErrorBudgetSlowBurn
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[6h]))
      /
      sum(rate(http_requests_total[6h]))
    ) > (6 * 0.001)  # 6x burn rate for 99.9% SLO
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "Elevated error budget burn detected"

SLO Reporting

Dashboard Structure

Overall Health:

┌─────────────────────────────────────────┐
│  SLO Compliance: 99.92% ✅              │
│  Error Budget Remaining: 73% 🟢         │
│  Burn Rate: 0.8x 🟢                     │
└─────────────────────────────────────────┘

SLI Performance:

Latency p95: 420ms (Target: 500ms) ✅
Error Rate: 0.08% (Target: < 0.1%) ✅
Availability: 99.95% (Target: > 99.9%) ✅

Error Budget Trend:

Graph showing:
- Error budget consumption over time
- Burn rate spikes
- Incidents marked
- Deploy events overlaid

Monthly SLO Report

Template:

# SLO Report: October 2024

## Executive Summary
- ✅ All SLOs met this month
- 🟡 Latency SLO came close to violation (99.1% compliance)
- 3 incidents consumed 47% of error budget
- Error budget remaining: 53%

## SLO Performance

### Availability SLO: 99.9%
- Actual: 99.92%
- Status: ✅ Met
- Error budget consumed: 33%
- Downtime: 23 minutes (allowed: 43.2 minutes)

### Latency SLO: p95 < 500ms
- Actual p95: 445ms
- Status: ✅ Met
- Compliance: 99.1% (target: 99%)
- 0.9% of requests exceeded threshold

### Error Rate SLO: < 0.1%
- Actual: 0.05%
- Status: ✅ Met
- Error budget consumed: 50%

## Incidents

### Incident #1: Database Overload (Oct 5)
- Duration: 15 minutes
- Error budget consumed: 35%
- Root cause: Slow query after schema change
- Prevention: Added query review to deploy checklist

### Incident #2: API Gateway Timeout (Oct 12)
- Duration: 5 minutes
- Error budget consumed: 10%
- Root cause: Configuration error in load balancer
- Prevention: Automated configuration validation

### Incident #3: Upstream Service Degradation (Oct 20)
- Duration: 3 minutes
- Error budget consumed: 2%
- Root cause: Third-party API outage
- Prevention: Implemented circuit breaker

## Recommendations
1. Investigate latency near-miss (Oct 15-17)
2. Add automated rollback for database changes
3. Increase circuit breaker thresholds for third-party APIs
4. Consider tightening availability SLO to 99.95%

## Next Month's Focus
- Reduce p95 latency to 400ms
- Implement automated canary deployments
- Add synthetic monitoring for critical paths

SLA Structure

Components

Service Description:

The API Service provides RESTful endpoints for user management,
authentication, and data retrieval.

Covered Metrics:

- Availability: Service is reachable and returns valid responses
- Latency: Time from request to response
- Error Rate: Percentage of requests returning errors

SLA Targets:

Service commits to:
1. 99.9% monthly uptime
2. p95 API response time < 1 second
3. Error rate < 0.5%

Measurement:

Metrics calculated from server-side monitoring:
- Uptime: Successful health check probes / total probes
- Latency: Server-side request duration (p95)
- Errors: HTTP 5xx responses / total responses

Calculated monthly (first of month for previous month).

Exclusions:

SLA does not cover:
- Scheduled maintenance (with 7 days notice)
- Client-side network issues
- DDoS attacks or force majeure
- Beta/preview features
- Issues caused by customer misuse

Service Credits:

Monthly Uptime    | Service Credit
----------------  | --------------
< 99.9% (SLA)     | 10%
< 99.0%           | 25%
< 95.0%           | 50%

Claiming Credits:

Customer must:
1. Report violation within 30 days
2. Provide ticket numbers for support requests
3. Credits applied to next month's invoice
4. Credits do not exceed monthly fee

Example SLAs by Industry

E-commerce:

- 99.95% availability
- p95 page load < 2s
- p99 checkout < 5s
- Credits: 5% per 0.1% below target

Financial Services:

- 99.99% availability
- p99 transaction < 500ms
- Zero data loss
- Penalties: $10,000 per hour of downtime

Media/Content:

- 99.9% availability
- p95 video start < 3s
- No credit system (best effort latency)

Best Practices

1. SLOs Should Be User-Centric

"Database queries < 100ms" "API response time p95 < 500ms"

2. Start Loose, Tighten Over Time

  • Begin with achievable targets
  • Build reliability culture
  • Gradually raise bar

3. Fewer, Better SLOs

  • 1-3 SLOs per service
  • Focus on user impact
  • Avoid SLO sprawl

4. SLAs More Conservative Than SLOs

Internal SLO: 99.95%
Customer SLA: 99.9%
Margin: 0.05% buffer

5. Make Error Budgets Actionable

  • Define policies at different thresholds
  • Empower teams to make tradeoffs
  • Review in planning meetings

6. Document Everything

  • How SLIs are measured
  • Why targets were chosen
  • Who owns each SLO
  • How to interpret metrics

7. Review Regularly

  • Monthly SLO reviews
  • Quarterly SLO adjustments
  • Annual SLA renegotiation

Common Pitfalls

1. Too Many SLOs

20 different SLOs per service 2-3 critical SLOs

2. Unrealistic Targets

99.999% for non-critical service 99.9% with room to improve

3. SLOs Without Error Budgets

"Must always be 99.9%" "Budget for 0.1% errors"

4. No Consequences

Missing SLO has no impact Deploy freeze when budget exhausted

5. SLA Equals SLO

Promise exactly what you target SLA more conservative than SLO

6. Ignoring User Experience

"Our servers are up 99.99%" "Users can complete actions 99.9% of the time"

7. Static Targets

Set once, never revisit Quarterly reviews and adjustments


Tools and Automation

SLO Tracking Tools

Prometheus + Grafana:

  • Use recording rules for SLIs
  • Alert on burn rates
  • Dashboard for compliance

Google Cloud SLO Monitoring:

  • Built-in SLO tracking
  • Automatic error budget calculation
  • Integration with alerting

Datadog SLOs:

  • UI for SLO definition
  • Automatic burn rate alerts
  • Status pages

Custom Tools:

  • sloth: Generate Prometheus rules from SLO definitions
  • slo-libsonnet: Jsonnet library for SLO monitoring

Example: Prometheus Recording Rules

groups:
  - name: sli_recording
    interval: 30s
    rules:
      # SLI: Request success rate
      - record: sli:request_success:ratio
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))

      # SLI: Request latency (p95)
      - record: sli:request_latency:p95
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          )

      # Error budget burn rate (1h window)
      - record: slo:error_budget_burn_rate:1h
        expr: |
          (1 - sli:request_success:ratio) / 0.001