gh-ahmedasmar-devops-claude…/references/slo_sla_guide.md

# SLI, SLO, and SLA Guide

## Definitions

### SLI (Service Level Indicator)
**What**: A quantitative measure of service quality

**Examples**:
- Request latency (ms)
- Error rate (%)
- Availability (%)
- Throughput (requests/sec)

### SLO (Service Level Objective)
**What**: Target value or range for an SLI

**Examples**:
- "99.9% of requests return in < 500ms"
- "99.95% availability"
- "Error rate < 0.1%"

### SLA (Service Level Agreement)
**What**: Business contract with consequences for SLO violations

**Examples**:
- "99.9% uptime or 10% monthly credit"
- "p95 latency < 1s or refund"

### Relationship
```
SLI = Measurement
SLO = Target (internal goal)
SLA = Promise (customer contract with penalties)

Example:
SLI: Actual availability this month = 99.92%
SLO: Target availability = 99.9%
SLA: Guaranteed availability = 99.5% (with penalties)
```

---

## Choosing SLIs

### The Four Golden Signals as SLIs

1. **Latency SLIs**
   - Request duration (p50, p95, p99)
   - Time to first byte
   - Page load time

2. **Availability/Success SLIs**
   - % of successful requests
   - % uptime
   - % of requests completing

3. **Throughput SLIs** (less common)
   - Requests per second
   - Transactions per second

4. **Saturation SLIs** (internal only)
   - Resource utilization
   - Queue depth

### SLI Selection Criteria

✅ **Good SLIs**:
- Measured from user perspective
- Directly impact user experience
- Aggregatable across instances
- Proportional to user happiness

❌ **Bad SLIs**:
- Internal metrics only
- Not user-facing
- Hard to measure consistently

### Examples by Service Type

**Web Application**:
```
SLI 1: Request Success Rate
  = successful_requests / total_requests

SLI 2: Request Latency (p95)
  = 95th percentile of response times

SLI 3: Availability
  = time_service_responding / total_time
```

**API Service**:
```
SLI 1: Error Rate
  = (4xx_errors + 5xx_errors) / total_requests

SLI 2: Response Time (p99)
  = 99th percentile latency

SLI 3: Throughput
  = requests_per_second
```

**Batch Processing**:
```
SLI 1: Job Success Rate
  = successful_jobs / total_jobs

SLI 2: Processing Latency
  = time_from_submission_to_completion

SLI 3: Freshness
  = age_of_oldest_unprocessed_item
```

**Storage Service**:
```
SLI 1: Durability
  = data_not_lost / total_data

SLI 2: Read Latency (p99)
  = 99th percentile read time

SLI 3: Write Success Rate
  = successful_writes / total_writes
```

---

## Setting SLO Targets

### Start with Current Performance

1. **Measure baseline**: Collect 30 days of data
2. **Analyze distribution**: Look at p50, p95, p99, p99.9
3. **Set initial SLO**: Slightly better than worst performer
4. **Iterate**: Tighten or loosen based on feasibility

### Example Process

**Current Performance** (30 days):
```
p50 latency: 120ms
p95 latency: 450ms
p99 latency: 1200ms
p99.9 latency: 3500ms

Error rate: 0.05%
Availability: 99.95%
```

**Initial SLOs**:
```
Latency: p95 < 500ms (slightly worse than current p95)
Error rate: < 0.1% (double current rate)
Availability: 99.9% (slightly worse than current)
```

**Rationale**: Start loose, prevent false alarms, tighten over time

### Common SLO Targets

**Availability**:
- **99%** (3.65 days downtime/year): Internal tools
- **99.5%** (1.83 days/year): Non-critical services
- **99.9%** (8.76 hours/year): Standard production
- **99.95%** (4.38 hours/year): Critical services
- **99.99%** (52 minutes/year): High availability
- **99.999%** (5 minutes/year): Mission critical

**Latency**:
- **p50 < 100ms**: Excellent responsiveness
- **p95 < 500ms**: Standard web applications
- **p99 < 1s**: Acceptable for most users
- **p99.9 < 5s**: Acceptable for rare edge cases

**Error Rate**:
- **< 0.01%** (99.99% success): Critical operations
- **< 0.1%** (99.9% success): Standard production
- **< 1%** (99% success): Non-critical services

---

## Error Budgets

### Concept

Error budget = (100% - SLO target)

If SLO is 99.9%, error budget is 0.1%

**Purpose**: Balance reliability with feature velocity

### Calculation

**For availability**:
```
Monthly error budget = (1 - SLO) × time_period

Example (99.9% SLO, 30 days):
Error budget = 0.001 × 30 days = 0.03 days = 43.2 minutes
```

**For request-based SLIs**:
```
Error budget = (1 - SLO) × total_requests

Example (99.9% SLO, 10M requests/month):
Error budget = 0.001 × 10,000,000 = 10,000 failed requests
```

### Error Budget Consumption

**Formula**:
```
Budget consumed = actual_errors / allowed_errors × 100%

Example:
SLO: 99.9% (0.1% error budget)
Total requests: 1,000,000
Failed requests: 500
Allowed failures: 1,000

Budget consumed = 500 / 1,000 × 100% = 50%
Budget remaining = 50%
```

### Error Budget Policy

**Example policy**:

```markdown
## Error Budget Policy

### If error budget > 50%
- Deploy frequently (multiple times per day)
- Take calculated risks
- Experiment with new features
- Acceptable to have some incidents

### If error budget 20-50%
- Deploy normally (once per day)
- Increase testing
- Review recent changes
- Monitor closely

### If error budget < 20%
- Freeze non-critical deploys
- Focus on reliability improvements
- Postmortem all incidents
- Reduce change velocity

### If error budget exhausted (< 0%)
- Complete deploy freeze except rollbacks
- All hands on reliability
- Mandatory postmortems
- Executive escalation
```

---

## Error Budget Burn Rate

### Concept

Burn rate = rate of error budget consumption

**Example**:
- Monthly budget: 43.2 minutes (99.9% SLO)
- If consuming at 2x rate: Budget exhausted in 15 days
- If consuming at 10x rate: Budget exhausted in 3 days

### Burn Rate Calculation

```
Burn rate = (actual_error_rate / allowed_error_rate)

Example:
SLO: 99.9% (0.1% allowed error rate)
Current error rate: 0.5%

Burn rate = 0.5% / 0.1% = 5x
Time to exhaust = 30 days / 5 = 6 days
```

### Multi-Window Alerting

Alert on burn rate across multiple time windows:

**Fast burn** (1 hour window):
```
Burn rate > 14.4x → Exhausts budget in 2 days
Alert after 2 minutes
Severity: Critical (page immediately)
```

**Moderate burn** (6 hour window):
```
Burn rate > 6x → Exhausts budget in 5 days
Alert after 30 minutes
Severity: Warning (create ticket)
```

**Slow burn** (3 day window):
```
Burn rate > 1x → Exhausts budget by end of month
Alert after 6 hours
Severity: Info (monitor)
```

### Implementation

**Prometheus**:
```yaml
# Fast burn alert (1h window, 2m grace period)
- alert: ErrorBudgetFastBurn
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[1h]))
      /
      sum(rate(http_requests_total[1h]))
    ) > (14.4 * 0.001)  # 14.4x burn rate for 99.9% SLO
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Fast error budget burn detected"
    description: "Error budget will be exhausted in 2 days at current rate"

# Slow burn alert (6h window, 30m grace period)
- alert: ErrorBudgetSlowBurn
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[6h]))
      /
      sum(rate(http_requests_total[6h]))
    ) > (6 * 0.001)  # 6x burn rate for 99.9% SLO
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "Elevated error budget burn detected"
```

---

## SLO Reporting

### Dashboard Structure

**Overall Health**:
```
┌─────────────────────────────────────────┐
│  SLO Compliance: 99.92% ✅              │
│  Error Budget Remaining: 73% 🟢         │
│  Burn Rate: 0.8x 🟢                     │
└─────────────────────────────────────────┘
```

**SLI Performance**:
```
Latency p95: 420ms (Target: 500ms) ✅
Error Rate: 0.08% (Target: < 0.1%) ✅
Availability: 99.95% (Target: > 99.9%) ✅
```

**Error Budget Trend**:
```
Graph showing:
- Error budget consumption over time
- Burn rate spikes
- Incidents marked
- Deploy events overlaid
```

### Monthly SLO Report

**Template**:
```markdown
# SLO Report: October 2024

## Executive Summary
- ✅ All SLOs met this month
- 🟡 Latency SLO came close to violation (99.1% compliance)
- 3 incidents consumed 47% of error budget
- Error budget remaining: 53%

## SLO Performance

### Availability SLO: 99.9%
- Actual: 99.92%
- Status: ✅ Met
- Error budget consumed: 33%
- Downtime: 23 minutes (allowed: 43.2 minutes)

### Latency SLO: p95 < 500ms
- Actual p95: 445ms
- Status: ✅ Met
- Compliance: 99.1% (target: 99%)
- 0.9% of requests exceeded threshold

### Error Rate SLO: < 0.1%
- Actual: 0.05%
- Status: ✅ Met
- Error budget consumed: 50%

## Incidents

### Incident #1: Database Overload (Oct 5)
- Duration: 15 minutes
- Error budget consumed: 35%
- Root cause: Slow query after schema change
- Prevention: Added query review to deploy checklist

### Incident #2: API Gateway Timeout (Oct 12)
- Duration: 5 minutes
- Error budget consumed: 10%
- Root cause: Configuration error in load balancer
- Prevention: Automated configuration validation

### Incident #3: Upstream Service Degradation (Oct 20)
- Duration: 3 minutes
- Error budget consumed: 2%
- Root cause: Third-party API outage
- Prevention: Implemented circuit breaker

## Recommendations
1. Investigate latency near-miss (Oct 15-17)
2. Add automated rollback for database changes
3. Increase circuit breaker thresholds for third-party APIs
4. Consider tightening availability SLO to 99.95%

## Next Month's Focus
- Reduce p95 latency to 400ms
- Implement automated canary deployments
- Add synthetic monitoring for critical paths
```

---

## SLA Structure

### Components

**Service Description**:
```
The API Service provides RESTful endpoints for user management,
authentication, and data retrieval.
```

**Covered Metrics**:
```
- Availability: Service is reachable and returns valid responses
- Latency: Time from request to response
- Error Rate: Percentage of requests returning errors
```

**SLA Targets**:
```
Service commits to:
1. 99.9% monthly uptime
2. p95 API response time < 1 second
3. Error rate < 0.5%
```

**Measurement**:
```
Metrics calculated from server-side monitoring:
- Uptime: Successful health check probes / total probes
- Latency: Server-side request duration (p95)
- Errors: HTTP 5xx responses / total responses

Calculated monthly (first of month for previous month).
```

**Exclusions**:
```
SLA does not cover:
- Scheduled maintenance (with 7 days notice)
- Client-side network issues
- DDoS attacks or force majeure
- Beta/preview features
- Issues caused by customer misuse
```

**Service Credits**:
```
Monthly Uptime    | Service Credit
----------------  | --------------
< 99.9% (SLA)     | 10%
< 99.0%           | 25%
< 95.0%           | 50%
```

**Claiming Credits**:
```
Customer must:
1. Report violation within 30 days
2. Provide ticket numbers for support requests
3. Credits applied to next month's invoice
4. Credits do not exceed monthly fee
```

### Example SLAs by Industry

**E-commerce**:
```
- 99.95% availability
- p95 page load < 2s
- p99 checkout < 5s
- Credits: 5% per 0.1% below target
```

**Financial Services**:
```
- 99.99% availability
- p99 transaction < 500ms
- Zero data loss
- Penalties: $10,000 per hour of downtime
```

**Media/Content**:
```
- 99.9% availability
- p95 video start < 3s
- No credit system (best effort latency)
```

---

## Best Practices

### 1. SLOs Should Be User-Centric
❌ "Database queries < 100ms"
✅ "API response time p95 < 500ms"

### 2. Start Loose, Tighten Over Time
- Begin with achievable targets
- Build reliability culture
- Gradually raise bar

### 3. Fewer, Better SLOs
- 1-3 SLOs per service
- Focus on user impact
- Avoid SLO sprawl

### 4. SLAs More Conservative Than SLOs
```
Internal SLO: 99.95%
Customer SLA: 99.9%
Margin: 0.05% buffer
```

### 5. Make Error Budgets Actionable
- Define policies at different thresholds
- Empower teams to make tradeoffs
- Review in planning meetings

### 6. Document Everything
- How SLIs are measured
- Why targets were chosen
- Who owns each SLO
- How to interpret metrics

### 7. Review Regularly
- Monthly SLO reviews
- Quarterly SLO adjustments
- Annual SLA renegotiation

---

## Common Pitfalls

### 1. Too Many SLOs
❌ 20 different SLOs per service
✅ 2-3 critical SLOs

### 2. Unrealistic Targets
❌ 99.999% for non-critical service
✅ 99.9% with room to improve

### 3. SLOs Without Error Budgets
❌ "Must always be 99.9%"
✅ "Budget for 0.1% errors"

### 4. No Consequences
❌ Missing SLO has no impact
✅ Deploy freeze when budget exhausted

### 5. SLA Equals SLO
❌ Promise exactly what you target
✅ SLA more conservative than SLO

### 6. Ignoring User Experience
❌ "Our servers are up 99.99%"
✅ "Users can complete actions 99.9% of the time"

### 7. Static Targets
❌ Set once, never revisit
✅ Quarterly reviews and adjustments

---

## Tools and Automation

### SLO Tracking Tools

**Prometheus + Grafana**:
- Use recording rules for SLIs
- Alert on burn rates
- Dashboard for compliance

**Google Cloud SLO Monitoring**:
- Built-in SLO tracking
- Automatic error budget calculation
- Integration with alerting

**Datadog SLOs**:
- UI for SLO definition
- Automatic burn rate alerts
- Status pages

**Custom Tools**:
- sloth: Generate Prometheus rules from SLO definitions
- slo-libsonnet: Jsonnet library for SLO monitoring

### Example: Prometheus Recording Rules

```yaml
groups:
  - name: sli_recording
    interval: 30s
    rules:
      # SLI: Request success rate
      - record: sli:request_success:ratio
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))

      # SLI: Request latency (p95)
      - record: sli:request_latency:p95
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          )

      # Error budget burn rate (1h window)
      - record: slo:error_budget_burn_rate:1h
        expr: |
          (1 - sli:request_success:ratio) / 0.001
```