zhongwei/gh-ahmedasmar-devops-claude-skills-monitoring-observability

Fork 0

Files

Zhongwei Li 23753b435e Initial commit

2025-11-29 17:51:22 +08:00

14 KiB

Raw Blame History

SLI, SLO, and SLA Guide

Definitions

SLI (Service Level Indicator)

What: A quantitative measure of service quality

Examples:

Request latency (ms)
Error rate (%)
Availability (%)
Throughput (requests/sec)

SLO (Service Level Objective)

What: Target value or range for an SLI

Examples:

"99.9% of requests return in < 500ms"
"99.95% availability"
"Error rate < 0.1%"

SLA (Service Level Agreement)

What: Business contract with consequences for SLO violations

Examples:

"99.9% uptime or 10% monthly credit"
"p95 latency < 1s or refund"

Relationship

SLI = Measurement
SLO = Target (internal goal)
SLA = Promise (customer contract with penalties)

Example:
SLI: Actual availability this month = 99.92%
SLO: Target availability = 99.9%
SLA: Guaranteed availability = 99.5% (with penalties)

Choosing SLIs

The Four Golden Signals as SLIs

Latency SLIs
- Request duration (p50, p95, p99)
- Time to first byte
- Page load time
Availability/Success SLIs
- % of successful requests
- % uptime
- % of requests completing
Throughput SLIs (less common)
- Requests per second
- Transactions per second
Saturation SLIs (internal only)
- Resource utilization
- Queue depth

SLI Selection Criteria

✅ Good SLIs:

Measured from user perspective
Directly impact user experience
Aggregatable across instances
Proportional to user happiness

❌ Bad SLIs:

Internal metrics only
Not user-facing
Hard to measure consistently

Examples by Service Type

Web Application:

SLI 1: Request Success Rate
  = successful_requests / total_requests

SLI 2: Request Latency (p95)
  = 95th percentile of response times

SLI 3: Availability
  = time_service_responding / total_time

API Service:

SLI 1: Error Rate
  = (4xx_errors + 5xx_errors) / total_requests

SLI 2: Response Time (p99)
  = 99th percentile latency

SLI 3: Throughput
  = requests_per_second

Batch Processing:

SLI 1: Job Success Rate
  = successful_jobs / total_jobs

SLI 2: Processing Latency
  = time_from_submission_to_completion

SLI 3: Freshness
  = age_of_oldest_unprocessed_item

Storage Service:

SLI 1: Durability
  = data_not_lost / total_data

SLI 2: Read Latency (p99)
  = 99th percentile read time

SLI 3: Write Success Rate
  = successful_writes / total_writes

Setting SLO Targets

Start with Current Performance

Measure baseline: Collect 30 days of data
Analyze distribution: Look at p50, p95, p99, p99.9
Set initial SLO: Slightly better than worst performer
Iterate: Tighten or loosen based on feasibility

Example Process

Current Performance (30 days):

p50 latency: 120ms
p95 latency: 450ms
p99 latency: 1200ms
p99.9 latency: 3500ms

Error rate: 0.05%
Availability: 99.95%

Initial SLOs:

Latency: p95 < 500ms (slightly worse than current p95)
Error rate: < 0.1% (double current rate)
Availability: 99.9% (slightly worse than current)

Rationale: Start loose, prevent false alarms, tighten over time

Common SLO Targets

Availability:

99% (3.65 days downtime/year): Internal tools
99.5% (1.83 days/year): Non-critical services
99.9% (8.76 hours/year): Standard production
99.95% (4.38 hours/year): Critical services
99.99% (52 minutes/year): High availability
99.999% (5 minutes/year): Mission critical

Latency:

p50 < 100ms: Excellent responsiveness
p95 < 500ms: Standard web applications
p99 < 1s: Acceptable for most users
p99.9 < 5s: Acceptable for rare edge cases

Error Rate:

< 0.01% (99.99% success): Critical operations
< 0.1% (99.9% success): Standard production
< 1% (99% success): Non-critical services

Error Budgets

Concept

Error budget = (100% - SLO target)

If SLO is 99.9%, error budget is 0.1%

Purpose: Balance reliability with feature velocity

Calculation

For availability:

Monthly error budget = (1 - SLO) × time_period

Example (99.9% SLO, 30 days):
Error budget = 0.001 × 30 days = 0.03 days = 43.2 minutes

For request-based SLIs:

Error budget = (1 - SLO) × total_requests

Example (99.9% SLO, 10M requests/month):
Error budget = 0.001 × 10,000,000 = 10,000 failed requests

Error Budget Consumption

Formula:

Budget consumed = actual_errors / allowed_errors × 100%

Example:
SLO: 99.9% (0.1% error budget)
Total requests: 1,000,000
Failed requests: 500
Allowed failures: 1,000

Budget consumed = 500 / 1,000 × 100% = 50%
Budget remaining = 50%

Error Budget Policy

Example policy:

## Error Budget Policy

### If error budget > 50%
- Deploy frequently (multiple times per day)
- Take calculated risks
- Experiment with new features
- Acceptable to have some incidents

### If error budget 20-50%
- Deploy normally (once per day)
- Increase testing
- Review recent changes
- Monitor closely

### If error budget < 20%
- Freeze non-critical deploys
- Focus on reliability improvements
- Postmortem all incidents
- Reduce change velocity

### If error budget exhausted (< 0%)
- Complete deploy freeze except rollbacks
- All hands on reliability
- Mandatory postmortems
- Executive escalation

Error Budget Burn Rate

Concept

Burn rate = rate of error budget consumption

Example:

Monthly budget: 43.2 minutes (99.9% SLO)
If consuming at 2x rate: Budget exhausted in 15 days
If consuming at 10x rate: Budget exhausted in 3 days

Burn Rate Calculation

Burn rate = (actual_error_rate / allowed_error_rate)

Example:
SLO: 99.9% (0.1% allowed error rate)
Current error rate: 0.5%

Burn rate = 0.5% / 0.1% = 5x
Time to exhaust = 30 days / 5 = 6 days

Multi-Window Alerting

Alert on burn rate across multiple time windows:

Fast burn (1 hour window):

Burn rate > 14.4x → Exhausts budget in 2 days
Alert after 2 minutes
Severity: Critical (page immediately)

Moderate burn (6 hour window):

Burn rate > 6x → Exhausts budget in 5 days
Alert after 30 minutes
Severity: Warning (create ticket)

Slow burn (3 day window):

Burn rate > 1x → Exhausts budget by end of month
Alert after 6 hours
Severity: Info (monitor)

Implementation

Prometheus:

# Fast burn alert (1h window, 2m grace period)
- alert: ErrorBudgetFastBurn
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[1h]))
      /
      sum(rate(http_requests_total[1h]))
    ) > (14.4 * 0.001)  # 14.4x burn rate for 99.9% SLO
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Fast error budget burn detected"
    description: "Error budget will be exhausted in 2 days at current rate"

# Slow burn alert (6h window, 30m grace period)
- alert: ErrorBudgetSlowBurn
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[6h]))
      /
      sum(rate(http_requests_total[6h]))
    ) > (6 * 0.001)  # 6x burn rate for 99.9% SLO
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "Elevated error budget burn detected"

SLO Reporting

Dashboard Structure

Overall Health:

┌─────────────────────────────────────────┐
│  SLO Compliance: 99.92% ✅              │
│  Error Budget Remaining: 73% 🟢         │
│  Burn Rate: 0.8x 🟢                     │
└─────────────────────────────────────────┘

SLI Performance:

Latency p95: 420ms (Target: 500ms) ✅
Error Rate: 0.08% (Target: < 0.1%) ✅
Availability: 99.95% (Target: > 99.9%) ✅

Error Budget Trend:

Graph showing:
- Error budget consumption over time
- Burn rate spikes
- Incidents marked
- Deploy events overlaid

Monthly SLO Report

Template:

# SLO Report: October 2024

## Executive Summary
- ✅ All SLOs met this month
- 🟡 Latency SLO came close to violation (99.1% compliance)
- 3 incidents consumed 47% of error budget
- Error budget remaining: 53%

## SLO Performance

### Availability SLO: 99.9%
- Actual: 99.92%
- Status: ✅ Met
- Error budget consumed: 33%
- Downtime: 23 minutes (allowed: 43.2 minutes)

### Latency SLO: p95 < 500ms
- Actual p95: 445ms
- Status: ✅ Met
- Compliance: 99.1% (target: 99%)
- 0.9% of requests exceeded threshold

### Error Rate SLO: < 0.1%
- Actual: 0.05%
- Status: ✅ Met
- Error budget consumed: 50%

## Incidents

### Incident #1: Database Overload (Oct 5)
- Duration: 15 minutes
- Error budget consumed: 35%
- Root cause: Slow query after schema change
- Prevention: Added query review to deploy checklist

### Incident #2: API Gateway Timeout (Oct 12)
- Duration: 5 minutes
- Error budget consumed: 10%
- Root cause: Configuration error in load balancer
- Prevention: Automated configuration validation

### Incident #3: Upstream Service Degradation (Oct 20)
- Duration: 3 minutes
- Error budget consumed: 2%
- Root cause: Third-party API outage
- Prevention: Implemented circuit breaker

## Recommendations
1. Investigate latency near-miss (Oct 15-17)
2. Add automated rollback for database changes
3. Increase circuit breaker thresholds for third-party APIs
4. Consider tightening availability SLO to 99.95%

## Next Month's Focus
- Reduce p95 latency to 400ms
- Implement automated canary deployments
- Add synthetic monitoring for critical paths

SLA Structure

Components

Service Description:

The API Service provides RESTful endpoints for user management,
authentication, and data retrieval.

Covered Metrics:

- Availability: Service is reachable and returns valid responses
- Latency: Time from request to response
- Error Rate: Percentage of requests returning errors

SLA Targets:

Service commits to:
1. 99.9% monthly uptime
2. p95 API response time < 1 second
3. Error rate < 0.5%

Measurement:

Metrics calculated from server-side monitoring:
- Uptime: Successful health check probes / total probes
- Latency: Server-side request duration (p95)
- Errors: HTTP 5xx responses / total responses

Calculated monthly (first of month for previous month).

Exclusions:

SLA does not cover:
- Scheduled maintenance (with 7 days notice)
- Client-side network issues
- DDoS attacks or force majeure
- Beta/preview features
- Issues caused by customer misuse

Service Credits:

Monthly Uptime    | Service Credit
----------------  | --------------
< 99.9% (SLA)     | 10%
< 99.0%           | 25%
< 95.0%           | 50%

Claiming Credits:

Customer must:
1. Report violation within 30 days
2. Provide ticket numbers for support requests
3. Credits applied to next month's invoice
4. Credits do not exceed monthly fee

Example SLAs by Industry

E-commerce:

- 99.95% availability
- p95 page load < 2s
- p99 checkout < 5s
- Credits: 5% per 0.1% below target

Financial Services:

- 99.99% availability
- p99 transaction < 500ms
- Zero data loss
- Penalties: $10,000 per hour of downtime

Media/Content:

- 99.9% availability
- p95 video start < 3s
- No credit system (best effort latency)

Best Practices

1. SLOs Should Be User-Centric

❌ "Database queries < 100ms" ✅ "API response time p95 < 500ms"

2. Start Loose, Tighten Over Time

Begin with achievable targets
Build reliability culture
Gradually raise bar

3. Fewer, Better SLOs

1-3 SLOs per service
Focus on user impact
Avoid SLO sprawl

4. SLAs More Conservative Than SLOs

Internal SLO: 99.95%
Customer SLA: 99.9%
Margin: 0.05% buffer

5. Make Error Budgets Actionable

Define policies at different thresholds
Empower teams to make tradeoffs
Review in planning meetings

6. Document Everything

How SLIs are measured
Why targets were chosen
Who owns each SLO
How to interpret metrics

7. Review Regularly

Monthly SLO reviews
Quarterly SLO adjustments
Annual SLA renegotiation

Common Pitfalls

1. Too Many SLOs

❌ 20 different SLOs per service ✅ 2-3 critical SLOs

2. Unrealistic Targets

❌ 99.999% for non-critical service ✅ 99.9% with room to improve

3. SLOs Without Error Budgets

❌ "Must always be 99.9%" ✅ "Budget for 0.1% errors"

4. No Consequences

❌ Missing SLO has no impact ✅ Deploy freeze when budget exhausted

5. SLA Equals SLO

❌ Promise exactly what you target ✅ SLA more conservative than SLO

6. Ignoring User Experience

❌ "Our servers are up 99.99%" ✅ "Users can complete actions 99.9% of the time"

7. Static Targets

❌ Set once, never revisit ✅ Quarterly reviews and adjustments

Tools and Automation

SLO Tracking Tools

Prometheus + Grafana:

Use recording rules for SLIs
Alert on burn rates
Dashboard for compliance

Google Cloud SLO Monitoring:

Built-in SLO tracking
Automatic error budget calculation
Integration with alerting

Datadog SLOs:

UI for SLO definition
Automatic burn rate alerts
Status pages

Custom Tools:

sloth: Generate Prometheus rules from SLO definitions
slo-libsonnet: Jsonnet library for SLO monitoring

Example: Prometheus Recording Rules

groups:
  - name: sli_recording
    interval: 30s
    rules:
      # SLI: Request success rate
      - record: sli:request_success:ratio
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))

      # SLI: Request latency (p95)
      - record: sli:request_latency:p95
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          )

      # Error budget burn rate (1h window)
      - record: slo:error_budget_burn_rate:1h
        expr: |
          (1 - sli:request_success:ratio) / 0.001

14 KiB Raw Blame History Unescape Escape

SLI, SLO, and SLA Guide

Definitions

SLI (Service Level Indicator)

SLO (Service Level Objective)

SLA (Service Level Agreement)

Relationship

Choosing SLIs

The Four Golden Signals as SLIs

SLI Selection Criteria

Examples by Service Type

Setting SLO Targets

Start with Current Performance

Example Process

Common SLO Targets

Error Budgets

Concept

Calculation

Error Budget Consumption

Error Budget Policy

Error Budget Burn Rate

Concept

Burn Rate Calculation

Multi-Window Alerting

Implementation

SLO Reporting

Dashboard Structure

Monthly SLO Report

SLA Structure

Components

Example SLAs by Industry

Best Practices

1. SLOs Should Be User-Centric

2. Start Loose, Tighten Over Time

3. Fewer, Better SLOs

4. SLAs More Conservative Than SLOs

5. Make Error Budgets Actionable

6. Document Everything

7. Review Regularly

Common Pitfalls

1. Too Many SLOs

2. Unrealistic Targets

3. SLOs Without Error Budgets

4. No Consequences

5. SLA Equals SLO

6. Ignoring User Experience

7. Static Targets

Tools and Automation

SLO Tracking Tools

Example: Prometheus Recording Rules

14 KiB

Raw Blame History