zhongwei/gh-ahmedasmar-devops-claude-skills-monitoring-observability

Files

Zhongwei Li 23753b435e Initial commit

2025-11-29 17:51:22 +08:00

14 KiB

Raw Blame History

Monitoring Tools Comparison

Overview Matrix

Tool	Type	Best For	Complexity	Cost	Cloud/Self-Hosted
Prometheus	Metrics	Kubernetes, time-series	Medium	Free	Self-hosted
Grafana	Visualization	Dashboards, multi-source	Low-Medium	Free	Both
Datadog	Full-stack	Ease of use, APM	Low	High	Cloud
New Relic	Full-stack	APM, traces	Low	High	Cloud
Elasticsearch (ELK)	Logs	Log search, analysis	High	Medium	Both
Grafana Loki	Logs	Cost-effective logs	Medium	Free	Both
CloudWatch	AWS-native	AWS infrastructure	Low	Medium	Cloud
Jaeger	Tracing	Distributed tracing	Medium	Free	Self-hosted
Grafana Tempo	Tracing	Cost-effective tracing	Medium	Free	Self-hosted

Metrics Platforms

Prometheus

Type: Open-source time-series database

Strengths:

✅ Industry standard for Kubernetes
✅ Powerful query language (PromQL)
✅ Pull-based model (no agent config)
✅ Service discovery
✅ Free and open source
✅ Huge ecosystem (exporters for everything)

Weaknesses:

❌ No built-in dashboards (need Grafana)
❌ Single-node only (no HA without federation)
❌ Limited long-term storage (need Thanos/Cortex)
❌ Steep learning curve for PromQL

Best For:

Kubernetes monitoring
Infrastructure metrics
Custom application metrics
Organizations that need control

Pricing: Free (open source)

Setup Complexity: Medium

Example:

# prometheus.yml
scrape_configs:
  - job_name: 'app'
    static_configs:
      - targets: ['localhost:8080']

Datadog

Type: SaaS monitoring platform

Strengths:

✅ Easy to set up (install agent, done)
✅ Beautiful pre-built dashboards
✅ APM, logs, metrics, traces in one platform
✅ Great anomaly detection
✅ Excellent integrations (500+)
✅ Good mobile app

Weaknesses:

❌ Very expensive at scale
❌ Vendor lock-in
❌ Cost can be unpredictable (per-host pricing)
❌ Limited PromQL support

Best For:

Teams that want quick setup
Companies prioritizing ease of use over cost
Organizations needing full observability

Pricing: $15-$31/host/month + custom metrics fees

Setup Complexity: Low

Example:

# Install agent
DD_API_KEY=xxx bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"

New Relic

Type: SaaS application performance monitoring

Strengths:

✅ Excellent APM capabilities
✅ User-friendly interface
✅ Good transaction tracing
✅ Comprehensive alerting
✅ Generous free tier

Weaknesses:

❌ Can get expensive at scale
❌ Vendor lock-in
❌ Query language less powerful than PromQL
❌ Limited customization

Best For:

Application performance monitoring
Teams focused on APM over infrastructure
Startups (free tier is generous)

Pricing: Free up to 100GB/month, then $0.30/GB

Setup Complexity: Low

Example:

import newrelic.agent
newrelic.agent.initialize('newrelic.ini')

CloudWatch

Type: AWS-native monitoring

Strengths:

✅ Zero setup for AWS services
✅ Native integration with AWS
✅ Automatic dashboards for AWS resources
✅ Tightly integrated with other AWS services
✅ Good for cost if already on AWS

Weaknesses:

❌ AWS-only (not multi-cloud)
❌ Limited query capabilities
❌ High costs for custom metrics
❌ Basic visualization
❌ 1-minute minimum resolution

Best For:

AWS-centric infrastructure
Quick setup for AWS services
Organizations already invested in AWS

Pricing:

First 10 custom metrics: Free
Additional: $0.30/metric/month
API calls: $0.01/1000 requests

Setup Complexity: Low (for AWS), Medium (for custom metrics)

Example:

import boto3
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_data(
    Namespace='MyApp',
    MetricData=[{'MetricName': 'RequestCount', 'Value': 1}]
)

Grafana Cloud / Mimir

Type: Managed Prometheus-compatible

Strengths:

✅ Prometheus-compatible (PromQL)
✅ Managed service (no ops burden)
✅ Good cost model (pay for what you use)
✅ Grafana dashboards included
✅ Long-term storage

Weaknesses:

❌ Relatively new (less mature)
❌ Some Prometheus features missing
❌ Requires Grafana for visualization

Best For:

Teams wanting Prometheus without ops overhead
Multi-cloud environments
Organizations already using Grafana

Pricing: $8/month + $0.29/1M samples

Setup Complexity: Low-Medium

Logging Platforms

Elasticsearch (ELK Stack)

Type: Open-source log search and analytics

Full Stack: Elasticsearch + Logstash + Kibana

Strengths:

✅ Powerful search capabilities
✅ Rich query language
✅ Great for log analysis
✅ Mature ecosystem
✅ Can handle large volumes
✅ Flexible data model

Weaknesses:

❌ Complex to operate
❌ Resource intensive (RAM hungry)
❌ Expensive at scale
❌ Requires dedicated ops team
❌ Slow for high-cardinality queries

Best For:

Large organizations with ops teams
Deep log analysis needs
Search-heavy use cases

Pricing: Free (open source) + infrastructure costs

Infrastructure: ~$500-2000/month for medium scale

Setup Complexity: High

Example:

PUT /logs-2024.10/_doc/1
{
  "timestamp": "2024-10-28T14:32:15Z",
  "level": "error",
  "message": "Payment failed"
}

Grafana Loki

Type: Log aggregation system

Strengths:

✅ Cost-effective (labels only, not full-text indexing)
✅ Easy to operate
✅ Prometheus-like label model
✅ Great Grafana integration
✅ Low resource usage
✅ Fast time-range queries

Weaknesses:

❌ Limited full-text search
❌ Requires careful label design
❌ Younger ecosystem than ELK
❌ Not ideal for complex queries

Best For:

Cost-conscious organizations
Kubernetes environments
Teams already using Prometheus
Time-series log queries

Pricing: Free (open source) + infrastructure costs

Infrastructure: ~$100-500/month for medium scale

Setup Complexity: Medium

Example:

{job="api", environment="prod"} |= "error" | json | level="error"

Splunk

Type: Enterprise log management

Strengths:

✅ Extremely powerful search
✅ Great for security/compliance
✅ Mature platform
✅ Enterprise support
✅ Machine learning features

Weaknesses:

❌ Very expensive
❌ Complex pricing (per GB ingested)
❌ Steep learning curve
❌ Heavy resource usage

Best For:

Large enterprises
Security operations centers (SOCs)
Compliance-heavy industries

Pricing: $150-$1800/GB/month (depending on tier)

Setup Complexity: Medium-High

CloudWatch Logs

Type: AWS-native log management

Strengths:

✅ Zero setup for AWS services
✅ Integrated with AWS ecosystem
✅ CloudWatch Insights for queries
✅ Reasonable cost for low volume

Weaknesses:

❌ AWS-only
❌ Limited query capabilities
❌ Expensive at high volume
❌ Basic visualization

Best For:

AWS-centric applications
Low-volume logging
Simple log aggregation

Pricing: Tiered (as of May 2025)

Vended Logs: $0.50/GB (first 10TB), $0.25/GB (next 20TB), then lower tiers
Standard logs: $0.50/GB flat
Storage: $0.03/GB

Setup Complexity: Low (AWS), Medium (custom)

Sumo Logic

Type: SaaS log management

Strengths:

✅ Easy to use
✅ Good for cloud-native apps
✅ Real-time analytics
✅ Good compliance features

Weaknesses:

❌ Expensive at scale
❌ Vendor lock-in
❌ Limited customization

Best For:

Cloud-native applications
Teams wanting managed solution
Security and compliance use cases

Pricing: $90-$180/GB/month

Setup Complexity: Low

Tracing Platforms

Jaeger

Type: Open-source distributed tracing

Strengths:

✅ Industry standard
✅ CNCF graduated project
✅ Supports OpenTelemetry
✅ Good UI
✅ Free and open source

Weaknesses:

❌ Requires separate storage backend
❌ Limited query capabilities
❌ No built-in analytics

Best For:

Microservices architectures
Kubernetes environments
OpenTelemetry users

Pricing: Free (open source) + storage costs

Setup Complexity: Medium

Grafana Tempo

Type: Open-source distributed tracing

Strengths:

✅ Cost-effective (object storage)
✅ Easy to operate
✅ Great Grafana integration
✅ TraceQL query language
✅ Supports OpenTelemetry

Weaknesses:

❌ Younger than Jaeger
❌ Limited third-party integrations
❌ Requires Grafana for UI

Best For:

Cost-conscious organizations
Teams using Grafana stack
High trace volumes

Pricing: Free (open source) + storage costs

Setup Complexity: Medium

Datadog APM

Type: SaaS application performance monitoring

Strengths:

✅ Easy to set up
✅ Excellent trace visualization
✅ Integrated with metrics/logs
✅ Automatic service map
✅ Good profiling features

Weaknesses:

❌ Expensive ($31/host/month)
❌ Vendor lock-in
❌ Limited sampling control

Best For:

Teams wanting ease of use
Organizations already using Datadog
Complex microservices

Pricing: $31/host/month + $1.70/million spans

Setup Complexity: Low

AWS X-Ray

Type: AWS-native distributed tracing

Strengths:

✅ Native AWS integration
✅ Automatic instrumentation for AWS services
✅ Low cost

Weaknesses:

❌ AWS-only
❌ Basic UI
❌ Limited query capabilities

Best For:

AWS-centric applications
Serverless architectures (Lambda)
Cost-sensitive projects

Pricing: $5/million traces, first 100k free/month

Setup Complexity: Low (AWS), Medium (custom)

Full-Stack Observability

Datadog (Full Platform)

Components: Metrics, logs, traces, RUM, synthetics

Strengths:

✅ Everything in one platform
✅ Excellent user experience
✅ Correlation across signals
✅ Great for teams

Weaknesses:

❌ Very expensive ($50-100+/host/month)
❌ Vendor lock-in
❌ Unpredictable costs

Total Cost (example 100 hosts):

Infrastructure: $3,100/month
APM: $3,100/month
Logs: ~$2,000/month
Total: ~$8,000/month

Grafana Stack (LGTM)

Components: Loki (logs), Grafana (viz), Tempo (traces), Mimir/Prometheus (metrics)

Strengths:

✅ Open source and cost-effective
✅ Unified visualization
✅ Prometheus-compatible
✅ Great for cloud-native

Weaknesses:

❌ Requires self-hosting or Grafana Cloud
❌ More ops burden
❌ Less polished than commercial tools

Total Cost (self-hosted, 100 hosts):

Infrastructure: ~$1,500/month
Ops time: Variable
Total: ~$1,500-3,000/month

Elastic Observability

Components: Elasticsearch (logs), Kibana (viz), APM, metrics

Strengths:

✅ Powerful search
✅ Mature platform
✅ Good for log-heavy use cases

Weaknesses:

❌ Complex to operate
❌ Expensive infrastructure
❌ Resource intensive

Total Cost (self-hosted, 100 hosts):

Infrastructure: ~$3,000-5,000/month
Ops time: High
Total: ~$4,000-7,000/month

New Relic One

Components: Metrics, logs, traces, synthetics

Strengths:

✅ Generous free tier (100GB)
✅ User-friendly
✅ Good for startups

Weaknesses:

❌ Costs increase quickly after free tier
❌ Vendor lock-in

Total Cost:

Free: up to 100GB/month
Paid: $0.30/GB beyond 100GB

Cloud Provider Native

AWS (CloudWatch + X-Ray)

Use When:

Primarily on AWS
Simple monitoring needs
Want minimal setup

Avoid When:

Multi-cloud environment
Need advanced features
High log volume (expensive)

Cost (example):

100 EC2 instances with basic metrics: ~$150/month
1TB logs: ~$500/month ingestion + storage
X-Ray: ~$50/month

GCP (Cloud Monitoring + Cloud Trace)

Use When:

Primarily on GCP
Using GKE
Want tight GCP integration

Avoid When:

Multi-cloud environment
Need advanced querying

Cost (example):

First 150MB/month per resource: Free
Additional: $0.2508/MB

Azure (Azure Monitor)

Use When:

Primarily on Azure
Using AKS
Need Azure integration

Avoid When:

Multi-cloud
Need advanced features

Cost (example):

First 5GB: Free
Additional: $2.76/GB

Decision Matrix

Choose Prometheus + Grafana If:

✅ Using Kubernetes
✅ Want control and customization
✅ Have ops capacity
✅ Budget-conscious
✅ Need Prometheus ecosystem

Choose Datadog If:

✅ Want ease of use
✅ Need full observability now
✅ Budget allows ($8k+/month for 100 hosts)
✅ Limited ops team
✅ Need excellent UX

Choose ELK If:

✅ Heavy log analysis needs
✅ Need powerful search
✅ Have dedicated ops team
✅ Compliance requirements
✅ Willing to invest in infrastructure

Choose Grafana Stack (LGTM) If:

✅ Want open source full stack
✅ Cost-effective solution
✅ Cloud-native architecture
✅ Already using Prometheus
✅ Have some ops capacity

Choose New Relic If:

✅ Startup with free tier
✅ APM is priority
✅ Want easy setup
✅ Don't need heavy customization

Choose Cloud Native (CloudWatch/etc) If:

✅ Single cloud provider
✅ Simple needs
✅ Want minimal setup
✅ Low to medium scale

Cost Comparison

Example: 100 hosts, 1TB logs/month, 1M spans/day

Solution	Monthly Cost	Setup	Ops Burden
Prometheus + Loki + Tempo	$1,500	Medium	Medium
Grafana Cloud	$3,000	Low	Low
Datadog	$8,000	Low	None
New Relic	$3,500	Low	None
ELK Stack	$4,000	High	High
CloudWatch	$2,000	Low	Low

Recommendations by Company Size

Startup (< 10 engineers)

Recommendation: New Relic or Grafana Cloud

Minimal ops burden
Good free tiers
Easy to get started

Small Company (10-50 engineers)

Recommendation: Prometheus + Grafana + Loki (self-hosted or cloud)

Cost-effective
Growing ops capacity
Flexibility

Medium Company (50-200 engineers)

Recommendation: Datadog or Grafana Stack

Datadog if budget allows
Grafana Stack if cost-conscious

Large Enterprise (200+ engineers)

Recommendation: Build observability platform

Mix of tools based on needs
Dedicated observability team
Custom integrations

14 KiB Raw Blame History

Monitoring Tools Comparison

Overview Matrix

Metrics Platforms

Prometheus

Datadog

New Relic

CloudWatch

Grafana Cloud / Mimir

Logging Platforms

Elasticsearch (ELK Stack)

Grafana Loki

Splunk

CloudWatch Logs

Sumo Logic

Tracing Platforms

Jaeger

Grafana Tempo

Datadog APM

AWS X-Ray

Full-Stack Observability

Datadog (Full Platform)

Grafana Stack (LGTM)

Elastic Observability

New Relic One

Cloud Provider Native

AWS (CloudWatch + X-Ray)

GCP (Cloud Monitoring + Cloud Trace)

Azure (Azure Monitor)

Decision Matrix

Choose Prometheus + Grafana If:

Choose Datadog If:

Choose ELK If:

Choose Grafana Stack (LGTM) If:

Choose New Relic If:

Choose Cloud Native (CloudWatch/etc) If:

Cost Comparison

Recommendations by Company Size

Startup (< 10 engineers)

Small Company (10-50 engineers)

Medium Company (50-200 engineers)

Large Enterprise (200+ engineers)

14 KiB

Raw Blame History