Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 17:51:22 +08:00
commit 23753b435e
24 changed files with 9837 additions and 0 deletions

View File

@@ -0,0 +1,697 @@
# Monitoring Tools Comparison
## Overview Matrix
| Tool | Type | Best For | Complexity | Cost | Cloud/Self-Hosted |
|------|------|----------|------------|------|-------------------|
| **Prometheus** | Metrics | Kubernetes, time-series | Medium | Free | Self-hosted |
| **Grafana** | Visualization | Dashboards, multi-source | Low-Medium | Free | Both |
| **Datadog** | Full-stack | Ease of use, APM | Low | High | Cloud |
| **New Relic** | Full-stack | APM, traces | Low | High | Cloud |
| **Elasticsearch (ELK)** | Logs | Log search, analysis | High | Medium | Both |
| **Grafana Loki** | Logs | Cost-effective logs | Medium | Free | Both |
| **CloudWatch** | AWS-native | AWS infrastructure | Low | Medium | Cloud |
| **Jaeger** | Tracing | Distributed tracing | Medium | Free | Self-hosted |
| **Grafana Tempo** | Tracing | Cost-effective tracing | Medium | Free | Self-hosted |
---
## Metrics Platforms
### Prometheus
**Type**: Open-source time-series database
**Strengths**:
- ✅ Industry standard for Kubernetes
- ✅ Powerful query language (PromQL)
- ✅ Pull-based model (no agent config)
- ✅ Service discovery
- ✅ Free and open source
- ✅ Huge ecosystem (exporters for everything)
**Weaknesses**:
- ❌ No built-in dashboards (need Grafana)
- ❌ Single-node only (no HA without federation)
- ❌ Limited long-term storage (need Thanos/Cortex)
- ❌ Steep learning curve for PromQL
**Best For**:
- Kubernetes monitoring
- Infrastructure metrics
- Custom application metrics
- Organizations that need control
**Pricing**: Free (open source)
**Setup Complexity**: Medium
**Example**:
```yaml
# prometheus.yml
scrape_configs:
- job_name: 'app'
static_configs:
- targets: ['localhost:8080']
```
---
### Datadog
**Type**: SaaS monitoring platform
**Strengths**:
- ✅ Easy to set up (install agent, done)
- ✅ Beautiful pre-built dashboards
- ✅ APM, logs, metrics, traces in one platform
- ✅ Great anomaly detection
- ✅ Excellent integrations (500+)
- ✅ Good mobile app
**Weaknesses**:
- ❌ Very expensive at scale
- ❌ Vendor lock-in
- ❌ Cost can be unpredictable (per-host pricing)
- ❌ Limited PromQL support
**Best For**:
- Teams that want quick setup
- Companies prioritizing ease of use over cost
- Organizations needing full observability
**Pricing**: $15-$31/host/month + custom metrics fees
**Setup Complexity**: Low
**Example**:
```bash
# Install agent
DD_API_KEY=xxx bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"
```
---
### New Relic
**Type**: SaaS application performance monitoring
**Strengths**:
- ✅ Excellent APM capabilities
- ✅ User-friendly interface
- ✅ Good transaction tracing
- ✅ Comprehensive alerting
- ✅ Generous free tier
**Weaknesses**:
- ❌ Can get expensive at scale
- ❌ Vendor lock-in
- ❌ Query language less powerful than PromQL
- ❌ Limited customization
**Best For**:
- Application performance monitoring
- Teams focused on APM over infrastructure
- Startups (free tier is generous)
**Pricing**: Free up to 100GB/month, then $0.30/GB
**Setup Complexity**: Low
**Example**:
```python
import newrelic.agent
newrelic.agent.initialize('newrelic.ini')
```
---
### CloudWatch
**Type**: AWS-native monitoring
**Strengths**:
- ✅ Zero setup for AWS services
- ✅ Native integration with AWS
- ✅ Automatic dashboards for AWS resources
- ✅ Tightly integrated with other AWS services
- ✅ Good for cost if already on AWS
**Weaknesses**:
- ❌ AWS-only (not multi-cloud)
- ❌ Limited query capabilities
- ❌ High costs for custom metrics
- ❌ Basic visualization
- ❌ 1-minute minimum resolution
**Best For**:
- AWS-centric infrastructure
- Quick setup for AWS services
- Organizations already invested in AWS
**Pricing**:
- First 10 custom metrics: Free
- Additional: $0.30/metric/month
- API calls: $0.01/1000 requests
**Setup Complexity**: Low (for AWS), Medium (for custom metrics)
**Example**:
```python
import boto3
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_data(
Namespace='MyApp',
MetricData=[{'MetricName': 'RequestCount', 'Value': 1}]
)
```
---
### Grafana Cloud / Mimir
**Type**: Managed Prometheus-compatible
**Strengths**:
- ✅ Prometheus-compatible (PromQL)
- ✅ Managed service (no ops burden)
- ✅ Good cost model (pay for what you use)
- ✅ Grafana dashboards included
- ✅ Long-term storage
**Weaknesses**:
- ❌ Relatively new (less mature)
- ❌ Some Prometheus features missing
- ❌ Requires Grafana for visualization
**Best For**:
- Teams wanting Prometheus without ops overhead
- Multi-cloud environments
- Organizations already using Grafana
**Pricing**: $8/month + $0.29/1M samples
**Setup Complexity**: Low-Medium
---
## Logging Platforms
### Elasticsearch (ELK Stack)
**Type**: Open-source log search and analytics
**Full Stack**: Elasticsearch + Logstash + Kibana
**Strengths**:
- ✅ Powerful search capabilities
- ✅ Rich query language
- ✅ Great for log analysis
- ✅ Mature ecosystem
- ✅ Can handle large volumes
- ✅ Flexible data model
**Weaknesses**:
- ❌ Complex to operate
- ❌ Resource intensive (RAM hungry)
- ❌ Expensive at scale
- ❌ Requires dedicated ops team
- ❌ Slow for high-cardinality queries
**Best For**:
- Large organizations with ops teams
- Deep log analysis needs
- Search-heavy use cases
**Pricing**: Free (open source) + infrastructure costs
**Infrastructure**: ~$500-2000/month for medium scale
**Setup Complexity**: High
**Example**:
```json
PUT /logs-2024.10/_doc/1
{
"timestamp": "2024-10-28T14:32:15Z",
"level": "error",
"message": "Payment failed"
}
```
---
### Grafana Loki
**Type**: Log aggregation system
**Strengths**:
- ✅ Cost-effective (labels only, not full-text indexing)
- ✅ Easy to operate
- ✅ Prometheus-like label model
- ✅ Great Grafana integration
- ✅ Low resource usage
- ✅ Fast time-range queries
**Weaknesses**:
- ❌ Limited full-text search
- ❌ Requires careful label design
- ❌ Younger ecosystem than ELK
- ❌ Not ideal for complex queries
**Best For**:
- Cost-conscious organizations
- Kubernetes environments
- Teams already using Prometheus
- Time-series log queries
**Pricing**: Free (open source) + infrastructure costs
**Infrastructure**: ~$100-500/month for medium scale
**Setup Complexity**: Medium
**Example**:
```logql
{job="api", environment="prod"} |= "error" | json | level="error"
```
---
### Splunk
**Type**: Enterprise log management
**Strengths**:
- ✅ Extremely powerful search
- ✅ Great for security/compliance
- ✅ Mature platform
- ✅ Enterprise support
- ✅ Machine learning features
**Weaknesses**:
- ❌ Very expensive
- ❌ Complex pricing (per GB ingested)
- ❌ Steep learning curve
- ❌ Heavy resource usage
**Best For**:
- Large enterprises
- Security operations centers (SOCs)
- Compliance-heavy industries
**Pricing**: $150-$1800/GB/month (depending on tier)
**Setup Complexity**: Medium-High
---
### CloudWatch Logs
**Type**: AWS-native log management
**Strengths**:
- ✅ Zero setup for AWS services
- ✅ Integrated with AWS ecosystem
- ✅ CloudWatch Insights for queries
- ✅ Reasonable cost for low volume
**Weaknesses**:
- ❌ AWS-only
- ❌ Limited query capabilities
- ❌ Expensive at high volume
- ❌ Basic visualization
**Best For**:
- AWS-centric applications
- Low-volume logging
- Simple log aggregation
**Pricing**: Tiered (as of May 2025)
- Vended Logs: $0.50/GB (first 10TB), $0.25/GB (next 20TB), then lower tiers
- Standard logs: $0.50/GB flat
- Storage: $0.03/GB
**Setup Complexity**: Low (AWS), Medium (custom)
---
### Sumo Logic
**Type**: SaaS log management
**Strengths**:
- ✅ Easy to use
- ✅ Good for cloud-native apps
- ✅ Real-time analytics
- ✅ Good compliance features
**Weaknesses**:
- ❌ Expensive at scale
- ❌ Vendor lock-in
- ❌ Limited customization
**Best For**:
- Cloud-native applications
- Teams wanting managed solution
- Security and compliance use cases
**Pricing**: $90-$180/GB/month
**Setup Complexity**: Low
---
## Tracing Platforms
### Jaeger
**Type**: Open-source distributed tracing
**Strengths**:
- ✅ Industry standard
- ✅ CNCF graduated project
- ✅ Supports OpenTelemetry
- ✅ Good UI
- ✅ Free and open source
**Weaknesses**:
- ❌ Requires separate storage backend
- ❌ Limited query capabilities
- ❌ No built-in analytics
**Best For**:
- Microservices architectures
- Kubernetes environments
- OpenTelemetry users
**Pricing**: Free (open source) + storage costs
**Setup Complexity**: Medium
---
### Grafana Tempo
**Type**: Open-source distributed tracing
**Strengths**:
- ✅ Cost-effective (object storage)
- ✅ Easy to operate
- ✅ Great Grafana integration
- ✅ TraceQL query language
- ✅ Supports OpenTelemetry
**Weaknesses**:
- ❌ Younger than Jaeger
- ❌ Limited third-party integrations
- ❌ Requires Grafana for UI
**Best For**:
- Cost-conscious organizations
- Teams using Grafana stack
- High trace volumes
**Pricing**: Free (open source) + storage costs
**Setup Complexity**: Medium
---
### Datadog APM
**Type**: SaaS application performance monitoring
**Strengths**:
- ✅ Easy to set up
- ✅ Excellent trace visualization
- ✅ Integrated with metrics/logs
- ✅ Automatic service map
- ✅ Good profiling features
**Weaknesses**:
- ❌ Expensive ($31/host/month)
- ❌ Vendor lock-in
- ❌ Limited sampling control
**Best For**:
- Teams wanting ease of use
- Organizations already using Datadog
- Complex microservices
**Pricing**: $31/host/month + $1.70/million spans
**Setup Complexity**: Low
---
### AWS X-Ray
**Type**: AWS-native distributed tracing
**Strengths**:
- ✅ Native AWS integration
- ✅ Automatic instrumentation for AWS services
- ✅ Low cost
**Weaknesses**:
- ❌ AWS-only
- ❌ Basic UI
- ❌ Limited query capabilities
**Best For**:
- AWS-centric applications
- Serverless architectures (Lambda)
- Cost-sensitive projects
**Pricing**: $5/million traces, first 100k free/month
**Setup Complexity**: Low (AWS), Medium (custom)
---
## Full-Stack Observability
### Datadog (Full Platform)
**Components**: Metrics, logs, traces, RUM, synthetics
**Strengths**:
- ✅ Everything in one platform
- ✅ Excellent user experience
- ✅ Correlation across signals
- ✅ Great for teams
**Weaknesses**:
- ❌ Very expensive ($50-100+/host/month)
- ❌ Vendor lock-in
- ❌ Unpredictable costs
**Total Cost** (example 100 hosts):
- Infrastructure: $3,100/month
- APM: $3,100/month
- Logs: ~$2,000/month
- **Total: ~$8,000/month**
---
### Grafana Stack (LGTM)
**Components**: Loki (logs), Grafana (viz), Tempo (traces), Mimir/Prometheus (metrics)
**Strengths**:
- ✅ Open source and cost-effective
- ✅ Unified visualization
- ✅ Prometheus-compatible
- ✅ Great for cloud-native
**Weaknesses**:
- ❌ Requires self-hosting or Grafana Cloud
- ❌ More ops burden
- ❌ Less polished than commercial tools
**Total Cost** (self-hosted, 100 hosts):
- Infrastructure: ~$1,500/month
- Ops time: Variable
- **Total: ~$1,500-3,000/month**
---
### Elastic Observability
**Components**: Elasticsearch (logs), Kibana (viz), APM, metrics
**Strengths**:
- ✅ Powerful search
- ✅ Mature platform
- ✅ Good for log-heavy use cases
**Weaknesses**:
- ❌ Complex to operate
- ❌ Expensive infrastructure
- ❌ Resource intensive
**Total Cost** (self-hosted, 100 hosts):
- Infrastructure: ~$3,000-5,000/month
- Ops time: High
- **Total: ~$4,000-7,000/month**
---
### New Relic One
**Components**: Metrics, logs, traces, synthetics
**Strengths**:
- ✅ Generous free tier (100GB)
- ✅ User-friendly
- ✅ Good for startups
**Weaknesses**:
- ❌ Costs increase quickly after free tier
- ❌ Vendor lock-in
**Total Cost**:
- Free: up to 100GB/month
- Paid: $0.30/GB beyond 100GB
---
## Cloud Provider Native
### AWS (CloudWatch + X-Ray)
**Use When**:
- Primarily on AWS
- Simple monitoring needs
- Want minimal setup
**Avoid When**:
- Multi-cloud environment
- Need advanced features
- High log volume (expensive)
**Cost** (example):
- 100 EC2 instances with basic metrics: ~$150/month
- 1TB logs: ~$500/month ingestion + storage
- X-Ray: ~$50/month
---
### GCP (Cloud Monitoring + Cloud Trace)
**Use When**:
- Primarily on GCP
- Using GKE
- Want tight GCP integration
**Avoid When**:
- Multi-cloud environment
- Need advanced querying
**Cost** (example):
- First 150MB/month per resource: Free
- Additional: $0.2508/MB
---
### Azure (Azure Monitor)
**Use When**:
- Primarily on Azure
- Using AKS
- Need Azure integration
**Avoid When**:
- Multi-cloud
- Need advanced features
**Cost** (example):
- First 5GB: Free
- Additional: $2.76/GB
---
## Decision Matrix
### Choose Prometheus + Grafana If:
- ✅ Using Kubernetes
- ✅ Want control and customization
- ✅ Have ops capacity
- ✅ Budget-conscious
- ✅ Need Prometheus ecosystem
### Choose Datadog If:
- ✅ Want ease of use
- ✅ Need full observability now
- ✅ Budget allows ($8k+/month for 100 hosts)
- ✅ Limited ops team
- ✅ Need excellent UX
### Choose ELK If:
- ✅ Heavy log analysis needs
- ✅ Need powerful search
- ✅ Have dedicated ops team
- ✅ Compliance requirements
- ✅ Willing to invest in infrastructure
### Choose Grafana Stack (LGTM) If:
- ✅ Want open source full stack
- ✅ Cost-effective solution
- ✅ Cloud-native architecture
- ✅ Already using Prometheus
- ✅ Have some ops capacity
### Choose New Relic If:
- ✅ Startup with free tier
- ✅ APM is priority
- ✅ Want easy setup
- ✅ Don't need heavy customization
### Choose Cloud Native (CloudWatch/etc) If:
- ✅ Single cloud provider
- ✅ Simple needs
- ✅ Want minimal setup
- ✅ Low to medium scale
---
## Cost Comparison
**Example: 100 hosts, 1TB logs/month, 1M spans/day**
| Solution | Monthly Cost | Setup | Ops Burden |
|----------|-------------|--------|------------|
| **Prometheus + Loki + Tempo** | $1,500 | Medium | Medium |
| **Grafana Cloud** | $3,000 | Low | Low |
| **Datadog** | $8,000 | Low | None |
| **New Relic** | $3,500 | Low | None |
| **ELK Stack** | $4,000 | High | High |
| **CloudWatch** | $2,000 | Low | Low |
---
## Recommendations by Company Size
### Startup (< 10 engineers)
**Recommendation**: New Relic or Grafana Cloud
- Minimal ops burden
- Good free tiers
- Easy to get started
### Small Company (10-50 engineers)
**Recommendation**: Prometheus + Grafana + Loki (self-hosted or cloud)
- Cost-effective
- Growing ops capacity
- Flexibility
### Medium Company (50-200 engineers)
**Recommendation**: Datadog or Grafana Stack
- Datadog if budget allows
- Grafana Stack if cost-conscious
### Large Enterprise (200+ engineers)
**Recommendation**: Build observability platform
- Mix of tools based on needs
- Dedicated observability team
- Custom integrations