Initial commit

2025-11-29 17:51:22 +08:00
commit 23753b435e
24 changed files with 9837 additions and 0 deletions
--- a/references/tool_comparison.md
+++ b/references/tool_comparison.md
@@ -0,0 +1,697 @@
+# Monitoring Tools Comparison
+
+## Overview Matrix
+
+| Tool | Type | Best For | Complexity | Cost | Cloud/Self-Hosted |
+|------|------|----------|------------|------|-------------------|
+| **Prometheus** | Metrics | Kubernetes, time-series | Medium | Free | Self-hosted |
+| **Grafana** | Visualization | Dashboards, multi-source | Low-Medium | Free | Both |
+| **Datadog** | Full-stack | Ease of use, APM | Low | High | Cloud |
+| **New Relic** | Full-stack | APM, traces | Low | High | Cloud |
+| **Elasticsearch (ELK)** | Logs | Log search, analysis | High | Medium | Both |
+| **Grafana Loki** | Logs | Cost-effective logs | Medium | Free | Both |
+| **CloudWatch** | AWS-native | AWS infrastructure | Low | Medium | Cloud |
+| **Jaeger** | Tracing | Distributed tracing | Medium | Free | Self-hosted |
+| **Grafana Tempo** | Tracing | Cost-effective tracing | Medium | Free | Self-hosted |
+
+---
+
+## Metrics Platforms
+
+### Prometheus
+
+**Type**: Open-source time-series database
+
+**Strengths**:
+- ✅ Industry standard for Kubernetes
+- ✅ Powerful query language (PromQL)
+- ✅ Pull-based model (no agent config)
+- ✅ Service discovery
+- ✅ Free and open source
+- ✅ Huge ecosystem (exporters for everything)
+
+**Weaknesses**:
+- ❌ No built-in dashboards (need Grafana)
+- ❌ Single-node only (no HA without federation)
+- ❌ Limited long-term storage (need Thanos/Cortex)
+- ❌ Steep learning curve for PromQL
+
+**Best For**:
+- Kubernetes monitoring
+- Infrastructure metrics
+- Custom application metrics
+- Organizations that need control
+
+**Pricing**: Free (open source)
+
+**Setup Complexity**: Medium
+
+**Example**:
+```yaml
+# prometheus.yml
+scrape_configs:
+  - job_name: 'app'
+    static_configs:
+      - targets: ['localhost:8080']
+```
+
+---
+
+### Datadog
+
+**Type**: SaaS monitoring platform
+
+**Strengths**:
+- ✅ Easy to set up (install agent, done)
+- ✅ Beautiful pre-built dashboards
+- ✅ APM, logs, metrics, traces in one platform
+- ✅ Great anomaly detection
+- ✅ Excellent integrations (500+)
+- ✅ Good mobile app
+
+**Weaknesses**:
+- ❌ Very expensive at scale
+- ❌ Vendor lock-in
+- ❌ Cost can be unpredictable (per-host pricing)
+- ❌ Limited PromQL support
+
+**Best For**:
+- Teams that want quick setup
+- Companies prioritizing ease of use over cost
+- Organizations needing full observability
+
+**Pricing**: $15-$31/host/month + custom metrics fees
+
+**Setup Complexity**: Low
+
+**Example**:
+```bash
+# Install agent
+DD_API_KEY=xxx bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"
+```
+
+---
+
+### New Relic
+
+**Type**: SaaS application performance monitoring
+
+**Strengths**:
+- ✅ Excellent APM capabilities
+- ✅ User-friendly interface
+- ✅ Good transaction tracing
+- ✅ Comprehensive alerting
+- ✅ Generous free tier
+
+**Weaknesses**:
+- ❌ Can get expensive at scale
+- ❌ Vendor lock-in
+- ❌ Query language less powerful than PromQL
+- ❌ Limited customization
+
+**Best For**:
+- Application performance monitoring
+- Teams focused on APM over infrastructure
+- Startups (free tier is generous)
+
+**Pricing**: Free up to 100GB/month, then $0.30/GB
+
+**Setup Complexity**: Low
+
+**Example**:
+```python
+import newrelic.agent
+newrelic.agent.initialize('newrelic.ini')
+```
+
+---
+
+### CloudWatch
+
+**Type**: AWS-native monitoring
+
+**Strengths**:
+- ✅ Zero setup for AWS services
+- ✅ Native integration with AWS
+- ✅ Automatic dashboards for AWS resources
+- ✅ Tightly integrated with other AWS services
+- ✅ Good for cost if already on AWS
+
+**Weaknesses**:
+- ❌ AWS-only (not multi-cloud)
+- ❌ Limited query capabilities
+- ❌ High costs for custom metrics
+- ❌ Basic visualization
+- ❌ 1-minute minimum resolution
+
+**Best For**:
+- AWS-centric infrastructure
+- Quick setup for AWS services
+- Organizations already invested in AWS
+
+**Pricing**:
+- First 10 custom metrics: Free
+- Additional: $0.30/metric/month
+- API calls: $0.01/1000 requests
+
+**Setup Complexity**: Low (for AWS), Medium (for custom metrics)
+
+**Example**:
+```python
+import boto3
+cloudwatch = boto3.client('cloudwatch')
+cloudwatch.put_metric_data(
+    Namespace='MyApp',
+    MetricData=[{'MetricName': 'RequestCount', 'Value': 1}]
+)
+```
+
+---
+
+### Grafana Cloud / Mimir
+
+**Type**: Managed Prometheus-compatible
+
+**Strengths**:
+- ✅ Prometheus-compatible (PromQL)
+- ✅ Managed service (no ops burden)
+- ✅ Good cost model (pay for what you use)
+- ✅ Grafana dashboards included
+- ✅ Long-term storage
+
+**Weaknesses**:
+- ❌ Relatively new (less mature)
+- ❌ Some Prometheus features missing
+- ❌ Requires Grafana for visualization
+
+**Best For**:
+- Teams wanting Prometheus without ops overhead
+- Multi-cloud environments
+- Organizations already using Grafana
+
+**Pricing**: $8/month + $0.29/1M samples
+
+**Setup Complexity**: Low-Medium
+
+---
+
+## Logging Platforms
+
+### Elasticsearch (ELK Stack)
+
+**Type**: Open-source log search and analytics
+
+**Full Stack**: Elasticsearch + Logstash + Kibana
+
+**Strengths**:
+- ✅ Powerful search capabilities
+- ✅ Rich query language
+- ✅ Great for log analysis
+- ✅ Mature ecosystem
+- ✅ Can handle large volumes
+- ✅ Flexible data model
+
+**Weaknesses**:
+- ❌ Complex to operate
+- ❌ Resource intensive (RAM hungry)
+- ❌ Expensive at scale
+- ❌ Requires dedicated ops team
+- ❌ Slow for high-cardinality queries
+
+**Best For**:
+- Large organizations with ops teams
+- Deep log analysis needs
+- Search-heavy use cases
+
+**Pricing**: Free (open source) + infrastructure costs
+
+**Infrastructure**: ~$500-2000/month for medium scale
+
+**Setup Complexity**: High
+
+**Example**:
+```json
+PUT /logs-2024.10/_doc/1
+{
+  "timestamp": "2024-10-28T14:32:15Z",
+  "level": "error",
+  "message": "Payment failed"
+}
+```
+
+---
+
+### Grafana Loki
+
+**Type**: Log aggregation system
+
+**Strengths**:
+- ✅ Cost-effective (labels only, not full-text indexing)
+- ✅ Easy to operate
+- ✅ Prometheus-like label model
+- ✅ Great Grafana integration
+- ✅ Low resource usage
+- ✅ Fast time-range queries
+
+**Weaknesses**:
+- ❌ Limited full-text search
+- ❌ Requires careful label design
+- ❌ Younger ecosystem than ELK
+- ❌ Not ideal for complex queries
+
+**Best For**:
+- Cost-conscious organizations
+- Kubernetes environments
+- Teams already using Prometheus
+- Time-series log queries
+
+**Pricing**: Free (open source) + infrastructure costs
+
+**Infrastructure**: ~$100-500/month for medium scale
+
+**Setup Complexity**: Medium
+
+**Example**:
+```logql
+{job="api", environment="prod"} |= "error" | json | level="error"
+```
+
+---
+
+### Splunk
+
+**Type**: Enterprise log management
+
+**Strengths**:
+- ✅ Extremely powerful search
+- ✅ Great for security/compliance
+- ✅ Mature platform
+- ✅ Enterprise support
+- ✅ Machine learning features
+
+**Weaknesses**:
+- ❌ Very expensive
+- ❌ Complex pricing (per GB ingested)
+- ❌ Steep learning curve
+- ❌ Heavy resource usage
+
+**Best For**:
+- Large enterprises
+- Security operations centers (SOCs)
+- Compliance-heavy industries
+
+**Pricing**: $150-$1800/GB/month (depending on tier)
+
+**Setup Complexity**: Medium-High
+
+---
+
+### CloudWatch Logs
+
+**Type**: AWS-native log management
+
+**Strengths**:
+- ✅ Zero setup for AWS services
+- ✅ Integrated with AWS ecosystem
+- ✅ CloudWatch Insights for queries
+- ✅ Reasonable cost for low volume
+
+**Weaknesses**:
+- ❌ AWS-only
+- ❌ Limited query capabilities
+- ❌ Expensive at high volume
+- ❌ Basic visualization
+
+**Best For**:
+- AWS-centric applications
+- Low-volume logging
+- Simple log aggregation
+
+**Pricing**: Tiered (as of May 2025)
+- Vended Logs: $0.50/GB (first 10TB), $0.25/GB (next 20TB), then lower tiers
+- Standard logs: $0.50/GB flat
+- Storage: $0.03/GB
+
+**Setup Complexity**: Low (AWS), Medium (custom)
+
+---
+
+### Sumo Logic
+
+**Type**: SaaS log management
+
+**Strengths**:
+- ✅ Easy to use
+- ✅ Good for cloud-native apps
+- ✅ Real-time analytics
+- ✅ Good compliance features
+
+**Weaknesses**:
+- ❌ Expensive at scale
+- ❌ Vendor lock-in
+- ❌ Limited customization
+
+**Best For**:
+- Cloud-native applications
+- Teams wanting managed solution
+- Security and compliance use cases
+
+**Pricing**: $90-$180/GB/month
+
+**Setup Complexity**: Low
+
+---
+
+## Tracing Platforms
+
+### Jaeger
+
+**Type**: Open-source distributed tracing
+
+**Strengths**:
+- ✅ Industry standard
+- ✅ CNCF graduated project
+- ✅ Supports OpenTelemetry
+- ✅ Good UI
+- ✅ Free and open source
+
+**Weaknesses**:
+- ❌ Requires separate storage backend
+- ❌ Limited query capabilities
+- ❌ No built-in analytics
+
+**Best For**:
+- Microservices architectures
+- Kubernetes environments
+- OpenTelemetry users
+
+**Pricing**: Free (open source) + storage costs
+
+**Setup Complexity**: Medium
+
+---
+
+### Grafana Tempo
+
+**Type**: Open-source distributed tracing
+
+**Strengths**:
+- ✅ Cost-effective (object storage)
+- ✅ Easy to operate
+- ✅ Great Grafana integration
+- ✅ TraceQL query language
+- ✅ Supports OpenTelemetry
+
+**Weaknesses**:
+- ❌ Younger than Jaeger
+- ❌ Limited third-party integrations
+- ❌ Requires Grafana for UI
+
+**Best For**:
+- Cost-conscious organizations
+- Teams using Grafana stack
+- High trace volumes
+
+**Pricing**: Free (open source) + storage costs
+
+**Setup Complexity**: Medium
+
+---
+
+### Datadog APM
+
+**Type**: SaaS application performance monitoring
+
+**Strengths**:
+- ✅ Easy to set up
+- ✅ Excellent trace visualization
+- ✅ Integrated with metrics/logs
+- ✅ Automatic service map
+- ✅ Good profiling features
+
+**Weaknesses**:
+- ❌ Expensive ($31/host/month)
+- ❌ Vendor lock-in
+- ❌ Limited sampling control
+
+**Best For**:
+- Teams wanting ease of use
+- Organizations already using Datadog
+- Complex microservices
+
+**Pricing**: $31/host/month + $1.70/million spans
+
+**Setup Complexity**: Low
+
+---
+
+### AWS X-Ray
+
+**Type**: AWS-native distributed tracing
+
+**Strengths**:
+- ✅ Native AWS integration
+- ✅ Automatic instrumentation for AWS services
+- ✅ Low cost
+
+**Weaknesses**:
+- ❌ AWS-only
+- ❌ Basic UI
+- ❌ Limited query capabilities
+
+**Best For**:
+- AWS-centric applications
+- Serverless architectures (Lambda)
+- Cost-sensitive projects
+
+**Pricing**: $5/million traces, first 100k free/month
+
+**Setup Complexity**: Low (AWS), Medium (custom)
+
+---
+
+## Full-Stack Observability
+
+### Datadog (Full Platform)
+
+**Components**: Metrics, logs, traces, RUM, synthetics
+
+**Strengths**:
+- ✅ Everything in one platform
+- ✅ Excellent user experience
+- ✅ Correlation across signals
+- ✅ Great for teams
+
+**Weaknesses**:
+- ❌ Very expensive ($50-100+/host/month)
+- ❌ Vendor lock-in
+- ❌ Unpredictable costs
+
+**Total Cost** (example 100 hosts):
+- Infrastructure: $3,100/month
+- APM: $3,100/month
+- Logs: ~$2,000/month
+- **Total: ~$8,000/month**
+
+---
+
+### Grafana Stack (LGTM)
+
+**Components**: Loki (logs), Grafana (viz), Tempo (traces), Mimir/Prometheus (metrics)
+
+**Strengths**:
+- ✅ Open source and cost-effective
+- ✅ Unified visualization
+- ✅ Prometheus-compatible
+- ✅ Great for cloud-native
+
+**Weaknesses**:
+- ❌ Requires self-hosting or Grafana Cloud
+- ❌ More ops burden
+- ❌ Less polished than commercial tools
+
+**Total Cost** (self-hosted, 100 hosts):
+- Infrastructure: ~$1,500/month
+- Ops time: Variable
+- **Total: ~$1,500-3,000/month**
+
+---
+
+### Elastic Observability
+
+**Components**: Elasticsearch (logs), Kibana (viz), APM, metrics
+
+**Strengths**:
+- ✅ Powerful search
+- ✅ Mature platform
+- ✅ Good for log-heavy use cases
+
+**Weaknesses**:
+- ❌ Complex to operate
+- ❌ Expensive infrastructure
+- ❌ Resource intensive
+
+**Total Cost** (self-hosted, 100 hosts):
+- Infrastructure: ~$3,000-5,000/month
+- Ops time: High
+- **Total: ~$4,000-7,000/month**
+
+---
+
+### New Relic One
+
+**Components**: Metrics, logs, traces, synthetics
+
+**Strengths**:
+- ✅ Generous free tier (100GB)
+- ✅ User-friendly
+- ✅ Good for startups
+
+**Weaknesses**:
+- ❌ Costs increase quickly after free tier
+- ❌ Vendor lock-in
+
+**Total Cost**:
+- Free: up to 100GB/month
+- Paid: $0.30/GB beyond 100GB
+
+---
+
+## Cloud Provider Native
+
+### AWS (CloudWatch + X-Ray)
+
+**Use When**:
+- Primarily on AWS
+- Simple monitoring needs
+- Want minimal setup
+
+**Avoid When**:
+- Multi-cloud environment
+- Need advanced features
+- High log volume (expensive)
+
+**Cost** (example):
+- 100 EC2 instances with basic metrics: ~$150/month
+- 1TB logs: ~$500/month ingestion + storage
+- X-Ray: ~$50/month
+
+---
+
+### GCP (Cloud Monitoring + Cloud Trace)
+
+**Use When**:
+- Primarily on GCP
+- Using GKE
+- Want tight GCP integration
+
+**Avoid When**:
+- Multi-cloud environment
+- Need advanced querying
+
+**Cost** (example):
+- First 150MB/month per resource: Free
+- Additional: $0.2508/MB
+
+---
+
+### Azure (Azure Monitor)
+
+**Use When**:
+- Primarily on Azure
+- Using AKS
+- Need Azure integration
+
+**Avoid When**:
+- Multi-cloud
+- Need advanced features
+
+**Cost** (example):
+- First 5GB: Free
+- Additional: $2.76/GB
+
+---
+
+## Decision Matrix
+
+### Choose Prometheus + Grafana If:
+- ✅ Using Kubernetes
+- ✅ Want control and customization
+- ✅ Have ops capacity
+- ✅ Budget-conscious
+- ✅ Need Prometheus ecosystem
+
+### Choose Datadog If:
+- ✅ Want ease of use
+- ✅ Need full observability now
+- ✅ Budget allows ($8k+/month for 100 hosts)
+- ✅ Limited ops team
+- ✅ Need excellent UX
+
+### Choose ELK If:
+- ✅ Heavy log analysis needs
+- ✅ Need powerful search
+- ✅ Have dedicated ops team
+- ✅ Compliance requirements
+- ✅ Willing to invest in infrastructure
+
+### Choose Grafana Stack (LGTM) If:
+- ✅ Want open source full stack
+- ✅ Cost-effective solution
+- ✅ Cloud-native architecture
+- ✅ Already using Prometheus
+- ✅ Have some ops capacity
+
+### Choose New Relic If:
+- ✅ Startup with free tier
+- ✅ APM is priority
+- ✅ Want easy setup
+- ✅ Don't need heavy customization
+
+### Choose Cloud Native (CloudWatch/etc) If:
+- ✅ Single cloud provider
+- ✅ Simple needs
+- ✅ Want minimal setup
+- ✅ Low to medium scale
+
+---
+
+## Cost Comparison
+
+**Example: 100 hosts, 1TB logs/month, 1M spans/day**
+
+| Solution | Monthly Cost | Setup | Ops Burden |
+|----------|-------------|--------|------------|
+| **Prometheus + Loki + Tempo** | $1,500 | Medium | Medium |
+| **Grafana Cloud** | $3,000 | Low | Low |
+| **Datadog** | $8,000 | Low | None |
+| **New Relic** | $3,500 | Low | None |
+| **ELK Stack** | $4,000 | High | High |
+| **CloudWatch** | $2,000 | Low | Low |
+
+---
+
+## Recommendations by Company Size
+
+### Startup (< 10 engineers)
+**Recommendation**: New Relic or Grafana Cloud
+- Minimal ops burden
+- Good free tiers
+- Easy to get started
+
+### Small Company (10-50 engineers)
+**Recommendation**: Prometheus + Grafana + Loki (self-hosted or cloud)
+- Cost-effective
+- Growing ops capacity
+- Flexibility
+
+### Medium Company (50-200 engineers)
+**Recommendation**: Datadog or Grafana Stack
+- Datadog if budget allows
+- Grafana Stack if cost-conscious
+
+### Large Enterprise (200+ engineers)
+**Recommendation**: Build observability platform
+- Mix of tools based on needs
+- Dedicated observability team
+- Custom integrations