2182 lines
68 KiB
Markdown
2182 lines
68 KiB
Markdown
---
|
|
description: Create API monitoring dashboard
|
|
shortcut: monitor
|
|
---
|
|
|
|
# Create API Monitoring Dashboard
|
|
|
|
Build comprehensive monitoring infrastructure with metrics, logs, traces, and alerts for full API observability.
|
|
|
|
## When to Use This Command
|
|
|
|
Use `/create-monitoring` when you need to:
|
|
- Establish observability for production APIs
|
|
- Track RED metrics (Rate, Errors, Duration) across services
|
|
- Set up real-time alerting for SLO violations
|
|
- Debug performance issues with distributed tracing
|
|
- Create executive dashboards for API health
|
|
- Implement SRE practices with data-driven insights
|
|
|
|
DON'T use this when:
|
|
- Building proof-of-concept applications (use lightweight logging instead)
|
|
- Monitoring non-critical internal tools (basic health checks may suffice)
|
|
- Resources are extremely constrained (consider managed solutions like Datadog first)
|
|
|
|
## Design Decisions
|
|
|
|
This command implements a **Prometheus + Grafana stack** as the primary approach because:
|
|
- Open-source with no vendor lock-in
|
|
- Industry-standard metric format with wide ecosystem support
|
|
- Powerful query language (PromQL) for complex analysis
|
|
- Horizontal scalability via federation and remote storage
|
|
|
|
**Alternative considered: ELK Stack** (Elasticsearch, Logstash, Kibana)
|
|
- Better for log-centric analysis
|
|
- Higher resource requirements
|
|
- More complex operational overhead
|
|
- Recommended when logs are primary data source
|
|
|
|
**Alternative considered: Managed solutions** (Datadog, New Relic)
|
|
- Faster time-to-value
|
|
- Higher ongoing cost
|
|
- Less customization flexibility
|
|
- Recommended for teams without dedicated DevOps
|
|
|
|
## Prerequisites
|
|
|
|
Before running this command:
|
|
1. Docker and Docker Compose installed
|
|
2. API instrumented with metrics endpoints (Prometheus format)
|
|
3. Basic understanding of PromQL query language
|
|
4. Network access for inter-service communication
|
|
5. Sufficient disk space for time-series data (plan for 2-4 weeks retention)
|
|
|
|
## Implementation Process
|
|
|
|
### Step 1: Configure Prometheus
|
|
Set up Prometheus to scrape metrics from your API endpoints with service discovery.
|
|
|
|
### Step 2: Create Grafana Dashboards
|
|
Build visualizations for RED metrics, custom business metrics, and SLO tracking.
|
|
|
|
### Step 3: Implement Distributed Tracing
|
|
Integrate Jaeger for end-to-end request tracing across microservices.
|
|
|
|
### Step 4: Configure Alerting
|
|
Set up AlertManager rules for critical thresholds with notification channels (Slack, PagerDuty).
|
|
|
|
### Step 5: Deploy Monitoring Stack
|
|
Deploy complete observability infrastructure with health checks and backup configurations.
|
|
|
|
## Output Format
|
|
|
|
The command generates:
|
|
- `docker-compose.yml` - Complete monitoring stack configuration
|
|
- `prometheus.yml` - Prometheus scrape configuration
|
|
- `grafana-dashboards/` - Pre-built dashboard JSON files
|
|
- `alerting-rules.yml` - AlertManager rule definitions
|
|
- `jaeger-config.yml` - Distributed tracing configuration
|
|
- `README.md` - Deployment and operation guide
|
|
|
|
## Code Examples
|
|
|
|
### Example 1: Complete Node.js Express API with Comprehensive Monitoring
|
|
|
|
```javascript
|
|
// metrics/instrumentation.js - Full-featured Prometheus instrumentation
|
|
const promClient = require('prom-client');
|
|
const { performance } = require('perf_hooks');
|
|
const os = require('os');
|
|
|
|
class MetricsCollector {
|
|
constructor() {
|
|
// Create separate registries for different metric types
|
|
this.register = new promClient.Registry();
|
|
this.businessRegister = new promClient.Registry();
|
|
|
|
// Add default system metrics
|
|
promClient.collectDefaultMetrics({
|
|
register: this.register,
|
|
prefix: 'api_',
|
|
gcDurationBuckets: [0.001, 0.01, 0.1, 1, 2, 5]
|
|
});
|
|
|
|
// Initialize all metric types
|
|
this.initializeMetrics();
|
|
this.initializeBusinessMetrics();
|
|
this.initializeCustomCollectors();
|
|
|
|
// Start periodic collectors
|
|
this.startPeriodicCollectors();
|
|
}
|
|
|
|
initializeMetrics() {
|
|
// RED Metrics (Rate, Errors, Duration)
|
|
this.httpRequestDuration = new promClient.Histogram({
|
|
name: 'http_request_duration_seconds',
|
|
help: 'Duration of HTTP requests in seconds',
|
|
labelNames: ['method', 'route', 'status_code', 'service', 'environment'],
|
|
buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
|
|
});
|
|
|
|
this.httpRequestTotal = new promClient.Counter({
|
|
name: 'http_requests_total',
|
|
help: 'Total number of HTTP requests',
|
|
labelNames: ['method', 'route', 'status_code', 'service', 'environment']
|
|
});
|
|
|
|
this.httpRequestErrors = new promClient.Counter({
|
|
name: 'http_request_errors_total',
|
|
help: 'Total number of HTTP errors',
|
|
labelNames: ['method', 'route', 'error_type', 'service', 'environment']
|
|
});
|
|
|
|
// Database metrics
|
|
this.dbQueryDuration = new promClient.Histogram({
|
|
name: 'db_query_duration_seconds',
|
|
help: 'Database query execution time',
|
|
labelNames: ['operation', 'table', 'database', 'status'],
|
|
buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5]
|
|
});
|
|
|
|
this.dbConnectionPool = new promClient.Gauge({
|
|
name: 'db_connection_pool_size',
|
|
help: 'Database connection pool metrics',
|
|
labelNames: ['state', 'database'] // states: active, idle, total
|
|
});
|
|
|
|
// Cache metrics
|
|
this.cacheHitRate = new promClient.Counter({
|
|
name: 'cache_operations_total',
|
|
help: 'Cache operation counts',
|
|
labelNames: ['operation', 'cache_name', 'status'] // hit, miss, set, delete
|
|
});
|
|
|
|
this.cacheLatency = new promClient.Histogram({
|
|
name: 'cache_operation_duration_seconds',
|
|
help: 'Cache operation latency',
|
|
labelNames: ['operation', 'cache_name'],
|
|
buckets: [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]
|
|
});
|
|
|
|
// External API metrics
|
|
this.externalApiCalls = new promClient.Histogram({
|
|
name: 'external_api_duration_seconds',
|
|
help: 'External API call duration',
|
|
labelNames: ['service', 'endpoint', 'status_code'],
|
|
buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
|
|
});
|
|
|
|
// Circuit breaker metrics
|
|
this.circuitBreakerState = new promClient.Gauge({
|
|
name: 'circuit_breaker_state',
|
|
help: 'Circuit breaker state (0=closed, 1=open, 2=half-open)',
|
|
labelNames: ['service']
|
|
});
|
|
|
|
// Rate limiting metrics
|
|
this.rateLimitHits = new promClient.Counter({
|
|
name: 'rate_limit_hits_total',
|
|
help: 'Number of rate limited requests',
|
|
labelNames: ['limit_type', 'client_type']
|
|
});
|
|
|
|
// WebSocket metrics
|
|
this.activeWebsockets = new promClient.Gauge({
|
|
name: 'websocket_connections_active',
|
|
help: 'Number of active WebSocket connections',
|
|
labelNames: ['namespace', 'room']
|
|
});
|
|
|
|
// Register all metrics
|
|
[
|
|
this.httpRequestDuration, this.httpRequestTotal, this.httpRequestErrors,
|
|
this.dbQueryDuration, this.dbConnectionPool, this.cacheHitRate,
|
|
this.cacheLatency, this.externalApiCalls, this.circuitBreakerState,
|
|
this.rateLimitHits, this.activeWebsockets
|
|
].forEach(metric => this.register.registerMetric(metric));
|
|
}
|
|
|
|
initializeBusinessMetrics() {
|
|
// User activity metrics
|
|
this.activeUsers = new promClient.Gauge({
|
|
name: 'business_active_users',
|
|
help: 'Number of active users in the last 5 minutes',
|
|
labelNames: ['user_type', 'plan']
|
|
});
|
|
|
|
this.userSignups = new promClient.Counter({
|
|
name: 'business_user_signups_total',
|
|
help: 'Total user signups',
|
|
labelNames: ['source', 'plan', 'country']
|
|
});
|
|
|
|
// Transaction metrics
|
|
this.transactionAmount = new promClient.Histogram({
|
|
name: 'business_transaction_amount_dollars',
|
|
help: 'Transaction amounts in dollars',
|
|
labelNames: ['type', 'status', 'payment_method'],
|
|
buckets: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000, 10000]
|
|
});
|
|
|
|
this.orderProcessingTime = new promClient.Histogram({
|
|
name: 'business_order_processing_seconds',
|
|
help: 'Time to process orders end-to-end',
|
|
labelNames: ['order_type', 'fulfillment_type'],
|
|
buckets: [10, 30, 60, 180, 300, 600, 1800, 3600]
|
|
});
|
|
|
|
// API usage metrics
|
|
this.apiUsageByClient = new promClient.Counter({
|
|
name: 'business_api_usage_by_client',
|
|
help: 'API usage segmented by client',
|
|
labelNames: ['client_id', 'tier', 'endpoint']
|
|
});
|
|
|
|
this.apiQuotaRemaining = new promClient.Gauge({
|
|
name: 'business_api_quota_remaining',
|
|
help: 'Remaining API quota for clients',
|
|
labelNames: ['client_id', 'tier', 'quota_type']
|
|
});
|
|
|
|
// Revenue metrics
|
|
this.revenueByProduct = new promClient.Counter({
|
|
name: 'business_revenue_by_product_cents',
|
|
help: 'Revenue by product in cents',
|
|
labelNames: ['product_id', 'product_category', 'currency']
|
|
});
|
|
|
|
// Register business metrics
|
|
[
|
|
this.activeUsers, this.userSignups, this.transactionAmount,
|
|
this.orderProcessingTime, this.apiUsageByClient, this.apiQuotaRemaining,
|
|
this.revenueByProduct
|
|
].forEach(metric => this.businessRegister.registerMetric(metric));
|
|
}
|
|
|
|
initializeCustomCollectors() {
|
|
// SLI/SLO metrics
|
|
this.sloCompliance = new promClient.Gauge({
|
|
name: 'slo_compliance_percentage',
|
|
help: 'SLO compliance percentage',
|
|
labelNames: ['slo_name', 'service', 'window']
|
|
});
|
|
|
|
this.errorBudgetRemaining = new promClient.Gauge({
|
|
name: 'error_budget_remaining_percentage',
|
|
help: 'Remaining error budget percentage',
|
|
labelNames: ['service', 'slo_type']
|
|
});
|
|
|
|
this.register.registerMetric(this.sloCompliance);
|
|
this.register.registerMetric(this.errorBudgetRemaining);
|
|
}
|
|
|
|
startPeriodicCollectors() {
|
|
// Update active users every 30 seconds
|
|
setInterval(() => {
|
|
const activeUserCount = this.calculateActiveUsers();
|
|
this.activeUsers.set(
|
|
{ user_type: 'registered', plan: 'free' },
|
|
activeUserCount.free
|
|
);
|
|
this.activeUsers.set(
|
|
{ user_type: 'registered', plan: 'premium' },
|
|
activeUserCount.premium
|
|
);
|
|
}, 30000);
|
|
|
|
// Update SLO compliance every minute
|
|
setInterval(() => {
|
|
this.updateSLOCompliance();
|
|
}, 60000);
|
|
|
|
// Database pool monitoring
|
|
setInterval(() => {
|
|
this.updateDatabasePoolMetrics();
|
|
}, 15000);
|
|
}
|
|
|
|
// Middleware for HTTP metrics
|
|
httpMetricsMiddleware() {
|
|
return (req, res, next) => {
|
|
const start = performance.now();
|
|
const route = req.route?.path || req.path || 'unknown';
|
|
|
|
// Track in-flight requests
|
|
const inFlightGauge = new promClient.Gauge({
|
|
name: 'http_requests_in_flight',
|
|
help: 'Number of in-flight HTTP requests',
|
|
labelNames: ['method', 'route']
|
|
});
|
|
|
|
inFlightGauge.inc({ method: req.method, route });
|
|
|
|
res.on('finish', () => {
|
|
const duration = (performance.now() - start) / 1000;
|
|
const labels = {
|
|
method: req.method,
|
|
route,
|
|
status_code: res.statusCode,
|
|
service: process.env.SERVICE_NAME || 'api',
|
|
environment: process.env.NODE_ENV || 'development'
|
|
};
|
|
|
|
// Record metrics
|
|
this.httpRequestDuration.observe(labels, duration);
|
|
this.httpRequestTotal.inc(labels);
|
|
|
|
if (res.statusCode >= 400) {
|
|
const errorType = res.statusCode >= 500 ? 'server_error' : 'client_error';
|
|
this.httpRequestErrors.inc({
|
|
...labels,
|
|
error_type: errorType
|
|
});
|
|
}
|
|
|
|
inFlightGauge.dec({ method: req.method, route });
|
|
|
|
// Log slow requests
|
|
if (duration > 1) {
|
|
console.warn('Slow request detected:', {
|
|
...labels,
|
|
duration,
|
|
user: req.user?.id,
|
|
ip: req.ip
|
|
});
|
|
}
|
|
});
|
|
|
|
next();
|
|
};
|
|
}
|
|
|
|
// Database query instrumentation
|
|
instrumentDatabase(knex) {
|
|
knex.on('query', (query) => {
|
|
query.__startTime = performance.now();
|
|
});
|
|
|
|
knex.on('query-response', (response, query) => {
|
|
const duration = (performance.now() - query.__startTime) / 1000;
|
|
const table = this.extractTableName(query.sql);
|
|
|
|
this.dbQueryDuration.observe({
|
|
operation: query.method || 'select',
|
|
table,
|
|
database: process.env.DB_NAME || 'default',
|
|
status: 'success'
|
|
}, duration);
|
|
});
|
|
|
|
knex.on('query-error', (error, query) => {
|
|
const duration = (performance.now() - query.__startTime) / 1000;
|
|
const table = this.extractTableName(query.sql);
|
|
|
|
this.dbQueryDuration.observe({
|
|
operation: query.method || 'select',
|
|
table,
|
|
database: process.env.DB_NAME || 'default',
|
|
status: 'error'
|
|
}, duration);
|
|
});
|
|
}
|
|
|
|
// Cache instrumentation wrapper
|
|
wrapCache(cache) {
|
|
const wrapper = {};
|
|
const methods = ['get', 'set', 'delete', 'has'];
|
|
|
|
methods.forEach(method => {
|
|
wrapper[method] = async (...args) => {
|
|
const start = performance.now();
|
|
const cacheName = cache.name || 'default';
|
|
|
|
try {
|
|
const result = await cache[method](...args);
|
|
const duration = (performance.now() - start) / 1000;
|
|
|
|
// Record cache metrics
|
|
if (method === 'get') {
|
|
const status = result !== undefined ? 'hit' : 'miss';
|
|
this.cacheHitRate.inc({
|
|
operation: method,
|
|
cache_name: cacheName,
|
|
status
|
|
});
|
|
} else {
|
|
this.cacheHitRate.inc({
|
|
operation: method,
|
|
cache_name: cacheName,
|
|
status: 'success'
|
|
});
|
|
}
|
|
|
|
this.cacheLatency.observe({
|
|
operation: method,
|
|
cache_name: cacheName
|
|
}, duration);
|
|
|
|
return result;
|
|
} catch (error) {
|
|
this.cacheHitRate.inc({
|
|
operation: method,
|
|
cache_name: cacheName,
|
|
status: 'error'
|
|
});
|
|
throw error;
|
|
}
|
|
};
|
|
});
|
|
|
|
return wrapper;
|
|
}
|
|
|
|
// External API call instrumentation
|
|
async trackExternalCall(serviceName, endpoint, callFunc) {
|
|
const start = performance.now();
|
|
|
|
try {
|
|
const result = await callFunc();
|
|
const duration = (performance.now() - start) / 1000;
|
|
|
|
this.externalApiCalls.observe({
|
|
service: serviceName,
|
|
endpoint,
|
|
status_code: result.status || 200
|
|
}, duration);
|
|
|
|
return result;
|
|
} catch (error) {
|
|
const duration = (performance.now() - start) / 1000;
|
|
|
|
this.externalApiCalls.observe({
|
|
service: serviceName,
|
|
endpoint,
|
|
status_code: error.response?.status || 0
|
|
}, duration);
|
|
|
|
throw error;
|
|
}
|
|
}
|
|
|
|
// Circuit breaker monitoring
|
|
updateCircuitBreakerState(service, state) {
|
|
const stateValue = {
|
|
'closed': 0,
|
|
'open': 1,
|
|
'half-open': 2
|
|
}[state] || 0;
|
|
|
|
this.circuitBreakerState.set({ service }, stateValue);
|
|
}
|
|
|
|
// Helper methods
|
|
calculateActiveUsers() {
|
|
// Implementation would query your session store or database
|
|
return {
|
|
free: Math.floor(Math.random() * 1000),
|
|
premium: Math.floor(Math.random() * 100)
|
|
};
|
|
}
|
|
|
|
updateSLOCompliance() {
|
|
// Calculate based on recent metrics
|
|
const availability = 99.95; // Calculate from actual metrics
|
|
const latencyP99 = 250; // Calculate from actual metrics
|
|
|
|
this.sloCompliance.set({
|
|
slo_name: 'availability',
|
|
service: 'api',
|
|
window: '30d'
|
|
}, availability);
|
|
|
|
this.sloCompliance.set({
|
|
slo_name: 'latency_p99',
|
|
service: 'api',
|
|
window: '30d'
|
|
}, latencyP99 < 500 ? 100 : 0);
|
|
|
|
// Update error budget
|
|
const errorBudget = 100 - ((100 - availability) / 0.05) * 100;
|
|
this.errorBudgetRemaining.set({
|
|
service: 'api',
|
|
slo_type: 'availability'
|
|
}, Math.max(0, errorBudget));
|
|
}
|
|
|
|
updateDatabasePoolMetrics() {
|
|
// Get pool stats from your database driver
|
|
const pool = global.dbPool; // Your database pool instance
|
|
if (pool) {
|
|
this.dbConnectionPool.set({
|
|
state: 'active',
|
|
database: 'primary'
|
|
}, pool.numUsed());
|
|
|
|
this.dbConnectionPool.set({
|
|
state: 'idle',
|
|
database: 'primary'
|
|
}, pool.numFree());
|
|
|
|
this.dbConnectionPool.set({
|
|
state: 'total',
|
|
database: 'primary'
|
|
}, pool.numUsed() + pool.numFree());
|
|
}
|
|
}
|
|
|
|
extractTableName(sql) {
|
|
const match = sql.match(/(?:from|into|update)\s+`?(\w+)`?/i);
|
|
return match ? match[1] : 'unknown';
|
|
}
|
|
|
|
// Expose metrics endpoint
|
|
async getMetrics() {
|
|
const baseMetrics = await this.register.metrics();
|
|
const businessMetrics = await this.businessRegister.metrics();
|
|
return baseMetrics + '\n' + businessMetrics;
|
|
}
|
|
}
|
|
|
|
// Express application setup
|
|
const express = require('express');
|
|
const app = express();
|
|
const metricsCollector = new MetricsCollector();
|
|
|
|
// Apply monitoring middleware
|
|
app.use(metricsCollector.httpMetricsMiddleware());
|
|
|
|
// Metrics endpoint
|
|
app.get('/metrics', async (req, res) => {
|
|
res.set('Content-Type', metricsCollector.register.contentType);
|
|
res.end(await metricsCollector.getMetrics());
|
|
});
|
|
|
|
// Example API endpoint with comprehensive tracking
|
|
app.post('/api/orders', async (req, res) => {
|
|
const orderStart = performance.now();
|
|
|
|
try {
|
|
// Track business metrics
|
|
metricsCollector.transactionAmount.observe({
|
|
type: 'purchase',
|
|
status: 'pending',
|
|
payment_method: req.body.paymentMethod
|
|
}, req.body.amount);
|
|
|
|
// Simulate external payment API call
|
|
const paymentResult = await metricsCollector.trackExternalCall(
|
|
'stripe',
|
|
'/charges',
|
|
async () => {
|
|
// Your actual payment API call
|
|
return await stripeClient.charges.create({
|
|
amount: req.body.amount * 100,
|
|
currency: 'usd'
|
|
});
|
|
}
|
|
);
|
|
|
|
// Track order processing time
|
|
const processingTime = (performance.now() - orderStart) / 1000;
|
|
metricsCollector.orderProcessingTime.observe({
|
|
order_type: 'standard',
|
|
fulfillment_type: 'digital'
|
|
}, processingTime);
|
|
|
|
// Track revenue
|
|
metricsCollector.revenueByProduct.inc({
|
|
product_id: req.body.productId,
|
|
product_category: req.body.category,
|
|
currency: 'USD'
|
|
}, req.body.amount * 100);
|
|
|
|
res.json({ success: true, orderId: paymentResult.id });
|
|
} catch (error) {
|
|
res.status(500).json({ error: error.message });
|
|
}
|
|
});
|
|
|
|
module.exports = { app, metricsCollector };
|
|
```
|
|
|
|
### Example 2: Complete Monitoring Stack with Docker Compose
|
|
|
|
```yaml
|
|
# docker-compose.yml
|
|
version: '3.8'
|
|
|
|
services:
|
|
prometheus:
|
|
image: prom/prometheus:v2.45.0
|
|
container_name: prometheus
|
|
volumes:
|
|
- ./prometheus.yml:/etc/prometheus/prometheus.yml
|
|
- ./alerting-rules.yml:/etc/prometheus/alerting-rules.yml
|
|
- prometheus-data:/prometheus
|
|
command:
|
|
- '--config.file=/etc/prometheus/prometheus.yml'
|
|
- '--storage.tsdb.path=/prometheus'
|
|
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
|
|
- '--web.console.templates=/usr/share/prometheus/consoles'
|
|
- '--storage.tsdb.retention.time=15d'
|
|
ports:
|
|
- "9090:9090"
|
|
networks:
|
|
- monitoring
|
|
|
|
grafana:
|
|
image: grafana/grafana:10.0.0
|
|
container_name: grafana
|
|
volumes:
|
|
- grafana-data:/var/lib/grafana
|
|
- ./grafana-dashboards:/etc/grafana/provisioning/dashboards
|
|
- ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
|
|
environment:
|
|
- GF_SECURITY_ADMIN_PASSWORD=admin
|
|
- GF_USERS_ALLOW_SIGN_UP=false
|
|
- GF_SERVER_ROOT_URL=http://localhost:3000
|
|
ports:
|
|
- "3000:3000"
|
|
networks:
|
|
- monitoring
|
|
depends_on:
|
|
- prometheus
|
|
|
|
jaeger:
|
|
image: jaegertracing/all-in-one:1.47
|
|
container_name: jaeger
|
|
environment:
|
|
- COLLECTOR_ZIPKIN_HOST_PORT=:9411
|
|
- COLLECTOR_OTLP_ENABLED=true
|
|
ports:
|
|
- "5775:5775/udp"
|
|
- "6831:6831/udp"
|
|
- "6832:6832/udp"
|
|
- "5778:5778"
|
|
- "16686:16686" # Jaeger UI
|
|
- "14268:14268"
|
|
- "14250:14250"
|
|
- "9411:9411"
|
|
- "4317:4317" # OTLP gRPC
|
|
- "4318:4318" # OTLP HTTP
|
|
networks:
|
|
- monitoring
|
|
|
|
alertmanager:
|
|
image: prom/alertmanager:v0.26.0
|
|
container_name: alertmanager
|
|
volumes:
|
|
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
|
|
command:
|
|
- '--config.file=/etc/alertmanager/alertmanager.yml'
|
|
- '--storage.path=/alertmanager'
|
|
ports:
|
|
- "9093:9093"
|
|
networks:
|
|
- monitoring
|
|
|
|
networks:
|
|
monitoring:
|
|
driver: bridge
|
|
|
|
volumes:
|
|
prometheus-data:
|
|
grafana-data:
|
|
```
|
|
|
|
### Example 3: Advanced Grafana Dashboard Definitions
|
|
|
|
```json
|
|
// grafana-dashboards/api-overview.json
|
|
{
|
|
"dashboard": {
|
|
"id": null,
|
|
"uid": "api-overview",
|
|
"title": "API Performance Overview",
|
|
"tags": ["api", "performance", "sre"],
|
|
"timezone": "browser",
|
|
"schemaVersion": 16,
|
|
"version": 0,
|
|
"refresh": "30s",
|
|
"time": {
|
|
"from": "now-6h",
|
|
"to": "now"
|
|
},
|
|
"templating": {
|
|
"list": [
|
|
{
|
|
"name": "datasource",
|
|
"type": "datasource",
|
|
"query": "prometheus",
|
|
"current": {
|
|
"value": "Prometheus",
|
|
"text": "Prometheus"
|
|
}
|
|
},
|
|
{
|
|
"name": "service",
|
|
"type": "query",
|
|
"datasource": "$datasource",
|
|
"query": "label_values(http_requests_total, service)",
|
|
"multi": true,
|
|
"includeAll": true,
|
|
"current": {
|
|
"value": ["$__all"],
|
|
"text": "All"
|
|
},
|
|
"refresh": 1
|
|
},
|
|
{
|
|
"name": "environment",
|
|
"type": "query",
|
|
"datasource": "$datasource",
|
|
"query": "label_values(http_requests_total, environment)",
|
|
"current": {
|
|
"value": "production",
|
|
"text": "Production"
|
|
}
|
|
}
|
|
]
|
|
},
|
|
"panels": [
|
|
{
|
|
"id": 1,
|
|
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 0 },
|
|
"type": "graph",
|
|
"title": "Request Rate (req/s)",
|
|
"targets": [
|
|
{
|
|
"expr": "sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service)",
|
|
"legendFormat": "{{service}}",
|
|
"refId": "A"
|
|
}
|
|
],
|
|
"yaxes": [
|
|
{
|
|
"format": "reqps",
|
|
"label": "Requests per second"
|
|
}
|
|
],
|
|
"lines": true,
|
|
"linewidth": 2,
|
|
"fill": 1,
|
|
"fillGradient": 3,
|
|
"steppedLine": false,
|
|
"tooltip": {
|
|
"shared": true,
|
|
"sort": 0,
|
|
"value_type": "individual"
|
|
},
|
|
"alert": {
|
|
"name": "High Request Rate",
|
|
"conditions": [
|
|
{
|
|
"evaluator": {
|
|
"params": [10000],
|
|
"type": "gt"
|
|
},
|
|
"operator": {
|
|
"type": "and"
|
|
},
|
|
"query": {
|
|
"params": ["A", "5m", "now"]
|
|
},
|
|
"reducer": {
|
|
"type": "avg"
|
|
},
|
|
"type": "query"
|
|
}
|
|
],
|
|
"executionErrorState": "alerting",
|
|
"frequency": "1m",
|
|
"handler": 1,
|
|
"noDataState": "no_data",
|
|
"notifications": [
|
|
{
|
|
"uid": "slack-channel"
|
|
}
|
|
]
|
|
}
|
|
},
|
|
{
|
|
"id": 2,
|
|
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 0 },
|
|
"type": "graph",
|
|
"title": "Error Rate (%)",
|
|
"targets": [
|
|
{
|
|
"expr": "sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\",status_code=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service) * 100",
|
|
"legendFormat": "{{service}}",
|
|
"refId": "A"
|
|
}
|
|
],
|
|
"yaxes": [
|
|
{
|
|
"format": "percent",
|
|
"label": "Error Rate",
|
|
"max": 10
|
|
}
|
|
],
|
|
"thresholds": [
|
|
{
|
|
"value": 1,
|
|
"op": "gt",
|
|
"fill": true,
|
|
"line": true,
|
|
"colorMode": "critical"
|
|
}
|
|
],
|
|
"alert": {
|
|
"name": "High Error Rate",
|
|
"conditions": [
|
|
{
|
|
"evaluator": {
|
|
"params": [1],
|
|
"type": "gt"
|
|
},
|
|
"operator": {
|
|
"type": "and"
|
|
},
|
|
"query": {
|
|
"params": ["A", "5m", "now"]
|
|
},
|
|
"reducer": {
|
|
"type": "last"
|
|
},
|
|
"type": "query"
|
|
}
|
|
],
|
|
"executionErrorState": "alerting",
|
|
"frequency": "1m",
|
|
"handler": 1,
|
|
"noDataState": "no_data",
|
|
"notifications": [
|
|
{
|
|
"uid": "pagerduty"
|
|
}
|
|
],
|
|
"message": "Error rate is above 1% for service {{service}}"
|
|
}
|
|
},
|
|
{
|
|
"id": 3,
|
|
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 0 },
|
|
"type": "graph",
|
|
"title": "Response Time (p50, p95, p99)",
|
|
"targets": [
|
|
{
|
|
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\",environment=\"$environment\"}[5m])) by (le, service))",
|
|
"legendFormat": "p50 {{service}}",
|
|
"refId": "A"
|
|
},
|
|
{
|
|
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\",environment=\"$environment\"}[5m])) by (le, service))",
|
|
"legendFormat": "p95 {{service}}",
|
|
"refId": "B"
|
|
},
|
|
{
|
|
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\",environment=\"$environment\"}[5m])) by (le, service))",
|
|
"legendFormat": "p99 {{service}}",
|
|
"refId": "C"
|
|
}
|
|
],
|
|
"yaxes": [
|
|
{
|
|
"format": "s",
|
|
"label": "Response Time"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"id": 4,
|
|
"gridPos": { "h": 6, "w": 6, "x": 0, "y": 8 },
|
|
"type": "stat",
|
|
"title": "Current QPS",
|
|
"targets": [
|
|
{
|
|
"expr": "sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[1m]))",
|
|
"instant": true,
|
|
"refId": "A"
|
|
}
|
|
],
|
|
"format": "reqps",
|
|
"sparkline": {
|
|
"show": true,
|
|
"lineColor": "rgb(31, 120, 193)",
|
|
"fillColor": "rgba(31, 120, 193, 0.18)"
|
|
},
|
|
"thresholds": {
|
|
"mode": "absolute",
|
|
"steps": [
|
|
{ "value": 0, "color": "green" },
|
|
{ "value": 5000, "color": "yellow" },
|
|
{ "value": 10000, "color": "red" }
|
|
]
|
|
}
|
|
},
|
|
{
|
|
"id": 5,
|
|
"gridPos": { "h": 6, "w": 6, "x": 6, "y": 8 },
|
|
"type": "stat",
|
|
"title": "Error Budget Remaining",
|
|
"targets": [
|
|
{
|
|
"expr": "error_budget_remaining_percentage{service=~\"$service\",slo_type=\"availability\"}",
|
|
"instant": true,
|
|
"refId": "A"
|
|
}
|
|
],
|
|
"format": "percent",
|
|
"thresholds": {
|
|
"mode": "absolute",
|
|
"steps": [
|
|
{ "value": 0, "color": "red" },
|
|
{ "value": 25, "color": "orange" },
|
|
{ "value": 50, "color": "yellow" },
|
|
{ "value": 75, "color": "green" }
|
|
]
|
|
}
|
|
},
|
|
{
|
|
"id": 6,
|
|
"gridPos": { "h": 6, "w": 12, "x": 12, "y": 8 },
|
|
"type": "table",
|
|
"title": "Top Slow Endpoints",
|
|
"targets": [
|
|
{
|
|
"expr": "topk(10, histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\",environment=\"$environment\"}[5m])) by (le, route)))",
|
|
"format": "table",
|
|
"instant": true,
|
|
"refId": "A"
|
|
}
|
|
],
|
|
"styles": [
|
|
{
|
|
"alias": "Time",
|
|
"dateFormat": "YYYY-MM-DD HH:mm:ss",
|
|
"type": "date"
|
|
},
|
|
{
|
|
"alias": "Duration",
|
|
"colorMode": "cell",
|
|
"colors": ["green", "yellow", "red"],
|
|
"thresholds": [0.5, 1],
|
|
"type": "number",
|
|
"unit": "s"
|
|
}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
### Example 4: Production-Ready Alerting Rules
|
|
|
|
```yaml
|
|
# alerting-rules.yml
|
|
groups:
|
|
- name: api_alerts
|
|
interval: 30s
|
|
rules:
|
|
# SLO-based alerts
|
|
- alert: APIHighErrorRate
|
|
expr: |
|
|
(
|
|
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service, environment)
|
|
/
|
|
sum(rate(http_requests_total[5m])) by (service, environment)
|
|
) > 0.01
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
team: api-platform
|
|
annotations:
|
|
summary: "High error rate on {{ $labels.service }}"
|
|
description: "{{ $labels.service }} in {{ $labels.environment }} has error rate of {{ $value | humanizePercentage }} (threshold: 1%)"
|
|
runbook_url: "https://wiki.example.com/runbooks/api-high-error-rate"
|
|
dashboard_url: "https://grafana.example.com/d/api-overview?var-service={{ $labels.service }}"
|
|
|
|
- alert: APIHighLatency
|
|
expr: |
|
|
histogram_quantile(0.95,
|
|
sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
|
|
) > 0.5
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
team: api-platform
|
|
annotations:
|
|
summary: "High latency on {{ $labels.service }}"
|
|
description: "P95 latency for {{ $labels.service }} is {{ $value | humanizeDuration }} (threshold: 500ms)"
|
|
|
|
- alert: APILowAvailability
|
|
expr: |
|
|
up{job="api-services"} == 0
|
|
for: 1m
|
|
labels:
|
|
severity: critical
|
|
team: api-platform
|
|
annotations:
|
|
summary: "API service {{ $labels.instance }} is down"
|
|
description: "{{ $labels.instance }} has been down for more than 1 minute"
|
|
|
|
# Business metrics alerts
|
|
- alert: LowActiveUsers
|
|
expr: |
|
|
business_active_users{plan="premium"} < 10
|
|
for: 30m
|
|
labels:
|
|
severity: warning
|
|
team: product
|
|
annotations:
|
|
summary: "Low number of active premium users"
|
|
description: "Only {{ $value }} premium users active in the last 30 minutes"
|
|
|
|
- alert: HighTransactionFailureRate
|
|
expr: |
|
|
(
|
|
sum(rate(business_transaction_amount_dollars_sum{status="failed"}[5m]))
|
|
/
|
|
sum(rate(business_transaction_amount_dollars_sum[5m]))
|
|
) > 0.05
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
team: payments
|
|
annotations:
|
|
summary: "High transaction failure rate"
|
|
description: "Transaction failure rate is {{ $value | humanizePercentage }} (threshold: 5%)"
|
|
|
|
# Infrastructure alerts
|
|
- alert: DatabaseConnectionPoolExhausted
|
|
expr: |
|
|
(
|
|
db_connection_pool_size{state="active"}
|
|
/
|
|
db_connection_pool_size{state="total"}
|
|
) > 0.9
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
team: database
|
|
annotations:
|
|
summary: "Database connection pool near exhaustion"
|
|
description: "{{ $labels.database }} pool is {{ $value | humanizePercentage }} utilized"
|
|
|
|
- alert: CacheLowHitRate
|
|
expr: |
|
|
(
|
|
sum(rate(cache_operations_total{status="hit"}[5m])) by (cache_name)
|
|
/
|
|
sum(rate(cache_operations_total{operation="get"}[5m])) by (cache_name)
|
|
) < 0.8
|
|
for: 15m
|
|
labels:
|
|
severity: warning
|
|
team: api-platform
|
|
annotations:
|
|
summary: "Low cache hit rate for {{ $labels.cache_name }}"
|
|
description: "Cache hit rate is {{ $value | humanizePercentage }} (expected: >80%)"
|
|
|
|
- alert: CircuitBreakerOpen
|
|
expr: |
|
|
circuit_breaker_state == 1
|
|
for: 1m
|
|
labels:
|
|
severity: warning
|
|
team: api-platform
|
|
annotations:
|
|
summary: "Circuit breaker open for {{ $labels.service }}"
|
|
description: "Circuit breaker for {{ $labels.service }} has been open for more than 1 minute"
|
|
|
|
# SLO burn rate alerts (multi-window approach)
|
|
- alert: SLOBurnRateHigh
|
|
expr: |
|
|
(
|
|
# 5m burn rate > 14.4 (1 hour of error budget in 5 minutes)
|
|
(
|
|
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
|
|
/
|
|
sum(rate(http_requests_total[5m])) by (service)
|
|
) > (1 - 0.999) * 14.4
|
|
) and (
|
|
# 1h burn rate > 1 (confirms it's not a spike)
|
|
(
|
|
sum(rate(http_requests_total{status_code=~"5.."}[1h])) by (service)
|
|
/
|
|
sum(rate(http_requests_total[1h])) by (service)
|
|
) > (1 - 0.999)
|
|
)
|
|
labels:
|
|
severity: critical
|
|
team: api-platform
|
|
alert_type: slo_burn
|
|
annotations:
|
|
summary: "SLO burn rate critically high for {{ $labels.service }}"
|
|
description: "{{ $labels.service }} is burning error budget 14.4x faster than normal"
|
|
|
|
# Resource alerts
|
|
- alert: HighMemoryUsage
|
|
expr: |
|
|
(
|
|
container_memory_usage_bytes{container!="POD",container!=""}
|
|
/
|
|
container_spec_memory_limit_bytes{container!="POD",container!=""}
|
|
) > 0.9
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
team: api-platform
|
|
annotations:
|
|
summary: "High memory usage for {{ $labels.container }}"
|
|
description: "Container {{ $labels.container }} memory usage is {{ $value | humanizePercentage }}"
|
|
|
|
# AlertManager configuration
|
|
# alertmanager.yml
|
|
global:
|
|
resolve_timeout: 5m
|
|
slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'
|
|
|
|
route:
|
|
group_by: ['alertname', 'cluster', 'service']
|
|
group_wait: 10s
|
|
group_interval: 10s
|
|
repeat_interval: 1h
|
|
receiver: 'default'
|
|
routes:
|
|
- match:
|
|
severity: critical
|
|
receiver: 'pagerduty-critical'
|
|
continue: true
|
|
- match:
|
|
severity: warning
|
|
receiver: 'slack-warnings'
|
|
- match:
|
|
team: payments
|
|
receiver: 'payments-team'
|
|
|
|
receivers:
|
|
- name: 'default'
|
|
slack_configs:
|
|
- channel: '#alerts'
|
|
title: 'Alert: {{ .GroupLabels.alertname }}'
|
|
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
|
|
|
|
- name: 'pagerduty-critical'
|
|
pagerduty_configs:
|
|
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
|
|
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
|
|
details:
|
|
firing: '{{ .Alerts.Firing | len }}'
|
|
resolved: '{{ .Alerts.Resolved | len }}'
|
|
labels: '{{ .CommonLabels }}'
|
|
|
|
- name: 'slack-warnings'
|
|
slack_configs:
|
|
- channel: '#warnings'
|
|
send_resolved: true
|
|
title: 'Warning: {{ .GroupLabels.alertname }}'
|
|
text: '{{ .CommonAnnotations.description }}'
|
|
actions:
|
|
- type: button
|
|
text: 'View Dashboard'
|
|
url: '{{ .CommonAnnotations.dashboard_url }}'
|
|
- type: button
|
|
text: 'View Runbook'
|
|
url: '{{ .CommonAnnotations.runbook_url }}'
|
|
|
|
- name: 'payments-team'
|
|
email_configs:
|
|
- to: 'payments-team@example.com'
|
|
from: 'alerts@example.com'
|
|
headers:
|
|
Subject: 'Payment Alert: {{ .GroupLabels.alertname }}'
|
|
|
|
inhibit_rules:
|
|
- source_match:
|
|
severity: 'critical'
|
|
target_match:
|
|
severity: 'warning'
|
|
equal: ['alertname', 'service']
|
|
```
|
|
|
|
### Example 5: OpenTelemetry Integration for Distributed Tracing
|
|
|
|
```javascript
|
|
// tracing/setup.js - OpenTelemetry configuration
|
|
const { NodeSDK } = require('@opentelemetry/sdk-node');
|
|
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
|
|
const { Resource } = require('@opentelemetry/resources');
|
|
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
|
|
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
|
|
const { PrometheusExporter } = require('@opentelemetry/exporter-prometheus');
|
|
const {
|
|
ConsoleSpanExporter,
|
|
BatchSpanProcessor,
|
|
SimpleSpanProcessor
|
|
} = require('@opentelemetry/sdk-trace-base');
|
|
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
|
|
|
|
class TracingSetup {
|
|
constructor(serviceName, environment = 'production') {
|
|
this.serviceName = serviceName;
|
|
this.environment = environment;
|
|
this.sdk = null;
|
|
}
|
|
|
|
initialize() {
|
|
// Create resource identifying the service
|
|
const resource = Resource.default().merge(
|
|
new Resource({
|
|
[SemanticResourceAttributes.SERVICE_NAME]: this.serviceName,
|
|
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.VERSION || '1.0.0',
|
|
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: this.environment,
|
|
'service.namespace': 'api-platform',
|
|
'service.instance.id': process.env.HOSTNAME || 'unknown',
|
|
'telemetry.sdk.language': 'nodejs',
|
|
})
|
|
);
|
|
|
|
// Configure Jaeger exporter for traces
|
|
const jaegerExporter = new JaegerExporter({
|
|
endpoint: process.env.JAEGER_ENDPOINT || 'http://localhost:14268/api/traces',
|
|
tags: {
|
|
service: this.serviceName,
|
|
environment: this.environment
|
|
}
|
|
});
|
|
|
|
// Configure Prometheus exporter for metrics
|
|
const prometheusExporter = new PrometheusExporter({
|
|
port: 9464,
|
|
endpoint: '/metrics',
|
|
prefix: 'otel_',
|
|
appendTimestamp: true,
|
|
}, () => {
|
|
console.log('Prometheus metrics server started on port 9464');
|
|
});
|
|
|
|
// Create SDK with auto-instrumentation
|
|
this.sdk = new NodeSDK({
|
|
resource,
|
|
instrumentations: [
|
|
getNodeAutoInstrumentations({
|
|
'@opentelemetry/instrumentation-fs': {
|
|
enabled: false, // Disable fs to reduce noise
|
|
},
|
|
'@opentelemetry/instrumentation-http': {
|
|
requestHook: (span, request) => {
|
|
span.setAttribute('http.request.body', JSON.stringify(request.body));
|
|
span.setAttribute('http.request.user_id', request.user?.id);
|
|
},
|
|
responseHook: (span, response) => {
|
|
span.setAttribute('http.response.size', response.length);
|
|
},
|
|
ignoreIncomingPaths: ['/health', '/metrics', '/favicon.ico'],
|
|
ignoreOutgoingUrls: [(url) => url.includes('prometheus')]
|
|
},
|
|
'@opentelemetry/instrumentation-express': {
|
|
requestHook: (span, request) => {
|
|
span.setAttribute('express.route', request.route?.path);
|
|
span.setAttribute('express.params', JSON.stringify(request.params));
|
|
}
|
|
},
|
|
'@opentelemetry/instrumentation-mysql2': {
|
|
enhancedDatabaseReporting: true,
|
|
},
|
|
'@opentelemetry/instrumentation-redis-4': {
|
|
dbStatementSerializer: (cmdName, cmdArgs) => {
|
|
return `${cmdName} ${cmdArgs.slice(0, 2).join(' ')}`;
|
|
}
|
|
}
|
|
})
|
|
],
|
|
spanProcessor: new BatchSpanProcessor(jaegerExporter, {
|
|
maxQueueSize: 2048,
|
|
maxExportBatchSize: 512,
|
|
scheduledDelayMillis: 5000,
|
|
exportTimeoutMillis: 30000,
|
|
}),
|
|
metricReader: new PeriodicExportingMetricReader({
|
|
exporter: prometheusExporter,
|
|
exportIntervalMillis: 10000,
|
|
}),
|
|
});
|
|
|
|
// Start the SDK
|
|
this.sdk.start()
|
|
.then(() => console.log('Tracing initialized successfully'))
|
|
.catch((error) => console.error('Error initializing tracing', error));
|
|
|
|
// Graceful shutdown
|
|
process.on('SIGTERM', () => {
|
|
this.shutdown();
|
|
});
|
|
}
|
|
|
|
async shutdown() {
|
|
try {
|
|
await this.sdk.shutdown();
|
|
console.log('Tracing terminated successfully');
|
|
} catch (error) {
|
|
console.error('Error terminating tracing', error);
|
|
}
|
|
}
|
|
|
|
// Manual span creation for custom instrumentation
|
|
createSpan(tracer, spanName, fn) {
|
|
return tracer.startActiveSpan(spanName, async (span) => {
|
|
try {
|
|
span.setAttribute('span.kind', 'internal');
|
|
span.setAttribute('custom.span', true);
|
|
|
|
const result = await fn(span);
|
|
|
|
span.setStatus({ code: 0, message: 'OK' });
|
|
return result;
|
|
} catch (error) {
|
|
span.setStatus({ code: 2, message: error.message });
|
|
span.recordException(error);
|
|
throw error;
|
|
} finally {
|
|
span.end();
|
|
}
|
|
});
|
|
}
|
|
}
|
|
|
|
// Usage in application
|
|
const tracing = new TracingSetup('api-gateway', process.env.NODE_ENV);
|
|
tracing.initialize();
|
|
|
|
// Custom instrumentation example
|
|
const { trace } = require('@opentelemetry/api');
|
|
|
|
async function processOrder(orderId) {
|
|
const tracer = trace.getTracer('order-processing', '1.0.0');
|
|
|
|
return tracing.createSpan(tracer, 'processOrder', async (span) => {
|
|
span.setAttribute('order.id', orderId);
|
|
span.addEvent('Order processing started');
|
|
|
|
// Validate order
|
|
await tracing.createSpan(tracer, 'validateOrder', async (childSpan) => {
|
|
childSpan.setAttribute('validation.type', 'schema');
|
|
// Validation logic
|
|
await validateOrderSchema(orderId);
|
|
});
|
|
|
|
// Process payment
|
|
await tracing.createSpan(tracer, 'processPayment', async (childSpan) => {
|
|
childSpan.setAttribute('payment.method', 'stripe');
|
|
// Payment logic
|
|
const result = await processStripePayment(orderId);
|
|
childSpan.setAttribute('payment.status', result.status);
|
|
childSpan.addEvent('Payment processed', {
|
|
'payment.amount': result.amount,
|
|
'payment.currency': result.currency
|
|
});
|
|
});
|
|
|
|
// Send confirmation
|
|
await tracing.createSpan(tracer, 'sendConfirmation', async (childSpan) => {
|
|
childSpan.setAttribute('notification.type', 'email');
|
|
// Email logic
|
|
await sendOrderConfirmation(orderId);
|
|
});
|
|
|
|
span.addEvent('Order processing completed');
|
|
return { success: true, orderId };
|
|
});
|
|
}
|
|
|
|
module.exports = { TracingSetup, tracing };
|
|
```
|
|
|
|
### Example 6: Custom Prometheus Exporters for Complex Metrics
|
|
|
|
```python
|
|
# custom_exporters.py - Python Prometheus exporter for business metrics
|
|
from prometheus_client import start_http_server, Gauge, Counter, Histogram, Info, Enum
|
|
from prometheus_client.core import CollectorRegistry
|
|
from prometheus_client import generate_latest
|
|
import time
|
|
import psycopg2
|
|
import redis
|
|
import requests
|
|
from datetime import datetime, timedelta
|
|
import asyncio
|
|
import aiohttp
|
|
|
|
class CustomBusinessExporter:
|
|
def __init__(self, db_config, redis_config, port=9091):
|
|
self.registry = CollectorRegistry()
|
|
self.db_config = db_config
|
|
self.redis_config = redis_config
|
|
self.port = port
|
|
|
|
# Initialize metrics
|
|
self.initialize_metrics()
|
|
|
|
# Connect to data sources
|
|
self.connect_datasources()
|
|
|
|
def initialize_metrics(self):
|
|
# Business KPI metrics
|
|
self.revenue_total = Gauge(
|
|
'business_revenue_total_usd',
|
|
'Total revenue in USD',
|
|
['period', 'product_line', 'region'],
|
|
registry=self.registry
|
|
)
|
|
|
|
self.customer_lifetime_value = Histogram(
|
|
'business_customer_lifetime_value_usd',
|
|
'Customer lifetime value distribution',
|
|
['customer_segment', 'acquisition_channel'],
|
|
buckets=(10, 50, 100, 500, 1000, 5000, 10000, 50000),
|
|
registry=self.registry
|
|
)
|
|
|
|
self.churn_rate = Gauge(
|
|
'business_churn_rate_percentage',
|
|
'Customer churn rate',
|
|
['plan', 'cohort'],
|
|
registry=self.registry
|
|
)
|
|
|
|
self.monthly_recurring_revenue = Gauge(
|
|
'business_mrr_usd',
|
|
'Monthly recurring revenue',
|
|
['plan', 'currency'],
|
|
registry=self.registry
|
|
)
|
|
|
|
self.net_promoter_score = Gauge(
|
|
'business_nps',
|
|
'Net Promoter Score',
|
|
['segment', 'survey_type'],
|
|
registry=self.registry
|
|
)
|
|
|
|
# Operational metrics
|
|
self.data_pipeline_lag = Histogram(
|
|
'data_pipeline_lag_seconds',
|
|
'Data pipeline processing lag',
|
|
['pipeline', 'stage'],
|
|
buckets=(1, 5, 10, 30, 60, 300, 600, 1800, 3600),
|
|
registry=self.registry
|
|
)
|
|
|
|
self.feature_usage = Counter(
|
|
'feature_usage_total',
|
|
'Feature usage counts',
|
|
['feature_name', 'user_tier', 'success'],
|
|
registry=self.registry
|
|
)
|
|
|
|
self.api_quota_usage = Gauge(
|
|
'api_quota_usage_percentage',
|
|
'API quota usage by customer',
|
|
['customer_id', 'tier', 'resource'],
|
|
registry=self.registry
|
|
)
|
|
|
|
# System health indicators
|
|
self.dependency_health = Enum(
|
|
'dependency_health_status',
|
|
'Health status of external dependencies',
|
|
['service', 'dependency'],
|
|
states=['healthy', 'degraded', 'unhealthy'],
|
|
registry=self.registry
|
|
)
|
|
|
|
self.data_quality_score = Gauge(
|
|
'data_quality_score',
|
|
'Data quality score (0-100)',
|
|
['dataset', 'dimension'],
|
|
registry=self.registry
|
|
)
|
|
|
|
def connect_datasources(self):
|
|
# PostgreSQL connection
|
|
self.db_conn = psycopg2.connect(**self.db_config)
|
|
|
|
# Redis connection
|
|
self.redis_client = redis.Redis(**self.redis_config)
|
|
|
|
def collect_business_metrics(self):
|
|
"""Collect business metrics from various data sources"""
|
|
cursor = self.db_conn.cursor()
|
|
|
|
# Revenue metrics
|
|
cursor.execute("""
|
|
SELECT
|
|
DATE_TRUNC('day', created_at) as period,
|
|
product_line,
|
|
region,
|
|
SUM(amount) as total_revenue
|
|
FROM orders
|
|
WHERE status = 'completed'
|
|
AND created_at >= NOW() - INTERVAL '7 days'
|
|
GROUP BY period, product_line, region
|
|
""")
|
|
|
|
for row in cursor.fetchall():
|
|
self.revenue_total.labels(
|
|
period=row[0].isoformat(),
|
|
product_line=row[1],
|
|
region=row[2]
|
|
).set(row[3])
|
|
|
|
# Customer lifetime value
|
|
cursor.execute("""
|
|
SELECT
|
|
c.segment,
|
|
c.acquisition_channel,
|
|
AVG(o.total_spent) as avg_clv
|
|
FROM customers c
|
|
JOIN (
|
|
SELECT customer_id, SUM(amount) as total_spent
|
|
FROM orders
|
|
WHERE status = 'completed'
|
|
GROUP BY customer_id
|
|
) o ON c.id = o.customer_id
|
|
GROUP BY c.segment, c.acquisition_channel
|
|
""")
|
|
|
|
for row in cursor.fetchall():
|
|
self.customer_lifetime_value.labels(
|
|
customer_segment=row[0],
|
|
acquisition_channel=row[1]
|
|
).observe(row[2])
|
|
|
|
# MRR calculation
|
|
cursor.execute("""
|
|
SELECT
|
|
plan_name,
|
|
currency,
|
|
SUM(
|
|
CASE
|
|
WHEN billing_period = 'yearly' THEN amount / 12
|
|
ELSE amount
|
|
END
|
|
) as mrr
|
|
FROM subscriptions
|
|
WHERE status = 'active'
|
|
GROUP BY plan_name, currency
|
|
""")
|
|
|
|
for row in cursor.fetchall():
|
|
self.monthly_recurring_revenue.labels(
|
|
plan=row[0],
|
|
currency=row[1]
|
|
).set(row[2])
|
|
|
|
# Churn rate
|
|
cursor.execute("""
|
|
WITH cohort_data AS (
|
|
SELECT
|
|
plan_name,
|
|
DATE_TRUNC('month', created_at) as cohort,
|
|
COUNT(*) as total_customers,
|
|
COUNT(CASE WHEN status = 'cancelled' THEN 1 END) as churned_customers
|
|
FROM subscriptions
|
|
WHERE created_at >= NOW() - INTERVAL '6 months'
|
|
GROUP BY plan_name, cohort
|
|
)
|
|
SELECT
|
|
plan_name,
|
|
cohort,
|
|
(churned_customers::float / total_customers) * 100 as churn_rate
|
|
FROM cohort_data
|
|
""")
|
|
|
|
for row in cursor.fetchall():
|
|
self.churn_rate.labels(
|
|
plan=row[0],
|
|
cohort=row[1].isoformat()
|
|
).set(row[2])
|
|
|
|
cursor.close()
|
|
|
|
def collect_operational_metrics(self):
|
|
"""Collect operational metrics from Redis and other sources"""
|
|
|
|
# API quota usage from Redis
|
|
for key in self.redis_client.scan_iter("quota:*"):
|
|
parts = key.decode().split(':')
|
|
if len(parts) >= 3:
|
|
customer_id = parts[1]
|
|
resource = parts[2]
|
|
|
|
used = float(self.redis_client.get(key) or 0)
|
|
limit_key = f"quota_limit:{customer_id}:{resource}"
|
|
limit = float(self.redis_client.get(limit_key) or 1000)
|
|
|
|
usage_percentage = (used / limit) * 100 if limit > 0 else 0
|
|
|
|
# Get customer tier from database
|
|
cursor = self.db_conn.cursor()
|
|
cursor.execute(
|
|
"SELECT tier FROM customers WHERE id = %s",
|
|
(customer_id,)
|
|
)
|
|
result = cursor.fetchone()
|
|
tier = result[0] if result else 'unknown'
|
|
cursor.close()
|
|
|
|
self.api_quota_usage.labels(
|
|
customer_id=customer_id,
|
|
tier=tier,
|
|
resource=resource
|
|
).set(usage_percentage)
|
|
|
|
# Data pipeline lag from Redis
|
|
pipeline_stages = ['ingestion', 'processing', 'storage', 'delivery']
|
|
for stage in pipeline_stages:
|
|
lag_key = f"pipeline:lag:{stage}"
|
|
lag_value = self.redis_client.get(lag_key)
|
|
if lag_value:
|
|
self.data_pipeline_lag.labels(
|
|
pipeline='main',
|
|
stage=stage
|
|
).observe(float(lag_value))
|
|
|
|
def check_dependency_health(self):
|
|
"""Check health of external dependencies"""
|
|
dependencies = [
|
|
('payment', 'stripe', 'https://api.stripe.com/health'),
|
|
('email', 'sendgrid', 'https://api.sendgrid.com/health'),
|
|
('storage', 's3', 'https://s3.amazonaws.com/health'),
|
|
('cache', 'redis', 'redis://localhost:6379'),
|
|
('database', 'postgres', self.db_config)
|
|
]
|
|
|
|
for service, dep_name, endpoint in dependencies:
|
|
try:
|
|
if dep_name == 'redis':
|
|
# Check Redis
|
|
self.redis_client.ping()
|
|
status = 'healthy'
|
|
elif dep_name == 'postgres':
|
|
# Check PostgreSQL
|
|
cursor = self.db_conn.cursor()
|
|
cursor.execute("SELECT 1")
|
|
cursor.close()
|
|
status = 'healthy'
|
|
else:
|
|
# Check HTTP endpoints
|
|
response = requests.get(endpoint, timeout=5)
|
|
if response.status_code == 200:
|
|
status = 'healthy'
|
|
elif 200 < response.status_code < 500:
|
|
status = 'degraded'
|
|
else:
|
|
status = 'unhealthy'
|
|
except Exception as e:
|
|
print(f"Health check failed for {dep_name}: {e}")
|
|
status = 'unhealthy'
|
|
|
|
self.dependency_health.labels(
|
|
service=service,
|
|
dependency=dep_name
|
|
).state(status)
|
|
|
|
def calculate_data_quality(self):
|
|
"""Calculate data quality scores"""
|
|
cursor = self.db_conn.cursor()
|
|
|
|
# Completeness score
|
|
cursor.execute("""
|
|
SELECT
|
|
'orders' as dataset,
|
|
(COUNT(*) - COUNT(CASE WHEN customer_email IS NULL THEN 1 END))::float / COUNT(*) * 100 as completeness
|
|
FROM orders
|
|
WHERE created_at >= NOW() - INTERVAL '1 day'
|
|
""")
|
|
|
|
for row in cursor.fetchall():
|
|
self.data_quality_score.labels(
|
|
dataset=row[0],
|
|
dimension='completeness'
|
|
).set(row[1])
|
|
|
|
# Accuracy score (checking for valid email formats)
|
|
cursor.execute("""
|
|
SELECT
|
|
'customers' as dataset,
|
|
COUNT(CASE WHEN email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}$' THEN 1 END)::float / COUNT(*) * 100 as accuracy
|
|
FROM customers
|
|
WHERE created_at >= NOW() - INTERVAL '1 day'
|
|
""")
|
|
|
|
for row in cursor.fetchall():
|
|
self.data_quality_score.labels(
|
|
dataset=row[0],
|
|
dimension='accuracy'
|
|
).set(row[1])
|
|
|
|
cursor.close()
|
|
|
|
async def collect_metrics_async(self):
|
|
"""Async collection for improved performance"""
|
|
tasks = [
|
|
self.collect_business_metrics_async(),
|
|
self.collect_operational_metrics_async(),
|
|
self.check_dependency_health_async(),
|
|
self.calculate_data_quality_async()
|
|
]
|
|
|
|
await asyncio.gather(*tasks)
|
|
|
|
def run(self):
|
|
"""Start the exporter"""
|
|
# Start HTTP server for Prometheus to scrape
|
|
start_http_server(self.port, registry=self.registry)
|
|
print(f"Custom exporter started on port {self.port}")
|
|
|
|
# Collect metrics every 30 seconds
|
|
while True:
|
|
try:
|
|
self.collect_business_metrics()
|
|
self.collect_operational_metrics()
|
|
self.check_dependency_health()
|
|
self.calculate_data_quality()
|
|
|
|
print(f"Metrics collected at {datetime.now()}")
|
|
time.sleep(30)
|
|
|
|
except Exception as e:
|
|
print(f"Error collecting metrics: {e}")
|
|
time.sleep(30)
|
|
|
|
# Usage
|
|
if __name__ == "__main__":
|
|
db_config = {
|
|
'host': 'localhost',
|
|
'database': 'production',
|
|
'user': 'metrics_user',
|
|
'password': 'secure_password',
|
|
'port': 5432
|
|
}
|
|
|
|
redis_config = {
|
|
'host': 'localhost',
|
|
'port': 6379,
|
|
'db': 0,
|
|
'decode_responses': True
|
|
}
|
|
|
|
exporter = CustomBusinessExporter(db_config, redis_config)
|
|
exporter.run()
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
| Error | Cause | Solution |
|
|
|-------|-------|----------|
|
|
| "Connection refused to Prometheus" | Prometheus not running or wrong port | Check Docker container status with `docker ps`, verify port mapping |
|
|
| "No data in Grafana dashboard" | Metrics not being scraped | Verify Prometheus targets at `localhost:9090/targets`, check API metrics endpoint |
|
|
| "Too many samples" error | High cardinality labels | Review label usage, avoid user IDs or timestamps as labels |
|
|
| "Out of memory" in Prometheus | Retention too long or too many metrics | Reduce retention time, implement remote storage, or scale vertically |
|
|
| Jaeger traces not appearing | Incorrect sampling rate | Increase sampling rate in tracer configuration |
|
|
| "Context deadline exceeded" | Scrape timeout too short | Increase scrape_timeout in prometheus.yml (default 10s) |
|
|
| "Error reading Prometheus" | Corrupt WAL (write-ahead log) | Delete WAL directory: `rm -rf /prometheus/wal/*` and restart |
|
|
| "Too many open files" | File descriptor limit reached | Increase ulimit: `ulimit -n 65536` or adjust systemd limits |
|
|
| AlertManager not firing | Incorrect routing rules | Validate routing tree with `amtool config routes` |
|
|
| Grafana login loop | Cookie/session issues | Clear browser cookies, check Grafana cookie settings |
|
|
|
|
## Configuration Options
|
|
|
|
**Basic Usage:**
|
|
```bash
|
|
/create-monitoring \
|
|
--stack=prometheus \
|
|
--services=api-gateway,user-service,order-service \
|
|
--environment=production \
|
|
--retention=30d
|
|
```
|
|
|
|
**Available Options:**
|
|
|
|
`--stack <type>` - Monitoring stack to deploy
|
|
- `prometheus` - Prometheus + Grafana + AlertManager (default, open-source)
|
|
- `elastic` - ELK stack (Elasticsearch, Logstash, Kibana) for log-centric
|
|
- `datadog` - Datadog agent configuration (requires API key)
|
|
- `newrelic` - New Relic agent setup (requires license key)
|
|
- `hybrid` - Combination of metrics (Prometheus) and logs (ELK)
|
|
|
|
`--tracing <backend>` - Distributed tracing backend
|
|
- `jaeger` - Jaeger all-in-one (default, recommended for start)
|
|
- `zipkin` - Zipkin server
|
|
- `tempo` - Grafana Tempo (for high-scale)
|
|
- `xray` - AWS X-Ray (for AWS environments)
|
|
- `none` - Skip tracing setup
|
|
|
|
`--retention <duration>` - Metrics retention period
|
|
- Default: `15d` (15 days)
|
|
- Production: `30d` to `90d`
|
|
- With remote storage: `365d` or more
|
|
|
|
`--scrape-interval <duration>` - How often to collect metrics
|
|
- Default: `15s`
|
|
- High-frequency: `5s` (higher resource usage)
|
|
- Low-frequency: `60s` (for stable metrics)
|
|
|
|
`--alerting-channels <channels>` - Where to send alerts
|
|
- `slack` - Slack webhook integration
|
|
- `pagerduty` - PagerDuty integration
|
|
- `email` - SMTP email notifications
|
|
- `webhook` - Custom webhook endpoint
|
|
- `opsgenie` - Atlassian OpsGenie
|
|
|
|
`--dashboard-presets <presets>` - Pre-built dashboards to install
|
|
- `red-metrics` - Rate, Errors, Duration
|
|
- `four-golden` - Latency, Traffic, Errors, Saturation
|
|
- `business-kpis` - Revenue, Users, Conversion
|
|
- `sre-slos` - SLI/SLO tracking
|
|
- `security` - Security metrics and anomalies
|
|
|
|
`--exporters <list>` - Additional exporters to configure
|
|
- `node-exporter` - System/host metrics
|
|
- `blackbox-exporter` - Probe endpoints
|
|
- `postgres-exporter` - PostgreSQL metrics
|
|
- `redis-exporter` - Redis metrics
|
|
- `custom` - Custom business metrics
|
|
|
|
`--high-availability` - Enable HA configuration
|
|
- Sets up Prometheus federation
|
|
- Configures AlertManager clustering
|
|
- Enables Grafana database replication
|
|
|
|
`--storage <type>` - Long-term storage backend
|
|
- `local` - Local disk (default)
|
|
- `thanos` - Thanos for unlimited retention
|
|
- `cortex` - Cortex for multi-tenant
|
|
- `victoria` - VictoriaMetrics for efficiency
|
|
- `s3` - S3-compatible object storage
|
|
|
|
`--dry-run` - Generate configuration without deploying
|
|
- Creates all config files
|
|
- Validates syntax
|
|
- Shows what would be deployed
|
|
- No actual containers started
|
|
|
|
## Best Practices
|
|
|
|
DO:
|
|
- Start with RED metrics (Rate, Errors, Duration) as your foundation
|
|
- Use histogram buckets that align with your SLO targets
|
|
- Tag metrics with environment, region, version, and service
|
|
- Create runbooks for every alert and link them in annotations
|
|
- Implement meta-monitoring (monitor the monitoring system)
|
|
- Use recording rules for frequently-run expensive queries
|
|
- Set up separate dashboards for different audiences (ops, dev, business)
|
|
- Use exemplars to link metrics to traces for easier debugging
|
|
- Implement gradual rollout of new metrics to avoid cardinality explosion
|
|
- Archive old dashboards before creating new ones
|
|
|
|
DON'T:
|
|
- Add high-cardinality labels like user IDs, session IDs, or UUIDs
|
|
- Create dashboards with 50+ panels (causes browser performance issues)
|
|
- Alert on symptoms without providing actionable runbooks
|
|
- Store raw logs in Prometheus (use log aggregation systems)
|
|
- Ignore alert fatigue (regularly review and tune thresholds)
|
|
- Hardcode datasource UIDs in dashboard JSON
|
|
- Mix metrics from different time ranges in one panel
|
|
- Use regex selectors without limits in production queries
|
|
- Forget to set up backup for Grafana database
|
|
- Skip capacity planning for metrics growth
|
|
|
|
TIPS:
|
|
- Import dashboards from grafana.com marketplace (dashboard IDs)
|
|
- Use Prometheus federation for multi-region deployments
|
|
- Implement progressive alerting: warning (Slack) → critical (PagerDuty)
|
|
- Create team-specific folders in Grafana for organization
|
|
- Use Grafana variables for dynamic, reusable dashboards
|
|
- Set up dashboard playlists for NOC/SOC displays
|
|
- Use annotations to mark deployments and incidents on graphs
|
|
- Implement SLO burn rate alerts instead of static thresholds
|
|
- Create separate Prometheus jobs for different scrape intervals
|
|
- Use remote_write for backup and long-term storage
|
|
|
|
## Performance Considerations
|
|
|
|
**Prometheus Resource Planning**
|
|
```
|
|
Memory Required =
|
|
(number_of_time_series * 2KB) + # Active series
|
|
(ingestion_rate * 2 * retention_hours) + # WAL and blocks
|
|
(2GB) # Base overhead
|
|
|
|
CPU Cores Required =
|
|
(ingestion_rate / 100,000) + # Ingestion processing
|
|
(query_rate / 10) + # Query processing
|
|
(1) # Base overhead
|
|
|
|
Disk IOPS Required =
|
|
(ingestion_rate / 1000) + # Write IOPS
|
|
(query_rate * 100) + # Read IOPS
|
|
(100) # Background compaction
|
|
```
|
|
|
|
**Optimization Strategies**
|
|
1. **Reduce cardinality**: Audit and remove unnecessary labels
|
|
2. **Use recording rules**: Pre-compute expensive queries
|
|
3. **Optimize scrape configs**: Different intervals for different metrics
|
|
4. **Implement downsampling**: For long-term storage
|
|
5. **Horizontal sharding**: Separate Prometheus per service/team
|
|
6. **Remote storage**: Offload old data to object storage
|
|
7. **Query caching**: Use Trickster or built-in Grafana caching
|
|
8. **Metric relabeling**: Drop unwanted metrics at scrape time
|
|
9. **Federation**: Aggregate metrics hierarchically
|
|
10. **Capacity limits**: Set max_samples_per_send and queue sizes
|
|
|
|
**Scaling Thresholds**
|
|
- < 1M active series: Single Prometheus instance
|
|
- 1M - 10M series: Prometheus with remote storage
|
|
- 10M - 100M series: Sharded Prometheus or Cortex
|
|
- > 100M series: Thanos or multi-region Cortex
|
|
|
|
## Security Considerations
|
|
|
|
**Authentication & Authorization**
|
|
```yaml
|
|
# prometheus.yml with basic auth
|
|
scrape_configs:
|
|
- job_name: 'secured-api'
|
|
basic_auth:
|
|
username: 'prometheus'
|
|
password_file: '/etc/prometheus/password.txt'
|
|
scheme: https
|
|
tls_config:
|
|
ca_file: '/etc/prometheus/ca.crt'
|
|
cert_file: '/etc/prometheus/cert.crt'
|
|
key_file: '/etc/prometheus/key.pem'
|
|
insecure_skip_verify: false
|
|
```
|
|
|
|
**Network Security**
|
|
- Deploy monitoring stack in isolated subnet
|
|
- Use internal load balancers for Prometheus federation
|
|
- Implement mTLS between Prometheus and targets
|
|
- Restrict metrics endpoints to monitoring CIDR blocks
|
|
- Use VPN or private links for cross-region federation
|
|
|
|
**Data Security**
|
|
- Encrypt data at rest (filesystem encryption)
|
|
- Sanitize metrics to avoid leaking sensitive data
|
|
- Implement audit logging for all access
|
|
- Regular security scanning of monitoring infrastructure
|
|
- Rotate credentials and certificates regularly
|
|
|
|
**Compliance Considerations**
|
|
- GDPR: Avoid collecting PII in metrics labels
|
|
- HIPAA: Encrypt all health-related metrics
|
|
- PCI DSS: Separate payment metrics into isolated stack
|
|
- SOC 2: Maintain audit trails and access logs
|
|
|
|
## Troubleshooting Guide
|
|
|
|
**Issue: Prometheus consuming too much memory**
|
|
```bash
|
|
# 1. Check current memory usage and series count
|
|
curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName' | head -20
|
|
|
|
# 2. Find high cardinality metrics
|
|
curl -g 'http://localhost:9090/api/v1/query?query=count(count+by(__name__)({__name__=~".+"}))' | jq
|
|
|
|
# 3. Identify problematic labels
|
|
curl -s http://localhost:9090/api/v1/label/userId/values | jq '. | length'
|
|
|
|
# 4. Drop high-cardinality metrics
|
|
# Add to prometheus.yml:
|
|
metric_relabel_configs:
|
|
- source_labels: [__name__]
|
|
regex: 'problematic_metric_.*'
|
|
action: drop
|
|
```
|
|
|
|
**Issue: Grafana dashboards loading slowly**
|
|
```bash
|
|
# 1. Check query performance
|
|
curl -s 'http://localhost:9090/api/v1/query_log' | jq '.data[] | select(.duration_seconds > 1)'
|
|
|
|
# 2. Analyze slow queries in Grafana
|
|
SELECT
|
|
dashboard_id,
|
|
panel_id,
|
|
AVG(duration) as avg_duration,
|
|
query
|
|
FROM grafana.query_history
|
|
WHERE duration > 1000
|
|
GROUP BY dashboard_id, panel_id, query
|
|
ORDER BY avg_duration DESC;
|
|
|
|
# 3. Optimize with recording rules
|
|
# Add to recording_rules.yml:
|
|
groups:
|
|
- name: dashboard_queries
|
|
interval: 30s
|
|
rules:
|
|
- record: api:request_rate5m
|
|
expr: sum(rate(http_requests_total[5m])) by (service)
|
|
```
|
|
|
|
**Issue: Alerts not firing**
|
|
```bash
|
|
# 1. Check alert state
|
|
curl http://localhost:9090/api/v1/alerts | jq
|
|
|
|
# 2. Validate AlertManager config
|
|
docker exec alertmanager amtool config routes
|
|
|
|
# 3. Test alert routing
|
|
docker exec alertmanager amtool config routes test \
|
|
--config.file=/etc/alertmanager/alertmanager.yml \
|
|
--verify.receivers=slack-critical \
|
|
severity=critical service=api
|
|
|
|
# 4. Check for inhibition rules
|
|
curl http://localhost:9093/api/v1/alerts | jq '.[] | select(.status.inhibitedBy != [])'
|
|
```
|
|
|
|
**Issue: Missing traces in Jaeger**
|
|
```javascript
|
|
// 1. Verify sampling rate
|
|
const tracer = initTracer({
|
|
serviceName: 'api-gateway',
|
|
sampler: {
|
|
type: 'const', // Change to 'const' for debugging
|
|
param: 1, // 1 = sample everything
|
|
},
|
|
});
|
|
|
|
// 2. Check span reporting
|
|
tracer.on('span_finished', (span) => {
|
|
console.log('Span finished:', span.operationName(), span.context().toTraceId());
|
|
});
|
|
|
|
// 3. Verify Jaeger agent connectivity
|
|
curl http://localhost:14268/api/traces?service=api-gateway
|
|
```
|
|
|
|
## Migration Guide
|
|
|
|
**From CloudWatch to Prometheus:**
|
|
```python
|
|
# Migration script example
|
|
import boto3
|
|
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
|
|
|
|
def migrate_cloudwatch_to_prometheus():
|
|
# Read from CloudWatch
|
|
cw = boto3.client('cloudwatch')
|
|
metrics = cw.get_metric_statistics(
|
|
Namespace='AWS/EC2',
|
|
MetricName='CPUUtilization',
|
|
StartTime=datetime.now() - timedelta(hours=1),
|
|
EndTime=datetime.now(),
|
|
Period=300,
|
|
Statistics=['Average']
|
|
)
|
|
|
|
# Write to Prometheus
|
|
registry = CollectorRegistry()
|
|
g = Gauge('aws_ec2_cpu_utilization', 'EC2 CPU Usage',
|
|
['instance_id'], registry=registry)
|
|
|
|
for datapoint in metrics['Datapoints']:
|
|
g.labels(instance_id='i-1234567890abcdef0').set(datapoint['Average'])
|
|
push_to_gateway('localhost:9091', job='cloudwatch_migration', registry=registry)
|
|
```
|
|
|
|
**From Datadog to Prometheus:**
|
|
1. Export Datadog dashboards as JSON
|
|
2. Convert queries using query translator
|
|
3. Import to Grafana with dashboard converter
|
|
4. Map Datadog tags to Prometheus labels
|
|
5. Recreate alerts in AlertManager format
|
|
|
|
## Related Commands
|
|
|
|
- `/api-load-tester` - Generate test traffic to validate monitoring setup
|
|
- `/api-security-scanner` - Security testing with metrics integration
|
|
- `/add-rate-limiting` - Rate limiting with metrics exposure
|
|
- `/api-contract-generator` - Generate OpenAPI specs with metrics annotations
|
|
- `/deployment-pipeline-orchestrator` - CI/CD with monitoring integration
|
|
- `/api-versioning-manager` - Version-aware metrics tracking
|
|
|
|
## Advanced Topics
|
|
|
|
**Multi-Cluster Monitoring with Thanos:**
|
|
```yaml
|
|
# thanos-sidecar.yaml
|
|
apiVersion: v1
|
|
kind: ConfigMap
|
|
metadata:
|
|
name: thanos-config
|
|
data:
|
|
object-store.yaml: |
|
|
type: S3
|
|
config:
|
|
bucket: metrics-long-term
|
|
endpoint: s3.amazonaws.com
|
|
access_key: ${AWS_ACCESS_KEY}
|
|
secret_key: ${AWS_SECRET_KEY}
|
|
---
|
|
apiVersion: apps/v1
|
|
kind: StatefulSet
|
|
metadata:
|
|
name: prometheus-thanos
|
|
spec:
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: prometheus
|
|
args:
|
|
- --storage.tsdb.retention.time=2h
|
|
- --storage.tsdb.min-block-duration=2h
|
|
- --storage.tsdb.max-block-duration=2h
|
|
- --web.enable-lifecycle
|
|
- name: thanos-sidecar
|
|
image: quay.io/thanos/thanos:v0.31.0
|
|
args:
|
|
- sidecar
|
|
- --prometheus.url=http://localhost:9090
|
|
- --objstore.config-file=/etc/thanos/object-store.yaml
|
|
```
|
|
|
|
**Service Mesh Observability (Istio):**
|
|
```yaml
|
|
# Automatic metrics from Istio
|
|
telemetry:
|
|
v2:
|
|
prometheus:
|
|
providers:
|
|
- name: prometheus
|
|
configOverride:
|
|
inboundSidecar:
|
|
disable_host_header_fallback: false
|
|
metric_expiry_duration: 10m
|
|
outboundSidecar:
|
|
disable_host_header_fallback: false
|
|
metric_expiry_duration: 10m
|
|
gateway:
|
|
disable_host_header_fallback: true
|
|
```
|
|
|
|
## Version History
|
|
|
|
- v1.0.0 (2024-01): Initial Prometheus + Grafana implementation
|
|
- v1.1.0 (2024-03): Added Jaeger tracing integration
|
|
- v1.2.0 (2024-05): Thanos long-term storage support
|
|
- v1.3.0 (2024-07): OpenTelemetry collector integration
|
|
- v1.4.0 (2024-09): Multi-cluster federation support
|
|
- v1.5.0 (2024-10): Custom business metrics exporters
|
|
- Planned v2.0.0: eBPF-based zero-instrumentation monitoring
|