68 KiB
description, shortcut
| description | shortcut |
|---|---|
| Create API monitoring dashboard | monitor |
Create API Monitoring Dashboard
Build comprehensive monitoring infrastructure with metrics, logs, traces, and alerts for full API observability.
When to Use This Command
Use /create-monitoring when you need to:
- Establish observability for production APIs
- Track RED metrics (Rate, Errors, Duration) across services
- Set up real-time alerting for SLO violations
- Debug performance issues with distributed tracing
- Create executive dashboards for API health
- Implement SRE practices with data-driven insights
DON'T use this when:
- Building proof-of-concept applications (use lightweight logging instead)
- Monitoring non-critical internal tools (basic health checks may suffice)
- Resources are extremely constrained (consider managed solutions like Datadog first)
Design Decisions
This command implements a Prometheus + Grafana stack as the primary approach because:
- Open-source with no vendor lock-in
- Industry-standard metric format with wide ecosystem support
- Powerful query language (PromQL) for complex analysis
- Horizontal scalability via federation and remote storage
Alternative considered: ELK Stack (Elasticsearch, Logstash, Kibana)
- Better for log-centric analysis
- Higher resource requirements
- More complex operational overhead
- Recommended when logs are primary data source
Alternative considered: Managed solutions (Datadog, New Relic)
- Faster time-to-value
- Higher ongoing cost
- Less customization flexibility
- Recommended for teams without dedicated DevOps
Prerequisites
Before running this command:
- Docker and Docker Compose installed
- API instrumented with metrics endpoints (Prometheus format)
- Basic understanding of PromQL query language
- Network access for inter-service communication
- Sufficient disk space for time-series data (plan for 2-4 weeks retention)
Implementation Process
Step 1: Configure Prometheus
Set up Prometheus to scrape metrics from your API endpoints with service discovery.
Step 2: Create Grafana Dashboards
Build visualizations for RED metrics, custom business metrics, and SLO tracking.
Step 3: Implement Distributed Tracing
Integrate Jaeger for end-to-end request tracing across microservices.
Step 4: Configure Alerting
Set up AlertManager rules for critical thresholds with notification channels (Slack, PagerDuty).
Step 5: Deploy Monitoring Stack
Deploy complete observability infrastructure with health checks and backup configurations.
Output Format
The command generates:
docker-compose.yml- Complete monitoring stack configurationprometheus.yml- Prometheus scrape configurationgrafana-dashboards/- Pre-built dashboard JSON filesalerting-rules.yml- AlertManager rule definitionsjaeger-config.yml- Distributed tracing configurationREADME.md- Deployment and operation guide
Code Examples
Example 1: Complete Node.js Express API with Comprehensive Monitoring
// metrics/instrumentation.js - Full-featured Prometheus instrumentation
const promClient = require('prom-client');
const { performance } = require('perf_hooks');
const os = require('os');
class MetricsCollector {
constructor() {
// Create separate registries for different metric types
this.register = new promClient.Registry();
this.businessRegister = new promClient.Registry();
// Add default system metrics
promClient.collectDefaultMetrics({
register: this.register,
prefix: 'api_',
gcDurationBuckets: [0.001, 0.01, 0.1, 1, 2, 5]
});
// Initialize all metric types
this.initializeMetrics();
this.initializeBusinessMetrics();
this.initializeCustomCollectors();
// Start periodic collectors
this.startPeriodicCollectors();
}
initializeMetrics() {
// RED Metrics (Rate, Errors, Duration)
this.httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code', 'service', 'environment'],
buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
});
this.httpRequestTotal = new promClient.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code', 'service', 'environment']
});
this.httpRequestErrors = new promClient.Counter({
name: 'http_request_errors_total',
help: 'Total number of HTTP errors',
labelNames: ['method', 'route', 'error_type', 'service', 'environment']
});
// Database metrics
this.dbQueryDuration = new promClient.Histogram({
name: 'db_query_duration_seconds',
help: 'Database query execution time',
labelNames: ['operation', 'table', 'database', 'status'],
buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5]
});
this.dbConnectionPool = new promClient.Gauge({
name: 'db_connection_pool_size',
help: 'Database connection pool metrics',
labelNames: ['state', 'database'] // states: active, idle, total
});
// Cache metrics
this.cacheHitRate = new promClient.Counter({
name: 'cache_operations_total',
help: 'Cache operation counts',
labelNames: ['operation', 'cache_name', 'status'] // hit, miss, set, delete
});
this.cacheLatency = new promClient.Histogram({
name: 'cache_operation_duration_seconds',
help: 'Cache operation latency',
labelNames: ['operation', 'cache_name'],
buckets: [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]
});
// External API metrics
this.externalApiCalls = new promClient.Histogram({
name: 'external_api_duration_seconds',
help: 'External API call duration',
labelNames: ['service', 'endpoint', 'status_code'],
buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
});
// Circuit breaker metrics
this.circuitBreakerState = new promClient.Gauge({
name: 'circuit_breaker_state',
help: 'Circuit breaker state (0=closed, 1=open, 2=half-open)',
labelNames: ['service']
});
// Rate limiting metrics
this.rateLimitHits = new promClient.Counter({
name: 'rate_limit_hits_total',
help: 'Number of rate limited requests',
labelNames: ['limit_type', 'client_type']
});
// WebSocket metrics
this.activeWebsockets = new promClient.Gauge({
name: 'websocket_connections_active',
help: 'Number of active WebSocket connections',
labelNames: ['namespace', 'room']
});
// Register all metrics
[
this.httpRequestDuration, this.httpRequestTotal, this.httpRequestErrors,
this.dbQueryDuration, this.dbConnectionPool, this.cacheHitRate,
this.cacheLatency, this.externalApiCalls, this.circuitBreakerState,
this.rateLimitHits, this.activeWebsockets
].forEach(metric => this.register.registerMetric(metric));
}
initializeBusinessMetrics() {
// User activity metrics
this.activeUsers = new promClient.Gauge({
name: 'business_active_users',
help: 'Number of active users in the last 5 minutes',
labelNames: ['user_type', 'plan']
});
this.userSignups = new promClient.Counter({
name: 'business_user_signups_total',
help: 'Total user signups',
labelNames: ['source', 'plan', 'country']
});
// Transaction metrics
this.transactionAmount = new promClient.Histogram({
name: 'business_transaction_amount_dollars',
help: 'Transaction amounts in dollars',
labelNames: ['type', 'status', 'payment_method'],
buckets: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000, 10000]
});
this.orderProcessingTime = new promClient.Histogram({
name: 'business_order_processing_seconds',
help: 'Time to process orders end-to-end',
labelNames: ['order_type', 'fulfillment_type'],
buckets: [10, 30, 60, 180, 300, 600, 1800, 3600]
});
// API usage metrics
this.apiUsageByClient = new promClient.Counter({
name: 'business_api_usage_by_client',
help: 'API usage segmented by client',
labelNames: ['client_id', 'tier', 'endpoint']
});
this.apiQuotaRemaining = new promClient.Gauge({
name: 'business_api_quota_remaining',
help: 'Remaining API quota for clients',
labelNames: ['client_id', 'tier', 'quota_type']
});
// Revenue metrics
this.revenueByProduct = new promClient.Counter({
name: 'business_revenue_by_product_cents',
help: 'Revenue by product in cents',
labelNames: ['product_id', 'product_category', 'currency']
});
// Register business metrics
[
this.activeUsers, this.userSignups, this.transactionAmount,
this.orderProcessingTime, this.apiUsageByClient, this.apiQuotaRemaining,
this.revenueByProduct
].forEach(metric => this.businessRegister.registerMetric(metric));
}
initializeCustomCollectors() {
// SLI/SLO metrics
this.sloCompliance = new promClient.Gauge({
name: 'slo_compliance_percentage',
help: 'SLO compliance percentage',
labelNames: ['slo_name', 'service', 'window']
});
this.errorBudgetRemaining = new promClient.Gauge({
name: 'error_budget_remaining_percentage',
help: 'Remaining error budget percentage',
labelNames: ['service', 'slo_type']
});
this.register.registerMetric(this.sloCompliance);
this.register.registerMetric(this.errorBudgetRemaining);
}
startPeriodicCollectors() {
// Update active users every 30 seconds
setInterval(() => {
const activeUserCount = this.calculateActiveUsers();
this.activeUsers.set(
{ user_type: 'registered', plan: 'free' },
activeUserCount.free
);
this.activeUsers.set(
{ user_type: 'registered', plan: 'premium' },
activeUserCount.premium
);
}, 30000);
// Update SLO compliance every minute
setInterval(() => {
this.updateSLOCompliance();
}, 60000);
// Database pool monitoring
setInterval(() => {
this.updateDatabasePoolMetrics();
}, 15000);
}
// Middleware for HTTP metrics
httpMetricsMiddleware() {
return (req, res, next) => {
const start = performance.now();
const route = req.route?.path || req.path || 'unknown';
// Track in-flight requests
const inFlightGauge = new promClient.Gauge({
name: 'http_requests_in_flight',
help: 'Number of in-flight HTTP requests',
labelNames: ['method', 'route']
});
inFlightGauge.inc({ method: req.method, route });
res.on('finish', () => {
const duration = (performance.now() - start) / 1000;
const labels = {
method: req.method,
route,
status_code: res.statusCode,
service: process.env.SERVICE_NAME || 'api',
environment: process.env.NODE_ENV || 'development'
};
// Record metrics
this.httpRequestDuration.observe(labels, duration);
this.httpRequestTotal.inc(labels);
if (res.statusCode >= 400) {
const errorType = res.statusCode >= 500 ? 'server_error' : 'client_error';
this.httpRequestErrors.inc({
...labels,
error_type: errorType
});
}
inFlightGauge.dec({ method: req.method, route });
// Log slow requests
if (duration > 1) {
console.warn('Slow request detected:', {
...labels,
duration,
user: req.user?.id,
ip: req.ip
});
}
});
next();
};
}
// Database query instrumentation
instrumentDatabase(knex) {
knex.on('query', (query) => {
query.__startTime = performance.now();
});
knex.on('query-response', (response, query) => {
const duration = (performance.now() - query.__startTime) / 1000;
const table = this.extractTableName(query.sql);
this.dbQueryDuration.observe({
operation: query.method || 'select',
table,
database: process.env.DB_NAME || 'default',
status: 'success'
}, duration);
});
knex.on('query-error', (error, query) => {
const duration = (performance.now() - query.__startTime) / 1000;
const table = this.extractTableName(query.sql);
this.dbQueryDuration.observe({
operation: query.method || 'select',
table,
database: process.env.DB_NAME || 'default',
status: 'error'
}, duration);
});
}
// Cache instrumentation wrapper
wrapCache(cache) {
const wrapper = {};
const methods = ['get', 'set', 'delete', 'has'];
methods.forEach(method => {
wrapper[method] = async (...args) => {
const start = performance.now();
const cacheName = cache.name || 'default';
try {
const result = await cache[method](...args);
const duration = (performance.now() - start) / 1000;
// Record cache metrics
if (method === 'get') {
const status = result !== undefined ? 'hit' : 'miss';
this.cacheHitRate.inc({
operation: method,
cache_name: cacheName,
status
});
} else {
this.cacheHitRate.inc({
operation: method,
cache_name: cacheName,
status: 'success'
});
}
this.cacheLatency.observe({
operation: method,
cache_name: cacheName
}, duration);
return result;
} catch (error) {
this.cacheHitRate.inc({
operation: method,
cache_name: cacheName,
status: 'error'
});
throw error;
}
};
});
return wrapper;
}
// External API call instrumentation
async trackExternalCall(serviceName, endpoint, callFunc) {
const start = performance.now();
try {
const result = await callFunc();
const duration = (performance.now() - start) / 1000;
this.externalApiCalls.observe({
service: serviceName,
endpoint,
status_code: result.status || 200
}, duration);
return result;
} catch (error) {
const duration = (performance.now() - start) / 1000;
this.externalApiCalls.observe({
service: serviceName,
endpoint,
status_code: error.response?.status || 0
}, duration);
throw error;
}
}
// Circuit breaker monitoring
updateCircuitBreakerState(service, state) {
const stateValue = {
'closed': 0,
'open': 1,
'half-open': 2
}[state] || 0;
this.circuitBreakerState.set({ service }, stateValue);
}
// Helper methods
calculateActiveUsers() {
// Implementation would query your session store or database
return {
free: Math.floor(Math.random() * 1000),
premium: Math.floor(Math.random() * 100)
};
}
updateSLOCompliance() {
// Calculate based on recent metrics
const availability = 99.95; // Calculate from actual metrics
const latencyP99 = 250; // Calculate from actual metrics
this.sloCompliance.set({
slo_name: 'availability',
service: 'api',
window: '30d'
}, availability);
this.sloCompliance.set({
slo_name: 'latency_p99',
service: 'api',
window: '30d'
}, latencyP99 < 500 ? 100 : 0);
// Update error budget
const errorBudget = 100 - ((100 - availability) / 0.05) * 100;
this.errorBudgetRemaining.set({
service: 'api',
slo_type: 'availability'
}, Math.max(0, errorBudget));
}
updateDatabasePoolMetrics() {
// Get pool stats from your database driver
const pool = global.dbPool; // Your database pool instance
if (pool) {
this.dbConnectionPool.set({
state: 'active',
database: 'primary'
}, pool.numUsed());
this.dbConnectionPool.set({
state: 'idle',
database: 'primary'
}, pool.numFree());
this.dbConnectionPool.set({
state: 'total',
database: 'primary'
}, pool.numUsed() + pool.numFree());
}
}
extractTableName(sql) {
const match = sql.match(/(?:from|into|update)\s+`?(\w+)`?/i);
return match ? match[1] : 'unknown';
}
// Expose metrics endpoint
async getMetrics() {
const baseMetrics = await this.register.metrics();
const businessMetrics = await this.businessRegister.metrics();
return baseMetrics + '\n' + businessMetrics;
}
}
// Express application setup
const express = require('express');
const app = express();
const metricsCollector = new MetricsCollector();
// Apply monitoring middleware
app.use(metricsCollector.httpMetricsMiddleware());
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', metricsCollector.register.contentType);
res.end(await metricsCollector.getMetrics());
});
// Example API endpoint with comprehensive tracking
app.post('/api/orders', async (req, res) => {
const orderStart = performance.now();
try {
// Track business metrics
metricsCollector.transactionAmount.observe({
type: 'purchase',
status: 'pending',
payment_method: req.body.paymentMethod
}, req.body.amount);
// Simulate external payment API call
const paymentResult = await metricsCollector.trackExternalCall(
'stripe',
'/charges',
async () => {
// Your actual payment API call
return await stripeClient.charges.create({
amount: req.body.amount * 100,
currency: 'usd'
});
}
);
// Track order processing time
const processingTime = (performance.now() - orderStart) / 1000;
metricsCollector.orderProcessingTime.observe({
order_type: 'standard',
fulfillment_type: 'digital'
}, processingTime);
// Track revenue
metricsCollector.revenueByProduct.inc({
product_id: req.body.productId,
product_category: req.body.category,
currency: 'USD'
}, req.body.amount * 100);
res.json({ success: true, orderId: paymentResult.id });
} catch (error) {
res.status(500).json({ error: error.message });
}
});
module.exports = { app, metricsCollector };
Example 2: Complete Monitoring Stack with Docker Compose
# docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.45.0
container_name: prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alerting-rules.yml:/etc/prometheus/alerting-rules.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--storage.tsdb.retention.time=15d'
ports:
- "9090:9090"
networks:
- monitoring
grafana:
image: grafana/grafana:10.0.0
container_name: grafana
volumes:
- grafana-data:/var/lib/grafana
- ./grafana-dashboards:/etc/grafana/provisioning/dashboards
- ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
- GF_SERVER_ROOT_URL=http://localhost:3000
ports:
- "3000:3000"
networks:
- monitoring
depends_on:
- prometheus
jaeger:
image: jaegertracing/all-in-one:1.47
container_name: jaeger
environment:
- COLLECTOR_ZIPKIN_HOST_PORT=:9411
- COLLECTOR_OTLP_ENABLED=true
ports:
- "5775:5775/udp"
- "6831:6831/udp"
- "6832:6832/udp"
- "5778:5778"
- "16686:16686" # Jaeger UI
- "14268:14268"
- "14250:14250"
- "9411:9411"
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
networks:
- monitoring
alertmanager:
image: prom/alertmanager:v0.26.0
container_name: alertmanager
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
ports:
- "9093:9093"
networks:
- monitoring
networks:
monitoring:
driver: bridge
volumes:
prometheus-data:
grafana-data:
Example 3: Advanced Grafana Dashboard Definitions
// grafana-dashboards/api-overview.json
{
"dashboard": {
"id": null,
"uid": "api-overview",
"title": "API Performance Overview",
"tags": ["api", "performance", "sre"],
"timezone": "browser",
"schemaVersion": 16,
"version": 0,
"refresh": "30s",
"time": {
"from": "now-6h",
"to": "now"
},
"templating": {
"list": [
{
"name": "datasource",
"type": "datasource",
"query": "prometheus",
"current": {
"value": "Prometheus",
"text": "Prometheus"
}
},
{
"name": "service",
"type": "query",
"datasource": "$datasource",
"query": "label_values(http_requests_total, service)",
"multi": true,
"includeAll": true,
"current": {
"value": ["$__all"],
"text": "All"
},
"refresh": 1
},
{
"name": "environment",
"type": "query",
"datasource": "$datasource",
"query": "label_values(http_requests_total, environment)",
"current": {
"value": "production",
"text": "Production"
}
}
]
},
"panels": [
{
"id": 1,
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 0 },
"type": "graph",
"title": "Request Rate (req/s)",
"targets": [
{
"expr": "sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service)",
"legendFormat": "{{service}}",
"refId": "A"
}
],
"yaxes": [
{
"format": "reqps",
"label": "Requests per second"
}
],
"lines": true,
"linewidth": 2,
"fill": 1,
"fillGradient": 3,
"steppedLine": false,
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"alert": {
"name": "High Request Rate",
"conditions": [
{
"evaluator": {
"params": [10000],
"type": "gt"
},
"operator": {
"type": "and"
},
"query": {
"params": ["A", "5m", "now"]
},
"reducer": {
"type": "avg"
},
"type": "query"
}
],
"executionErrorState": "alerting",
"frequency": "1m",
"handler": 1,
"noDataState": "no_data",
"notifications": [
{
"uid": "slack-channel"
}
]
}
},
{
"id": 2,
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 0 },
"type": "graph",
"title": "Error Rate (%)",
"targets": [
{
"expr": "sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\",status_code=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service) * 100",
"legendFormat": "{{service}}",
"refId": "A"
}
],
"yaxes": [
{
"format": "percent",
"label": "Error Rate",
"max": 10
}
],
"thresholds": [
{
"value": 1,
"op": "gt",
"fill": true,
"line": true,
"colorMode": "critical"
}
],
"alert": {
"name": "High Error Rate",
"conditions": [
{
"evaluator": {
"params": [1],
"type": "gt"
},
"operator": {
"type": "and"
},
"query": {
"params": ["A", "5m", "now"]
},
"reducer": {
"type": "last"
},
"type": "query"
}
],
"executionErrorState": "alerting",
"frequency": "1m",
"handler": 1,
"noDataState": "no_data",
"notifications": [
{
"uid": "pagerduty"
}
],
"message": "Error rate is above 1% for service {{service}}"
}
},
{
"id": 3,
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 0 },
"type": "graph",
"title": "Response Time (p50, p95, p99)",
"targets": [
{
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\",environment=\"$environment\"}[5m])) by (le, service))",
"legendFormat": "p50 {{service}}",
"refId": "A"
},
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\",environment=\"$environment\"}[5m])) by (le, service))",
"legendFormat": "p95 {{service}}",
"refId": "B"
},
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\",environment=\"$environment\"}[5m])) by (le, service))",
"legendFormat": "p99 {{service}}",
"refId": "C"
}
],
"yaxes": [
{
"format": "s",
"label": "Response Time"
}
]
},
{
"id": 4,
"gridPos": { "h": 6, "w": 6, "x": 0, "y": 8 },
"type": "stat",
"title": "Current QPS",
"targets": [
{
"expr": "sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[1m]))",
"instant": true,
"refId": "A"
}
],
"format": "reqps",
"sparkline": {
"show": true,
"lineColor": "rgb(31, 120, 193)",
"fillColor": "rgba(31, 120, 193, 0.18)"
},
"thresholds": {
"mode": "absolute",
"steps": [
{ "value": 0, "color": "green" },
{ "value": 5000, "color": "yellow" },
{ "value": 10000, "color": "red" }
]
}
},
{
"id": 5,
"gridPos": { "h": 6, "w": 6, "x": 6, "y": 8 },
"type": "stat",
"title": "Error Budget Remaining",
"targets": [
{
"expr": "error_budget_remaining_percentage{service=~\"$service\",slo_type=\"availability\"}",
"instant": true,
"refId": "A"
}
],
"format": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{ "value": 0, "color": "red" },
{ "value": 25, "color": "orange" },
{ "value": 50, "color": "yellow" },
{ "value": 75, "color": "green" }
]
}
},
{
"id": 6,
"gridPos": { "h": 6, "w": 12, "x": 12, "y": 8 },
"type": "table",
"title": "Top Slow Endpoints",
"targets": [
{
"expr": "topk(10, histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\",environment=\"$environment\"}[5m])) by (le, route)))",
"format": "table",
"instant": true,
"refId": "A"
}
],
"styles": [
{
"alias": "Time",
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"type": "date"
},
{
"alias": "Duration",
"colorMode": "cell",
"colors": ["green", "yellow", "red"],
"thresholds": [0.5, 1],
"type": "number",
"unit": "s"
}
]
}
]
}
}
Example 4: Production-Ready Alerting Rules
# alerting-rules.yml
groups:
- name: api_alerts
interval: 30s
rules:
# SLO-based alerts
- alert: APIHighErrorRate
expr: |
(
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service, environment)
/
sum(rate(http_requests_total[5m])) by (service, environment)
) > 0.01
for: 5m
labels:
severity: critical
team: api-platform
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "{{ $labels.service }} in {{ $labels.environment }} has error rate of {{ $value | humanizePercentage }} (threshold: 1%)"
runbook_url: "https://wiki.example.com/runbooks/api-high-error-rate"
dashboard_url: "https://grafana.example.com/d/api-overview?var-service={{ $labels.service }}"
- alert: APIHighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
) > 0.5
for: 10m
labels:
severity: warning
team: api-platform
annotations:
summary: "High latency on {{ $labels.service }}"
description: "P95 latency for {{ $labels.service }} is {{ $value | humanizeDuration }} (threshold: 500ms)"
- alert: APILowAvailability
expr: |
up{job="api-services"} == 0
for: 1m
labels:
severity: critical
team: api-platform
annotations:
summary: "API service {{ $labels.instance }} is down"
description: "{{ $labels.instance }} has been down for more than 1 minute"
# Business metrics alerts
- alert: LowActiveUsers
expr: |
business_active_users{plan="premium"} < 10
for: 30m
labels:
severity: warning
team: product
annotations:
summary: "Low number of active premium users"
description: "Only {{ $value }} premium users active in the last 30 minutes"
- alert: HighTransactionFailureRate
expr: |
(
sum(rate(business_transaction_amount_dollars_sum{status="failed"}[5m]))
/
sum(rate(business_transaction_amount_dollars_sum[5m]))
) > 0.05
for: 5m
labels:
severity: critical
team: payments
annotations:
summary: "High transaction failure rate"
description: "Transaction failure rate is {{ $value | humanizePercentage }} (threshold: 5%)"
# Infrastructure alerts
- alert: DatabaseConnectionPoolExhausted
expr: |
(
db_connection_pool_size{state="active"}
/
db_connection_pool_size{state="total"}
) > 0.9
for: 5m
labels:
severity: warning
team: database
annotations:
summary: "Database connection pool near exhaustion"
description: "{{ $labels.database }} pool is {{ $value | humanizePercentage }} utilized"
- alert: CacheLowHitRate
expr: |
(
sum(rate(cache_operations_total{status="hit"}[5m])) by (cache_name)
/
sum(rate(cache_operations_total{operation="get"}[5m])) by (cache_name)
) < 0.8
for: 15m
labels:
severity: warning
team: api-platform
annotations:
summary: "Low cache hit rate for {{ $labels.cache_name }}"
description: "Cache hit rate is {{ $value | humanizePercentage }} (expected: >80%)"
- alert: CircuitBreakerOpen
expr: |
circuit_breaker_state == 1
for: 1m
labels:
severity: warning
team: api-platform
annotations:
summary: "Circuit breaker open for {{ $labels.service }}"
description: "Circuit breaker for {{ $labels.service }} has been open for more than 1 minute"
# SLO burn rate alerts (multi-window approach)
- alert: SLOBurnRateHigh
expr: |
(
# 5m burn rate > 14.4 (1 hour of error budget in 5 minutes)
(
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
) > (1 - 0.999) * 14.4
) and (
# 1h burn rate > 1 (confirms it's not a spike)
(
sum(rate(http_requests_total{status_code=~"5.."}[1h])) by (service)
/
sum(rate(http_requests_total[1h])) by (service)
) > (1 - 0.999)
)
labels:
severity: critical
team: api-platform
alert_type: slo_burn
annotations:
summary: "SLO burn rate critically high for {{ $labels.service }}"
description: "{{ $labels.service }} is burning error budget 14.4x faster than normal"
# Resource alerts
- alert: HighMemoryUsage
expr: |
(
container_memory_usage_bytes{container!="POD",container!=""}
/
container_spec_memory_limit_bytes{container!="POD",container!=""}
) > 0.9
for: 5m
labels:
severity: warning
team: api-platform
annotations:
summary: "High memory usage for {{ $labels.container }}"
description: "Container {{ $labels.container }} memory usage is {{ $value | humanizePercentage }}"
# AlertManager configuration
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
- match:
severity: warning
receiver: 'slack-warnings'
- match:
team: payments
receiver: 'payments-team'
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
details:
firing: '{{ .Alerts.Firing | len }}'
resolved: '{{ .Alerts.Resolved | len }}'
labels: '{{ .CommonLabels }}'
- name: 'slack-warnings'
slack_configs:
- channel: '#warnings'
send_resolved: true
title: 'Warning: {{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.description }}'
actions:
- type: button
text: 'View Dashboard'
url: '{{ .CommonAnnotations.dashboard_url }}'
- type: button
text: 'View Runbook'
url: '{{ .CommonAnnotations.runbook_url }}'
- name: 'payments-team'
email_configs:
- to: 'payments-team@example.com'
from: 'alerts@example.com'
headers:
Subject: 'Payment Alert: {{ .GroupLabels.alertname }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'service']
Example 5: OpenTelemetry Integration for Distributed Tracing
// tracing/setup.js - OpenTelemetry configuration
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { PrometheusExporter } = require('@opentelemetry/exporter-prometheus');
const {
ConsoleSpanExporter,
BatchSpanProcessor,
SimpleSpanProcessor
} = require('@opentelemetry/sdk-trace-base');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
class TracingSetup {
constructor(serviceName, environment = 'production') {
this.serviceName = serviceName;
this.environment = environment;
this.sdk = null;
}
initialize() {
// Create resource identifying the service
const resource = Resource.default().merge(
new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: this.serviceName,
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.VERSION || '1.0.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: this.environment,
'service.namespace': 'api-platform',
'service.instance.id': process.env.HOSTNAME || 'unknown',
'telemetry.sdk.language': 'nodejs',
})
);
// Configure Jaeger exporter for traces
const jaegerExporter = new JaegerExporter({
endpoint: process.env.JAEGER_ENDPOINT || 'http://localhost:14268/api/traces',
tags: {
service: this.serviceName,
environment: this.environment
}
});
// Configure Prometheus exporter for metrics
const prometheusExporter = new PrometheusExporter({
port: 9464,
endpoint: '/metrics',
prefix: 'otel_',
appendTimestamp: true,
}, () => {
console.log('Prometheus metrics server started on port 9464');
});
// Create SDK with auto-instrumentation
this.sdk = new NodeSDK({
resource,
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': {
enabled: false, // Disable fs to reduce noise
},
'@opentelemetry/instrumentation-http': {
requestHook: (span, request) => {
span.setAttribute('http.request.body', JSON.stringify(request.body));
span.setAttribute('http.request.user_id', request.user?.id);
},
responseHook: (span, response) => {
span.setAttribute('http.response.size', response.length);
},
ignoreIncomingPaths: ['/health', '/metrics', '/favicon.ico'],
ignoreOutgoingUrls: [(url) => url.includes('prometheus')]
},
'@opentelemetry/instrumentation-express': {
requestHook: (span, request) => {
span.setAttribute('express.route', request.route?.path);
span.setAttribute('express.params', JSON.stringify(request.params));
}
},
'@opentelemetry/instrumentation-mysql2': {
enhancedDatabaseReporting: true,
},
'@opentelemetry/instrumentation-redis-4': {
dbStatementSerializer: (cmdName, cmdArgs) => {
return `${cmdName} ${cmdArgs.slice(0, 2).join(' ')}`;
}
}
})
],
spanProcessor: new BatchSpanProcessor(jaegerExporter, {
maxQueueSize: 2048,
maxExportBatchSize: 512,
scheduledDelayMillis: 5000,
exportTimeoutMillis: 30000,
}),
metricReader: new PeriodicExportingMetricReader({
exporter: prometheusExporter,
exportIntervalMillis: 10000,
}),
});
// Start the SDK
this.sdk.start()
.then(() => console.log('Tracing initialized successfully'))
.catch((error) => console.error('Error initializing tracing', error));
// Graceful shutdown
process.on('SIGTERM', () => {
this.shutdown();
});
}
async shutdown() {
try {
await this.sdk.shutdown();
console.log('Tracing terminated successfully');
} catch (error) {
console.error('Error terminating tracing', error);
}
}
// Manual span creation for custom instrumentation
createSpan(tracer, spanName, fn) {
return tracer.startActiveSpan(spanName, async (span) => {
try {
span.setAttribute('span.kind', 'internal');
span.setAttribute('custom.span', true);
const result = await fn(span);
span.setStatus({ code: 0, message: 'OK' });
return result;
} catch (error) {
span.setStatus({ code: 2, message: error.message });
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}
}
// Usage in application
const tracing = new TracingSetup('api-gateway', process.env.NODE_ENV);
tracing.initialize();
// Custom instrumentation example
const { trace } = require('@opentelemetry/api');
async function processOrder(orderId) {
const tracer = trace.getTracer('order-processing', '1.0.0');
return tracing.createSpan(tracer, 'processOrder', async (span) => {
span.setAttribute('order.id', orderId);
span.addEvent('Order processing started');
// Validate order
await tracing.createSpan(tracer, 'validateOrder', async (childSpan) => {
childSpan.setAttribute('validation.type', 'schema');
// Validation logic
await validateOrderSchema(orderId);
});
// Process payment
await tracing.createSpan(tracer, 'processPayment', async (childSpan) => {
childSpan.setAttribute('payment.method', 'stripe');
// Payment logic
const result = await processStripePayment(orderId);
childSpan.setAttribute('payment.status', result.status);
childSpan.addEvent('Payment processed', {
'payment.amount': result.amount,
'payment.currency': result.currency
});
});
// Send confirmation
await tracing.createSpan(tracer, 'sendConfirmation', async (childSpan) => {
childSpan.setAttribute('notification.type', 'email');
// Email logic
await sendOrderConfirmation(orderId);
});
span.addEvent('Order processing completed');
return { success: true, orderId };
});
}
module.exports = { TracingSetup, tracing };
Example 6: Custom Prometheus Exporters for Complex Metrics
# custom_exporters.py - Python Prometheus exporter for business metrics
from prometheus_client import start_http_server, Gauge, Counter, Histogram, Info, Enum
from prometheus_client.core import CollectorRegistry
from prometheus_client import generate_latest
import time
import psycopg2
import redis
import requests
from datetime import datetime, timedelta
import asyncio
import aiohttp
class CustomBusinessExporter:
def __init__(self, db_config, redis_config, port=9091):
self.registry = CollectorRegistry()
self.db_config = db_config
self.redis_config = redis_config
self.port = port
# Initialize metrics
self.initialize_metrics()
# Connect to data sources
self.connect_datasources()
def initialize_metrics(self):
# Business KPI metrics
self.revenue_total = Gauge(
'business_revenue_total_usd',
'Total revenue in USD',
['period', 'product_line', 'region'],
registry=self.registry
)
self.customer_lifetime_value = Histogram(
'business_customer_lifetime_value_usd',
'Customer lifetime value distribution',
['customer_segment', 'acquisition_channel'],
buckets=(10, 50, 100, 500, 1000, 5000, 10000, 50000),
registry=self.registry
)
self.churn_rate = Gauge(
'business_churn_rate_percentage',
'Customer churn rate',
['plan', 'cohort'],
registry=self.registry
)
self.monthly_recurring_revenue = Gauge(
'business_mrr_usd',
'Monthly recurring revenue',
['plan', 'currency'],
registry=self.registry
)
self.net_promoter_score = Gauge(
'business_nps',
'Net Promoter Score',
['segment', 'survey_type'],
registry=self.registry
)
# Operational metrics
self.data_pipeline_lag = Histogram(
'data_pipeline_lag_seconds',
'Data pipeline processing lag',
['pipeline', 'stage'],
buckets=(1, 5, 10, 30, 60, 300, 600, 1800, 3600),
registry=self.registry
)
self.feature_usage = Counter(
'feature_usage_total',
'Feature usage counts',
['feature_name', 'user_tier', 'success'],
registry=self.registry
)
self.api_quota_usage = Gauge(
'api_quota_usage_percentage',
'API quota usage by customer',
['customer_id', 'tier', 'resource'],
registry=self.registry
)
# System health indicators
self.dependency_health = Enum(
'dependency_health_status',
'Health status of external dependencies',
['service', 'dependency'],
states=['healthy', 'degraded', 'unhealthy'],
registry=self.registry
)
self.data_quality_score = Gauge(
'data_quality_score',
'Data quality score (0-100)',
['dataset', 'dimension'],
registry=self.registry
)
def connect_datasources(self):
# PostgreSQL connection
self.db_conn = psycopg2.connect(**self.db_config)
# Redis connection
self.redis_client = redis.Redis(**self.redis_config)
def collect_business_metrics(self):
"""Collect business metrics from various data sources"""
cursor = self.db_conn.cursor()
# Revenue metrics
cursor.execute("""
SELECT
DATE_TRUNC('day', created_at) as period,
product_line,
region,
SUM(amount) as total_revenue
FROM orders
WHERE status = 'completed'
AND created_at >= NOW() - INTERVAL '7 days'
GROUP BY period, product_line, region
""")
for row in cursor.fetchall():
self.revenue_total.labels(
period=row[0].isoformat(),
product_line=row[1],
region=row[2]
).set(row[3])
# Customer lifetime value
cursor.execute("""
SELECT
c.segment,
c.acquisition_channel,
AVG(o.total_spent) as avg_clv
FROM customers c
JOIN (
SELECT customer_id, SUM(amount) as total_spent
FROM orders
WHERE status = 'completed'
GROUP BY customer_id
) o ON c.id = o.customer_id
GROUP BY c.segment, c.acquisition_channel
""")
for row in cursor.fetchall():
self.customer_lifetime_value.labels(
customer_segment=row[0],
acquisition_channel=row[1]
).observe(row[2])
# MRR calculation
cursor.execute("""
SELECT
plan_name,
currency,
SUM(
CASE
WHEN billing_period = 'yearly' THEN amount / 12
ELSE amount
END
) as mrr
FROM subscriptions
WHERE status = 'active'
GROUP BY plan_name, currency
""")
for row in cursor.fetchall():
self.monthly_recurring_revenue.labels(
plan=row[0],
currency=row[1]
).set(row[2])
# Churn rate
cursor.execute("""
WITH cohort_data AS (
SELECT
plan_name,
DATE_TRUNC('month', created_at) as cohort,
COUNT(*) as total_customers,
COUNT(CASE WHEN status = 'cancelled' THEN 1 END) as churned_customers
FROM subscriptions
WHERE created_at >= NOW() - INTERVAL '6 months'
GROUP BY plan_name, cohort
)
SELECT
plan_name,
cohort,
(churned_customers::float / total_customers) * 100 as churn_rate
FROM cohort_data
""")
for row in cursor.fetchall():
self.churn_rate.labels(
plan=row[0],
cohort=row[1].isoformat()
).set(row[2])
cursor.close()
def collect_operational_metrics(self):
"""Collect operational metrics from Redis and other sources"""
# API quota usage from Redis
for key in self.redis_client.scan_iter("quota:*"):
parts = key.decode().split(':')
if len(parts) >= 3:
customer_id = parts[1]
resource = parts[2]
used = float(self.redis_client.get(key) or 0)
limit_key = f"quota_limit:{customer_id}:{resource}"
limit = float(self.redis_client.get(limit_key) or 1000)
usage_percentage = (used / limit) * 100 if limit > 0 else 0
# Get customer tier from database
cursor = self.db_conn.cursor()
cursor.execute(
"SELECT tier FROM customers WHERE id = %s",
(customer_id,)
)
result = cursor.fetchone()
tier = result[0] if result else 'unknown'
cursor.close()
self.api_quota_usage.labels(
customer_id=customer_id,
tier=tier,
resource=resource
).set(usage_percentage)
# Data pipeline lag from Redis
pipeline_stages = ['ingestion', 'processing', 'storage', 'delivery']
for stage in pipeline_stages:
lag_key = f"pipeline:lag:{stage}"
lag_value = self.redis_client.get(lag_key)
if lag_value:
self.data_pipeline_lag.labels(
pipeline='main',
stage=stage
).observe(float(lag_value))
def check_dependency_health(self):
"""Check health of external dependencies"""
dependencies = [
('payment', 'stripe', 'https://api.stripe.com/health'),
('email', 'sendgrid', 'https://api.sendgrid.com/health'),
('storage', 's3', 'https://s3.amazonaws.com/health'),
('cache', 'redis', 'redis://localhost:6379'),
('database', 'postgres', self.db_config)
]
for service, dep_name, endpoint in dependencies:
try:
if dep_name == 'redis':
# Check Redis
self.redis_client.ping()
status = 'healthy'
elif dep_name == 'postgres':
# Check PostgreSQL
cursor = self.db_conn.cursor()
cursor.execute("SELECT 1")
cursor.close()
status = 'healthy'
else:
# Check HTTP endpoints
response = requests.get(endpoint, timeout=5)
if response.status_code == 200:
status = 'healthy'
elif 200 < response.status_code < 500:
status = 'degraded'
else:
status = 'unhealthy'
except Exception as e:
print(f"Health check failed for {dep_name}: {e}")
status = 'unhealthy'
self.dependency_health.labels(
service=service,
dependency=dep_name
).state(status)
def calculate_data_quality(self):
"""Calculate data quality scores"""
cursor = self.db_conn.cursor()
# Completeness score
cursor.execute("""
SELECT
'orders' as dataset,
(COUNT(*) - COUNT(CASE WHEN customer_email IS NULL THEN 1 END))::float / COUNT(*) * 100 as completeness
FROM orders
WHERE created_at >= NOW() - INTERVAL '1 day'
""")
for row in cursor.fetchall():
self.data_quality_score.labels(
dataset=row[0],
dimension='completeness'
).set(row[1])
# Accuracy score (checking for valid email formats)
cursor.execute("""
SELECT
'customers' as dataset,
COUNT(CASE WHEN email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}$' THEN 1 END)::float / COUNT(*) * 100 as accuracy
FROM customers
WHERE created_at >= NOW() - INTERVAL '1 day'
""")
for row in cursor.fetchall():
self.data_quality_score.labels(
dataset=row[0],
dimension='accuracy'
).set(row[1])
cursor.close()
async def collect_metrics_async(self):
"""Async collection for improved performance"""
tasks = [
self.collect_business_metrics_async(),
self.collect_operational_metrics_async(),
self.check_dependency_health_async(),
self.calculate_data_quality_async()
]
await asyncio.gather(*tasks)
def run(self):
"""Start the exporter"""
# Start HTTP server for Prometheus to scrape
start_http_server(self.port, registry=self.registry)
print(f"Custom exporter started on port {self.port}")
# Collect metrics every 30 seconds
while True:
try:
self.collect_business_metrics()
self.collect_operational_metrics()
self.check_dependency_health()
self.calculate_data_quality()
print(f"Metrics collected at {datetime.now()}")
time.sleep(30)
except Exception as e:
print(f"Error collecting metrics: {e}")
time.sleep(30)
# Usage
if __name__ == "__main__":
db_config = {
'host': 'localhost',
'database': 'production',
'user': 'metrics_user',
'password': 'secure_password',
'port': 5432
}
redis_config = {
'host': 'localhost',
'port': 6379,
'db': 0,
'decode_responses': True
}
exporter = CustomBusinessExporter(db_config, redis_config)
exporter.run()
Error Handling
| Error | Cause | Solution |
|---|---|---|
| "Connection refused to Prometheus" | Prometheus not running or wrong port | Check Docker container status with docker ps, verify port mapping |
| "No data in Grafana dashboard" | Metrics not being scraped | Verify Prometheus targets at localhost:9090/targets, check API metrics endpoint |
| "Too many samples" error | High cardinality labels | Review label usage, avoid user IDs or timestamps as labels |
| "Out of memory" in Prometheus | Retention too long or too many metrics | Reduce retention time, implement remote storage, or scale vertically |
| Jaeger traces not appearing | Incorrect sampling rate | Increase sampling rate in tracer configuration |
| "Context deadline exceeded" | Scrape timeout too short | Increase scrape_timeout in prometheus.yml (default 10s) |
| "Error reading Prometheus" | Corrupt WAL (write-ahead log) | Delete WAL directory: rm -rf /prometheus/wal/* and restart |
| "Too many open files" | File descriptor limit reached | Increase ulimit: ulimit -n 65536 or adjust systemd limits |
| AlertManager not firing | Incorrect routing rules | Validate routing tree with amtool config routes |
| Grafana login loop | Cookie/session issues | Clear browser cookies, check Grafana cookie settings |
Configuration Options
Basic Usage:
/create-monitoring \
--stack=prometheus \
--services=api-gateway,user-service,order-service \
--environment=production \
--retention=30d
Available Options:
--stack <type> - Monitoring stack to deploy
prometheus- Prometheus + Grafana + AlertManager (default, open-source)elastic- ELK stack (Elasticsearch, Logstash, Kibana) for log-centricdatadog- Datadog agent configuration (requires API key)newrelic- New Relic agent setup (requires license key)hybrid- Combination of metrics (Prometheus) and logs (ELK)
--tracing <backend> - Distributed tracing backend
jaeger- Jaeger all-in-one (default, recommended for start)zipkin- Zipkin servertempo- Grafana Tempo (for high-scale)xray- AWS X-Ray (for AWS environments)none- Skip tracing setup
--retention <duration> - Metrics retention period
- Default:
15d(15 days) - Production:
30dto90d - With remote storage:
365dor more
--scrape-interval <duration> - How often to collect metrics
- Default:
15s - High-frequency:
5s(higher resource usage) - Low-frequency:
60s(for stable metrics)
--alerting-channels <channels> - Where to send alerts
slack- Slack webhook integrationpagerduty- PagerDuty integrationemail- SMTP email notificationswebhook- Custom webhook endpointopsgenie- Atlassian OpsGenie
--dashboard-presets <presets> - Pre-built dashboards to install
red-metrics- Rate, Errors, Durationfour-golden- Latency, Traffic, Errors, Saturationbusiness-kpis- Revenue, Users, Conversionsre-slos- SLI/SLO trackingsecurity- Security metrics and anomalies
--exporters <list> - Additional exporters to configure
node-exporter- System/host metricsblackbox-exporter- Probe endpointspostgres-exporter- PostgreSQL metricsredis-exporter- Redis metricscustom- Custom business metrics
--high-availability - Enable HA configuration
- Sets up Prometheus federation
- Configures AlertManager clustering
- Enables Grafana database replication
--storage <type> - Long-term storage backend
local- Local disk (default)thanos- Thanos for unlimited retentioncortex- Cortex for multi-tenantvictoria- VictoriaMetrics for efficiencys3- S3-compatible object storage
--dry-run - Generate configuration without deploying
- Creates all config files
- Validates syntax
- Shows what would be deployed
- No actual containers started
Best Practices
DO:
- Start with RED metrics (Rate, Errors, Duration) as your foundation
- Use histogram buckets that align with your SLO targets
- Tag metrics with environment, region, version, and service
- Create runbooks for every alert and link them in annotations
- Implement meta-monitoring (monitor the monitoring system)
- Use recording rules for frequently-run expensive queries
- Set up separate dashboards for different audiences (ops, dev, business)
- Use exemplars to link metrics to traces for easier debugging
- Implement gradual rollout of new metrics to avoid cardinality explosion
- Archive old dashboards before creating new ones
DON'T:
- Add high-cardinality labels like user IDs, session IDs, or UUIDs
- Create dashboards with 50+ panels (causes browser performance issues)
- Alert on symptoms without providing actionable runbooks
- Store raw logs in Prometheus (use log aggregation systems)
- Ignore alert fatigue (regularly review and tune thresholds)
- Hardcode datasource UIDs in dashboard JSON
- Mix metrics from different time ranges in one panel
- Use regex selectors without limits in production queries
- Forget to set up backup for Grafana database
- Skip capacity planning for metrics growth
TIPS:
- Import dashboards from grafana.com marketplace (dashboard IDs)
- Use Prometheus federation for multi-region deployments
- Implement progressive alerting: warning (Slack) → critical (PagerDuty)
- Create team-specific folders in Grafana for organization
- Use Grafana variables for dynamic, reusable dashboards
- Set up dashboard playlists for NOC/SOC displays
- Use annotations to mark deployments and incidents on graphs
- Implement SLO burn rate alerts instead of static thresholds
- Create separate Prometheus jobs for different scrape intervals
- Use remote_write for backup and long-term storage
Performance Considerations
Prometheus Resource Planning
Memory Required =
(number_of_time_series * 2KB) + # Active series
(ingestion_rate * 2 * retention_hours) + # WAL and blocks
(2GB) # Base overhead
CPU Cores Required =
(ingestion_rate / 100,000) + # Ingestion processing
(query_rate / 10) + # Query processing
(1) # Base overhead
Disk IOPS Required =
(ingestion_rate / 1000) + # Write IOPS
(query_rate * 100) + # Read IOPS
(100) # Background compaction
Optimization Strategies
- Reduce cardinality: Audit and remove unnecessary labels
- Use recording rules: Pre-compute expensive queries
- Optimize scrape configs: Different intervals for different metrics
- Implement downsampling: For long-term storage
- Horizontal sharding: Separate Prometheus per service/team
- Remote storage: Offload old data to object storage
- Query caching: Use Trickster or built-in Grafana caching
- Metric relabeling: Drop unwanted metrics at scrape time
- Federation: Aggregate metrics hierarchically
- Capacity limits: Set max_samples_per_send and queue sizes
Scaling Thresholds
- < 1M active series: Single Prometheus instance
- 1M - 10M series: Prometheus with remote storage
- 10M - 100M series: Sharded Prometheus or Cortex
-
100M series: Thanos or multi-region Cortex
Security Considerations
Authentication & Authorization
# prometheus.yml with basic auth
scrape_configs:
- job_name: 'secured-api'
basic_auth:
username: 'prometheus'
password_file: '/etc/prometheus/password.txt'
scheme: https
tls_config:
ca_file: '/etc/prometheus/ca.crt'
cert_file: '/etc/prometheus/cert.crt'
key_file: '/etc/prometheus/key.pem'
insecure_skip_verify: false
Network Security
- Deploy monitoring stack in isolated subnet
- Use internal load balancers for Prometheus federation
- Implement mTLS between Prometheus and targets
- Restrict metrics endpoints to monitoring CIDR blocks
- Use VPN or private links for cross-region federation
Data Security
- Encrypt data at rest (filesystem encryption)
- Sanitize metrics to avoid leaking sensitive data
- Implement audit logging for all access
- Regular security scanning of monitoring infrastructure
- Rotate credentials and certificates regularly
Compliance Considerations
- GDPR: Avoid collecting PII in metrics labels
- HIPAA: Encrypt all health-related metrics
- PCI DSS: Separate payment metrics into isolated stack
- SOC 2: Maintain audit trails and access logs
Troubleshooting Guide
Issue: Prometheus consuming too much memory
# 1. Check current memory usage and series count
curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName' | head -20
# 2. Find high cardinality metrics
curl -g 'http://localhost:9090/api/v1/query?query=count(count+by(__name__)({__name__=~".+"}))' | jq
# 3. Identify problematic labels
curl -s http://localhost:9090/api/v1/label/userId/values | jq '. | length'
# 4. Drop high-cardinality metrics
# Add to prometheus.yml:
metric_relabel_configs:
- source_labels: [__name__]
regex: 'problematic_metric_.*'
action: drop
Issue: Grafana dashboards loading slowly
# 1. Check query performance
curl -s 'http://localhost:9090/api/v1/query_log' | jq '.data[] | select(.duration_seconds > 1)'
# 2. Analyze slow queries in Grafana
SELECT
dashboard_id,
panel_id,
AVG(duration) as avg_duration,
query
FROM grafana.query_history
WHERE duration > 1000
GROUP BY dashboard_id, panel_id, query
ORDER BY avg_duration DESC;
# 3. Optimize with recording rules
# Add to recording_rules.yml:
groups:
- name: dashboard_queries
interval: 30s
rules:
- record: api:request_rate5m
expr: sum(rate(http_requests_total[5m])) by (service)
Issue: Alerts not firing
# 1. Check alert state
curl http://localhost:9090/api/v1/alerts | jq
# 2. Validate AlertManager config
docker exec alertmanager amtool config routes
# 3. Test alert routing
docker exec alertmanager amtool config routes test \
--config.file=/etc/alertmanager/alertmanager.yml \
--verify.receivers=slack-critical \
severity=critical service=api
# 4. Check for inhibition rules
curl http://localhost:9093/api/v1/alerts | jq '.[] | select(.status.inhibitedBy != [])'
Issue: Missing traces in Jaeger
// 1. Verify sampling rate
const tracer = initTracer({
serviceName: 'api-gateway',
sampler: {
type: 'const', // Change to 'const' for debugging
param: 1, // 1 = sample everything
},
});
// 2. Check span reporting
tracer.on('span_finished', (span) => {
console.log('Span finished:', span.operationName(), span.context().toTraceId());
});
// 3. Verify Jaeger agent connectivity
curl http://localhost:14268/api/traces?service=api-gateway
Migration Guide
From CloudWatch to Prometheus:
# Migration script example
import boto3
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
def migrate_cloudwatch_to_prometheus():
# Read from CloudWatch
cw = boto3.client('cloudwatch')
metrics = cw.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
StartTime=datetime.now() - timedelta(hours=1),
EndTime=datetime.now(),
Period=300,
Statistics=['Average']
)
# Write to Prometheus
registry = CollectorRegistry()
g = Gauge('aws_ec2_cpu_utilization', 'EC2 CPU Usage',
['instance_id'], registry=registry)
for datapoint in metrics['Datapoints']:
g.labels(instance_id='i-1234567890abcdef0').set(datapoint['Average'])
push_to_gateway('localhost:9091', job='cloudwatch_migration', registry=registry)
From Datadog to Prometheus:
- Export Datadog dashboards as JSON
- Convert queries using query translator
- Import to Grafana with dashboard converter
- Map Datadog tags to Prometheus labels
- Recreate alerts in AlertManager format
Related Commands
/api-load-tester- Generate test traffic to validate monitoring setup/api-security-scanner- Security testing with metrics integration/add-rate-limiting- Rate limiting with metrics exposure/api-contract-generator- Generate OpenAPI specs with metrics annotations/deployment-pipeline-orchestrator- CI/CD with monitoring integration/api-versioning-manager- Version-aware metrics tracking
Advanced Topics
Multi-Cluster Monitoring with Thanos:
# thanos-sidecar.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: thanos-config
data:
object-store.yaml: |
type: S3
config:
bucket: metrics-long-term
endpoint: s3.amazonaws.com
access_key: ${AWS_ACCESS_KEY}
secret_key: ${AWS_SECRET_KEY}
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus-thanos
spec:
template:
spec:
containers:
- name: prometheus
args:
- --storage.tsdb.retention.time=2h
- --storage.tsdb.min-block-duration=2h
- --storage.tsdb.max-block-duration=2h
- --web.enable-lifecycle
- name: thanos-sidecar
image: quay.io/thanos/thanos:v0.31.0
args:
- sidecar
- --prometheus.url=http://localhost:9090
- --objstore.config-file=/etc/thanos/object-store.yaml
Service Mesh Observability (Istio):
# Automatic metrics from Istio
telemetry:
v2:
prometheus:
providers:
- name: prometheus
configOverride:
inboundSidecar:
disable_host_header_fallback: false
metric_expiry_duration: 10m
outboundSidecar:
disable_host_header_fallback: false
metric_expiry_duration: 10m
gateway:
disable_host_header_fallback: true
Version History
- v1.0.0 (2024-01): Initial Prometheus + Grafana implementation
- v1.1.0 (2024-03): Added Jaeger tracing integration
- v1.2.0 (2024-05): Thanos long-term storage support
- v1.3.0 (2024-07): OpenTelemetry collector integration
- v1.4.0 (2024-09): Multi-cluster federation support
- v1.5.0 (2024-10): Custom business metrics exporters
- Planned v2.0.0: eBPF-based zero-instrumentation monitoring