zhongwei/gh-jeremylongshore-claude-code-plugins-plus-plugins-api-development-api-monitoring-dashboard

Files

Zhongwei Li ccc4f37ba9 Initial commit

2025-11-29 18:52:28 +08:00

68 KiB

Raw Blame History

description, shortcut

description	shortcut
Create API monitoring dashboard	monitor

Create API Monitoring Dashboard

Build comprehensive monitoring infrastructure with metrics, logs, traces, and alerts for full API observability.

When to Use This Command

Use /create-monitoring when you need to:

Establish observability for production APIs
Track RED metrics (Rate, Errors, Duration) across services
Set up real-time alerting for SLO violations
Debug performance issues with distributed tracing
Create executive dashboards for API health
Implement SRE practices with data-driven insights

DON'T use this when:

Building proof-of-concept applications (use lightweight logging instead)
Monitoring non-critical internal tools (basic health checks may suffice)
Resources are extremely constrained (consider managed solutions like Datadog first)

Design Decisions

This command implements a Prometheus + Grafana stack as the primary approach because:

Open-source with no vendor lock-in
Industry-standard metric format with wide ecosystem support
Powerful query language (PromQL) for complex analysis
Horizontal scalability via federation and remote storage

Alternative considered: ELK Stack (Elasticsearch, Logstash, Kibana)

Better for log-centric analysis
Higher resource requirements
More complex operational overhead
Recommended when logs are primary data source

Alternative considered: Managed solutions (Datadog, New Relic)

Faster time-to-value
Higher ongoing cost
Less customization flexibility
Recommended for teams without dedicated DevOps

Prerequisites

Before running this command:

Docker and Docker Compose installed
API instrumented with metrics endpoints (Prometheus format)
Basic understanding of PromQL query language
Network access for inter-service communication
Sufficient disk space for time-series data (plan for 2-4 weeks retention)

Implementation Process

Step 1: Configure Prometheus

Set up Prometheus to scrape metrics from your API endpoints with service discovery.

Step 2: Create Grafana Dashboards

Build visualizations for RED metrics, custom business metrics, and SLO tracking.

Step 3: Implement Distributed Tracing

Integrate Jaeger for end-to-end request tracing across microservices.

Step 4: Configure Alerting

Set up AlertManager rules for critical thresholds with notification channels (Slack, PagerDuty).

Step 5: Deploy Monitoring Stack

Deploy complete observability infrastructure with health checks and backup configurations.

Output Format

The command generates:

docker-compose.yml - Complete monitoring stack configuration
prometheus.yml - Prometheus scrape configuration
grafana-dashboards/ - Pre-built dashboard JSON files
alerting-rules.yml - AlertManager rule definitions
jaeger-config.yml - Distributed tracing configuration
README.md - Deployment and operation guide

Code Examples

Example 1: Complete Node.js Express API with Comprehensive Monitoring

// metrics/instrumentation.js - Full-featured Prometheus instrumentation
const promClient = require('prom-client');
const { performance } = require('perf_hooks');
const os = require('os');

class MetricsCollector {
  constructor() {
    // Create separate registries for different metric types
    this.register = new promClient.Registry();
    this.businessRegister = new promClient.Registry();

    // Add default system metrics
    promClient.collectDefaultMetrics({
      register: this.register,
      prefix: 'api_',
      gcDurationBuckets: [0.001, 0.01, 0.1, 1, 2, 5]
    });

    // Initialize all metric types
    this.initializeMetrics();
    this.initializeBusinessMetrics();
    this.initializeCustomCollectors();

    // Start periodic collectors
    this.startPeriodicCollectors();
  }

  initializeMetrics() {
    // RED Metrics (Rate, Errors, Duration)
    this.httpRequestDuration = new promClient.Histogram({
      name: 'http_request_duration_seconds',
      help: 'Duration of HTTP requests in seconds',
      labelNames: ['method', 'route', 'status_code', 'service', 'environment'],
      buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
    });

    this.httpRequestTotal = new promClient.Counter({
      name: 'http_requests_total',
      help: 'Total number of HTTP requests',
      labelNames: ['method', 'route', 'status_code', 'service', 'environment']
    });

    this.httpRequestErrors = new promClient.Counter({
      name: 'http_request_errors_total',
      help: 'Total number of HTTP errors',
      labelNames: ['method', 'route', 'error_type', 'service', 'environment']
    });

    // Database metrics
    this.dbQueryDuration = new promClient.Histogram({
      name: 'db_query_duration_seconds',
      help: 'Database query execution time',
      labelNames: ['operation', 'table', 'database', 'status'],
      buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5]
    });

    this.dbConnectionPool = new promClient.Gauge({
      name: 'db_connection_pool_size',
      help: 'Database connection pool metrics',
      labelNames: ['state', 'database'] // states: active, idle, total
    });

    // Cache metrics
    this.cacheHitRate = new promClient.Counter({
      name: 'cache_operations_total',
      help: 'Cache operation counts',
      labelNames: ['operation', 'cache_name', 'status'] // hit, miss, set, delete
    });

    this.cacheLatency = new promClient.Histogram({
      name: 'cache_operation_duration_seconds',
      help: 'Cache operation latency',
      labelNames: ['operation', 'cache_name'],
      buckets: [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]
    });

    // External API metrics
    this.externalApiCalls = new promClient.Histogram({
      name: 'external_api_duration_seconds',
      help: 'External API call duration',
      labelNames: ['service', 'endpoint', 'status_code'],
      buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
    });

    // Circuit breaker metrics
    this.circuitBreakerState = new promClient.Gauge({
      name: 'circuit_breaker_state',
      help: 'Circuit breaker state (0=closed, 1=open, 2=half-open)',
      labelNames: ['service']
    });

    // Rate limiting metrics
    this.rateLimitHits = new promClient.Counter({
      name: 'rate_limit_hits_total',
      help: 'Number of rate limited requests',
      labelNames: ['limit_type', 'client_type']
    });

    // WebSocket metrics
    this.activeWebsockets = new promClient.Gauge({
      name: 'websocket_connections_active',
      help: 'Number of active WebSocket connections',
      labelNames: ['namespace', 'room']
    });

    // Register all metrics
    [
      this.httpRequestDuration, this.httpRequestTotal, this.httpRequestErrors,
      this.dbQueryDuration, this.dbConnectionPool, this.cacheHitRate,
      this.cacheLatency, this.externalApiCalls, this.circuitBreakerState,
      this.rateLimitHits, this.activeWebsockets
    ].forEach(metric => this.register.registerMetric(metric));
  }

  initializeBusinessMetrics() {
    // User activity metrics
    this.activeUsers = new promClient.Gauge({
      name: 'business_active_users',
      help: 'Number of active users in the last 5 minutes',
      labelNames: ['user_type', 'plan']
    });

    this.userSignups = new promClient.Counter({
      name: 'business_user_signups_total',
      help: 'Total user signups',
      labelNames: ['source', 'plan', 'country']
    });

    // Transaction metrics
    this.transactionAmount = new promClient.Histogram({
      name: 'business_transaction_amount_dollars',
      help: 'Transaction amounts in dollars',
      labelNames: ['type', 'status', 'payment_method'],
      buckets: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000, 10000]
    });

    this.orderProcessingTime = new promClient.Histogram({
      name: 'business_order_processing_seconds',
      help: 'Time to process orders end-to-end',
      labelNames: ['order_type', 'fulfillment_type'],
      buckets: [10, 30, 60, 180, 300, 600, 1800, 3600]
    });

    // API usage metrics
    this.apiUsageByClient = new promClient.Counter({
      name: 'business_api_usage_by_client',
      help: 'API usage segmented by client',
      labelNames: ['client_id', 'tier', 'endpoint']
    });

    this.apiQuotaRemaining = new promClient.Gauge({
      name: 'business_api_quota_remaining',
      help: 'Remaining API quota for clients',
      labelNames: ['client_id', 'tier', 'quota_type']
    });

    // Revenue metrics
    this.revenueByProduct = new promClient.Counter({
      name: 'business_revenue_by_product_cents',
      help: 'Revenue by product in cents',
      labelNames: ['product_id', 'product_category', 'currency']
    });

    // Register business metrics
    [
      this.activeUsers, this.userSignups, this.transactionAmount,
      this.orderProcessingTime, this.apiUsageByClient, this.apiQuotaRemaining,
      this.revenueByProduct
    ].forEach(metric => this.businessRegister.registerMetric(metric));
  }

  initializeCustomCollectors() {
    // SLI/SLO metrics
    this.sloCompliance = new promClient.Gauge({
      name: 'slo_compliance_percentage',
      help: 'SLO compliance percentage',
      labelNames: ['slo_name', 'service', 'window']
    });

    this.errorBudgetRemaining = new promClient.Gauge({
      name: 'error_budget_remaining_percentage',
      help: 'Remaining error budget percentage',
      labelNames: ['service', 'slo_type']
    });

    this.register.registerMetric(this.sloCompliance);
    this.register.registerMetric(this.errorBudgetRemaining);
  }

  startPeriodicCollectors() {
    // Update active users every 30 seconds
    setInterval(() => {
      const activeUserCount = this.calculateActiveUsers();
      this.activeUsers.set(
        { user_type: 'registered', plan: 'free' },
        activeUserCount.free
      );
      this.activeUsers.set(
        { user_type: 'registered', plan: 'premium' },
        activeUserCount.premium
      );
    }, 30000);

    // Update SLO compliance every minute
    setInterval(() => {
      this.updateSLOCompliance();
    }, 60000);

    // Database pool monitoring
    setInterval(() => {
      this.updateDatabasePoolMetrics();
    }, 15000);
  }

  // Middleware for HTTP metrics
  httpMetricsMiddleware() {
    return (req, res, next) => {
      const start = performance.now();
      const route = req.route?.path || req.path || 'unknown';

      // Track in-flight requests
      const inFlightGauge = new promClient.Gauge({
        name: 'http_requests_in_flight',
        help: 'Number of in-flight HTTP requests',
        labelNames: ['method', 'route']
      });

      inFlightGauge.inc({ method: req.method, route });

      res.on('finish', () => {
        const duration = (performance.now() - start) / 1000;
        const labels = {
          method: req.method,
          route,
          status_code: res.statusCode,
          service: process.env.SERVICE_NAME || 'api',
          environment: process.env.NODE_ENV || 'development'
        };

        // Record metrics
        this.httpRequestDuration.observe(labels, duration);
        this.httpRequestTotal.inc(labels);

        if (res.statusCode >= 400) {
          const errorType = res.statusCode >= 500 ? 'server_error' : 'client_error';
          this.httpRequestErrors.inc({
            ...labels,
            error_type: errorType
          });
        }

        inFlightGauge.dec({ method: req.method, route });

        // Log slow requests
        if (duration > 1) {
          console.warn('Slow request detected:', {
            ...labels,
            duration,
            user: req.user?.id,
            ip: req.ip
          });
        }
      });

      next();
    };
  }

  // Database query instrumentation
  instrumentDatabase(knex) {
    knex.on('query', (query) => {
      query.__startTime = performance.now();
    });

    knex.on('query-response', (response, query) => {
      const duration = (performance.now() - query.__startTime) / 1000;
      const table = this.extractTableName(query.sql);

      this.dbQueryDuration.observe({
        operation: query.method || 'select',
        table,
        database: process.env.DB_NAME || 'default',
        status: 'success'
      }, duration);
    });

    knex.on('query-error', (error, query) => {
      const duration = (performance.now() - query.__startTime) / 1000;
      const table = this.extractTableName(query.sql);

      this.dbQueryDuration.observe({
        operation: query.method || 'select',
        table,
        database: process.env.DB_NAME || 'default',
        status: 'error'
      }, duration);
    });
  }

  // Cache instrumentation wrapper
  wrapCache(cache) {
    const wrapper = {};
    const methods = ['get', 'set', 'delete', 'has'];

    methods.forEach(method => {
      wrapper[method] = async (...args) => {
        const start = performance.now();
        const cacheName = cache.name || 'default';

        try {
          const result = await cache[method](...args);
          const duration = (performance.now() - start) / 1000;

          // Record cache metrics
          if (method === 'get') {
            const status = result !== undefined ? 'hit' : 'miss';
            this.cacheHitRate.inc({
              operation: method,
              cache_name: cacheName,
              status
            });
          } else {
            this.cacheHitRate.inc({
              operation: method,
              cache_name: cacheName,
              status: 'success'
            });
          }

          this.cacheLatency.observe({
            operation: method,
            cache_name: cacheName
          }, duration);

          return result;
        } catch (error) {
          this.cacheHitRate.inc({
            operation: method,
            cache_name: cacheName,
            status: 'error'
          });
          throw error;
        }
      };
    });

    return wrapper;
  }

  // External API call instrumentation
  async trackExternalCall(serviceName, endpoint, callFunc) {
    const start = performance.now();

    try {
      const result = await callFunc();
      const duration = (performance.now() - start) / 1000;

      this.externalApiCalls.observe({
        service: serviceName,
        endpoint,
        status_code: result.status || 200
      }, duration);

      return result;
    } catch (error) {
      const duration = (performance.now() - start) / 1000;

      this.externalApiCalls.observe({
        service: serviceName,
        endpoint,
        status_code: error.response?.status || 0
      }, duration);

      throw error;
    }
  }

  // Circuit breaker monitoring
  updateCircuitBreakerState(service, state) {
    const stateValue = {
      'closed': 0,
      'open': 1,
      'half-open': 2
    }[state] || 0;

    this.circuitBreakerState.set({ service }, stateValue);
  }

  // Helper methods
  calculateActiveUsers() {
    // Implementation would query your session store or database
    return {
      free: Math.floor(Math.random() * 1000),
      premium: Math.floor(Math.random() * 100)
    };
  }

  updateSLOCompliance() {
    // Calculate based on recent metrics
    const availability = 99.95; // Calculate from actual metrics
    const latencyP99 = 250; // Calculate from actual metrics

    this.sloCompliance.set({
      slo_name: 'availability',
      service: 'api',
      window: '30d'
    }, availability);

    this.sloCompliance.set({
      slo_name: 'latency_p99',
      service: 'api',
      window: '30d'
    }, latencyP99 < 500 ? 100 : 0);

    // Update error budget
    const errorBudget = 100 - ((100 - availability) / 0.05) * 100;
    this.errorBudgetRemaining.set({
      service: 'api',
      slo_type: 'availability'
    }, Math.max(0, errorBudget));
  }

  updateDatabasePoolMetrics() {
    // Get pool stats from your database driver
    const pool = global.dbPool; // Your database pool instance
    if (pool) {
      this.dbConnectionPool.set({
        state: 'active',
        database: 'primary'
      }, pool.numUsed());

      this.dbConnectionPool.set({
        state: 'idle',
        database: 'primary'
      }, pool.numFree());

      this.dbConnectionPool.set({
        state: 'total',
        database: 'primary'
      }, pool.numUsed() + pool.numFree());
    }
  }

  extractTableName(sql) {
    const match = sql.match(/(?:from|into|update)\s+`?(\w+)`?/i);
    return match ? match[1] : 'unknown';
  }

  // Expose metrics endpoint
  async getMetrics() {
    const baseMetrics = await this.register.metrics();
    const businessMetrics = await this.businessRegister.metrics();
    return baseMetrics + '\n' + businessMetrics;
  }
}

// Express application setup
const express = require('express');
const app = express();
const metricsCollector = new MetricsCollector();

// Apply monitoring middleware
app.use(metricsCollector.httpMetricsMiddleware());

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', metricsCollector.register.contentType);
  res.end(await metricsCollector.getMetrics());
});

// Example API endpoint with comprehensive tracking
app.post('/api/orders', async (req, res) => {
  const orderStart = performance.now();

  try {
    // Track business metrics
    metricsCollector.transactionAmount.observe({
      type: 'purchase',
      status: 'pending',
      payment_method: req.body.paymentMethod
    }, req.body.amount);

    // Simulate external payment API call
    const paymentResult = await metricsCollector.trackExternalCall(
      'stripe',
      '/charges',
      async () => {
        // Your actual payment API call
        return await stripeClient.charges.create({
          amount: req.body.amount * 100,
          currency: 'usd'
        });
      }
    );

    // Track order processing time
    const processingTime = (performance.now() - orderStart) / 1000;
    metricsCollector.orderProcessingTime.observe({
      order_type: 'standard',
      fulfillment_type: 'digital'
    }, processingTime);

    // Track revenue
    metricsCollector.revenueByProduct.inc({
      product_id: req.body.productId,
      product_category: req.body.category,
      currency: 'USD'
    }, req.body.amount * 100);

    res.json({ success: true, orderId: paymentResult.id });
  } catch (error) {
    res.status(500).json({ error: error.message });
  }
});

module.exports = { app, metricsCollector };

Example 2: Complete Monitoring Stack with Docker Compose

# docker-compose.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.45.0
    container_name: prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alerting-rules.yml:/etc/prometheus/alerting-rules.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
      - '--storage.tsdb.retention.time=15d'
    ports:
      - "9090:9090"
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:10.0.0
    container_name: grafana
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana-dashboards:/etc/grafana/provisioning/dashboards
      - ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_ROOT_URL=http://localhost:3000
    ports:
      - "3000:3000"
    networks:
      - monitoring
    depends_on:
      - prometheus

  jaeger:
    image: jaegertracing/all-in-one:1.47
    container_name: jaeger
    environment:
      - COLLECTOR_ZIPKIN_HOST_PORT=:9411
      - COLLECTOR_OTLP_ENABLED=true
    ports:
      - "5775:5775/udp"
      - "6831:6831/udp"
      - "6832:6832/udp"
      - "5778:5778"
      - "16686:16686"  # Jaeger UI
      - "14268:14268"
      - "14250:14250"
      - "9411:9411"
      - "4317:4317"    # OTLP gRPC
      - "4318:4318"    # OTLP HTTP
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:v0.26.0
    container_name: alertmanager
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    ports:
      - "9093:9093"
    networks:
      - monitoring

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus-data:
  grafana-data:

Example 3: Advanced Grafana Dashboard Definitions

// grafana-dashboards/api-overview.json
{
  "dashboard": {
    "id": null,
    "uid": "api-overview",
    "title": "API Performance Overview",
    "tags": ["api", "performance", "sre"],
    "timezone": "browser",
    "schemaVersion": 16,
    "version": 0,
    "refresh": "30s",
    "time": {
      "from": "now-6h",
      "to": "now"
    },
    "templating": {
      "list": [
        {
          "name": "datasource",
          "type": "datasource",
          "query": "prometheus",
          "current": {
            "value": "Prometheus",
            "text": "Prometheus"
          }
        },
        {
          "name": "service",
          "type": "query",
          "datasource": "$datasource",
          "query": "label_values(http_requests_total, service)",
          "multi": true,
          "includeAll": true,
          "current": {
            "value": ["$__all"],
            "text": "All"
          },
          "refresh": 1
        },
        {
          "name": "environment",
          "type": "query",
          "datasource": "$datasource",
          "query": "label_values(http_requests_total, environment)",
          "current": {
            "value": "production",
            "text": "Production"
          }
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "gridPos": { "h": 8, "w": 8, "x": 0, "y": 0 },
        "type": "graph",
        "title": "Request Rate (req/s)",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service)",
            "legendFormat": "{{service}}",
            "refId": "A"
          }
        ],
        "yaxes": [
          {
            "format": "reqps",
            "label": "Requests per second"
          }
        ],
        "lines": true,
        "linewidth": 2,
        "fill": 1,
        "fillGradient": 3,
        "steppedLine": false,
        "tooltip": {
          "shared": true,
          "sort": 0,
          "value_type": "individual"
        },
        "alert": {
          "name": "High Request Rate",
          "conditions": [
            {
              "evaluator": {
                "params": [10000],
                "type": "gt"
              },
              "operator": {
                "type": "and"
              },
              "query": {
                "params": ["A", "5m", "now"]
              },
              "reducer": {
                "type": "avg"
              },
              "type": "query"
            }
          ],
          "executionErrorState": "alerting",
          "frequency": "1m",
          "handler": 1,
          "noDataState": "no_data",
          "notifications": [
            {
              "uid": "slack-channel"
            }
          ]
        }
      },
      {
        "id": 2,
        "gridPos": { "h": 8, "w": 8, "x": 8, "y": 0 },
        "type": "graph",
        "title": "Error Rate (%)",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\",status_code=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service) * 100",
            "legendFormat": "{{service}}",
            "refId": "A"
          }
        ],
        "yaxes": [
          {
            "format": "percent",
            "label": "Error Rate",
            "max": 10
          }
        ],
        "thresholds": [
          {
            "value": 1,
            "op": "gt",
            "fill": true,
            "line": true,
            "colorMode": "critical"
          }
        ],
        "alert": {
          "name": "High Error Rate",
          "conditions": [
            {
              "evaluator": {
                "params": [1],
                "type": "gt"
              },
              "operator": {
                "type": "and"
              },
              "query": {
                "params": ["A", "5m", "now"]
              },
              "reducer": {
                "type": "last"
              },
              "type": "query"
            }
          ],
          "executionErrorState": "alerting",
          "frequency": "1m",
          "handler": 1,
          "noDataState": "no_data",
          "notifications": [
            {
              "uid": "pagerduty"
            }
          ],
          "message": "Error rate is above 1% for service {{service}}"
        }
      },
      {
        "id": 3,
        "gridPos": { "h": 8, "w": 8, "x": 16, "y": 0 },
        "type": "graph",
        "title": "Response Time (p50, p95, p99)",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\",environment=\"$environment\"}[5m])) by (le, service))",
            "legendFormat": "p50 {{service}}",
            "refId": "A"
          },
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\",environment=\"$environment\"}[5m])) by (le, service))",
            "legendFormat": "p95 {{service}}",
            "refId": "B"
          },
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\",environment=\"$environment\"}[5m])) by (le, service))",
            "legendFormat": "p99 {{service}}",
            "refId": "C"
          }
        ],
        "yaxes": [
          {
            "format": "s",
            "label": "Response Time"
          }
        ]
      },
      {
        "id": 4,
        "gridPos": { "h": 6, "w": 6, "x": 0, "y": 8 },
        "type": "stat",
        "title": "Current QPS",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[1m]))",
            "instant": true,
            "refId": "A"
          }
        ],
        "format": "reqps",
        "sparkline": {
          "show": true,
          "lineColor": "rgb(31, 120, 193)",
          "fillColor": "rgba(31, 120, 193, 0.18)"
        },
        "thresholds": {
          "mode": "absolute",
          "steps": [
            { "value": 0, "color": "green" },
            { "value": 5000, "color": "yellow" },
            { "value": 10000, "color": "red" }
          ]
        }
      },
      {
        "id": 5,
        "gridPos": { "h": 6, "w": 6, "x": 6, "y": 8 },
        "type": "stat",
        "title": "Error Budget Remaining",
        "targets": [
          {
            "expr": "error_budget_remaining_percentage{service=~\"$service\",slo_type=\"availability\"}",
            "instant": true,
            "refId": "A"
          }
        ],
        "format": "percent",
        "thresholds": {
          "mode": "absolute",
          "steps": [
            { "value": 0, "color": "red" },
            { "value": 25, "color": "orange" },
            { "value": 50, "color": "yellow" },
            { "value": 75, "color": "green" }
          ]
        }
      },
      {
        "id": 6,
        "gridPos": { "h": 6, "w": 12, "x": 12, "y": 8 },
        "type": "table",
        "title": "Top Slow Endpoints",
        "targets": [
          {
            "expr": "topk(10, histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\",environment=\"$environment\"}[5m])) by (le, route)))",
            "format": "table",
            "instant": true,
            "refId": "A"
          }
        ],
        "styles": [
          {
            "alias": "Time",
            "dateFormat": "YYYY-MM-DD HH:mm:ss",
            "type": "date"
          },
          {
            "alias": "Duration",
            "colorMode": "cell",
            "colors": ["green", "yellow", "red"],
            "thresholds": [0.5, 1],
            "type": "number",
            "unit": "s"
          }
        ]
      }
    ]
  }
}

Example 4: Production-Ready Alerting Rules

# alerting-rules.yml
groups:
  - name: api_alerts
    interval: 30s
    rules:
      # SLO-based alerts
      - alert: APIHighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service, environment)
            /
            sum(rate(http_requests_total[5m])) by (service, environment)
          ) > 0.01
        for: 5m
        labels:
          severity: critical
          team: api-platform
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "{{ $labels.service }} in {{ $labels.environment }} has error rate of {{ $value | humanizePercentage }} (threshold: 1%)"
          runbook_url: "https://wiki.example.com/runbooks/api-high-error-rate"
          dashboard_url: "https://grafana.example.com/d/api-overview?var-service={{ $labels.service }}"

      - alert: APIHighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
          ) > 0.5
        for: 10m
        labels:
          severity: warning
          team: api-platform
        annotations:
          summary: "High latency on {{ $labels.service }}"
          description: "P95 latency for {{ $labels.service }} is {{ $value | humanizeDuration }} (threshold: 500ms)"

      - alert: APILowAvailability
        expr: |
          up{job="api-services"} == 0
        for: 1m
        labels:
          severity: critical
          team: api-platform
        annotations:
          summary: "API service {{ $labels.instance }} is down"
          description: "{{ $labels.instance }} has been down for more than 1 minute"

      # Business metrics alerts
      - alert: LowActiveUsers
        expr: |
          business_active_users{plan="premium"} < 10
        for: 30m
        labels:
          severity: warning
          team: product
        annotations:
          summary: "Low number of active premium users"
          description: "Only {{ $value }} premium users active in the last 30 minutes"

      - alert: HighTransactionFailureRate
        expr: |
          (
            sum(rate(business_transaction_amount_dollars_sum{status="failed"}[5m]))
            /
            sum(rate(business_transaction_amount_dollars_sum[5m]))
          ) > 0.05
        for: 5m
        labels:
          severity: critical
          team: payments
        annotations:
          summary: "High transaction failure rate"
          description: "Transaction failure rate is {{ $value | humanizePercentage }} (threshold: 5%)"

      # Infrastructure alerts
      - alert: DatabaseConnectionPoolExhausted
        expr: |
          (
            db_connection_pool_size{state="active"}
            /
            db_connection_pool_size{state="total"}
          ) > 0.9
        for: 5m
        labels:
          severity: warning
          team: database
        annotations:
          summary: "Database connection pool near exhaustion"
          description: "{{ $labels.database }} pool is {{ $value | humanizePercentage }} utilized"

      - alert: CacheLowHitRate
        expr: |
          (
            sum(rate(cache_operations_total{status="hit"}[5m])) by (cache_name)
            /
            sum(rate(cache_operations_total{operation="get"}[5m])) by (cache_name)
          ) < 0.8
        for: 15m
        labels:
          severity: warning
          team: api-platform
        annotations:
          summary: "Low cache hit rate for {{ $labels.cache_name }}"
          description: "Cache hit rate is {{ $value | humanizePercentage }} (expected: >80%)"

      - alert: CircuitBreakerOpen
        expr: |
          circuit_breaker_state == 1
        for: 1m
        labels:
          severity: warning
          team: api-platform
        annotations:
          summary: "Circuit breaker open for {{ $labels.service }}"
          description: "Circuit breaker for {{ $labels.service }} has been open for more than 1 minute"

      # SLO burn rate alerts (multi-window approach)
      - alert: SLOBurnRateHigh
        expr: |
          (
            # 5m burn rate > 14.4 (1 hour of error budget in 5 minutes)
            (
              sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
              /
              sum(rate(http_requests_total[5m])) by (service)
            ) > (1 - 0.999) * 14.4
          ) and (
            # 1h burn rate > 1 (confirms it's not a spike)
            (
              sum(rate(http_requests_total{status_code=~"5.."}[1h])) by (service)
              /
              sum(rate(http_requests_total[1h])) by (service)
            ) > (1 - 0.999)
          )
        labels:
          severity: critical
          team: api-platform
          alert_type: slo_burn
        annotations:
          summary: "SLO burn rate critically high for {{ $labels.service }}"
          description: "{{ $labels.service }} is burning error budget 14.4x faster than normal"

      # Resource alerts
      - alert: HighMemoryUsage
        expr: |
          (
            container_memory_usage_bytes{container!="POD",container!=""}
            /
            container_spec_memory_limit_bytes{container!="POD",container!=""}
          ) > 0.9
        for: 5m
        labels:
          severity: warning
          team: api-platform
        annotations:
          summary: "High memory usage for {{ $labels.container }}"
          description: "Container {{ $labels.container }} memory usage is {{ $value | humanizePercentage }}"

# AlertManager configuration
# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true
    - match:
        severity: warning
      receiver: 'slack-warnings'
    - match:
        team: payments
      receiver: 'payments-team'

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
        description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ .Alerts.Firing | len }}'
          resolved: '{{ .Alerts.Resolved | len }}'
          labels: '{{ .CommonLabels }}'

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#warnings'
        send_resolved: true
        title: 'Warning: {{ .GroupLabels.alertname }}'
        text: '{{ .CommonAnnotations.description }}'
        actions:
          - type: button
            text: 'View Dashboard'
            url: '{{ .CommonAnnotations.dashboard_url }}'
          - type: button
            text: 'View Runbook'
            url: '{{ .CommonAnnotations.runbook_url }}'

  - name: 'payments-team'
    email_configs:
      - to: 'payments-team@example.com'
        from: 'alerts@example.com'
        headers:
          Subject: 'Payment Alert: {{ .GroupLabels.alertname }}'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'service']

Example 5: OpenTelemetry Integration for Distributed Tracing

// tracing/setup.js - OpenTelemetry configuration
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { PrometheusExporter } = require('@opentelemetry/exporter-prometheus');
const {
  ConsoleSpanExporter,
  BatchSpanProcessor,
  SimpleSpanProcessor
} = require('@opentelemetry/sdk-trace-base');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');

class TracingSetup {
  constructor(serviceName, environment = 'production') {
    this.serviceName = serviceName;
    this.environment = environment;
    this.sdk = null;
  }

  initialize() {
    // Create resource identifying the service
    const resource = Resource.default().merge(
      new Resource({
        [SemanticResourceAttributes.SERVICE_NAME]: this.serviceName,
        [SemanticResourceAttributes.SERVICE_VERSION]: process.env.VERSION || '1.0.0',
        [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: this.environment,
        'service.namespace': 'api-platform',
        'service.instance.id': process.env.HOSTNAME || 'unknown',
        'telemetry.sdk.language': 'nodejs',
      })
    );

    // Configure Jaeger exporter for traces
    const jaegerExporter = new JaegerExporter({
      endpoint: process.env.JAEGER_ENDPOINT || 'http://localhost:14268/api/traces',
      tags: {
        service: this.serviceName,
        environment: this.environment
      }
    });

    // Configure Prometheus exporter for metrics
    const prometheusExporter = new PrometheusExporter({
      port: 9464,
      endpoint: '/metrics',
      prefix: 'otel_',
      appendTimestamp: true,
    }, () => {
      console.log('Prometheus metrics server started on port 9464');
    });

    // Create SDK with auto-instrumentation
    this.sdk = new NodeSDK({
      resource,
      instrumentations: [
        getNodeAutoInstrumentations({
          '@opentelemetry/instrumentation-fs': {
            enabled: false, // Disable fs to reduce noise
          },
          '@opentelemetry/instrumentation-http': {
            requestHook: (span, request) => {
              span.setAttribute('http.request.body', JSON.stringify(request.body));
              span.setAttribute('http.request.user_id', request.user?.id);
            },
            responseHook: (span, response) => {
              span.setAttribute('http.response.size', response.length);
            },
            ignoreIncomingPaths: ['/health', '/metrics', '/favicon.ico'],
            ignoreOutgoingUrls: [(url) => url.includes('prometheus')]
          },
          '@opentelemetry/instrumentation-express': {
            requestHook: (span, request) => {
              span.setAttribute('express.route', request.route?.path);
              span.setAttribute('express.params', JSON.stringify(request.params));
            }
          },
          '@opentelemetry/instrumentation-mysql2': {
            enhancedDatabaseReporting: true,
          },
          '@opentelemetry/instrumentation-redis-4': {
            dbStatementSerializer: (cmdName, cmdArgs) => {
              return `${cmdName} ${cmdArgs.slice(0, 2).join(' ')}`;
            }
          }
        })
      ],
      spanProcessor: new BatchSpanProcessor(jaegerExporter, {
        maxQueueSize: 2048,
        maxExportBatchSize: 512,
        scheduledDelayMillis: 5000,
        exportTimeoutMillis: 30000,
      }),
      metricReader: new PeriodicExportingMetricReader({
        exporter: prometheusExporter,
        exportIntervalMillis: 10000,
      }),
    });

    // Start the SDK
    this.sdk.start()
      .then(() => console.log('Tracing initialized successfully'))
      .catch((error) => console.error('Error initializing tracing', error));

    // Graceful shutdown
    process.on('SIGTERM', () => {
      this.shutdown();
    });
  }

  async shutdown() {
    try {
      await this.sdk.shutdown();
      console.log('Tracing terminated successfully');
    } catch (error) {
      console.error('Error terminating tracing', error);
    }
  }

  // Manual span creation for custom instrumentation
  createSpan(tracer, spanName, fn) {
    return tracer.startActiveSpan(spanName, async (span) => {
      try {
        span.setAttribute('span.kind', 'internal');
        span.setAttribute('custom.span', true);

        const result = await fn(span);

        span.setStatus({ code: 0, message: 'OK' });
        return result;
      } catch (error) {
        span.setStatus({ code: 2, message: error.message });
        span.recordException(error);
        throw error;
      } finally {
        span.end();
      }
    });
  }
}

// Usage in application
const tracing = new TracingSetup('api-gateway', process.env.NODE_ENV);
tracing.initialize();

// Custom instrumentation example
const { trace } = require('@opentelemetry/api');

async function processOrder(orderId) {
  const tracer = trace.getTracer('order-processing', '1.0.0');

  return tracing.createSpan(tracer, 'processOrder', async (span) => {
    span.setAttribute('order.id', orderId);
    span.addEvent('Order processing started');

    // Validate order
    await tracing.createSpan(tracer, 'validateOrder', async (childSpan) => {
      childSpan.setAttribute('validation.type', 'schema');
      // Validation logic
      await validateOrderSchema(orderId);
    });

    // Process payment
    await tracing.createSpan(tracer, 'processPayment', async (childSpan) => {
      childSpan.setAttribute('payment.method', 'stripe');
      // Payment logic
      const result = await processStripePayment(orderId);
      childSpan.setAttribute('payment.status', result.status);
      childSpan.addEvent('Payment processed', {
        'payment.amount': result.amount,
        'payment.currency': result.currency
      });
    });

    // Send confirmation
    await tracing.createSpan(tracer, 'sendConfirmation', async (childSpan) => {
      childSpan.setAttribute('notification.type', 'email');
      // Email logic
      await sendOrderConfirmation(orderId);
    });

    span.addEvent('Order processing completed');
    return { success: true, orderId };
  });
}

module.exports = { TracingSetup, tracing };

Example 6: Custom Prometheus Exporters for Complex Metrics

# custom_exporters.py - Python Prometheus exporter for business metrics
from prometheus_client import start_http_server, Gauge, Counter, Histogram, Info, Enum
from prometheus_client.core import CollectorRegistry
from prometheus_client import generate_latest
import time
import psycopg2
import redis
import requests
from datetime import datetime, timedelta
import asyncio
import aiohttp

class CustomBusinessExporter:
    def __init__(self, db_config, redis_config, port=9091):
        self.registry = CollectorRegistry()
        self.db_config = db_config
        self.redis_config = redis_config
        self.port = port

        # Initialize metrics
        self.initialize_metrics()

        # Connect to data sources
        self.connect_datasources()

    def initialize_metrics(self):
        # Business KPI metrics
        self.revenue_total = Gauge(
            'business_revenue_total_usd',
            'Total revenue in USD',
            ['period', 'product_line', 'region'],
            registry=self.registry
        )

        self.customer_lifetime_value = Histogram(
            'business_customer_lifetime_value_usd',
            'Customer lifetime value distribution',
            ['customer_segment', 'acquisition_channel'],
            buckets=(10, 50, 100, 500, 1000, 5000, 10000, 50000),
            registry=self.registry
        )

        self.churn_rate = Gauge(
            'business_churn_rate_percentage',
            'Customer churn rate',
            ['plan', 'cohort'],
            registry=self.registry
        )

        self.monthly_recurring_revenue = Gauge(
            'business_mrr_usd',
            'Monthly recurring revenue',
            ['plan', 'currency'],
            registry=self.registry
        )

        self.net_promoter_score = Gauge(
            'business_nps',
            'Net Promoter Score',
            ['segment', 'survey_type'],
            registry=self.registry
        )

        # Operational metrics
        self.data_pipeline_lag = Histogram(
            'data_pipeline_lag_seconds',
            'Data pipeline processing lag',
            ['pipeline', 'stage'],
            buckets=(1, 5, 10, 30, 60, 300, 600, 1800, 3600),
            registry=self.registry
        )

        self.feature_usage = Counter(
            'feature_usage_total',
            'Feature usage counts',
            ['feature_name', 'user_tier', 'success'],
            registry=self.registry
        )

        self.api_quota_usage = Gauge(
            'api_quota_usage_percentage',
            'API quota usage by customer',
            ['customer_id', 'tier', 'resource'],
            registry=self.registry
        )

        # System health indicators
        self.dependency_health = Enum(
            'dependency_health_status',
            'Health status of external dependencies',
            ['service', 'dependency'],
            states=['healthy', 'degraded', 'unhealthy'],
            registry=self.registry
        )

        self.data_quality_score = Gauge(
            'data_quality_score',
            'Data quality score (0-100)',
            ['dataset', 'dimension'],
            registry=self.registry
        )

    def connect_datasources(self):
        # PostgreSQL connection
        self.db_conn = psycopg2.connect(**self.db_config)

        # Redis connection
        self.redis_client = redis.Redis(**self.redis_config)

    def collect_business_metrics(self):
        """Collect business metrics from various data sources"""
        cursor = self.db_conn.cursor()

        # Revenue metrics
        cursor.execute("""
            SELECT
                DATE_TRUNC('day', created_at) as period,
                product_line,
                region,
                SUM(amount) as total_revenue
            FROM orders
            WHERE status = 'completed'
                AND created_at >= NOW() - INTERVAL '7 days'
            GROUP BY period, product_line, region
        """)

        for row in cursor.fetchall():
            self.revenue_total.labels(
                period=row[0].isoformat(),
                product_line=row[1],
                region=row[2]
            ).set(row[3])

        # Customer lifetime value
        cursor.execute("""
            SELECT
                c.segment,
                c.acquisition_channel,
                AVG(o.total_spent) as avg_clv
            FROM customers c
            JOIN (
                SELECT customer_id, SUM(amount) as total_spent
                FROM orders
                WHERE status = 'completed'
                GROUP BY customer_id
            ) o ON c.id = o.customer_id
            GROUP BY c.segment, c.acquisition_channel
        """)

        for row in cursor.fetchall():
            self.customer_lifetime_value.labels(
                customer_segment=row[0],
                acquisition_channel=row[1]
            ).observe(row[2])

        # MRR calculation
        cursor.execute("""
            SELECT
                plan_name,
                currency,
                SUM(
                    CASE
                        WHEN billing_period = 'yearly' THEN amount / 12
                        ELSE amount
                    END
                ) as mrr
            FROM subscriptions
            WHERE status = 'active'
            GROUP BY plan_name, currency
        """)

        for row in cursor.fetchall():
            self.monthly_recurring_revenue.labels(
                plan=row[0],
                currency=row[1]
            ).set(row[2])

        # Churn rate
        cursor.execute("""
            WITH cohort_data AS (
                SELECT
                    plan_name,
                    DATE_TRUNC('month', created_at) as cohort,
                    COUNT(*) as total_customers,
                    COUNT(CASE WHEN status = 'cancelled' THEN 1 END) as churned_customers
                FROM subscriptions
                WHERE created_at >= NOW() - INTERVAL '6 months'
                GROUP BY plan_name, cohort
            )
            SELECT
                plan_name,
                cohort,
                (churned_customers::float / total_customers) * 100 as churn_rate
            FROM cohort_data
        """)

        for row in cursor.fetchall():
            self.churn_rate.labels(
                plan=row[0],
                cohort=row[1].isoformat()
            ).set(row[2])

        cursor.close()

    def collect_operational_metrics(self):
        """Collect operational metrics from Redis and other sources"""

        # API quota usage from Redis
        for key in self.redis_client.scan_iter("quota:*"):
            parts = key.decode().split(':')
            if len(parts) >= 3:
                customer_id = parts[1]
                resource = parts[2]

                used = float(self.redis_client.get(key) or 0)
                limit_key = f"quota_limit:{customer_id}:{resource}"
                limit = float(self.redis_client.get(limit_key) or 1000)

                usage_percentage = (used / limit) * 100 if limit > 0 else 0

                # Get customer tier from database
                cursor = self.db_conn.cursor()
                cursor.execute(
                    "SELECT tier FROM customers WHERE id = %s",
                    (customer_id,)
                )
                result = cursor.fetchone()
                tier = result[0] if result else 'unknown'
                cursor.close()

                self.api_quota_usage.labels(
                    customer_id=customer_id,
                    tier=tier,
                    resource=resource
                ).set(usage_percentage)

        # Data pipeline lag from Redis
        pipeline_stages = ['ingestion', 'processing', 'storage', 'delivery']
        for stage in pipeline_stages:
            lag_key = f"pipeline:lag:{stage}"
            lag_value = self.redis_client.get(lag_key)
            if lag_value:
                self.data_pipeline_lag.labels(
                    pipeline='main',
                    stage=stage
                ).observe(float(lag_value))

    def check_dependency_health(self):
        """Check health of external dependencies"""
        dependencies = [
            ('payment', 'stripe', 'https://api.stripe.com/health'),
            ('email', 'sendgrid', 'https://api.sendgrid.com/health'),
            ('storage', 's3', 'https://s3.amazonaws.com/health'),
            ('cache', 'redis', 'redis://localhost:6379'),
            ('database', 'postgres', self.db_config)
        ]

        for service, dep_name, endpoint in dependencies:
            try:
                if dep_name == 'redis':
                    # Check Redis
                    self.redis_client.ping()
                    status = 'healthy'
                elif dep_name == 'postgres':
                    # Check PostgreSQL
                    cursor = self.db_conn.cursor()
                    cursor.execute("SELECT 1")
                    cursor.close()
                    status = 'healthy'
                else:
                    # Check HTTP endpoints
                    response = requests.get(endpoint, timeout=5)
                    if response.status_code == 200:
                        status = 'healthy'
                    elif 200 < response.status_code < 500:
                        status = 'degraded'
                    else:
                        status = 'unhealthy'
            except Exception as e:
                print(f"Health check failed for {dep_name}: {e}")
                status = 'unhealthy'

            self.dependency_health.labels(
                service=service,
                dependency=dep_name
            ).state(status)

    def calculate_data_quality(self):
        """Calculate data quality scores"""
        cursor = self.db_conn.cursor()

        # Completeness score
        cursor.execute("""
            SELECT
                'orders' as dataset,
                (COUNT(*) - COUNT(CASE WHEN customer_email IS NULL THEN 1 END))::float / COUNT(*) * 100 as completeness
            FROM orders
            WHERE created_at >= NOW() - INTERVAL '1 day'
        """)

        for row in cursor.fetchall():
            self.data_quality_score.labels(
                dataset=row[0],
                dimension='completeness'
            ).set(row[1])

        # Accuracy score (checking for valid email formats)
        cursor.execute("""
            SELECT
                'customers' as dataset,
                COUNT(CASE WHEN email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}$' THEN 1 END)::float / COUNT(*) * 100 as accuracy
            FROM customers
            WHERE created_at >= NOW() - INTERVAL '1 day'
        """)

        for row in cursor.fetchall():
            self.data_quality_score.labels(
                dataset=row[0],
                dimension='accuracy'
            ).set(row[1])

        cursor.close()

    async def collect_metrics_async(self):
        """Async collection for improved performance"""
        tasks = [
            self.collect_business_metrics_async(),
            self.collect_operational_metrics_async(),
            self.check_dependency_health_async(),
            self.calculate_data_quality_async()
        ]

        await asyncio.gather(*tasks)

    def run(self):
        """Start the exporter"""
        # Start HTTP server for Prometheus to scrape
        start_http_server(self.port, registry=self.registry)
        print(f"Custom exporter started on port {self.port}")

        # Collect metrics every 30 seconds
        while True:
            try:
                self.collect_business_metrics()
                self.collect_operational_metrics()
                self.check_dependency_health()
                self.calculate_data_quality()

                print(f"Metrics collected at {datetime.now()}")
                time.sleep(30)

            except Exception as e:
                print(f"Error collecting metrics: {e}")
                time.sleep(30)

# Usage
if __name__ == "__main__":
    db_config = {
        'host': 'localhost',
        'database': 'production',
        'user': 'metrics_user',
        'password': 'secure_password',
        'port': 5432
    }

    redis_config = {
        'host': 'localhost',
        'port': 6379,
        'db': 0,
        'decode_responses': True
    }

    exporter = CustomBusinessExporter(db_config, redis_config)
    exporter.run()

Error Handling

Error	Cause	Solution
"Connection refused to Prometheus"	Prometheus not running or wrong port	Check Docker container status with `docker ps`, verify port mapping
"No data in Grafana dashboard"	Metrics not being scraped	Verify Prometheus targets at `localhost:9090/targets`, check API metrics endpoint
"Too many samples" error	High cardinality labels	Review label usage, avoid user IDs or timestamps as labels
"Out of memory" in Prometheus	Retention too long or too many metrics	Reduce retention time, implement remote storage, or scale vertically
Jaeger traces not appearing	Incorrect sampling rate	Increase sampling rate in tracer configuration
"Context deadline exceeded"	Scrape timeout too short	Increase scrape_timeout in prometheus.yml (default 10s)
"Error reading Prometheus"	Corrupt WAL (write-ahead log)	Delete WAL directory: `rm -rf /prometheus/wal/*` and restart
"Too many open files"	File descriptor limit reached	Increase ulimit: `ulimit -n 65536` or adjust systemd limits
AlertManager not firing	Incorrect routing rules	Validate routing tree with `amtool config routes`
Grafana login loop	Cookie/session issues	Clear browser cookies, check Grafana cookie settings

Configuration Options

Basic Usage:

/create-monitoring \
  --stack=prometheus \
  --services=api-gateway,user-service,order-service \
  --environment=production \
  --retention=30d

Available Options:

--stack <type> - Monitoring stack to deploy

prometheus - Prometheus + Grafana + AlertManager (default, open-source)
elastic - ELK stack (Elasticsearch, Logstash, Kibana) for log-centric
datadog - Datadog agent configuration (requires API key)
newrelic - New Relic agent setup (requires license key)
hybrid - Combination of metrics (Prometheus) and logs (ELK)

--tracing <backend> - Distributed tracing backend

jaeger - Jaeger all-in-one (default, recommended for start)
zipkin - Zipkin server
tempo - Grafana Tempo (for high-scale)
xray - AWS X-Ray (for AWS environments)
none - Skip tracing setup

--retention <duration> - Metrics retention period

Default: 15d (15 days)
Production: 30d to 90d
With remote storage: 365d or more

--scrape-interval <duration> - How often to collect metrics

Default: 15s
High-frequency: 5s (higher resource usage)
Low-frequency: 60s (for stable metrics)

--alerting-channels <channels> - Where to send alerts

slack - Slack webhook integration
pagerduty - PagerDuty integration
email - SMTP email notifications
webhook - Custom webhook endpoint
opsgenie - Atlassian OpsGenie

--dashboard-presets <presets> - Pre-built dashboards to install

red-metrics - Rate, Errors, Duration
four-golden - Latency, Traffic, Errors, Saturation
business-kpis - Revenue, Users, Conversion
sre-slos - SLI/SLO tracking
security - Security metrics and anomalies

--exporters <list> - Additional exporters to configure

node-exporter - System/host metrics
blackbox-exporter - Probe endpoints
postgres-exporter - PostgreSQL metrics
redis-exporter - Redis metrics
custom - Custom business metrics

--high-availability - Enable HA configuration

Sets up Prometheus federation
Configures AlertManager clustering
Enables Grafana database replication

--storage <type> - Long-term storage backend

local - Local disk (default)
thanos - Thanos for unlimited retention
cortex - Cortex for multi-tenant
victoria - VictoriaMetrics for efficiency
s3 - S3-compatible object storage

--dry-run - Generate configuration without deploying

Creates all config files
Validates syntax
Shows what would be deployed
No actual containers started

Best Practices

DO:

Start with RED metrics (Rate, Errors, Duration) as your foundation
Use histogram buckets that align with your SLO targets
Tag metrics with environment, region, version, and service
Create runbooks for every alert and link them in annotations
Implement meta-monitoring (monitor the monitoring system)
Use recording rules for frequently-run expensive queries
Set up separate dashboards for different audiences (ops, dev, business)
Use exemplars to link metrics to traces for easier debugging
Implement gradual rollout of new metrics to avoid cardinality explosion
Archive old dashboards before creating new ones

DON'T:

Add high-cardinality labels like user IDs, session IDs, or UUIDs
Create dashboards with 50+ panels (causes browser performance issues)
Alert on symptoms without providing actionable runbooks
Store raw logs in Prometheus (use log aggregation systems)
Ignore alert fatigue (regularly review and tune thresholds)
Hardcode datasource UIDs in dashboard JSON
Mix metrics from different time ranges in one panel
Use regex selectors without limits in production queries
Forget to set up backup for Grafana database
Skip capacity planning for metrics growth

TIPS:

Import dashboards from grafana.com marketplace (dashboard IDs)
Use Prometheus federation for multi-region deployments
Implement progressive alerting: warning (Slack) → critical (PagerDuty)
Create team-specific folders in Grafana for organization
Use Grafana variables for dynamic, reusable dashboards
Set up dashboard playlists for NOC/SOC displays
Use annotations to mark deployments and incidents on graphs
Implement SLO burn rate alerts instead of static thresholds
Create separate Prometheus jobs for different scrape intervals
Use remote_write for backup and long-term storage

Performance Considerations

Prometheus Resource Planning

Memory Required =
  (number_of_time_series * 2KB) +    # Active series
  (ingestion_rate * 2 * retention_hours) +  # WAL and blocks
  (2GB)                               # Base overhead

CPU Cores Required =
  (ingestion_rate / 100,000) +        # Ingestion processing
  (query_rate / 10) +                 # Query processing
  (1)                                 # Base overhead

Disk IOPS Required =
  (ingestion_rate / 1000) +           # Write IOPS
  (query_rate * 100) +                # Read IOPS
  (100)                               # Background compaction

Optimization Strategies

Reduce cardinality: Audit and remove unnecessary labels
Use recording rules: Pre-compute expensive queries
Optimize scrape configs: Different intervals for different metrics
Implement downsampling: For long-term storage
Horizontal sharding: Separate Prometheus per service/team
Remote storage: Offload old data to object storage
Query caching: Use Trickster or built-in Grafana caching
Metric relabeling: Drop unwanted metrics at scrape time
Federation: Aggregate metrics hierarchically
Capacity limits: Set max_samples_per_send and queue sizes

Scaling Thresholds

< 1M active series: Single Prometheus instance
1M - 10M series: Prometheus with remote storage
10M - 100M series: Sharded Prometheus or Cortex
100M series: Thanos or multi-region Cortex

Security Considerations

Authentication & Authorization

# prometheus.yml with basic auth
scrape_configs:
  - job_name: 'secured-api'
    basic_auth:
      username: 'prometheus'
      password_file: '/etc/prometheus/password.txt'
    scheme: https
    tls_config:
      ca_file: '/etc/prometheus/ca.crt'
      cert_file: '/etc/prometheus/cert.crt'
      key_file: '/etc/prometheus/key.pem'
      insecure_skip_verify: false

Network Security

Deploy monitoring stack in isolated subnet
Use internal load balancers for Prometheus federation
Implement mTLS between Prometheus and targets
Restrict metrics endpoints to monitoring CIDR blocks
Use VPN or private links for cross-region federation

Data Security

Encrypt data at rest (filesystem encryption)
Sanitize metrics to avoid leaking sensitive data
Implement audit logging for all access
Regular security scanning of monitoring infrastructure
Rotate credentials and certificates regularly

Compliance Considerations

GDPR: Avoid collecting PII in metrics labels
HIPAA: Encrypt all health-related metrics
PCI DSS: Separate payment metrics into isolated stack
SOC 2: Maintain audit trails and access logs

Troubleshooting Guide

Issue: Prometheus consuming too much memory

# 1. Check current memory usage and series count
curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName' | head -20

# 2. Find high cardinality metrics
curl -g 'http://localhost:9090/api/v1/query?query=count(count+by(__name__)({__name__=~".+"}))' | jq

# 3. Identify problematic labels
curl -s http://localhost:9090/api/v1/label/userId/values | jq '. | length'

# 4. Drop high-cardinality metrics
# Add to prometheus.yml:
metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'problematic_metric_.*'
    action: drop

Issue: Grafana dashboards loading slowly

# 1. Check query performance
curl -s 'http://localhost:9090/api/v1/query_log' | jq '.data[] | select(.duration_seconds > 1)'

# 2. Analyze slow queries in Grafana
SELECT
  dashboard_id,
  panel_id,
  AVG(duration) as avg_duration,
  query
FROM grafana.query_history
WHERE duration > 1000
GROUP BY dashboard_id, panel_id, query
ORDER BY avg_duration DESC;

# 3. Optimize with recording rules
# Add to recording_rules.yml:
groups:
  - name: dashboard_queries
    interval: 30s
    rules:
      - record: api:request_rate5m
        expr: sum(rate(http_requests_total[5m])) by (service)

Issue: Alerts not firing

# 1. Check alert state
curl http://localhost:9090/api/v1/alerts | jq

# 2. Validate AlertManager config
docker exec alertmanager amtool config routes

# 3. Test alert routing
docker exec alertmanager amtool config routes test \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --verify.receivers=slack-critical \
  severity=critical service=api

# 4. Check for inhibition rules
curl http://localhost:9093/api/v1/alerts | jq '.[] | select(.status.inhibitedBy != [])'

Issue: Missing traces in Jaeger

// 1. Verify sampling rate
const tracer = initTracer({
  serviceName: 'api-gateway',
  sampler: {
    type: 'const',  // Change to 'const' for debugging
    param: 1,        // 1 = sample everything
  },
});

// 2. Check span reporting
tracer.on('span_finished', (span) => {
  console.log('Span finished:', span.operationName(), span.context().toTraceId());
});

// 3. Verify Jaeger agent connectivity
curl http://localhost:14268/api/traces?service=api-gateway

Migration Guide

From CloudWatch to Prometheus:

# Migration script example
import boto3
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway

def migrate_cloudwatch_to_prometheus():
    # Read from CloudWatch
    cw = boto3.client('cloudwatch')
    metrics = cw.get_metric_statistics(
        Namespace='AWS/EC2',
        MetricName='CPUUtilization',
        StartTime=datetime.now() - timedelta(hours=1),
        EndTime=datetime.now(),
        Period=300,
        Statistics=['Average']
    )

    # Write to Prometheus
    registry = CollectorRegistry()
    g = Gauge('aws_ec2_cpu_utilization', 'EC2 CPU Usage',
              ['instance_id'], registry=registry)

    for datapoint in metrics['Datapoints']:
        g.labels(instance_id='i-1234567890abcdef0').set(datapoint['Average'])
        push_to_gateway('localhost:9091', job='cloudwatch_migration', registry=registry)

From Datadog to Prometheus:

Export Datadog dashboards as JSON
Convert queries using query translator
Import to Grafana with dashboard converter
Map Datadog tags to Prometheus labels
Recreate alerts in AlertManager format

/api-load-tester - Generate test traffic to validate monitoring setup
/api-security-scanner - Security testing with metrics integration
/add-rate-limiting - Rate limiting with metrics exposure
/api-contract-generator - Generate OpenAPI specs with metrics annotations
/deployment-pipeline-orchestrator - CI/CD with monitoring integration
/api-versioning-manager - Version-aware metrics tracking

Advanced Topics

Multi-Cluster Monitoring with Thanos:

# thanos-sidecar.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: thanos-config
data:
  object-store.yaml: |
    type: S3
    config:
      bucket: metrics-long-term
      endpoint: s3.amazonaws.com
      access_key: ${AWS_ACCESS_KEY}
      secret_key: ${AWS_SECRET_KEY}
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus-thanos
spec:
  template:
    spec:
      containers:
      - name: prometheus
        args:
          - --storage.tsdb.retention.time=2h
          - --storage.tsdb.min-block-duration=2h
          - --storage.tsdb.max-block-duration=2h
          - --web.enable-lifecycle
      - name: thanos-sidecar
        image: quay.io/thanos/thanos:v0.31.0
        args:
          - sidecar
          - --prometheus.url=http://localhost:9090
          - --objstore.config-file=/etc/thanos/object-store.yaml

Service Mesh Observability (Istio):

# Automatic metrics from Istio
telemetry:
  v2:
    prometheus:
      providers:
        - name: prometheus
      configOverride:
        inboundSidecar:
          disable_host_header_fallback: false
          metric_expiry_duration: 10m
        outboundSidecar:
          disable_host_header_fallback: false
          metric_expiry_duration: 10m
        gateway:
          disable_host_header_fallback: true

Version History

v1.0.0 (2024-01): Initial Prometheus + Grafana implementation
v1.1.0 (2024-03): Added Jaeger tracing integration
v1.2.0 (2024-05): Thanos long-term storage support
v1.3.0 (2024-07): OpenTelemetry collector integration
v1.4.0 (2024-09): Multi-cluster federation support
v1.5.0 (2024-10): Custom business metrics exporters
Planned v2.0.0: eBPF-based zero-instrumentation monitoring

68 KiB Raw Blame History