zhongwei/gh-zxkane-aws-skills

Fork 0

Files

Zhongwei Li 38b562e994 Initial commit

2025-11-30 09:08:38 +08:00

19 KiB

Raw Blame History

Serverless Observability Best Practices

Comprehensive observability patterns for serverless applications based on AWS best practices.

Three Pillars of Observability
Metrics
Logging
Tracing
Unified Observability
Alerting

Three Pillars of Observability

Metrics

Numeric data measured at intervals (time series)

Request rate, error rate, duration
CPU%, memory%, disk%
Custom business metrics
Service Level Indicators (SLIs)

Logs

Timestamped records of discrete events

Application events and errors
State transformations
Debugging information
Audit trails

Traces

Single user's journey across services

Request flow through distributed system
Service dependencies
Latency breakdown
Error propagation

Metrics

CloudWatch Metrics for Lambda

Out-of-the-box metrics (automatically available):

- Invocations
- Errors
- Throttles
- Duration
- ConcurrentExecutions
- IteratorAge (for streams)

CDK Configuration:

const fn = new NodejsFunction(this, 'Function', {
  entry: 'src/handler.ts',
});

// Create alarms on metrics
new cloudwatch.Alarm(this, 'ErrorAlarm', {
  metric: fn.metricErrors({
    statistic: 'Sum',
    period: Duration.minutes(5),
  }),
  threshold: 10,
  evaluationPeriods: 1,
});

new cloudwatch.Alarm(this, 'DurationAlarm', {
  metric: fn.metricDuration({
    statistic: 'p99',
    period: Duration.minutes(5),
  }),
  threshold: 1000, // 1 second
  evaluationPeriods: 2,
});

Custom Metrics

Use CloudWatch Embedded Metric Format (EMF):

export const handler = async (event: any) => {
  const startTime = Date.now();

  try {
    const result = await processOrder(event);

    // Emit custom metrics
    console.log(JSON.stringify({
      _aws: {
        Timestamp: Date.now(),
        CloudWatchMetrics: [{
          Namespace: 'MyApp/Orders',
          Dimensions: [['ServiceName', 'Operation']],
          Metrics: [
            { Name: 'ProcessingTime', Unit: 'Milliseconds' },
            { Name: 'OrderValue', Unit: 'None' },
          ],
        }],
      },
      ServiceName: 'OrderService',
      Operation: 'ProcessOrder',
      ProcessingTime: Date.now() - startTime,
      OrderValue: result.amount,
    }));

    return result;
  } catch (error) {
    // Emit error metric
    console.log(JSON.stringify({
      _aws: {
        CloudWatchMetrics: [{
          Namespace: 'MyApp/Orders',
          Dimensions: [['ServiceName']],
          Metrics: [{ Name: 'Errors', Unit: 'Count' }],
        }],
      },
      ServiceName: 'OrderService',
      Errors: 1,
    }));

    throw error;
  }
};

Using Lambda Powertools:

import { Metrics, MetricUnits } from '@aws-lambda-powertools/metrics';

const metrics = new Metrics({
  namespace: 'MyApp',
  serviceName: 'OrderService',
});

export const handler = async (event: any) => {
  metrics.addMetric('Invocation', MetricUnits.Count, 1);

  const startTime = Date.now();

  try {
    const result = await processOrder(event);

    metrics.addMetric('Success', MetricUnits.Count, 1);
    metrics.addMetric('ProcessingTime', MetricUnits.Milliseconds, Date.now() - startTime);
    metrics.addMetric('OrderValue', MetricUnits.None, result.amount);

    return result;
  } catch (error) {
    metrics.addMetric('Error', MetricUnits.Count, 1);
    throw error;
  } finally {
    metrics.publishStoredMetrics();
  }
};

Logging

Structured Logging

Use JSON format for logs:

// ✅ GOOD - Structured JSON logging
export const handler = async (event: any) => {
  console.log(JSON.stringify({
    level: 'INFO',
    message: 'Processing order',
    orderId: event.orderId,
    customerId: event.customerId,
    timestamp: new Date().toISOString(),
    requestId: context.requestId,
  }));

  try {
    const result = await processOrder(event);

    console.log(JSON.stringify({
      level: 'INFO',
      message: 'Order processed successfully',
      orderId: event.orderId,
      duration: Date.now() - startTime,
      timestamp: new Date().toISOString(),
    }));

    return result;
  } catch (error) {
    console.error(JSON.stringify({
      level: 'ERROR',
      message: 'Order processing failed',
      orderId: event.orderId,
      error: {
        name: error.name,
        message: error.message,
        stack: error.stack,
      },
      timestamp: new Date().toISOString(),
    }));

    throw error;
  }
};

// ❌ BAD - Unstructured logging
console.log('Processing order ' + orderId + ' for customer ' + customerId);

Using Lambda Powertools Logger:

import { Logger } from '@aws-lambda-powertools/logger';

const logger = new Logger({
  serviceName: 'OrderService',
  logLevel: 'INFO',
});

export const handler = async (event: any, context: Context) => {
  logger.addContext(context);

  logger.info('Processing order', {
    orderId: event.orderId,
    customerId: event.customerId,
  });

  try {
    const result = await processOrder(event);

    logger.info('Order processed', {
      orderId: event.orderId,
      amount: result.amount,
    });

    return result;
  } catch (error) {
    logger.error('Order processing failed', {
      orderId: event.orderId,
      error,
    });

    throw error;
  }
};

Log Levels

Use appropriate log levels:

ERROR: Errors requiring immediate attention
WARN: Warnings or recoverable errors
INFO: Important business events
DEBUG: Detailed debugging information (disable in production)

const logger = new Logger({
  serviceName: 'OrderService',
  logLevel: process.env.LOG_LEVEL || 'INFO',
});

logger.debug('Detailed processing info', { data });
logger.info('Business event occurred', { event });
logger.warn('Recoverable error', { error });
logger.error('Critical failure', { error });

Log Insights Queries

Common CloudWatch Logs Insights queries:

# Find errors in last hour
fields @timestamp, @message, level, error.message
| filter level = "ERROR"
| sort @timestamp desc
| limit 100

# Count errors by type
stats count() by error.name as ErrorType
| sort count desc

# Calculate p99 latency
stats percentile(duration, 99) by serviceName

# Find slow requests
fields @timestamp, orderId, duration
| filter duration > 1000
| sort duration desc
| limit 50

# Track specific customer requests
fields @timestamp, @message, orderId
| filter customerId = "customer-123"
| sort @timestamp desc

Tracing

Enable X-Ray Tracing

Configure X-Ray for Lambda:

const fn = new NodejsFunction(this, 'Function', {
  entry: 'src/handler.ts',
  tracing: lambda.Tracing.ACTIVE, // Enable X-Ray
});

// API Gateway tracing
const api = new apigateway.RestApi(this, 'Api', {
  deployOptions: {
    tracingEnabled: true,
  },
});

// Step Functions tracing
new stepfunctions.StateMachine(this, 'StateMachine', {
  definition,
  tracingEnabled: true,
});

Instrument application code:

import { captureAWSv3Client } from 'aws-xray-sdk-core';
import { DynamoDBClient } from '@aws-sdk/client-dynamodb';

// Wrap AWS SDK clients
const client = captureAWSv3Client(new DynamoDBClient({}));

// Custom segments
import AWSXRay from 'aws-xray-sdk-core';

export const handler = async (event: any) => {
  const segment = AWSXRay.getSegment();

  // Custom subsegment
  const subsegment = segment.addNewSubsegment('ProcessOrder');

  try {
    // Add annotations (indexed for filtering)
    subsegment.addAnnotation('orderId', event.orderId);
    subsegment.addAnnotation('customerId', event.customerId);

    // Add metadata (not indexed, detailed info)
    subsegment.addMetadata('orderDetails', event);

    const result = await processOrder(event);

    subsegment.addAnnotation('status', 'success');
    subsegment.close();

    return result;
  } catch (error) {
    subsegment.addError(error);
    subsegment.close();
    throw error;
  }
};

Using Lambda Powertools Tracer:

import { Tracer } from '@aws-lambda-powertools/tracer';

const tracer = new Tracer({ serviceName: 'OrderService' });

export const handler = async (event: any) => {
  const segment = tracer.getSegment();

  // Automatically captures and traces
  const result = await tracer.captureAWSv3Client(dynamodb).getItem({
    TableName: process.env.TABLE_NAME,
    Key: { orderId: event.orderId },
  });

  // Custom annotation
  tracer.putAnnotation('orderId', event.orderId);
  tracer.putMetadata('orderDetails', event);

  return result;
};

Service Map

Visualize service dependencies with X-Ray:

Shows service-to-service communication
Identifies latency bottlenecks
Highlights error rates between services
Tracks downstream dependencies

Distributed Tracing Best Practices

Enable tracing everywhere: Lambda, API Gateway, Step Functions
Use annotations for filtering: Indexed fields for queries
Use metadata for details: Non-indexed detailed information
Sample appropriately: 100% for low traffic, sampled for high traffic
Correlate with logs: Include trace ID in log entries

Unified Observability

Correlation Between Pillars

Include trace ID in logs:

export const handler = async (event: any, context: Context) => {
  const traceId = process.env._X_AMZN_TRACE_ID;

  console.log(JSON.stringify({
    level: 'INFO',
    message: 'Processing order',
    traceId,
    requestId: context.requestId,
    orderId: event.orderId,
  }));
};

CloudWatch ServiceLens

Unified view of traces and metrics:

Automatically correlates X-Ray traces with CloudWatch metrics
Shows service map with metrics overlay
Identifies performance and availability issues
Provides end-to-end request view

Lambda Powertools Integration

All three pillars in one:

import { Logger } from '@aws-lambda-powertools/logger';
import { Tracer } from '@aws-lambda-powertools/tracer';
import { Metrics, MetricUnits } from '@aws-lambda-powertools/metrics';

const logger = new Logger({ serviceName: 'OrderService' });
const tracer = new Tracer({ serviceName: 'OrderService' });
const metrics = new Metrics({ namespace: 'MyApp', serviceName: 'OrderService' });

export const handler = async (event: any, context: Context) => {
  // Automatically adds trace context to logs
  logger.addContext(context);

  logger.info('Processing order', { orderId: event.orderId });

  // Add trace annotations
  tracer.putAnnotation('orderId', event.orderId);

  // Add metrics
  metrics.addMetric('Invocation', MetricUnits.Count, 1);

  const startTime = Date.now();

  try {
    const result = await processOrder(event);

    metrics.addMetric('Success', MetricUnits.Count, 1);
    metrics.addMetric('Duration', MetricUnits.Milliseconds, Date.now() - startTime);

    logger.info('Order processed', { orderId: event.orderId });

    return result;
  } catch (error) {
    metrics.addMetric('Error', MetricUnits.Count, 1);
    logger.error('Processing failed', { orderId: event.orderId, error });
    throw error;
  } finally {
    metrics.publishStoredMetrics();
  }
};

Alerting

Effective Alerting Strategy

Alert on what matters:

Critical: Customer-impacting issues (errors, high latency)
Warning: Approaching thresholds (80% capacity)
Info: Trends and anomalies (cost spikes)

Alarm fatigue prevention:

Tune thresholds based on actual patterns
Use composite alarms to reduce noise
Set appropriate evaluation periods
Include clear remediation steps

CloudWatch Alarms

Common alarm patterns:

// Error rate alarm
new cloudwatch.Alarm(this, 'ErrorRateAlarm', {
  metric: new cloudwatch.MathExpression({
    expression: 'errors / invocations * 100',
    usingMetrics: {
      errors: fn.metricErrors({ statistic: 'Sum' }),
      invocations: fn.metricInvocations({ statistic: 'Sum' }),
    },
  }),
  threshold: 1, // 1% error rate
  evaluationPeriods: 2,
  alarmDescription: 'Error rate exceeded 1%',
});

// Latency alarm (p99)
new cloudwatch.Alarm(this, 'LatencyAlarm', {
  metric: fn.metricDuration({
    statistic: 'p99',
    period: Duration.minutes(5),
  }),
  threshold: 1000, // 1 second
  evaluationPeriods: 2,
  alarmDescription: 'p99 latency exceeded 1 second',
});

// Concurrent executions approaching limit
new cloudwatch.Alarm(this, 'ConcurrencyAlarm', {
  metric: fn.metricConcurrentExecutions({
    statistic: 'Maximum',
  }),
  threshold: 800, // 80% of 1000 default limit
  evaluationPeriods: 1,
  alarmDescription: 'Approaching concurrency limit',
});

Composite Alarms

Reduce alert noise:

const errorAlarm = new cloudwatch.Alarm(this, 'Errors', {
  metric: fn.metricErrors(),
  threshold: 10,
  evaluationPeriods: 1,
});

const throttleAlarm = new cloudwatch.Alarm(this, 'Throttles', {
  metric: fn.metricThrottles(),
  threshold: 5,
  evaluationPeriods: 1,
});

const latencyAlarm = new cloudwatch.Alarm(this, 'Latency', {
  metric: fn.metricDuration({ statistic: 'p99' }),
  threshold: 2000,
  evaluationPeriods: 2,
});

// Composite alarm (any of the above)
new cloudwatch.CompositeAlarm(this, 'ServiceHealthAlarm', {
  compositeAlarmName: 'order-service-health',
  alarmRule: cloudwatch.AlarmRule.anyOf(
    errorAlarm,
    throttleAlarm,
    latencyAlarm
  ),
  alarmDescription: 'Overall service health degraded',
});

Dashboard Best Practices

Service Dashboard Layout

Recommended sections:

Overview:
- Total invocations
- Error rate percentage
- P50, P95, P99 latency
- Availability percentage
Resource Utilization:
- Concurrent executions
- Memory utilization
- Duration distribution
- Throttles
Business Metrics:
- Orders processed
- Revenue per minute
- Customer activity
- Feature usage
Errors and Alerts:
- Error count by type
- Active alarms
- DLQ message count
- Failed transactions

CloudWatch Dashboard CDK

const dashboard = new cloudwatch.Dashboard(this, 'ServiceDashboard', {
  dashboardName: 'order-service',
});

dashboard.addWidgets(
  // Row 1: Overview
  new cloudwatch.GraphWidget({
    title: 'Invocations',
    left: [fn.metricInvocations()],
  }),
  new cloudwatch.SingleValueWidget({
    title: 'Error Rate',
    metrics: [
      new cloudwatch.MathExpression({
        expression: 'errors / invocations * 100',
        usingMetrics: {
          errors: fn.metricErrors({ statistic: 'Sum' }),
          invocations: fn.metricInvocations({ statistic: 'Sum' }),
        },
      }),
    ],
  }),
  new cloudwatch.GraphWidget({
    title: 'Latency (p50, p95, p99)',
    left: [
      fn.metricDuration({ statistic: 'p50', label: 'p50' }),
      fn.metricDuration({ statistic: 'p95', label: 'p95' }),
      fn.metricDuration({ statistic: 'p99', label: 'p99' }),
    ],
  })
);

// Row 2: Errors
dashboard.addWidgets(
  new cloudwatch.LogQueryWidget({
    title: 'Recent Errors',
    logGroupNames: [fn.logGroup.logGroupName],
    queryLines: [
      'fields @timestamp, @message',
      'filter level = "ERROR"',
      'sort @timestamp desc',
      'limit 20',
    ],
  })
);

Monitoring Serverless Architectures

End-to-End Monitoring

Monitor the entire flow:

API Gateway → Lambda → DynamoDB → EventBridge → Lambda
     ↓           ↓          ↓            ↓           ↓
  Metrics    Traces     Metrics      Metrics     Logs

Key metrics per service:

Service	Key Metrics
API Gateway	Count, 4XXError, 5XXError, Latency, CacheHitCount
Lambda	Invocations, Errors, Duration, Throttles, ConcurrentExecutions
DynamoDB	ConsumedReadCapacity, ConsumedWriteCapacity, UserErrors, SystemErrors
SQS	NumberOfMessagesSent, NumberOfMessagesReceived, ApproximateAgeOfOldestMessage
EventBridge	Invocations, FailedInvocations, TriggeredRules
Step Functions	ExecutionsStarted, ExecutionsFailed, ExecutionTime

Synthetic Monitoring

Use CloudWatch Synthetics for API monitoring:

import { Canary, Test, Code, Schedule } from '@aws-cdk/aws-synthetics-alpha';

new Canary(this, 'ApiCanary', {
  canaryName: 'api-health-check',
  schedule: Schedule.rate(Duration.minutes(5)),
  test: Test.custom({
    code: Code.fromInline(`
      const synthetics = require('Synthetics');

      const apiCanaryBlueprint = async function () {
        const response = await synthetics.executeHttpStep('Verify API', {
          url: 'https://api.example.com/health',
          method: 'GET',
        });

        return response.statusCode === 200 ? 'success' : 'failure';
      };

      exports.handler = async () => {
        return await apiCanaryBlueprint();
      };
    `),
    handler: 'index.handler',
  }),
  runtime: synthetics.Runtime.SYNTHETICS_NODEJS_PUPPETEER_6_2,
});

OpenTelemetry Integration

Amazon Distro for OpenTelemetry (ADOT)

Use ADOT for vendor-neutral observability:

// Lambda Layer with ADOT
const adotLayer = lambda.LayerVersion.fromLayerVersionArn(
  this,
  'AdotLayer',
  `arn:aws:lambda:${this.region}:901920570463:layer:aws-otel-nodejs-amd64-ver-1-18-1:4`
);

new NodejsFunction(this, 'Function', {
  entry: 'src/handler.ts',
  layers: [adotLayer],
  tracing: lambda.Tracing.ACTIVE,
  environment: {
    AWS_LAMBDA_EXEC_WRAPPER: '/opt/otel-handler',
    OPENTELEMETRY_COLLECTOR_CONFIG_FILE: '/var/task/collector.yaml',
  },
});

Benefits of ADOT:

Vendor-neutral (works with Datadog, New Relic, Honeycomb, etc.)
Automatic instrumentation
Consistent format across services
Export to multiple backends

Best Practices Summary

Metrics

✅ Use CloudWatch Embedded Metric Format (EMF)
✅ Track business metrics, not just technical metrics
✅ Set alarms on error rate, latency, and throughput
✅ Use p99 for latency, not average
✅ Create dashboards for key services

Logging

✅ Use structured JSON logging
✅ Include correlation IDs (request ID, trace ID)
✅ Use appropriate log levels
✅ Never log sensitive data (PII, secrets)
✅ Use CloudWatch Logs Insights for analysis

Tracing

✅ Enable X-Ray tracing on all services
✅ Instrument AWS SDK calls
✅ Add custom annotations for business context
✅ Use service map to understand dependencies
✅ Correlate traces with logs and metrics

Alerting

✅ Alert on customer-impacting issues
✅ Tune thresholds to reduce false positives
✅ Use composite alarms to reduce noise
✅ Include clear remediation steps
✅ Escalate critical alarms appropriately

Tools

✅ Use Lambda Powertools for unified observability
✅ Use CloudWatch ServiceLens for service view
✅ Use Synthetics for proactive monitoring
✅ Consider ADOT for vendor-neutral observability

19 KiB Raw Blame History

Serverless Observability Best Practices

Table of Contents

Three Pillars of Observability

Metrics

Logs

Traces

Metrics

CloudWatch Metrics for Lambda

Custom Metrics

Logging

Structured Logging

Log Levels

Log Insights Queries

Tracing

Enable X-Ray Tracing

Service Map

Distributed Tracing Best Practices

Unified Observability

Correlation Between Pillars

CloudWatch ServiceLens

Lambda Powertools Integration

Alerting

Effective Alerting Strategy

CloudWatch Alarms

Composite Alarms

Dashboard Best Practices

Service Dashboard Layout

CloudWatch Dashboard CDK

Monitoring Serverless Architectures

End-to-End Monitoring

Synthetic Monitoring

OpenTelemetry Integration

Amazon Distro for OpenTelemetry (ADOT)

Best Practices Summary

Metrics

Logging

Tracing

Alerting

Tools

19 KiB

Raw Blame History