Initial commit

2025-11-30 09:08:38 +08:00
commit 38b562e994
16 changed files with 7792 additions and 0 deletions
--- a/skills/aws-serverless-eda/references/observability-best-practices.md
+++ b/skills/aws-serverless-eda/references/observability-best-practices.md
@@ -0,0 +1,770 @@
+# Serverless Observability Best Practices
+
+Comprehensive observability patterns for serverless applications based on AWS best practices.
+
+## Table of Contents
+
+- [Three Pillars of Observability](#three-pillars-of-observability)
+- [Metrics](#metrics)
+- [Logging](#logging)
+- [Tracing](#tracing)
+- [Unified Observability](#unified-observability)
+- [Alerting](#alerting)
+
+## Three Pillars of Observability
+
+### Metrics
+**Numeric data measured at intervals (time series)**
+- Request rate, error rate, duration
+- CPU%, memory%, disk%
+- Custom business metrics
+- Service Level Indicators (SLIs)
+
+### Logs
+**Timestamped records of discrete events**
+- Application events and errors
+- State transformations
+- Debugging information
+- Audit trails
+
+### Traces
+**Single user's journey across services**
+- Request flow through distributed system
+- Service dependencies
+- Latency breakdown
+- Error propagation
+
+## Metrics
+
+### CloudWatch Metrics for Lambda
+
+**Out-of-the-box metrics** (automatically available):
+```
+- Invocations
+- Errors
+- Throttles
+- Duration
+- ConcurrentExecutions
+- IteratorAge (for streams)
+```
+
+**CDK Configuration**:
+```typescript
+const fn = new NodejsFunction(this, 'Function', {
+  entry: 'src/handler.ts',
+});
+
+// Create alarms on metrics
+new cloudwatch.Alarm(this, 'ErrorAlarm', {
+  metric: fn.metricErrors({
+    statistic: 'Sum',
+    period: Duration.minutes(5),
+  }),
+  threshold: 10,
+  evaluationPeriods: 1,
+});
+
+new cloudwatch.Alarm(this, 'DurationAlarm', {
+  metric: fn.metricDuration({
+    statistic: 'p99',
+    period: Duration.minutes(5),
+  }),
+  threshold: 1000, // 1 second
+  evaluationPeriods: 2,
+});
+```
+
+### Custom Metrics
+
+**Use CloudWatch Embedded Metric Format (EMF)**:
+
+```typescript
+export const handler = async (event: any) => {
+  const startTime = Date.now();
+
+  try {
+    const result = await processOrder(event);
+
+    // Emit custom metrics
+    console.log(JSON.stringify({
+      _aws: {
+        Timestamp: Date.now(),
+        CloudWatchMetrics: [{
+          Namespace: 'MyApp/Orders',
+          Dimensions: [['ServiceName', 'Operation']],
+          Metrics: [
+            { Name: 'ProcessingTime', Unit: 'Milliseconds' },
+            { Name: 'OrderValue', Unit: 'None' },
+          ],
+        }],
+      },
+      ServiceName: 'OrderService',
+      Operation: 'ProcessOrder',
+      ProcessingTime: Date.now() - startTime,
+      OrderValue: result.amount,
+    }));
+
+    return result;
+  } catch (error) {
+    // Emit error metric
+    console.log(JSON.stringify({
+      _aws: {
+        CloudWatchMetrics: [{
+          Namespace: 'MyApp/Orders',
+          Dimensions: [['ServiceName']],
+          Metrics: [{ Name: 'Errors', Unit: 'Count' }],
+        }],
+      },
+      ServiceName: 'OrderService',
+      Errors: 1,
+    }));
+
+    throw error;
+  }
+};
+```
+
+**Using Lambda Powertools**:
+
+```typescript
+import { Metrics, MetricUnits } from '@aws-lambda-powertools/metrics';
+
+const metrics = new Metrics({
+  namespace: 'MyApp',
+  serviceName: 'OrderService',
+});
+
+export const handler = async (event: any) => {
+  metrics.addMetric('Invocation', MetricUnits.Count, 1);
+
+  const startTime = Date.now();
+
+  try {
+    const result = await processOrder(event);
+
+    metrics.addMetric('Success', MetricUnits.Count, 1);
+    metrics.addMetric('ProcessingTime', MetricUnits.Milliseconds, Date.now() - startTime);
+    metrics.addMetric('OrderValue', MetricUnits.None, result.amount);
+
+    return result;
+  } catch (error) {
+    metrics.addMetric('Error', MetricUnits.Count, 1);
+    throw error;
+  } finally {
+    metrics.publishStoredMetrics();
+  }
+};
+```
+
+## Logging
+
+### Structured Logging
+
+**Use JSON format for logs**:
+
+```typescript
+// ✅ GOOD - Structured JSON logging
+export const handler = async (event: any) => {
+  console.log(JSON.stringify({
+    level: 'INFO',
+    message: 'Processing order',
+    orderId: event.orderId,
+    customerId: event.customerId,
+    timestamp: new Date().toISOString(),
+    requestId: context.requestId,
+  }));
+
+  try {
+    const result = await processOrder(event);
+
+    console.log(JSON.stringify({
+      level: 'INFO',
+      message: 'Order processed successfully',
+      orderId: event.orderId,
+      duration: Date.now() - startTime,
+      timestamp: new Date().toISOString(),
+    }));
+
+    return result;
+  } catch (error) {
+    console.error(JSON.stringify({
+      level: 'ERROR',
+      message: 'Order processing failed',
+      orderId: event.orderId,
+      error: {
+        name: error.name,
+        message: error.message,
+        stack: error.stack,
+      },
+      timestamp: new Date().toISOString(),
+    }));
+
+    throw error;
+  }
+};
+
+// ❌ BAD - Unstructured logging
+console.log('Processing order ' + orderId + ' for customer ' + customerId);
+```
+
+**Using Lambda Powertools Logger**:
+
+```typescript
+import { Logger } from '@aws-lambda-powertools/logger';
+
+const logger = new Logger({
+  serviceName: 'OrderService',
+  logLevel: 'INFO',
+});
+
+export const handler = async (event: any, context: Context) => {
+  logger.addContext(context);
+
+  logger.info('Processing order', {
+    orderId: event.orderId,
+    customerId: event.customerId,
+  });
+
+  try {
+    const result = await processOrder(event);
+
+    logger.info('Order processed', {
+      orderId: event.orderId,
+      amount: result.amount,
+    });
+
+    return result;
+  } catch (error) {
+    logger.error('Order processing failed', {
+      orderId: event.orderId,
+      error,
+    });
+
+    throw error;
+  }
+};
+```
+
+### Log Levels
+
+**Use appropriate log levels**:
+- **ERROR**: Errors requiring immediate attention
+- **WARN**: Warnings or recoverable errors
+- **INFO**: Important business events
+- **DEBUG**: Detailed debugging information (disable in production)
+
+```typescript
+const logger = new Logger({
+  serviceName: 'OrderService',
+  logLevel: process.env.LOG_LEVEL || 'INFO',
+});
+
+logger.debug('Detailed processing info', { data });
+logger.info('Business event occurred', { event });
+logger.warn('Recoverable error', { error });
+logger.error('Critical failure', { error });
+```
+
+### Log Insights Queries
+
+**Common CloudWatch Logs Insights queries**:
+
+```
+# Find errors in last hour
+fields @timestamp, @message, level, error.message
+| filter level = "ERROR"
+| sort @timestamp desc
+| limit 100
+
+# Count errors by type
+stats count() by error.name as ErrorType
+| sort count desc
+
+# Calculate p99 latency
+stats percentile(duration, 99) by serviceName
+
+# Find slow requests
+fields @timestamp, orderId, duration
+| filter duration > 1000
+| sort duration desc
+| limit 50
+
+# Track specific customer requests
+fields @timestamp, @message, orderId
+| filter customerId = "customer-123"
+| sort @timestamp desc
+```
+
+## Tracing
+
+### Enable X-Ray Tracing
+
+**Configure X-Ray for Lambda**:
+
+```typescript
+const fn = new NodejsFunction(this, 'Function', {
+  entry: 'src/handler.ts',
+  tracing: lambda.Tracing.ACTIVE, // Enable X-Ray
+});
+
+// API Gateway tracing
+const api = new apigateway.RestApi(this, 'Api', {
+  deployOptions: {
+    tracingEnabled: true,
+  },
+});
+
+// Step Functions tracing
+new stepfunctions.StateMachine(this, 'StateMachine', {
+  definition,
+  tracingEnabled: true,
+});
+```
+
+**Instrument application code**:
+
+```typescript
+import { captureAWSv3Client } from 'aws-xray-sdk-core';
+import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
+
+// Wrap AWS SDK clients
+const client = captureAWSv3Client(new DynamoDBClient({}));
+
+// Custom segments
+import AWSXRay from 'aws-xray-sdk-core';
+
+export const handler = async (event: any) => {
+  const segment = AWSXRay.getSegment();
+
+  // Custom subsegment
+  const subsegment = segment.addNewSubsegment('ProcessOrder');
+
+  try {
+    // Add annotations (indexed for filtering)
+    subsegment.addAnnotation('orderId', event.orderId);
+    subsegment.addAnnotation('customerId', event.customerId);
+
+    // Add metadata (not indexed, detailed info)
+    subsegment.addMetadata('orderDetails', event);
+
+    const result = await processOrder(event);
+
+    subsegment.addAnnotation('status', 'success');
+    subsegment.close();
+
+    return result;
+  } catch (error) {
+    subsegment.addError(error);
+    subsegment.close();
+    throw error;
+  }
+};
+```
+
+**Using Lambda Powertools Tracer**:
+
+```typescript
+import { Tracer } from '@aws-lambda-powertools/tracer';
+
+const tracer = new Tracer({ serviceName: 'OrderService' });
+
+export const handler = async (event: any) => {
+  const segment = tracer.getSegment();
+
+  // Automatically captures and traces
+  const result = await tracer.captureAWSv3Client(dynamodb).getItem({
+    TableName: process.env.TABLE_NAME,
+    Key: { orderId: event.orderId },
+  });
+
+  // Custom annotation
+  tracer.putAnnotation('orderId', event.orderId);
+  tracer.putMetadata('orderDetails', event);
+
+  return result;
+};
+```
+
+### Service Map
+
+**Visualize service dependencies** with X-Ray:
+- Shows service-to-service communication
+- Identifies latency bottlenecks
+- Highlights error rates between services
+- Tracks downstream dependencies
+
+### Distributed Tracing Best Practices
+
+1. **Enable tracing everywhere**: Lambda, API Gateway, Step Functions
+2. **Use annotations for filtering**: Indexed fields for queries
+3. **Use metadata for details**: Non-indexed detailed information
+4. **Sample appropriately**: 100% for low traffic, sampled for high traffic
+5. **Correlate with logs**: Include trace ID in log entries
+
+## Unified Observability
+
+### Correlation Between Pillars
+
+**Include trace ID in logs**:
+
+```typescript
+export const handler = async (event: any, context: Context) => {
+  const traceId = process.env._X_AMZN_TRACE_ID;
+
+  console.log(JSON.stringify({
+    level: 'INFO',
+    message: 'Processing order',
+    traceId,
+    requestId: context.requestId,
+    orderId: event.orderId,
+  }));
+};
+```
+
+### CloudWatch ServiceLens
+
+**Unified view of traces and metrics**:
+- Automatically correlates X-Ray traces with CloudWatch metrics
+- Shows service map with metrics overlay
+- Identifies performance and availability issues
+- Provides end-to-end request view
+
+### Lambda Powertools Integration
+
+**All three pillars in one**:
+
+```typescript
+import { Logger } from '@aws-lambda-powertools/logger';
+import { Tracer } from '@aws-lambda-powertools/tracer';
+import { Metrics, MetricUnits } from '@aws-lambda-powertools/metrics';
+
+const logger = new Logger({ serviceName: 'OrderService' });
+const tracer = new Tracer({ serviceName: 'OrderService' });
+const metrics = new Metrics({ namespace: 'MyApp', serviceName: 'OrderService' });
+
+export const handler = async (event: any, context: Context) => {
+  // Automatically adds trace context to logs
+  logger.addContext(context);
+
+  logger.info('Processing order', { orderId: event.orderId });
+
+  // Add trace annotations
+  tracer.putAnnotation('orderId', event.orderId);
+
+  // Add metrics
+  metrics.addMetric('Invocation', MetricUnits.Count, 1);
+
+  const startTime = Date.now();
+
+  try {
+    const result = await processOrder(event);
+
+    metrics.addMetric('Success', MetricUnits.Count, 1);
+    metrics.addMetric('Duration', MetricUnits.Milliseconds, Date.now() - startTime);
+
+    logger.info('Order processed', { orderId: event.orderId });
+
+    return result;
+  } catch (error) {
+    metrics.addMetric('Error', MetricUnits.Count, 1);
+    logger.error('Processing failed', { orderId: event.orderId, error });
+    throw error;
+  } finally {
+    metrics.publishStoredMetrics();
+  }
+};
+```
+
+## Alerting
+
+### Effective Alerting Strategy
+
+**Alert on what matters**:
+- **Critical**: Customer-impacting issues (errors, high latency)
+- **Warning**: Approaching thresholds (80% capacity)
+- **Info**: Trends and anomalies (cost spikes)
+
+**Alarm fatigue prevention**:
+- Tune thresholds based on actual patterns
+- Use composite alarms to reduce noise
+- Set appropriate evaluation periods
+- Include clear remediation steps
+
+### CloudWatch Alarms
+
+**Common alarm patterns**:
+
+```typescript
+// Error rate alarm
+new cloudwatch.Alarm(this, 'ErrorRateAlarm', {
+  metric: new cloudwatch.MathExpression({
+    expression: 'errors / invocations * 100',
+    usingMetrics: {
+      errors: fn.metricErrors({ statistic: 'Sum' }),
+      invocations: fn.metricInvocations({ statistic: 'Sum' }),
+    },
+  }),
+  threshold: 1, // 1% error rate
+  evaluationPeriods: 2,
+  alarmDescription: 'Error rate exceeded 1%',
+});
+
+// Latency alarm (p99)
+new cloudwatch.Alarm(this, 'LatencyAlarm', {
+  metric: fn.metricDuration({
+    statistic: 'p99',
+    period: Duration.minutes(5),
+  }),
+  threshold: 1000, // 1 second
+  evaluationPeriods: 2,
+  alarmDescription: 'p99 latency exceeded 1 second',
+});
+
+// Concurrent executions approaching limit
+new cloudwatch.Alarm(this, 'ConcurrencyAlarm', {
+  metric: fn.metricConcurrentExecutions({
+    statistic: 'Maximum',
+  }),
+  threshold: 800, // 80% of 1000 default limit
+  evaluationPeriods: 1,
+  alarmDescription: 'Approaching concurrency limit',
+});
+```
+
+### Composite Alarms
+
+**Reduce alert noise**:
+
+```typescript
+const errorAlarm = new cloudwatch.Alarm(this, 'Errors', {
+  metric: fn.metricErrors(),
+  threshold: 10,
+  evaluationPeriods: 1,
+});
+
+const throttleAlarm = new cloudwatch.Alarm(this, 'Throttles', {
+  metric: fn.metricThrottles(),
+  threshold: 5,
+  evaluationPeriods: 1,
+});
+
+const latencyAlarm = new cloudwatch.Alarm(this, 'Latency', {
+  metric: fn.metricDuration({ statistic: 'p99' }),
+  threshold: 2000,
+  evaluationPeriods: 2,
+});
+
+// Composite alarm (any of the above)
+new cloudwatch.CompositeAlarm(this, 'ServiceHealthAlarm', {
+  compositeAlarmName: 'order-service-health',
+  alarmRule: cloudwatch.AlarmRule.anyOf(
+    errorAlarm,
+    throttleAlarm,
+    latencyAlarm
+  ),
+  alarmDescription: 'Overall service health degraded',
+});
+```
+
+## Dashboard Best Practices
+
+### Service Dashboard Layout
+
+**Recommended sections**:
+
+1. **Overview**:
+   - Total invocations
+   - Error rate percentage
+   - P50, P95, P99 latency
+   - Availability percentage
+
+2. **Resource Utilization**:
+   - Concurrent executions
+   - Memory utilization
+   - Duration distribution
+   - Throttles
+
+3. **Business Metrics**:
+   - Orders processed
+   - Revenue per minute
+   - Customer activity
+   - Feature usage
+
+4. **Errors and Alerts**:
+   - Error count by type
+   - Active alarms
+   - DLQ message count
+   - Failed transactions
+
+### CloudWatch Dashboard CDK
+
+```typescript
+const dashboard = new cloudwatch.Dashboard(this, 'ServiceDashboard', {
+  dashboardName: 'order-service',
+});
+
+dashboard.addWidgets(
+  // Row 1: Overview
+  new cloudwatch.GraphWidget({
+    title: 'Invocations',
+    left: [fn.metricInvocations()],
+  }),
+  new cloudwatch.SingleValueWidget({
+    title: 'Error Rate',
+    metrics: [
+      new cloudwatch.MathExpression({
+        expression: 'errors / invocations * 100',
+        usingMetrics: {
+          errors: fn.metricErrors({ statistic: 'Sum' }),
+          invocations: fn.metricInvocations({ statistic: 'Sum' }),
+        },
+      }),
+    ],
+  }),
+  new cloudwatch.GraphWidget({
+    title: 'Latency (p50, p95, p99)',
+    left: [
+      fn.metricDuration({ statistic: 'p50', label: 'p50' }),
+      fn.metricDuration({ statistic: 'p95', label: 'p95' }),
+      fn.metricDuration({ statistic: 'p99', label: 'p99' }),
+    ],
+  })
+);
+
+// Row 2: Errors
+dashboard.addWidgets(
+  new cloudwatch.LogQueryWidget({
+    title: 'Recent Errors',
+    logGroupNames: [fn.logGroup.logGroupName],
+    queryLines: [
+      'fields @timestamp, @message',
+      'filter level = "ERROR"',
+      'sort @timestamp desc',
+      'limit 20',
+    ],
+  })
+);
+```
+
+## Monitoring Serverless Architectures
+
+### End-to-End Monitoring
+
+**Monitor the entire flow**:
+
+```
+API Gateway → Lambda → DynamoDB → EventBridge → Lambda
+     ↓           ↓          ↓            ↓           ↓
+  Metrics    Traces     Metrics      Metrics     Logs
+```
+
+**Key metrics per service**:
+
+| Service | Key Metrics |
+|---------|-------------|
+| API Gateway | Count, 4XXError, 5XXError, Latency, CacheHitCount |
+| Lambda | Invocations, Errors, Duration, Throttles, ConcurrentExecutions |
+| DynamoDB | ConsumedReadCapacity, ConsumedWriteCapacity, UserErrors, SystemErrors |
+| SQS | NumberOfMessagesSent, NumberOfMessagesReceived, ApproximateAgeOfOldestMessage |
+| EventBridge | Invocations, FailedInvocations, TriggeredRules |
+| Step Functions | ExecutionsStarted, ExecutionsFailed, ExecutionTime |
+
+### Synthetic Monitoring
+
+**Use CloudWatch Synthetics for API monitoring**:
+
+```typescript
+import { Canary, Test, Code, Schedule } from '@aws-cdk/aws-synthetics-alpha';
+
+new Canary(this, 'ApiCanary', {
+  canaryName: 'api-health-check',
+  schedule: Schedule.rate(Duration.minutes(5)),
+  test: Test.custom({
+    code: Code.fromInline(`
+      const synthetics = require('Synthetics');
+
+      const apiCanaryBlueprint = async function () {
+        const response = await synthetics.executeHttpStep('Verify API', {
+          url: 'https://api.example.com/health',
+          method: 'GET',
+        });
+
+        return response.statusCode === 200 ? 'success' : 'failure';
+      };
+
+      exports.handler = async () => {
+        return await apiCanaryBlueprint();
+      };
+    `),
+    handler: 'index.handler',
+  }),
+  runtime: synthetics.Runtime.SYNTHETICS_NODEJS_PUPPETEER_6_2,
+});
+```
+
+## OpenTelemetry Integration
+
+### Amazon Distro for OpenTelemetry (ADOT)
+
+**Use ADOT for vendor-neutral observability**:
+
+```typescript
+// Lambda Layer with ADOT
+const adotLayer = lambda.LayerVersion.fromLayerVersionArn(
+  this,
+  'AdotLayer',
+  `arn:aws:lambda:${this.region}:901920570463:layer:aws-otel-nodejs-amd64-ver-1-18-1:4`
+);
+
+new NodejsFunction(this, 'Function', {
+  entry: 'src/handler.ts',
+  layers: [adotLayer],
+  tracing: lambda.Tracing.ACTIVE,
+  environment: {
+    AWS_LAMBDA_EXEC_WRAPPER: '/opt/otel-handler',
+    OPENTELEMETRY_COLLECTOR_CONFIG_FILE: '/var/task/collector.yaml',
+  },
+});
+```
+
+**Benefits of ADOT**:
+- Vendor-neutral (works with Datadog, New Relic, Honeycomb, etc.)
+- Automatic instrumentation
+- Consistent format across services
+- Export to multiple backends
+
+## Best Practices Summary
+
+### Metrics
+- ✅ Use CloudWatch Embedded Metric Format (EMF)
+- ✅ Track business metrics, not just technical metrics
+- ✅ Set alarms on error rate, latency, and throughput
+- ✅ Use p99 for latency, not average
+- ✅ Create dashboards for key services
+
+### Logging
+- ✅ Use structured JSON logging
+- ✅ Include correlation IDs (request ID, trace ID)
+- ✅ Use appropriate log levels
+- ✅ Never log sensitive data (PII, secrets)
+- ✅ Use CloudWatch Logs Insights for analysis
+
+### Tracing
+- ✅ Enable X-Ray tracing on all services
+- ✅ Instrument AWS SDK calls
+- ✅ Add custom annotations for business context
+- ✅ Use service map to understand dependencies
+- ✅ Correlate traces with logs and metrics
+
+### Alerting
+- ✅ Alert on customer-impacting issues
+- ✅ Tune thresholds to reduce false positives
+- ✅ Use composite alarms to reduce noise
+- ✅ Include clear remediation steps
+- ✅ Escalate critical alarms appropriately
+
+### Tools
+- ✅ Use Lambda Powertools for unified observability
+- ✅ Use CloudWatch ServiceLens for service view
+- ✅ Use Synthetics for proactive monitoring
+- ✅ Consider ADOT for vendor-neutral observability