gh-zxkane-aws-skills-server…/skills/aws-serverless-eda/references/observability-best-practices.md

# Serverless Observability Best Practices

Comprehensive observability patterns for serverless applications based on AWS best practices.

## Table of Contents

- [Three Pillars of Observability](#three-pillars-of-observability)
- [Metrics](#metrics)
- [Logging](#logging)
- [Tracing](#tracing)
- [Unified Observability](#unified-observability)
- [Alerting](#alerting)

## Three Pillars of Observability

### Metrics
**Numeric data measured at intervals (time series)**
- Request rate, error rate, duration
- CPU%, memory%, disk%
- Custom business metrics
- Service Level Indicators (SLIs)

### Logs
**Timestamped records of discrete events**
- Application events and errors
- State transformations
- Debugging information
- Audit trails

### Traces
**Single user's journey across services**
- Request flow through distributed system
- Service dependencies
- Latency breakdown
- Error propagation

## Metrics

### CloudWatch Metrics for Lambda

**Out-of-the-box metrics** (automatically available):
```
- Invocations
- Errors
- Throttles
- Duration
- ConcurrentExecutions
- IteratorAge (for streams)
```

**CDK Configuration**:
```typescript
const fn = new NodejsFunction(this, 'Function', {
  entry: 'src/handler.ts',
});

// Create alarms on metrics
new cloudwatch.Alarm(this, 'ErrorAlarm', {
  metric: fn.metricErrors({
    statistic: 'Sum',
    period: Duration.minutes(5),
  }),
  threshold: 10,
  evaluationPeriods: 1,
});

new cloudwatch.Alarm(this, 'DurationAlarm', {
  metric: fn.metricDuration({
    statistic: 'p99',
    period: Duration.minutes(5),
  }),
  threshold: 1000, // 1 second
  evaluationPeriods: 2,
});
```

### Custom Metrics

**Use CloudWatch Embedded Metric Format (EMF)**:

```typescript
export const handler = async (event: any) => {
  const startTime = Date.now();

  try {
    const result = await processOrder(event);

    // Emit custom metrics
    console.log(JSON.stringify({
      _aws: {
        Timestamp: Date.now(),
        CloudWatchMetrics: [{
          Namespace: 'MyApp/Orders',
          Dimensions: [['ServiceName', 'Operation']],
          Metrics: [
            { Name: 'ProcessingTime', Unit: 'Milliseconds' },
            { Name: 'OrderValue', Unit: 'None' },
          ],
        }],
      },
      ServiceName: 'OrderService',
      Operation: 'ProcessOrder',
      ProcessingTime: Date.now() - startTime,
      OrderValue: result.amount,
    }));

    return result;
  } catch (error) {
    // Emit error metric
    console.log(JSON.stringify({
      _aws: {
        CloudWatchMetrics: [{
          Namespace: 'MyApp/Orders',
          Dimensions: [['ServiceName']],
          Metrics: [{ Name: 'Errors', Unit: 'Count' }],
        }],
      },
      ServiceName: 'OrderService',
      Errors: 1,
    }));

    throw error;
  }
};
```

**Using Lambda Powertools**:

```typescript
import { Metrics, MetricUnits } from '@aws-lambda-powertools/metrics';

const metrics = new Metrics({
  namespace: 'MyApp',
  serviceName: 'OrderService',
});

export const handler = async (event: any) => {
  metrics.addMetric('Invocation', MetricUnits.Count, 1);

  const startTime = Date.now();

  try {
    const result = await processOrder(event);

    metrics.addMetric('Success', MetricUnits.Count, 1);
    metrics.addMetric('ProcessingTime', MetricUnits.Milliseconds, Date.now() - startTime);
    metrics.addMetric('OrderValue', MetricUnits.None, result.amount);

    return result;
  } catch (error) {
    metrics.addMetric('Error', MetricUnits.Count, 1);
    throw error;
  } finally {
    metrics.publishStoredMetrics();
  }
};
```

## Logging

### Structured Logging

**Use JSON format for logs**:

```typescript
// ✅ GOOD - Structured JSON logging
export const handler = async (event: any) => {
  console.log(JSON.stringify({
    level: 'INFO',
    message: 'Processing order',
    orderId: event.orderId,
    customerId: event.customerId,
    timestamp: new Date().toISOString(),
    requestId: context.requestId,
  }));

  try {
    const result = await processOrder(event);

    console.log(JSON.stringify({
      level: 'INFO',
      message: 'Order processed successfully',
      orderId: event.orderId,
      duration: Date.now() - startTime,
      timestamp: new Date().toISOString(),
    }));

    return result;
  } catch (error) {
    console.error(JSON.stringify({
      level: 'ERROR',
      message: 'Order processing failed',
      orderId: event.orderId,
      error: {
        name: error.name,
        message: error.message,
        stack: error.stack,
      },
      timestamp: new Date().toISOString(),
    }));

    throw error;
  }
};

// ❌ BAD - Unstructured logging
console.log('Processing order ' + orderId + ' for customer ' + customerId);
```

**Using Lambda Powertools Logger**:

```typescript
import { Logger } from '@aws-lambda-powertools/logger';

const logger = new Logger({
  serviceName: 'OrderService',
  logLevel: 'INFO',
});

export const handler = async (event: any, context: Context) => {
  logger.addContext(context);

  logger.info('Processing order', {
    orderId: event.orderId,
    customerId: event.customerId,
  });

  try {
    const result = await processOrder(event);

    logger.info('Order processed', {
      orderId: event.orderId,
      amount: result.amount,
    });

    return result;
  } catch (error) {
    logger.error('Order processing failed', {
      orderId: event.orderId,
      error,
    });

    throw error;
  }
};
```

### Log Levels

**Use appropriate log levels**:
- **ERROR**: Errors requiring immediate attention
- **WARN**: Warnings or recoverable errors
- **INFO**: Important business events
- **DEBUG**: Detailed debugging information (disable in production)

```typescript
const logger = new Logger({
  serviceName: 'OrderService',
  logLevel: process.env.LOG_LEVEL || 'INFO',
});

logger.debug('Detailed processing info', { data });
logger.info('Business event occurred', { event });
logger.warn('Recoverable error', { error });
logger.error('Critical failure', { error });
```

### Log Insights Queries

**Common CloudWatch Logs Insights queries**:

```
# Find errors in last hour
fields @timestamp, @message, level, error.message
| filter level = "ERROR"
| sort @timestamp desc
| limit 100

# Count errors by type
stats count() by error.name as ErrorType
| sort count desc

# Calculate p99 latency
stats percentile(duration, 99) by serviceName

# Find slow requests
fields @timestamp, orderId, duration
| filter duration > 1000
| sort duration desc
| limit 50

# Track specific customer requests
fields @timestamp, @message, orderId
| filter customerId = "customer-123"
| sort @timestamp desc
```

## Tracing

### Enable X-Ray Tracing

**Configure X-Ray for Lambda**:

```typescript
const fn = new NodejsFunction(this, 'Function', {
  entry: 'src/handler.ts',
  tracing: lambda.Tracing.ACTIVE, // Enable X-Ray
});

// API Gateway tracing
const api = new apigateway.RestApi(this, 'Api', {
  deployOptions: {
    tracingEnabled: true,
  },
});

// Step Functions tracing
new stepfunctions.StateMachine(this, 'StateMachine', {
  definition,
  tracingEnabled: true,
});
```

**Instrument application code**:

```typescript
import { captureAWSv3Client } from 'aws-xray-sdk-core';
import { DynamoDBClient } from '@aws-sdk/client-dynamodb';

// Wrap AWS SDK clients
const client = captureAWSv3Client(new DynamoDBClient({}));

// Custom segments
import AWSXRay from 'aws-xray-sdk-core';

export const handler = async (event: any) => {
  const segment = AWSXRay.getSegment();

  // Custom subsegment
  const subsegment = segment.addNewSubsegment('ProcessOrder');

  try {
    // Add annotations (indexed for filtering)
    subsegment.addAnnotation('orderId', event.orderId);
    subsegment.addAnnotation('customerId', event.customerId);

    // Add metadata (not indexed, detailed info)
    subsegment.addMetadata('orderDetails', event);

    const result = await processOrder(event);

    subsegment.addAnnotation('status', 'success');
    subsegment.close();

    return result;
  } catch (error) {
    subsegment.addError(error);
    subsegment.close();
    throw error;
  }
};
```

**Using Lambda Powertools Tracer**:

```typescript
import { Tracer } from '@aws-lambda-powertools/tracer';

const tracer = new Tracer({ serviceName: 'OrderService' });

export const handler = async (event: any) => {
  const segment = tracer.getSegment();

  // Automatically captures and traces
  const result = await tracer.captureAWSv3Client(dynamodb).getItem({
    TableName: process.env.TABLE_NAME,
    Key: { orderId: event.orderId },
  });

  // Custom annotation
  tracer.putAnnotation('orderId', event.orderId);
  tracer.putMetadata('orderDetails', event);

  return result;
};
```

### Service Map

**Visualize service dependencies** with X-Ray:
- Shows service-to-service communication
- Identifies latency bottlenecks
- Highlights error rates between services
- Tracks downstream dependencies

### Distributed Tracing Best Practices

1. **Enable tracing everywhere**: Lambda, API Gateway, Step Functions
2. **Use annotations for filtering**: Indexed fields for queries
3. **Use metadata for details**: Non-indexed detailed information
4. **Sample appropriately**: 100% for low traffic, sampled for high traffic
5. **Correlate with logs**: Include trace ID in log entries

## Unified Observability

### Correlation Between Pillars

**Include trace ID in logs**:

```typescript
export const handler = async (event: any, context: Context) => {
  const traceId = process.env._X_AMZN_TRACE_ID;

  console.log(JSON.stringify({
    level: 'INFO',
    message: 'Processing order',
    traceId,
    requestId: context.requestId,
    orderId: event.orderId,
  }));
};
```

### CloudWatch ServiceLens

**Unified view of traces and metrics**:
- Automatically correlates X-Ray traces with CloudWatch metrics
- Shows service map with metrics overlay
- Identifies performance and availability issues
- Provides end-to-end request view

### Lambda Powertools Integration

**All three pillars in one**:

```typescript
import { Logger } from '@aws-lambda-powertools/logger';
import { Tracer } from '@aws-lambda-powertools/tracer';
import { Metrics, MetricUnits } from '@aws-lambda-powertools/metrics';

const logger = new Logger({ serviceName: 'OrderService' });
const tracer = new Tracer({ serviceName: 'OrderService' });
const metrics = new Metrics({ namespace: 'MyApp', serviceName: 'OrderService' });

export const handler = async (event: any, context: Context) => {
  // Automatically adds trace context to logs
  logger.addContext(context);

  logger.info('Processing order', { orderId: event.orderId });

  // Add trace annotations
  tracer.putAnnotation('orderId', event.orderId);

  // Add metrics
  metrics.addMetric('Invocation', MetricUnits.Count, 1);

  const startTime = Date.now();

  try {
    const result = await processOrder(event);

    metrics.addMetric('Success', MetricUnits.Count, 1);
    metrics.addMetric('Duration', MetricUnits.Milliseconds, Date.now() - startTime);

    logger.info('Order processed', { orderId: event.orderId });

    return result;
  } catch (error) {
    metrics.addMetric('Error', MetricUnits.Count, 1);
    logger.error('Processing failed', { orderId: event.orderId, error });
    throw error;
  } finally {
    metrics.publishStoredMetrics();
  }
};
```

## Alerting

### Effective Alerting Strategy

**Alert on what matters**:
- **Critical**: Customer-impacting issues (errors, high latency)
- **Warning**: Approaching thresholds (80% capacity)
- **Info**: Trends and anomalies (cost spikes)

**Alarm fatigue prevention**:
- Tune thresholds based on actual patterns
- Use composite alarms to reduce noise
- Set appropriate evaluation periods
- Include clear remediation steps

### CloudWatch Alarms

**Common alarm patterns**:

```typescript
// Error rate alarm
new cloudwatch.Alarm(this, 'ErrorRateAlarm', {
  metric: new cloudwatch.MathExpression({
    expression: 'errors / invocations * 100',
    usingMetrics: {
      errors: fn.metricErrors({ statistic: 'Sum' }),
      invocations: fn.metricInvocations({ statistic: 'Sum' }),
    },
  }),
  threshold: 1, // 1% error rate
  evaluationPeriods: 2,
  alarmDescription: 'Error rate exceeded 1%',
});

// Latency alarm (p99)
new cloudwatch.Alarm(this, 'LatencyAlarm', {
  metric: fn.metricDuration({
    statistic: 'p99',
    period: Duration.minutes(5),
  }),
  threshold: 1000, // 1 second
  evaluationPeriods: 2,
  alarmDescription: 'p99 latency exceeded 1 second',
});

// Concurrent executions approaching limit
new cloudwatch.Alarm(this, 'ConcurrencyAlarm', {
  metric: fn.metricConcurrentExecutions({
    statistic: 'Maximum',
  }),
  threshold: 800, // 80% of 1000 default limit
  evaluationPeriods: 1,
  alarmDescription: 'Approaching concurrency limit',
});
```

### Composite Alarms

**Reduce alert noise**:

```typescript
const errorAlarm = new cloudwatch.Alarm(this, 'Errors', {
  metric: fn.metricErrors(),
  threshold: 10,
  evaluationPeriods: 1,
});

const throttleAlarm = new cloudwatch.Alarm(this, 'Throttles', {
  metric: fn.metricThrottles(),
  threshold: 5,
  evaluationPeriods: 1,
});

const latencyAlarm = new cloudwatch.Alarm(this, 'Latency', {
  metric: fn.metricDuration({ statistic: 'p99' }),
  threshold: 2000,
  evaluationPeriods: 2,
});

// Composite alarm (any of the above)
new cloudwatch.CompositeAlarm(this, 'ServiceHealthAlarm', {
  compositeAlarmName: 'order-service-health',
  alarmRule: cloudwatch.AlarmRule.anyOf(
    errorAlarm,
    throttleAlarm,
    latencyAlarm
  ),
  alarmDescription: 'Overall service health degraded',
});
```

## Dashboard Best Practices

### Service Dashboard Layout

**Recommended sections**:

1. **Overview**:
   - Total invocations
   - Error rate percentage
   - P50, P95, P99 latency
   - Availability percentage

2. **Resource Utilization**:
   - Concurrent executions
   - Memory utilization
   - Duration distribution
   - Throttles

3. **Business Metrics**:
   - Orders processed
   - Revenue per minute
   - Customer activity
   - Feature usage

4. **Errors and Alerts**:
   - Error count by type
   - Active alarms
   - DLQ message count
   - Failed transactions

### CloudWatch Dashboard CDK

```typescript
const dashboard = new cloudwatch.Dashboard(this, 'ServiceDashboard', {
  dashboardName: 'order-service',
});

dashboard.addWidgets(
  // Row 1: Overview
  new cloudwatch.GraphWidget({
    title: 'Invocations',
    left: [fn.metricInvocations()],
  }),
  new cloudwatch.SingleValueWidget({
    title: 'Error Rate',
    metrics: [
      new cloudwatch.MathExpression({
        expression: 'errors / invocations * 100',
        usingMetrics: {
          errors: fn.metricErrors({ statistic: 'Sum' }),
          invocations: fn.metricInvocations({ statistic: 'Sum' }),
        },
      }),
    ],
  }),
  new cloudwatch.GraphWidget({
    title: 'Latency (p50, p95, p99)',
    left: [
      fn.metricDuration({ statistic: 'p50', label: 'p50' }),
      fn.metricDuration({ statistic: 'p95', label: 'p95' }),
      fn.metricDuration({ statistic: 'p99', label: 'p99' }),
    ],
  })
);

// Row 2: Errors
dashboard.addWidgets(
  new cloudwatch.LogQueryWidget({
    title: 'Recent Errors',
    logGroupNames: [fn.logGroup.logGroupName],
    queryLines: [
      'fields @timestamp, @message',
      'filter level = "ERROR"',
      'sort @timestamp desc',
      'limit 20',
    ],
  })
);
```

## Monitoring Serverless Architectures

### End-to-End Monitoring

**Monitor the entire flow**:

```
API Gateway → Lambda → DynamoDB → EventBridge → Lambda
     ↓           ↓          ↓            ↓           ↓
  Metrics    Traces     Metrics      Metrics     Logs
```

**Key metrics per service**:

| Service | Key Metrics |
|---------|-------------|
| API Gateway | Count, 4XXError, 5XXError, Latency, CacheHitCount |
| Lambda | Invocations, Errors, Duration, Throttles, ConcurrentExecutions |
| DynamoDB | ConsumedReadCapacity, ConsumedWriteCapacity, UserErrors, SystemErrors |
| SQS | NumberOfMessagesSent, NumberOfMessagesReceived, ApproximateAgeOfOldestMessage |
| EventBridge | Invocations, FailedInvocations, TriggeredRules |
| Step Functions | ExecutionsStarted, ExecutionsFailed, ExecutionTime |

### Synthetic Monitoring

**Use CloudWatch Synthetics for API monitoring**:

```typescript
import { Canary, Test, Code, Schedule } from '@aws-cdk/aws-synthetics-alpha';

new Canary(this, 'ApiCanary', {
  canaryName: 'api-health-check',
  schedule: Schedule.rate(Duration.minutes(5)),
  test: Test.custom({
    code: Code.fromInline(`
      const synthetics = require('Synthetics');

      const apiCanaryBlueprint = async function () {
        const response = await synthetics.executeHttpStep('Verify API', {
          url: 'https://api.example.com/health',
          method: 'GET',
        });

        return response.statusCode === 200 ? 'success' : 'failure';
      };

      exports.handler = async () => {
        return await apiCanaryBlueprint();
      };
    `),
    handler: 'index.handler',
  }),
  runtime: synthetics.Runtime.SYNTHETICS_NODEJS_PUPPETEER_6_2,
});
```

## OpenTelemetry Integration

### Amazon Distro for OpenTelemetry (ADOT)

**Use ADOT for vendor-neutral observability**:

```typescript
// Lambda Layer with ADOT
const adotLayer = lambda.LayerVersion.fromLayerVersionArn(
  this,
  'AdotLayer',
  `arn:aws:lambda:${this.region}:901920570463:layer:aws-otel-nodejs-amd64-ver-1-18-1:4`
);

new NodejsFunction(this, 'Function', {
  entry: 'src/handler.ts',
  layers: [adotLayer],
  tracing: lambda.Tracing.ACTIVE,
  environment: {
    AWS_LAMBDA_EXEC_WRAPPER: '/opt/otel-handler',
    OPENTELEMETRY_COLLECTOR_CONFIG_FILE: '/var/task/collector.yaml',
  },
});
```

**Benefits of ADOT**:
- Vendor-neutral (works with Datadog, New Relic, Honeycomb, etc.)
- Automatic instrumentation
- Consistent format across services
- Export to multiple backends

## Best Practices Summary

### Metrics
- ✅ Use CloudWatch Embedded Metric Format (EMF)
- ✅ Track business metrics, not just technical metrics
- ✅ Set alarms on error rate, latency, and throughput
- ✅ Use p99 for latency, not average
- ✅ Create dashboards for key services

### Logging
- ✅ Use structured JSON logging
- ✅ Include correlation IDs (request ID, trace ID)
- ✅ Use appropriate log levels
- ✅ Never log sensitive data (PII, secrets)
- ✅ Use CloudWatch Logs Insights for analysis

### Tracing
- ✅ Enable X-Ray tracing on all services
- ✅ Instrument AWS SDK calls
- ✅ Add custom annotations for business context
- ✅ Use service map to understand dependencies
- ✅ Correlate traces with logs and metrics

### Alerting
- ✅ Alert on customer-impacting issues
- ✅ Tune thresholds to reduce false positives
- ✅ Use composite alarms to reduce noise
- ✅ Include clear remediation steps
- ✅ Escalate critical alarms appropriately

### Tools
- ✅ Use Lambda Powertools for unified observability
- ✅ Use CloudWatch ServiceLens for service view
- ✅ Use Synthetics for proactive monitoring
- ✅ Consider ADOT for vendor-neutral observability