19 KiB
Serverless Observability Best Practices
Comprehensive observability patterns for serverless applications based on AWS best practices.
Table of Contents
Three Pillars of Observability
Metrics
Numeric data measured at intervals (time series)
- Request rate, error rate, duration
- CPU%, memory%, disk%
- Custom business metrics
- Service Level Indicators (SLIs)
Logs
Timestamped records of discrete events
- Application events and errors
- State transformations
- Debugging information
- Audit trails
Traces
Single user's journey across services
- Request flow through distributed system
- Service dependencies
- Latency breakdown
- Error propagation
Metrics
CloudWatch Metrics for Lambda
Out-of-the-box metrics (automatically available):
- Invocations
- Errors
- Throttles
- Duration
- ConcurrentExecutions
- IteratorAge (for streams)
CDK Configuration:
const fn = new NodejsFunction(this, 'Function', {
entry: 'src/handler.ts',
});
// Create alarms on metrics
new cloudwatch.Alarm(this, 'ErrorAlarm', {
metric: fn.metricErrors({
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 10,
evaluationPeriods: 1,
});
new cloudwatch.Alarm(this, 'DurationAlarm', {
metric: fn.metricDuration({
statistic: 'p99',
period: Duration.minutes(5),
}),
threshold: 1000, // 1 second
evaluationPeriods: 2,
});
Custom Metrics
Use CloudWatch Embedded Metric Format (EMF):
export const handler = async (event: any) => {
const startTime = Date.now();
try {
const result = await processOrder(event);
// Emit custom metrics
console.log(JSON.stringify({
_aws: {
Timestamp: Date.now(),
CloudWatchMetrics: [{
Namespace: 'MyApp/Orders',
Dimensions: [['ServiceName', 'Operation']],
Metrics: [
{ Name: 'ProcessingTime', Unit: 'Milliseconds' },
{ Name: 'OrderValue', Unit: 'None' },
],
}],
},
ServiceName: 'OrderService',
Operation: 'ProcessOrder',
ProcessingTime: Date.now() - startTime,
OrderValue: result.amount,
}));
return result;
} catch (error) {
// Emit error metric
console.log(JSON.stringify({
_aws: {
CloudWatchMetrics: [{
Namespace: 'MyApp/Orders',
Dimensions: [['ServiceName']],
Metrics: [{ Name: 'Errors', Unit: 'Count' }],
}],
},
ServiceName: 'OrderService',
Errors: 1,
}));
throw error;
}
};
Using Lambda Powertools:
import { Metrics, MetricUnits } from '@aws-lambda-powertools/metrics';
const metrics = new Metrics({
namespace: 'MyApp',
serviceName: 'OrderService',
});
export const handler = async (event: any) => {
metrics.addMetric('Invocation', MetricUnits.Count, 1);
const startTime = Date.now();
try {
const result = await processOrder(event);
metrics.addMetric('Success', MetricUnits.Count, 1);
metrics.addMetric('ProcessingTime', MetricUnits.Milliseconds, Date.now() - startTime);
metrics.addMetric('OrderValue', MetricUnits.None, result.amount);
return result;
} catch (error) {
metrics.addMetric('Error', MetricUnits.Count, 1);
throw error;
} finally {
metrics.publishStoredMetrics();
}
};
Logging
Structured Logging
Use JSON format for logs:
// ✅ GOOD - Structured JSON logging
export const handler = async (event: any) => {
console.log(JSON.stringify({
level: 'INFO',
message: 'Processing order',
orderId: event.orderId,
customerId: event.customerId,
timestamp: new Date().toISOString(),
requestId: context.requestId,
}));
try {
const result = await processOrder(event);
console.log(JSON.stringify({
level: 'INFO',
message: 'Order processed successfully',
orderId: event.orderId,
duration: Date.now() - startTime,
timestamp: new Date().toISOString(),
}));
return result;
} catch (error) {
console.error(JSON.stringify({
level: 'ERROR',
message: 'Order processing failed',
orderId: event.orderId,
error: {
name: error.name,
message: error.message,
stack: error.stack,
},
timestamp: new Date().toISOString(),
}));
throw error;
}
};
// ❌ BAD - Unstructured logging
console.log('Processing order ' + orderId + ' for customer ' + customerId);
Using Lambda Powertools Logger:
import { Logger } from '@aws-lambda-powertools/logger';
const logger = new Logger({
serviceName: 'OrderService',
logLevel: 'INFO',
});
export const handler = async (event: any, context: Context) => {
logger.addContext(context);
logger.info('Processing order', {
orderId: event.orderId,
customerId: event.customerId,
});
try {
const result = await processOrder(event);
logger.info('Order processed', {
orderId: event.orderId,
amount: result.amount,
});
return result;
} catch (error) {
logger.error('Order processing failed', {
orderId: event.orderId,
error,
});
throw error;
}
};
Log Levels
Use appropriate log levels:
- ERROR: Errors requiring immediate attention
- WARN: Warnings or recoverable errors
- INFO: Important business events
- DEBUG: Detailed debugging information (disable in production)
const logger = new Logger({
serviceName: 'OrderService',
logLevel: process.env.LOG_LEVEL || 'INFO',
});
logger.debug('Detailed processing info', { data });
logger.info('Business event occurred', { event });
logger.warn('Recoverable error', { error });
logger.error('Critical failure', { error });
Log Insights Queries
Common CloudWatch Logs Insights queries:
# Find errors in last hour
fields @timestamp, @message, level, error.message
| filter level = "ERROR"
| sort @timestamp desc
| limit 100
# Count errors by type
stats count() by error.name as ErrorType
| sort count desc
# Calculate p99 latency
stats percentile(duration, 99) by serviceName
# Find slow requests
fields @timestamp, orderId, duration
| filter duration > 1000
| sort duration desc
| limit 50
# Track specific customer requests
fields @timestamp, @message, orderId
| filter customerId = "customer-123"
| sort @timestamp desc
Tracing
Enable X-Ray Tracing
Configure X-Ray for Lambda:
const fn = new NodejsFunction(this, 'Function', {
entry: 'src/handler.ts',
tracing: lambda.Tracing.ACTIVE, // Enable X-Ray
});
// API Gateway tracing
const api = new apigateway.RestApi(this, 'Api', {
deployOptions: {
tracingEnabled: true,
},
});
// Step Functions tracing
new stepfunctions.StateMachine(this, 'StateMachine', {
definition,
tracingEnabled: true,
});
Instrument application code:
import { captureAWSv3Client } from 'aws-xray-sdk-core';
import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
// Wrap AWS SDK clients
const client = captureAWSv3Client(new DynamoDBClient({}));
// Custom segments
import AWSXRay from 'aws-xray-sdk-core';
export const handler = async (event: any) => {
const segment = AWSXRay.getSegment();
// Custom subsegment
const subsegment = segment.addNewSubsegment('ProcessOrder');
try {
// Add annotations (indexed for filtering)
subsegment.addAnnotation('orderId', event.orderId);
subsegment.addAnnotation('customerId', event.customerId);
// Add metadata (not indexed, detailed info)
subsegment.addMetadata('orderDetails', event);
const result = await processOrder(event);
subsegment.addAnnotation('status', 'success');
subsegment.close();
return result;
} catch (error) {
subsegment.addError(error);
subsegment.close();
throw error;
}
};
Using Lambda Powertools Tracer:
import { Tracer } from '@aws-lambda-powertools/tracer';
const tracer = new Tracer({ serviceName: 'OrderService' });
export const handler = async (event: any) => {
const segment = tracer.getSegment();
// Automatically captures and traces
const result = await tracer.captureAWSv3Client(dynamodb).getItem({
TableName: process.env.TABLE_NAME,
Key: { orderId: event.orderId },
});
// Custom annotation
tracer.putAnnotation('orderId', event.orderId);
tracer.putMetadata('orderDetails', event);
return result;
};
Service Map
Visualize service dependencies with X-Ray:
- Shows service-to-service communication
- Identifies latency bottlenecks
- Highlights error rates between services
- Tracks downstream dependencies
Distributed Tracing Best Practices
- Enable tracing everywhere: Lambda, API Gateway, Step Functions
- Use annotations for filtering: Indexed fields for queries
- Use metadata for details: Non-indexed detailed information
- Sample appropriately: 100% for low traffic, sampled for high traffic
- Correlate with logs: Include trace ID in log entries
Unified Observability
Correlation Between Pillars
Include trace ID in logs:
export const handler = async (event: any, context: Context) => {
const traceId = process.env._X_AMZN_TRACE_ID;
console.log(JSON.stringify({
level: 'INFO',
message: 'Processing order',
traceId,
requestId: context.requestId,
orderId: event.orderId,
}));
};
CloudWatch ServiceLens
Unified view of traces and metrics:
- Automatically correlates X-Ray traces with CloudWatch metrics
- Shows service map with metrics overlay
- Identifies performance and availability issues
- Provides end-to-end request view
Lambda Powertools Integration
All three pillars in one:
import { Logger } from '@aws-lambda-powertools/logger';
import { Tracer } from '@aws-lambda-powertools/tracer';
import { Metrics, MetricUnits } from '@aws-lambda-powertools/metrics';
const logger = new Logger({ serviceName: 'OrderService' });
const tracer = new Tracer({ serviceName: 'OrderService' });
const metrics = new Metrics({ namespace: 'MyApp', serviceName: 'OrderService' });
export const handler = async (event: any, context: Context) => {
// Automatically adds trace context to logs
logger.addContext(context);
logger.info('Processing order', { orderId: event.orderId });
// Add trace annotations
tracer.putAnnotation('orderId', event.orderId);
// Add metrics
metrics.addMetric('Invocation', MetricUnits.Count, 1);
const startTime = Date.now();
try {
const result = await processOrder(event);
metrics.addMetric('Success', MetricUnits.Count, 1);
metrics.addMetric('Duration', MetricUnits.Milliseconds, Date.now() - startTime);
logger.info('Order processed', { orderId: event.orderId });
return result;
} catch (error) {
metrics.addMetric('Error', MetricUnits.Count, 1);
logger.error('Processing failed', { orderId: event.orderId, error });
throw error;
} finally {
metrics.publishStoredMetrics();
}
};
Alerting
Effective Alerting Strategy
Alert on what matters:
- Critical: Customer-impacting issues (errors, high latency)
- Warning: Approaching thresholds (80% capacity)
- Info: Trends and anomalies (cost spikes)
Alarm fatigue prevention:
- Tune thresholds based on actual patterns
- Use composite alarms to reduce noise
- Set appropriate evaluation periods
- Include clear remediation steps
CloudWatch Alarms
Common alarm patterns:
// Error rate alarm
new cloudwatch.Alarm(this, 'ErrorRateAlarm', {
metric: new cloudwatch.MathExpression({
expression: 'errors / invocations * 100',
usingMetrics: {
errors: fn.metricErrors({ statistic: 'Sum' }),
invocations: fn.metricInvocations({ statistic: 'Sum' }),
},
}),
threshold: 1, // 1% error rate
evaluationPeriods: 2,
alarmDescription: 'Error rate exceeded 1%',
});
// Latency alarm (p99)
new cloudwatch.Alarm(this, 'LatencyAlarm', {
metric: fn.metricDuration({
statistic: 'p99',
period: Duration.minutes(5),
}),
threshold: 1000, // 1 second
evaluationPeriods: 2,
alarmDescription: 'p99 latency exceeded 1 second',
});
// Concurrent executions approaching limit
new cloudwatch.Alarm(this, 'ConcurrencyAlarm', {
metric: fn.metricConcurrentExecutions({
statistic: 'Maximum',
}),
threshold: 800, // 80% of 1000 default limit
evaluationPeriods: 1,
alarmDescription: 'Approaching concurrency limit',
});
Composite Alarms
Reduce alert noise:
const errorAlarm = new cloudwatch.Alarm(this, 'Errors', {
metric: fn.metricErrors(),
threshold: 10,
evaluationPeriods: 1,
});
const throttleAlarm = new cloudwatch.Alarm(this, 'Throttles', {
metric: fn.metricThrottles(),
threshold: 5,
evaluationPeriods: 1,
});
const latencyAlarm = new cloudwatch.Alarm(this, 'Latency', {
metric: fn.metricDuration({ statistic: 'p99' }),
threshold: 2000,
evaluationPeriods: 2,
});
// Composite alarm (any of the above)
new cloudwatch.CompositeAlarm(this, 'ServiceHealthAlarm', {
compositeAlarmName: 'order-service-health',
alarmRule: cloudwatch.AlarmRule.anyOf(
errorAlarm,
throttleAlarm,
latencyAlarm
),
alarmDescription: 'Overall service health degraded',
});
Dashboard Best Practices
Service Dashboard Layout
Recommended sections:
-
Overview:
- Total invocations
- Error rate percentage
- P50, P95, P99 latency
- Availability percentage
-
Resource Utilization:
- Concurrent executions
- Memory utilization
- Duration distribution
- Throttles
-
Business Metrics:
- Orders processed
- Revenue per minute
- Customer activity
- Feature usage
-
Errors and Alerts:
- Error count by type
- Active alarms
- DLQ message count
- Failed transactions
CloudWatch Dashboard CDK
const dashboard = new cloudwatch.Dashboard(this, 'ServiceDashboard', {
dashboardName: 'order-service',
});
dashboard.addWidgets(
// Row 1: Overview
new cloudwatch.GraphWidget({
title: 'Invocations',
left: [fn.metricInvocations()],
}),
new cloudwatch.SingleValueWidget({
title: 'Error Rate',
metrics: [
new cloudwatch.MathExpression({
expression: 'errors / invocations * 100',
usingMetrics: {
errors: fn.metricErrors({ statistic: 'Sum' }),
invocations: fn.metricInvocations({ statistic: 'Sum' }),
},
}),
],
}),
new cloudwatch.GraphWidget({
title: 'Latency (p50, p95, p99)',
left: [
fn.metricDuration({ statistic: 'p50', label: 'p50' }),
fn.metricDuration({ statistic: 'p95', label: 'p95' }),
fn.metricDuration({ statistic: 'p99', label: 'p99' }),
],
})
);
// Row 2: Errors
dashboard.addWidgets(
new cloudwatch.LogQueryWidget({
title: 'Recent Errors',
logGroupNames: [fn.logGroup.logGroupName],
queryLines: [
'fields @timestamp, @message',
'filter level = "ERROR"',
'sort @timestamp desc',
'limit 20',
],
})
);
Monitoring Serverless Architectures
End-to-End Monitoring
Monitor the entire flow:
API Gateway → Lambda → DynamoDB → EventBridge → Lambda
↓ ↓ ↓ ↓ ↓
Metrics Traces Metrics Metrics Logs
Key metrics per service:
| Service | Key Metrics |
|---|---|
| API Gateway | Count, 4XXError, 5XXError, Latency, CacheHitCount |
| Lambda | Invocations, Errors, Duration, Throttles, ConcurrentExecutions |
| DynamoDB | ConsumedReadCapacity, ConsumedWriteCapacity, UserErrors, SystemErrors |
| SQS | NumberOfMessagesSent, NumberOfMessagesReceived, ApproximateAgeOfOldestMessage |
| EventBridge | Invocations, FailedInvocations, TriggeredRules |
| Step Functions | ExecutionsStarted, ExecutionsFailed, ExecutionTime |
Synthetic Monitoring
Use CloudWatch Synthetics for API monitoring:
import { Canary, Test, Code, Schedule } from '@aws-cdk/aws-synthetics-alpha';
new Canary(this, 'ApiCanary', {
canaryName: 'api-health-check',
schedule: Schedule.rate(Duration.minutes(5)),
test: Test.custom({
code: Code.fromInline(`
const synthetics = require('Synthetics');
const apiCanaryBlueprint = async function () {
const response = await synthetics.executeHttpStep('Verify API', {
url: 'https://api.example.com/health',
method: 'GET',
});
return response.statusCode === 200 ? 'success' : 'failure';
};
exports.handler = async () => {
return await apiCanaryBlueprint();
};
`),
handler: 'index.handler',
}),
runtime: synthetics.Runtime.SYNTHETICS_NODEJS_PUPPETEER_6_2,
});
OpenTelemetry Integration
Amazon Distro for OpenTelemetry (ADOT)
Use ADOT for vendor-neutral observability:
// Lambda Layer with ADOT
const adotLayer = lambda.LayerVersion.fromLayerVersionArn(
this,
'AdotLayer',
`arn:aws:lambda:${this.region}:901920570463:layer:aws-otel-nodejs-amd64-ver-1-18-1:4`
);
new NodejsFunction(this, 'Function', {
entry: 'src/handler.ts',
layers: [adotLayer],
tracing: lambda.Tracing.ACTIVE,
environment: {
AWS_LAMBDA_EXEC_WRAPPER: '/opt/otel-handler',
OPENTELEMETRY_COLLECTOR_CONFIG_FILE: '/var/task/collector.yaml',
},
});
Benefits of ADOT:
- Vendor-neutral (works with Datadog, New Relic, Honeycomb, etc.)
- Automatic instrumentation
- Consistent format across services
- Export to multiple backends
Best Practices Summary
Metrics
- ✅ Use CloudWatch Embedded Metric Format (EMF)
- ✅ Track business metrics, not just technical metrics
- ✅ Set alarms on error rate, latency, and throughput
- ✅ Use p99 for latency, not average
- ✅ Create dashboards for key services
Logging
- ✅ Use structured JSON logging
- ✅ Include correlation IDs (request ID, trace ID)
- ✅ Use appropriate log levels
- ✅ Never log sensitive data (PII, secrets)
- ✅ Use CloudWatch Logs Insights for analysis
Tracing
- ✅ Enable X-Ray tracing on all services
- ✅ Instrument AWS SDK calls
- ✅ Add custom annotations for business context
- ✅ Use service map to understand dependencies
- ✅ Correlate traces with logs and metrics
Alerting
- ✅ Alert on customer-impacting issues
- ✅ Tune thresholds to reduce false positives
- ✅ Use composite alarms to reduce noise
- ✅ Include clear remediation steps
- ✅ Escalate critical alarms appropriately
Tools
- ✅ Use Lambda Powertools for unified observability
- ✅ Use CloudWatch ServiceLens for service view
- ✅ Use Synthetics for proactive monitoring
- ✅ Consider ADOT for vendor-neutral observability