Initial commit

2025-11-29 18:25:52 +08:00
commit e840c60a06
19 changed files with 6852 additions and 0 deletions
--- a/commands/lambda-observability.md
+++ b/commands/lambda-observability.md
@@ -0,0 +1,575 @@
+---
+description: Set up advanced observability for Rust Lambda with OpenTelemetry, X-Ray, and structured logging
+---
+
+You are helping the user implement comprehensive observability for their Rust Lambda functions.
+
+## Your Task
+
+Guide the user through setting up production-grade observability including distributed tracing, metrics, and structured logging.
+
+## Observability Stack Options
+
+### Option 1: AWS X-Ray (Native AWS Solution)
+
+**Best for**:
+- AWS-native monitoring
+- Quick setup
+- CloudWatch integration
+- Basic distributed tracing needs
+
+#### Enable X-Ray in Lambda
+
+**Via cargo-lambda:**
+```bash
+cargo lambda deploy --enable-tracing
+```
+
+**Via SAM template:**
+```yaml
+Resources:
+  MyFunction:
+    Type: AWS::Serverless::Function
+    Properties:
+      Tracing: Active  # Enable X-Ray
+```
+
+**Via Terraform:**
+```hcl
+resource "aws_lambda_function" "function" {
+  # ... other config ...
+
+  tracing_config {
+    mode = "Active"
+  }
+}
+```
+
+#### X-Ray with xray-lite
+
+Add to `Cargo.toml`:
+```toml
+[dependencies]
+xray-lite = "0.1"
+aws-config = "1"
+aws-sdk-dynamodb = "1"  # or other AWS services
+```
+
+Basic usage:
+```rust
+use lambda_runtime::{run, service_fn, Error, LambdaEvent};
+use xray_lite::SubsegmentContext;
+use xray_lite_aws_sdk::XRayAwsSdkExtension;
+
+async fn function_handler(event: LambdaEvent<Request>) -> Result<Response, Error> {
+    // X-Ray automatically creates parent segment for Lambda
+
+    // Create subsegment for custom operation
+    let subsegment = SubsegmentContext::from_lambda_ctx(&event.context);
+
+    // Trace AWS SDK calls
+    let config = aws_config::load_from_env().await
+        .xray_extension(subsegment.clone());
+
+    let dynamodb = aws_sdk_dynamodb::Client::new(&config);
+
+    // This DynamoDB call will be traced automatically
+    let result = dynamodb
+        .get_item()
+        .table_name("MyTable")
+        .key("id", AttributeValue::S("123".to_string()))
+        .send()
+        .await?;
+
+    Ok(Response { data: result })
+}
+```
+
+### Option 2: OpenTelemetry (Vendor-Neutral)
+
+**Best for**:
+- Multi-vendor monitoring
+- Portability across platforms
+- Advanced telemetry needs
+- Custom metrics and traces
+
+#### Setup OpenTelemetry
+
+Add to `Cargo.toml`:
+```toml
+[dependencies]
+lambda_runtime = "0.13"
+lambda-otel-lite = "0.1"  # Lightweight OpenTelemetry for Lambda
+opentelemetry = "0.22"
+opentelemetry-otlp = "0.15"
+opentelemetry_sdk = "0.22"
+tracing = "0.1"
+tracing-opentelemetry = "0.23"
+tracing-subscriber = { version = "0.3", features = ["env-filter", "json"] }
+```
+
+#### Basic OpenTelemetry Setup
+
+```rust
+use lambda_otel_lite::{init_telemetry, HttpTracerProviderBuilder};
+use lambda_runtime::{run, service_fn, Error, LambdaEvent};
+use opentelemetry::trace::TracerProvider;
+use tracing::{info, instrument};
+use tracing_subscriber::layer::SubscriberExt;
+
+#[tokio::main]
+async fn main() -> Result<(), Error> {
+    // Initialize OpenTelemetry
+    let tracer_provider = HttpTracerProviderBuilder::default()
+        .with_default_text_map_propagator()
+        .with_stdout_client()  // For testing, use OTLP for production
+        .build()?;
+
+    let tracer = tracer_provider.tracer("my-rust-lambda");
+
+    // Setup tracing subscriber
+    let telemetry_layer = tracing_opentelemetry::layer()
+        .with_tracer(tracer);
+
+    let subscriber = tracing_subscriber::registry()
+        .with(tracing_subscriber::EnvFilter::from_default_env())
+        .with(tracing_subscriber::fmt::layer())
+        .with(telemetry_layer);
+
+    tracing::subscriber::set_global_default(subscriber)?;
+
+    run(service_fn(function_handler)).await
+}
+
+#[instrument(skip(event))]
+async fn function_handler(event: LambdaEvent<Request>) -> Result<Response, Error> {
+    info!(request_id = %event.context.request_id, "Processing request");
+
+    let result = process_data(&event.payload).await?;
+
+    Ok(Response { result })
+}
+
+#[instrument]
+async fn process_data(request: &Request) -> Result<Data, Error> {
+    info!("Processing data");
+
+    // Your processing logic
+    // All operations within this function will be traced
+
+    Ok(Data::new())
+}
+```
+
+#### OpenTelemetry with OTLP Exporter
+
+For production, export to observability backend:
+
+```rust
+use lambda_otel_lite::HttpTracerProviderBuilder;
+use opentelemetry_otlp::WithExportConfig;
+
+let tracer_provider = HttpTracerProviderBuilder::default()
+    .with_stdout_client()
+    .enable_otlp(
+        opentelemetry_otlp::new_exporter()
+            .http()
+            .with_endpoint("https://your-collector:4318")
+            .with_headers([("api-key", "your-key")])
+    )?
+    .build()?;
+```
+
+### Option 3: Datadog Integration
+
+**Best for**:
+- Datadog users
+- Comprehensive APM
+- Log aggregation
+- Custom metrics
+
+Add Datadog Lambda Extension layer and configure:
+
+```rust
+use lambda_runtime::{run, service_fn, Error, LambdaEvent};
+use tracing::{info, instrument};
+use tracing_subscriber::{fmt, EnvFilter};
+
+#[tokio::main]
+async fn main() -> Result<(), Error> {
+    // JSON format for Datadog log parsing
+    tracing_subscriber::fmt()
+        .json()
+        .with_env_filter(EnvFilter::from_default_env())
+        .with_target(false)
+        .with_current_span(false)
+        .init();
+
+    run(service_fn(function_handler)).await
+}
+
+#[instrument(
+    skip(event),
+    fields(
+        request_id = %event.context.request_id,
+        user_id = %event.payload.user_id,
+    )
+)]
+async fn function_handler(event: LambdaEvent<Request>) -> Result<Response, Error> {
+    info!("Processing user request");
+
+    // Datadog automatically traces this
+    let result = fetch_user_data(&event.payload.user_id).await?;
+
+    Ok(Response { result })
+}
+```
+
+Deploy with Datadog extension layer:
+```bash
+cargo lambda deploy \
+  --layers arn:aws:lambda:us-east-1:464622532012:layer:Datadog-Extension-ARM:latest \
+  --env-var DD_API_KEY=your-api-key \
+  --env-var DD_SITE=datadoghq.com \
+  --env-var DD_SERVICE=my-rust-service \
+  --env-var DD_ENV=production
+```
+
+## Structured Logging Best Practices
+
+### Using tracing with Spans
+
+```rust
+use tracing::{info, warn, error, debug, span, Level};
+
+async fn function_handler(event: LambdaEvent<Request>) -> Result<Response, Error> {
+    let span = span!(
+        Level::INFO,
+        "process_request",
+        request_id = %event.context.request_id,
+        user_id = %event.payload.user_id,
+    );
+
+    let _enter = span.enter();
+
+    info!("Starting request processing");
+
+    match process_user(&event.payload.user_id).await {
+        Ok(user) => {
+            info!(user_name = %user.name, "User processed successfully");
+            Ok(Response { user })
+        }
+        Err(e) => {
+            error!(error = %e, "Failed to process user");
+            Err(e)
+        }
+    }
+}
+
+#[instrument(skip(db), fields(user_id = %user_id))]
+async fn process_user(user_id: &str) -> Result<User, Error> {
+    debug!("Fetching user from database");
+
+    let user = fetch_from_db(user_id).await?;
+
+    info!(email = %user.email, "User fetched");
+
+    Ok(user)
+}
+```
+
+### JSON Structured Logging
+
+```rust
+use tracing_subscriber::{fmt, EnvFilter, layer::SubscriberExt};
+use serde_json::json;
+
+#[tokio::main]
+async fn main() -> Result<(), Error> {
+    // JSON output for CloudWatch Insights
+    tracing_subscriber::fmt()
+        .json()
+        .with_env_filter(EnvFilter::from_default_env())
+        .with_current_span(true)
+        .with_span_list(true)
+        .with_target(false)
+        .without_time()  // CloudWatch adds timestamp
+        .init();
+
+    run(service_fn(function_handler)).await
+}
+
+// Logs will be structured JSON:
+// {"level":"info","message":"Processing request","request_id":"abc123","user_id":"user456"}
+```
+
+### Custom Metrics with OpenTelemetry
+
+```rust
+use opentelemetry::metrics::{Counter, Histogram};
+use opentelemetry::KeyValue;
+
+struct Metrics {
+    request_counter: Counter<u64>,
+    duration_histogram: Histogram<f64>,
+}
+
+async fn function_handler(event: LambdaEvent<Request>) -> Result<Response, Error> {
+    let start = std::time::Instant::now();
+
+    // Increment counter
+    metrics.request_counter.add(
+        1,
+        &[
+            KeyValue::new("function", "my-lambda"),
+            KeyValue::new("region", "us-east-1"),
+        ],
+    );
+
+    let result = process_request(&event.payload).await?;
+
+    // Record duration
+    let duration = start.elapsed().as_secs_f64();
+    metrics.duration_histogram.record(
+        duration,
+        &[KeyValue::new("status", "success")],
+    );
+
+    Ok(result)
+}
+```
+
+## CloudWatch Logs Insights Queries
+
+With structured logging, you can query efficiently:
+
+```
+# Find errors for specific user
+fields @timestamp, message, error
+| filter user_id = "user456"
+| filter level = "error"
+| sort @timestamp desc
+
+# Calculate p95 latency
+fields duration_ms
+| stats percentile(duration_ms, 95) as p95_latency by bin(5m)
+
+# Count requests by status
+fields @timestamp
+| filter message = "Request completed"
+| stats count() by status
+```
+
+## Distributed Tracing Pattern
+
+For microservices calling each other:
+
+```rust
+use opentelemetry::global;
+use opentelemetry::trace::{Tracer, TracerProvider, SpanKind};
+use opentelemetry_http::HeaderExtractor;
+
+async fn function_handler(event: LambdaEvent<ApiGatewayRequest>) -> Result<Response, Error> {
+    let tracer = global::tracer("my-service");
+
+    // Extract trace context from incoming request
+    let parent_cx = global::get_text_map_propagator(|propagator| {
+        let headers = HeaderExtractor::new(&event.payload.headers);
+        propagator.extract(&headers)
+    });
+
+    // Create span with parent context
+    let span = tracer
+        .span_builder("handle_request")
+        .with_kind(SpanKind::Server)
+        .start_with_context(&tracer, &parent_cx);
+
+    let cx = opentelemetry::Context::current_with_span(span);
+
+    // Call downstream service with trace context
+    let client = reqwest::Client::new();
+    let response = client
+        .get("https://downstream-service.com/api")
+        .header("traceparent", extract_traceparent(&cx))
+        .send()
+        .await?;
+
+    Ok(Response { data: response.text().await? })
+}
+```
+
+## AWS ADOT Lambda Layer
+
+For automatic instrumentation (limited Rust support):
+
+```bash
+# Add ADOT layer (note: Rust needs manual instrumentation)
+cargo lambda deploy \
+  --layers arn:aws:lambda:us-east-1:901920570463:layer:aws-otel-collector-arm64-ver-0-90-1:1 \
+  --env-var AWS_LAMBDA_EXEC_WRAPPER=/opt/otel-instrument \
+  --env-var OPENTELEMETRY_COLLECTOR_CONFIG_FILE=/var/task/collector.yaml
+```
+
+## Cold Start Monitoring
+
+Track cold start vs warm start:
+
+```rust
+use std::sync::atomic::{AtomicBool, Ordering};
+
+static COLD_START: AtomicBool = AtomicBool::new(true);
+
+async fn function_handler(event: LambdaEvent<Request>) -> Result<Response, Error> {
+    let is_cold_start = COLD_START.swap(false, Ordering::Relaxed);
+
+    info!(
+        cold_start = is_cold_start,
+        "Lambda invocation"
+    );
+
+    // Process request...
+
+    Ok(Response {})
+}
+```
+
+## Error Tracking
+
+### Capturing Error Context
+
+```rust
+use tracing::error;
+use thiserror::Error;
+
+#[derive(Error, Debug)]
+enum LambdaError {
+    #[error("Database error: {0}")]
+    Database(#[from] sqlx::Error),
+
+    #[error("External API error: {status}, {message}")]
+    ExternalApi { status: u16, message: String },
+}
+
+async fn function_handler(event: LambdaEvent<Request>) -> Result<Response, Error> {
+    match process_request(&event.payload).await {
+        Ok(result) => {
+            info!("Request processed successfully");
+            Ok(Response { result })
+        }
+        Err(e) => {
+            error!(
+                error = %e,
+                error_type = std::any::type_name_of_val(&e),
+                request_id = %event.context.request_id,
+                "Request failed"
+            );
+
+            // Optionally send to error tracking service
+            send_to_sentry(&e, &event.context).await;
+
+            Err(e.into())
+        }
+    }
+}
+```
+
+## Performance Monitoring
+
+### Measure Operation Duration
+
+```rust
+use std::time::Instant;
+use tracing::info;
+
+#[instrument]
+async fn expensive_operation() -> Result<Data, Error> {
+    let start = Instant::now();
+
+    let result = do_work().await?;
+
+    let duration = start.elapsed();
+    info!(duration_ms = duration.as_millis(), "Operation completed");
+
+    Ok(result)
+}
+```
+
+### Automatic Instrumentation
+
+```rust
+use tracing::instrument;
+
+// Automatically creates span and logs entry/exit
+#[instrument(
+    skip(db),  // Don't log entire db object
+    fields(
+        user_id = %user_id,
+        operation = "fetch_user"
+    ),
+    err  // Log errors automatically
+)]
+async fn fetch_user(db: &Database, user_id: &str) -> Result<User, Error> {
+    db.get_user(user_id).await
+}
+```
+
+## Observability Checklist
+
+- [ ] Enable X-Ray or OpenTelemetry tracing
+- [ ] Use structured logging (JSON format)
+- [ ] Add span instrumentation to key functions
+- [ ] Track cold vs warm starts
+- [ ] Monitor error rates and types
+- [ ] Measure operation durations
+- [ ] Set up CloudWatch Logs Insights queries
+- [ ] Configure alerts for errors and latency
+- [ ] Track custom business metrics
+- [ ] Propagate trace context across services
+- [ ] Set appropriate log retention
+- [ ] Use log levels correctly (debug, info, warn, error)
+
+## Recommended Stack
+
+**For AWS-only**:
+- X-Ray for tracing
+- CloudWatch Logs with structured JSON
+- CloudWatch Insights for queries
+- xray-lite for Rust integration
+
+**For multi-cloud/vendor-neutral**:
+- OpenTelemetry for tracing
+- OTLP exporter to your backend
+- lambda-otel-lite for Lambda optimization
+- tracing crate for structured logging
+
+**For Datadog users**:
+- Datadog Lambda Extension
+- DD_TRACE_ENABLED for automatic tracing
+- JSON structured logging
+- Custom metrics via DogStatsD
+
+## Dependencies
+
+```toml
+[dependencies]
+# Basic tracing
+tracing = { version = "0.1", features = ["log"] }
+tracing-subscriber = { version = "0.3", features = ["env-filter", "json"] }
+
+# X-Ray
+xray-lite = "0.1"
+xray-lite-aws-sdk = "0.1"
+
+# OpenTelemetry
+lambda-otel-lite = "0.1"
+opentelemetry = "0.22"
+opentelemetry-otlp = "0.15"
+opentelemetry_sdk = "0.22"
+tracing-opentelemetry = "0.23"
+
+# AWS SDK (for tracing AWS calls)
+aws-config = "1"
+aws-sdk-dynamodb = "1"  # or other services
+```
+
+Guide the user through setting up observability appropriate for their needs and monitoring backend.