Files
2025-11-30 09:08:44 +08:00

10 KiB

AWS Cost & Operations Patterns

Comprehensive patterns and best practices for AWS cost optimization, monitoring, and operational excellence.

Table of Contents

Cost Optimization Patterns

Pattern 1: Cost Estimation Before Deployment

When: Before deploying any new infrastructure

MCP Server: AWS Pricing MCP

Steps:

  1. List all resources to be deployed
  2. Query pricing for each resource type
  3. Calculate monthly costs based on expected usage
  4. Compare pricing across regions
  5. Document cost estimates in architecture docs

Example:

Resource: Lambda Function
- Invocations: 1,000,000/month
- Duration: 3 seconds avg
- Memory: 512 MB
- Region: us-east-1
Estimated cost: $X/month

Pattern 2: Monthly Cost Review

When: First week of every month

MCP Servers: Cost Explorer MCP, Billing and Cost Management MCP

Steps:

  1. Review total spending vs. budget
  2. Analyze cost by service (top 5 services)
  3. Identify cost anomalies (>20% increase)
  4. Review cost by environment (dev/staging/prod)
  5. Check cost allocation tag coverage
  6. Generate cost optimization recommendations

Key Metrics:

  • Month-over-month cost change
  • Cost per environment
  • Cost per application/project
  • Untagged resource costs

Pattern 3: Right-Sizing Resources

When: Quarterly or when utilization alerts trigger

MCP Servers: CloudWatch MCP, Cost Explorer MCP

Steps:

  1. Query CloudWatch for resource utilization metrics
  2. Identify over-provisioned resources (< 40% utilization)
  3. Identify under-provisioned resources (> 80% utilization)
  4. Calculate potential savings from right-sizing
  5. Plan and execute right-sizing changes
  6. Monitor post-change performance

Common Right-Sizing Scenarios:

  • EC2 instances with low CPU utilization
  • RDS instances with excess capacity
  • DynamoDB tables with low read/write usage
  • Lambda functions with excessive memory allocation

Pattern 4: Unused Resource Cleanup

When: Monthly or triggered by cost anomalies

MCP Servers: Cost Explorer MCP, CloudTrail MCP

Steps:

  1. Identify resources with zero usage
  2. Query CloudTrail for last access time
  3. Tag resources for deletion review
  4. Notify resource owners
  5. Delete confirmed unused resources
  6. Track cost savings

Common Unused Resources:

  • Unattached EBS volumes
  • Old EBS snapshots
  • Idle Load Balancers
  • Unused Elastic IPs
  • Old AMIs and snapshots
  • Stopped EC2 instances (long-term)

Monitoring Patterns

Pattern 1: Critical Service Monitoring

When: All production services

MCP Server: CloudWatch MCP

Metrics to Monitor:

  • Availability: Service uptime, health checks
  • Performance: Latency, response time
  • Errors: Error rate, failed requests
  • Saturation: CPU, memory, disk, network utilization

Alarm Thresholds (adjust based on SLAs):

  • Error rate: > 1% for 2 consecutive periods
  • Latency: p99 > 1 second for 5 minutes
  • CPU: > 80% for 10 minutes
  • Memory: > 85% for 5 minutes

Pattern 2: Lambda Function Monitoring

MCP Server: CloudWatch MCP

Key Metrics:

- Invocations (Count)
- Errors (Count, %)
- Duration (Average, p99)
- Throttles (Count)
- ConcurrentExecutions (Max)
- IteratorAge (for stream processing)

Recommended Alarms:

  • Error rate > 1%
  • Duration > 80% of timeout
  • Throttles > 0
  • ConcurrentExecutions > 80% of reserved

Pattern 3: API Gateway Monitoring

MCP Server: CloudWatch MCP

Key Metrics:

- Count (Total requests)
- 4XXError, 5XXError
- Latency (p50, p95, p99)
- IntegrationLatency
- CacheHitCount, CacheMissCount

Recommended Alarms:

  • 5XX error rate > 0.5%
  • 4XX error rate > 5%
  • Latency p99 > 2 seconds
  • Integration latency spike

Pattern 4: Database Monitoring

MCP Server: CloudWatch MCP

RDS Metrics:

- CPUUtilization
- DatabaseConnections
- FreeableMemory
- ReadLatency, WriteLatency
- ReadIOPS, WriteIOPS
- FreeStorageSpace

DynamoDB Metrics:

- ConsumedReadCapacityUnits
- ConsumedWriteCapacityUnits
- UserErrors
- SystemErrors
- ThrottledRequests

Recommended Alarms:

  • RDS CPU > 80% for 10 minutes
  • RDS connections > 80% of max
  • RDS free storage < 10 GB
  • DynamoDB throttled requests > 0
  • DynamoDB user errors spike

Observability Patterns

Pattern 1: Distributed Tracing Setup

MCP Server: CloudWatch Application Signals MCP

Components:

  1. Service Map: Visualize service dependencies
  2. Traces: Track requests across services
  3. Metrics: Monitor latency and errors per service
  4. SLOs: Define and track service level objectives

Implementation:

  • Enable X-Ray tracing on Lambda functions
  • Add X-Ray SDK to application code
  • Configure sampling rules
  • Create service lens dashboards

Pattern 2: Log Aggregation and Analysis

MCP Server: CloudWatch MCP

Log Strategy:

  1. Centralize Logs: Send all application logs to CloudWatch Logs
  2. Structure Logs: Use JSON format for structured logging
  3. Log Insights: Use CloudWatch Logs Insights for queries
  4. Retention: Set appropriate retention periods

Example Log Insights Queries:

# Find errors in last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

# Count errors by type
stats count() by error_type
| sort count desc

# Calculate p99 latency
stats percentile(duration, 99) by service_name

Pattern 3: Custom Metrics

MCP Server: CloudWatch MCP

When to Use Custom Metrics:

  • Business-specific KPIs (orders/minute, revenue/hour)
  • Application-specific metrics (cache hit rate, queue depth)
  • Performance metrics not provided by AWS

Best Practices:

  • Use consistent namespace: CompanyName/ApplicationName
  • Include relevant dimensions (environment, region, version)
  • Publish metrics at appropriate intervals
  • Use metric filters for log-derived metrics

Security and Audit Patterns

Pattern 1: API Activity Auditing

MCP Server: CloudTrail MCP

Regular Audit Queries:

# Find all IAM changes
eventName: CreateUser, DeleteUser, AttachUserPolicy, etc.
Time: Last 24 hours

# Track S3 bucket deletions
eventName: DeleteBucket
Time: Last 7 days

# Find failed login attempts
eventName: ConsoleLogin
errorCode: Failure

# Monitor privileged actions
userIdentity.arn: *admin* OR *root*

Audit Schedule:

  • Daily: Review privileged user actions
  • Weekly: Audit IAM changes and security group modifications
  • Monthly: Comprehensive security review

Pattern 2: Security Posture Assessment

MCP Server: Well-Architected Security Assessment Tool MCP

Assessment Areas:

  1. Identity and Access Management

    • Least privilege implementation
    • MFA enforcement
    • Role-based access control
    • Service control policies
  2. Detective Controls

    • CloudTrail enabled in all regions
    • GuardDuty findings review
    • Config rule compliance
    • Security Hub findings
  3. Infrastructure Protection

    • VPC security groups review
    • Network ACLs configuration
    • AWS WAF rules
    • Security group ingress rules
  4. Data Protection

    • Encryption at rest (S3, EBS, RDS)
    • Encryption in transit (TLS/SSL)
    • KMS key usage and rotation
    • Secrets Manager utilization
  5. Incident Response

    • IR playbooks documented
    • Automated response procedures
    • Contact information current
    • Regular IR drills

Assessment Frequency:

  • Quarterly: Full Well-Architected review
  • Monthly: High-priority findings review
  • Weekly: Critical security findings

Pattern 3: Compliance Monitoring

MCP Servers: CloudTrail MCP, CloudWatch MCP

Compliance Requirements:

  • Data residency (ensure data stays in approved regions)
  • Access logging (all access logged and retained)
  • Encryption requirements (data encrypted at rest and in transit)
  • Change management (all changes tracked in CloudTrail)

Compliance Dashboards:

  • Encryption coverage by service
  • CloudTrail logging status
  • Failed login attempts
  • Privileged access usage
  • Non-compliant resources

Troubleshooting Workflows

Workflow 1: High Lambda Error Rate

MCP Servers: CloudWatch MCP, CloudWatch Application Signals MCP

Steps:

  1. Query CloudWatch for Lambda error metrics
  2. Check error logs in CloudWatch Logs
  3. Identify error patterns (timeout, memory, permission)
  4. Check Lambda configuration (memory, timeout, permissions)
  5. Review recent code deployments
  6. Check downstream service health
  7. Implement fix and monitor

Workflow 2: Increased Latency

MCP Servers: CloudWatch MCP, CloudWatch Application Signals MCP

Steps:

  1. Identify latency spike in CloudWatch metrics
  2. Check service map for slow dependencies
  3. Query distributed traces for slow requests
  4. Check database query performance
  5. Review API Gateway integration latency
  6. Check Lambda cold starts
  7. Identify bottleneck and optimize

Workflow 3: Cost Spike Investigation

MCP Servers: Cost Explorer MCP, CloudWatch MCP, CloudTrail MCP

Steps:

  1. Use Cost Explorer to identify service causing spike
  2. Check CloudWatch metrics for usage increase
  3. Review CloudTrail for recent resource creation
  4. Identify root cause (misconfiguration, runaway process, attack)
  5. Implement cost controls (budgets, alarms, service quotas)
  6. Clean up unnecessary resources

Workflow 4: Security Incident Response

MCP Servers: CloudTrail MCP, GuardDuty (via CloudWatch), Well-Architected Assessment MCP

Steps:

  1. Identify security event in GuardDuty or CloudWatch
  2. Query CloudTrail for related API activity
  3. Determine scope and impact
  4. Isolate affected resources
  5. Revoke compromised credentials
  6. Implement remediation
  7. Conduct post-incident review
  8. Update security controls

Summary

  • Cost Optimization: Use Pricing, Cost Explorer, and Billing MCPs for proactive cost management
  • Monitoring: Set up comprehensive CloudWatch alarms for all critical services
  • Observability: Implement distributed tracing and structured logging
  • Security: Regular CloudTrail audits and Well-Architected assessments
  • Proactive: Don't wait for incidents - monitor and optimize continuously