zhongwei/gh-zxkane-aws-skills-aws-cost-ops

Fork 0

Files

Zhongwei Li 0364d81c91 Initial commit

2025-11-30 09:08:44 +08:00

10 KiB

Raw Permalink Blame History

AWS Cost & Operations Patterns

Comprehensive patterns and best practices for AWS cost optimization, monitoring, and operational excellence.

Cost Optimization Patterns
Monitoring Patterns
Observability Patterns
Security and Audit Patterns
Troubleshooting Workflows

Cost Optimization Patterns

Pattern 1: Cost Estimation Before Deployment

When: Before deploying any new infrastructure

MCP Server: AWS Pricing MCP

Steps:

List all resources to be deployed
Query pricing for each resource type
Calculate monthly costs based on expected usage
Compare pricing across regions
Document cost estimates in architecture docs

Example:

Resource: Lambda Function
- Invocations: 1,000,000/month
- Duration: 3 seconds avg
- Memory: 512 MB
- Region: us-east-1
Estimated cost: $X/month

Pattern 2: Monthly Cost Review

When: First week of every month

MCP Servers: Cost Explorer MCP, Billing and Cost Management MCP

Steps:

Review total spending vs. budget
Analyze cost by service (top 5 services)
Identify cost anomalies (>20% increase)
Review cost by environment (dev/staging/prod)
Check cost allocation tag coverage
Generate cost optimization recommendations

Key Metrics:

Month-over-month cost change
Cost per environment
Cost per application/project
Untagged resource costs

Pattern 3: Right-Sizing Resources

When: Quarterly or when utilization alerts trigger

MCP Servers: CloudWatch MCP, Cost Explorer MCP

Steps:

Query CloudWatch for resource utilization metrics
Identify over-provisioned resources (< 40% utilization)
Identify under-provisioned resources (> 80% utilization)
Calculate potential savings from right-sizing
Plan and execute right-sizing changes
Monitor post-change performance

Common Right-Sizing Scenarios:

EC2 instances with low CPU utilization
RDS instances with excess capacity
DynamoDB tables with low read/write usage
Lambda functions with excessive memory allocation

Pattern 4: Unused Resource Cleanup

When: Monthly or triggered by cost anomalies

MCP Servers: Cost Explorer MCP, CloudTrail MCP

Steps:

Identify resources with zero usage
Query CloudTrail for last access time
Tag resources for deletion review
Notify resource owners
Delete confirmed unused resources
Track cost savings

Common Unused Resources:

Unattached EBS volumes
Old EBS snapshots
Idle Load Balancers
Unused Elastic IPs
Old AMIs and snapshots
Stopped EC2 instances (long-term)

Monitoring Patterns

Pattern 1: Critical Service Monitoring

When: All production services

MCP Server: CloudWatch MCP

Metrics to Monitor:

Availability: Service uptime, health checks
Performance: Latency, response time
Errors: Error rate, failed requests
Saturation: CPU, memory, disk, network utilization

Alarm Thresholds (adjust based on SLAs):

Error rate: > 1% for 2 consecutive periods
Latency: p99 > 1 second for 5 minutes
CPU: > 80% for 10 minutes
Memory: > 85% for 5 minutes

Pattern 2: Lambda Function Monitoring

MCP Server: CloudWatch MCP

Key Metrics:

- Invocations (Count)
- Errors (Count, %)
- Duration (Average, p99)
- Throttles (Count)
- ConcurrentExecutions (Max)
- IteratorAge (for stream processing)

Recommended Alarms:

Error rate > 1%
Duration > 80% of timeout
Throttles > 0
ConcurrentExecutions > 80% of reserved

Pattern 3: API Gateway Monitoring

MCP Server: CloudWatch MCP

Key Metrics:

- Count (Total requests)
- 4XXError, 5XXError
- Latency (p50, p95, p99)
- IntegrationLatency
- CacheHitCount, CacheMissCount

Recommended Alarms:

5XX error rate > 0.5%
4XX error rate > 5%
Latency p99 > 2 seconds
Integration latency spike

Pattern 4: Database Monitoring

MCP Server: CloudWatch MCP

RDS Metrics:

- CPUUtilization
- DatabaseConnections
- FreeableMemory
- ReadLatency, WriteLatency
- ReadIOPS, WriteIOPS
- FreeStorageSpace

DynamoDB Metrics:

- ConsumedReadCapacityUnits
- ConsumedWriteCapacityUnits
- UserErrors
- SystemErrors
- ThrottledRequests

Recommended Alarms:

RDS CPU > 80% for 10 minutes
RDS connections > 80% of max
RDS free storage < 10 GB
DynamoDB throttled requests > 0
DynamoDB user errors spike

Observability Patterns

Pattern 1: Distributed Tracing Setup

MCP Server: CloudWatch Application Signals MCP

Components:

Service Map: Visualize service dependencies
Traces: Track requests across services
Metrics: Monitor latency and errors per service
SLOs: Define and track service level objectives

Implementation:

Enable X-Ray tracing on Lambda functions
Add X-Ray SDK to application code
Configure sampling rules
Create service lens dashboards

Pattern 2: Log Aggregation and Analysis

MCP Server: CloudWatch MCP

Log Strategy:

Centralize Logs: Send all application logs to CloudWatch Logs
Structure Logs: Use JSON format for structured logging
Log Insights: Use CloudWatch Logs Insights for queries
Retention: Set appropriate retention periods

Example Log Insights Queries:

# Find errors in last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

# Count errors by type
stats count() by error_type
| sort count desc

# Calculate p99 latency
stats percentile(duration, 99) by service_name

Pattern 3: Custom Metrics

MCP Server: CloudWatch MCP

When to Use Custom Metrics:

Business-specific KPIs (orders/minute, revenue/hour)
Application-specific metrics (cache hit rate, queue depth)
Performance metrics not provided by AWS

Best Practices:

Use consistent namespace: CompanyName/ApplicationName
Include relevant dimensions (environment, region, version)
Publish metrics at appropriate intervals
Use metric filters for log-derived metrics

Security and Audit Patterns

Pattern 1: API Activity Auditing

MCP Server: CloudTrail MCP

Regular Audit Queries:

# Find all IAM changes
eventName: CreateUser, DeleteUser, AttachUserPolicy, etc.
Time: Last 24 hours

# Track S3 bucket deletions
eventName: DeleteBucket
Time: Last 7 days

# Find failed login attempts
eventName: ConsoleLogin
errorCode: Failure

# Monitor privileged actions
userIdentity.arn: *admin* OR *root*

Audit Schedule:

Daily: Review privileged user actions
Weekly: Audit IAM changes and security group modifications
Monthly: Comprehensive security review

Pattern 2: Security Posture Assessment

MCP Server: Well-Architected Security Assessment Tool MCP

Assessment Areas:

Identity and Access Management
- Least privilege implementation
- MFA enforcement
- Role-based access control
- Service control policies
Detective Controls
- CloudTrail enabled in all regions
- GuardDuty findings review
- Config rule compliance
- Security Hub findings
Infrastructure Protection
- VPC security groups review
- Network ACLs configuration
- AWS WAF rules
- Security group ingress rules
Data Protection
- Encryption at rest (S3, EBS, RDS)
- Encryption in transit (TLS/SSL)
- KMS key usage and rotation
- Secrets Manager utilization
Incident Response
- IR playbooks documented
- Automated response procedures
- Contact information current
- Regular IR drills

Assessment Frequency:

Quarterly: Full Well-Architected review
Monthly: High-priority findings review
Weekly: Critical security findings

Pattern 3: Compliance Monitoring

MCP Servers: CloudTrail MCP, CloudWatch MCP

Compliance Requirements:

Data residency (ensure data stays in approved regions)
Access logging (all access logged and retained)
Encryption requirements (data encrypted at rest and in transit)
Change management (all changes tracked in CloudTrail)

Compliance Dashboards:

Encryption coverage by service
CloudTrail logging status
Failed login attempts
Privileged access usage
Non-compliant resources

Troubleshooting Workflows

Workflow 1: High Lambda Error Rate

MCP Servers: CloudWatch MCP, CloudWatch Application Signals MCP

Steps:

Query CloudWatch for Lambda error metrics
Check error logs in CloudWatch Logs
Identify error patterns (timeout, memory, permission)
Check Lambda configuration (memory, timeout, permissions)
Review recent code deployments
Check downstream service health
Implement fix and monitor

Workflow 2: Increased Latency

MCP Servers: CloudWatch MCP, CloudWatch Application Signals MCP

Steps:

Identify latency spike in CloudWatch metrics
Check service map for slow dependencies
Query distributed traces for slow requests
Check database query performance
Review API Gateway integration latency
Check Lambda cold starts
Identify bottleneck and optimize

Workflow 3: Cost Spike Investigation

MCP Servers: Cost Explorer MCP, CloudWatch MCP, CloudTrail MCP

Steps:

Use Cost Explorer to identify service causing spike
Check CloudWatch metrics for usage increase
Review CloudTrail for recent resource creation
Identify root cause (misconfiguration, runaway process, attack)
Implement cost controls (budgets, alarms, service quotas)
Clean up unnecessary resources

Workflow 4: Security Incident Response

MCP Servers: CloudTrail MCP, GuardDuty (via CloudWatch), Well-Architected Assessment MCP

Steps:

Identify security event in GuardDuty or CloudWatch
Query CloudTrail for related API activity
Determine scope and impact
Isolate affected resources
Revoke compromised credentials
Implement remediation
Conduct post-incident review
Update security controls

Summary

Cost Optimization: Use Pricing, Cost Explorer, and Billing MCPs for proactive cost management
Monitoring: Set up comprehensive CloudWatch alarms for all critical services
Observability: Implement distributed tracing and structured logging
Security: Regular CloudTrail audits and Well-Architected assessments
Proactive: Don't wait for incidents - monitor and optimize continuously

10 KiB Raw Permalink Blame History

AWS Cost & Operations Patterns

Table of Contents

Cost Optimization Patterns

Pattern 1: Cost Estimation Before Deployment

Pattern 2: Monthly Cost Review

Pattern 3: Right-Sizing Resources

Pattern 4: Unused Resource Cleanup

Monitoring Patterns

Pattern 1: Critical Service Monitoring

Pattern 2: Lambda Function Monitoring

Pattern 3: API Gateway Monitoring

Pattern 4: Database Monitoring

Observability Patterns

Pattern 1: Distributed Tracing Setup

Pattern 2: Log Aggregation and Analysis

Pattern 3: Custom Metrics

Security and Audit Patterns

Pattern 1: API Activity Auditing

Pattern 2: Security Posture Assessment

Pattern 3: Compliance Monitoring

Troubleshooting Workflows

Workflow 1: High Lambda Error Rate

Workflow 2: Increased Latency

Workflow 3: Cost Spike Investigation

Workflow 4: Security Incident Response

Summary

10 KiB

Raw Permalink Blame History