Initial commit

2025-11-29 18:20:21 +08:00
commit bbbaf7acad
63 changed files with 38552 additions and 0 deletions
--- a/commands/optimize/infrastructure.md
+++ b/commands/optimize/infrastructure.md
@@ -0,0 +1,677 @@
+# Infrastructure Optimization Operation
+
+You are executing the **infrastructure** operation to optimize infrastructure scaling, CDN configuration, resource allocation, deployment, and cost efficiency.
+
+## Parameters
+
+**Received**: `$ARGUMENTS` (after removing 'infrastructure' operation name)
+
+Expected format: `target:"scaling|cdn|resources|deployment|costs|all" [environment:"prod|staging|dev"] [provider:"aws|azure|gcp|vercel"] [budget_constraint:"true|false"]`
+
+**Parameter definitions**:
+- `target` (required): What to optimize - `scaling`, `cdn`, `resources`, `deployment`, `costs`, or `all`
+- `environment` (optional): Target environment (default: production)
+- `provider` (optional): Cloud provider (auto-detected if not specified)
+- `budget_constraint` (optional): Prioritize cost reduction (default: false)
+
+## Workflow
+
+### 1. Detect Infrastructure Provider
+
+```bash
+# Check for cloud provider configuration
+ls -la .aws/ .azure/ .gcp/ vercel.json netlify.toml 2>/dev/null
+
+# Check for container orchestration
+kubectl config current-context 2>/dev/null
+docker-compose version 2>/dev/null
+
+# Check for IaC tools
+ls -la terraform/ *.tf serverless.yml cloudformation/ 2>/dev/null
+```
+
+### 2. Analyze Current Infrastructure
+
+**Resource Utilization (Kubernetes)**:
+```bash
+# Node resource usage
+kubectl top nodes
+
+# Pod resource usage
+kubectl top pods --all-namespaces
+
+# Check resource requests vs limits
+kubectl get pods -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].resources}{"\n"}{end}'
+```
+
+**Resource Utilization (AWS EC2)**:
+```bash
+# CloudWatch metrics
+aws cloudwatch get-metric-statistics \
+  --namespace AWS/EC2 \
+  --metric-name CPUUtilization \
+  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
+  --start-time 2025-10-07T00:00:00Z \
+  --end-time 2025-10-14T00:00:00Z \
+  --period 3600 \
+  --statistics Average
+```
+
+### 3. Scaling Optimization
+
+#### 3.1. Horizontal Pod Autoscaling (Kubernetes)
+
+```yaml
+# BEFORE (fixed 3 replicas)
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: api-server
+spec:
+  replicas: 3  # Fixed count, wastes resources at low traffic
+  template:
+    spec:
+      containers:
+      - name: api
+        image: api:v1.0.0
+        resources:
+          requests:
+            memory: "512Mi"
+            cpu: "500m"
+
+# AFTER (horizontal pod autoscaler)
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: api-server-hpa
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: api-server
+  minReplicas: 2  # Minimum for high availability
+  maxReplicas: 10  # Scale up under load
+  metrics:
+  - type: Resource
+    resource:
+      name: cpu
+      target:
+        type: Utilization
+        averageUtilization: 70  # Target 70% CPU
+  - type: Resource
+    resource:
+      name: memory
+      target:
+        type: Utilization
+        averageUtilization: 80
+  behavior:
+    scaleDown:
+      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
+    scaleUp:
+      stabilizationWindowSeconds: 0  # Scale up immediately
+      policies:
+      - type: Percent
+        value: 100  # Double pods at a time
+        periodSeconds: 15
+
+# Result:
+# - Off-peak: 2 pods (save 33% resources)
+# - Peak: Up to 10 pods (handle 5x traffic)
+# - Cost savings: ~40% while maintaining performance
+```
+
+#### 3.2. Vertical Pod Autoscaling
+
+```yaml
+# Automatically adjust resource requests/limits
+apiVersion: autoscaling.k8s.io/v1
+kind: VerticalPodAutoscaler
+metadata:
+  name: api-server-vpa
+spec:
+  targetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: api-server
+  updatePolicy:
+    updateMode: "Auto"  # Automatically apply recommendations
+  resourcePolicy:
+    containerPolicies:
+    - containerName: api
+      minAllowed:
+        memory: "256Mi"
+        cpu: "100m"
+      maxAllowed:
+        memory: "2Gi"
+        cpu: "2000m"
+      controlledResources: ["cpu", "memory"]
+```
+
+#### 3.3. AWS Auto Scaling Groups
+
+```json
+{
+  "AutoScalingGroupName": "api-server-asg",
+  "MinSize": 2,
+  "MaxSize": 10,
+  "DesiredCapacity": 2,
+  "DefaultCooldown": 300,
+  "HealthCheckType": "ELB",
+  "HealthCheckGracePeriod": 180,
+  "TargetGroupARNs": ["arn:aws:elasticloadbalancing:..."],
+  "TargetTrackingScalingPolicies": [
+    {
+      "PolicyName": "target-tracking-cpu",
+      "TargetValue": 70.0,
+      "PredefinedMetricSpecification": {
+        "PredefinedMetricType": "ASGAverageCPUUtilization"
+      }
+    }
+  ]
+}
+```
+
+### 4. CDN Optimization
+
+#### 4.1. CloudFront Configuration (AWS)
+
+```json
+{
+  "DistributionConfig": {
+    "CallerReference": "api-cdn-2025",
+    "Comment": "Optimized CDN for static assets",
+    "Enabled": true,
+    "PriceClass": "PriceClass_100",
+    "Origins": [
+      {
+        "Id": "S3-static-assets",
+        "DomainName": "static-assets.s3.amazonaws.com",
+        "S3OriginConfig": {
+          "OriginAccessIdentity": "origin-access-identity/cloudfront/..."
+        }
+      }
+    ],
+    "DefaultCacheBehavior": {
+      "TargetOriginId": "S3-static-assets",
+      "ViewerProtocolPolicy": "redirect-to-https",
+      "Compress": true,
+      "MinTTL": 0,
+      "DefaultTTL": 86400,
+      "MaxTTL": 31536000,
+      "ForwardedValues": {
+        "QueryString": false,
+        "Cookies": { "Forward": "none" }
+      }
+    },
+    "CacheBehaviors": [
+      {
+        "PathPattern": "*.js",
+        "TargetOriginId": "S3-static-assets",
+        "Compress": true,
+        "MinTTL": 31536000,
+        "CachePolicyId": "immutable-assets"
+      },
+      {
+        "PathPattern": "*.css",
+        "TargetOriginId": "S3-static-assets",
+        "Compress": true,
+        "MinTTL": 31536000
+      }
+    ]
+  }
+}
+```
+
+**Cache Headers**:
+```javascript
+// Express server - set appropriate cache headers
+app.use('/static', express.static('public', {
+  maxAge: '1y',  // Immutable assets with hash in filename
+  immutable: true
+}));
+
+app.use('/api', (req, res, next) => {
+  res.set('Cache-Control', 'no-cache'); // API responses
+  next();
+});
+
+// HTML pages - short cache with revalidation
+app.get('/', (req, res) => {
+  res.set('Cache-Control', 'public, max-age=300, must-revalidate');
+  res.sendFile('index.html');
+});
+```
+
+#### 4.2. Image Optimization with CDN
+
+```nginx
+# Nginx configuration for image optimization
+location ~* \.(jpg|jpeg|png|gif|webp)$ {
+    expires 1y;
+    add_header Cache-Control "public, immutable";
+
+    # Enable compression
+    gzip on;
+    gzip_comp_level 6;
+
+    # Serve WebP if browser supports it
+    set $webp_suffix "";
+    if ($http_accept ~* "webp") {
+        set $webp_suffix ".webp";
+    }
+    try_files $uri$webp_suffix $uri =404;
+}
+```
+
+### 5. Resource Right-Sizing
+
+#### 5.1. Analyze Resource Usage Patterns
+
+```bash
+# Kubernetes - Resource usage over time
+kubectl top pods --containers --namespace production | awk '{
+  if (NR>1) {
+    split($3, cpu, "m"); split($4, mem, "Mi");
+    print $1, $2, cpu[1], mem[1]
+  }
+}' > resource-usage.txt
+
+# Analyze patterns
+# If CPU consistently <30% → reduce CPU request
+# If memory consistently <50% → reduce memory request
+```
+
+**Optimization Example**:
+```yaml
+# BEFORE (over-provisioned)
+resources:
+  requests:
+    memory: "2Gi"    # Usage: 600Mi (30%)
+    cpu: "1000m"     # Usage: 200m (20%)
+  limits:
+    memory: "4Gi"
+    cpu: "2000m"
+
+# AFTER (right-sized)
+resources:
+  requests:
+    memory: "768Mi"  # 600Mi + 28% headroom
+    cpu: "300m"      # 200m + 50% headroom
+  limits:
+    memory: "1.5Gi"  # 2x request
+    cpu: "600m"      # 2x request
+
+# Savings: 62% CPU, 61% memory
+# Cost impact: ~60% reduction per pod
+```
+
+#### 5.2. Reserved Instances / Savings Plans
+
+**AWS Reserved Instances**:
+```bash
+# Analyze instance usage patterns
+aws ce get-reservation-utilization \
+  --time-period Start=2024-10-01,End=2025-10-01 \
+  --granularity MONTHLY
+
+# Recommendation: Convert frequently-used instances to Reserved Instances
+# Example savings:
+# - On-Demand t3.large: $0.0832/hour = $612/month
+# - Reserved t3.large (1 year): $0.0520/hour = $383/month
+# - Savings: 37% ($229/month per instance)
+```
+
+### 6. Deployment Optimization
+
+#### 6.1. Container Image Optimization
+
+```dockerfile
+# BEFORE (large image: 1.2GB)
+FROM node:18
+WORKDIR /app
+COPY . .
+RUN npm install
+CMD ["npm", "start"]
+
+# AFTER (optimized image: 180MB)
+# Multi-stage build
+FROM node:18-alpine AS builder
+WORKDIR /app
+COPY package*.json ./
+RUN npm ci --only=production
+COPY . .
+RUN npm run build
+
+FROM node:18-alpine
+WORKDIR /app
+COPY --from=builder /app/dist ./dist
+COPY --from=builder /app/node_modules ./node_modules
+COPY package*.json ./
+
+# Create non-root user
+RUN addgroup -g 1001 -S nodejs && adduser -S nodejs -u 1001
+USER nodejs
+
+EXPOSE 3000
+CMD ["node", "dist/main.js"]
+
+# Image size: 1.2GB → 180MB (85% smaller)
+# Security: Non-root user, minimal attack surface
+```
+
+#### 6.2. Blue-Green Deployment
+
+```yaml
+# Kubernetes Blue-Green deployment
+# Green (new version)
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: api-green
+spec:
+  replicas: 3
+  selector:
+    matchLabels:
+      app: api
+      version: green
+  template:
+    metadata:
+      labels:
+        app: api
+        version: green
+    spec:
+      containers:
+      - name: api
+        image: api:v2.0.0
+
+---
+# Service - switch traffic by changing selector
+apiVersion: v1
+kind: Service
+metadata:
+  name: api-service
+spec:
+  selector:
+    app: api
+    version: green  # Change from 'blue' to 'green' to switch traffic
+  ports:
+  - port: 80
+    targetPort: 3000
+
+# Zero-downtime deployment
+# Instant rollback by changing selector back to 'blue'
+```
+
+### 7. Cost Optimization
+
+#### 7.1. Spot Instances for Non-Critical Workloads
+
+```yaml
+# Kubernetes - Use spot instances for batch jobs
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: data-processing
+spec:
+  template:
+    spec:
+      nodeSelector:
+        node.kubernetes.io/instance-type: spot  # Use spot instances
+      tolerations:
+      - key: "spot"
+        operator: "Equal"
+        value: "true"
+        effect: "NoSchedule"
+      containers:
+      - name: processor
+        image: data-processor:v1.0.0
+
+# Savings: 70-90% cost reduction for spot vs on-demand
+# Trade-off: May be interrupted (acceptable for batch jobs)
+```
+
+#### 7.2. Storage Optimization
+
+```bash
+# S3 Lifecycle Policy
+aws s3api put-bucket-lifecycle-configuration \
+  --bucket static-assets \
+  --lifecycle-configuration '{
+    "Rules": [
+      {
+        "Id": "archive-old-logs",
+        "Status": "Enabled",
+        "Filter": { "Prefix": "logs/" },
+        "Transitions": [
+          {
+            "Days": 30,
+            "StorageClass": "STANDARD_IA"
+          },
+          {
+            "Days": 90,
+            "StorageClass": "GLACIER"
+          }
+        ],
+        "Expiration": { "Days": 365 }
+      }
+    ]
+  }'
+
+# Cost impact:
+# - Standard: $0.023/GB/month
+# - Standard-IA: $0.0125/GB/month (46% cheaper)
+# - Glacier: $0.004/GB/month (83% cheaper)
+```
+
+#### 7.3. Database Instance Right-Sizing
+
+```sql
+-- Analyze actual database usage
+SELECT
+  datname,
+  pg_size_pretty(pg_database_size(datname)) AS size
+FROM pg_database
+ORDER BY pg_database_size(datname) DESC;
+
+-- Check connection usage
+SELECT count(*) AS connections,
+       max_conn,
+       max_conn - count(*) AS available
+FROM pg_stat_activity,
+     (SELECT setting::int AS max_conn FROM pg_settings WHERE name='max_connections') mc
+GROUP BY max_conn;
+
+-- Recommendation: If consistently using <30% connections and <50% storage
+-- Consider downsizing from db.r5.xlarge to db.r5.large
+-- Savings: ~50% cost reduction
+```
+
+### 8. Monitoring and Alerting
+
+**CloudWatch Alarms (AWS)**:
+```json
+{
+  "AlarmName": "high-cpu-utilization",
+  "ComparisonOperator": "GreaterThanThreshold",
+  "EvaluationPeriods": 2,
+  "MetricName": "CPUUtilization",
+  "Namespace": "AWS/EC2",
+  "Period": 300,
+  "Statistic": "Average",
+  "Threshold": 80.0,
+  "ActionsEnabled": true,
+  "AlarmActions": ["arn:aws:sns:us-east-1:123456789012:ops-team"]
+}
+```
+
+**Prometheus Alerts (Kubernetes)**:
+```yaml
+groups:
+- name: infrastructure
+  rules:
+  - alert: HighMemoryUsage
+    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.85
+    for: 5m
+    labels:
+      severity: warning
+    annotations:
+      summary: "High memory usage on {{ $labels.instance }}"
+
+  - alert: HighCPUUsage
+    expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
+    for: 5m
+    labels:
+      severity: warning
+```
+
+## Output Format
+
+```markdown
+# Infrastructure Optimization Report: [Environment]
+
+**Optimization Date**: [Date]
+**Provider**: [AWS/Azure/GCP/Hybrid]
+**Environment**: [production/staging]
+**Target**: [scaling/cdn/resources/costs/all]
+
+## Executive Summary
+
+[Summary of infrastructure state and optimizations]
+
+## Baseline Metrics
+
+### Resource Utilization
+- **CPU**: 68% average across nodes
+- **Memory**: 72% average
+- **Network**: 45% utilization
+- **Storage**: 60% utilization
+
+### Cost Breakdown (Monthly)
+- **Compute**: $4,500 (EC2 instances)
+- **Database**: $1,200 (RDS)
+- **Storage**: $800 (S3, EBS)
+- **Network**: $600 (Data transfer, CloudFront)
+- **Total**: $7,100/month
+
+### Scaling Configuration
+- **Auto Scaling**: Fixed 5 instances (no scaling)
+- **Pod Count**: Fixed 15 pods
+- **Resource Allocation**: Static (no HPA/VPA)
+
+## Optimizations Implemented
+
+### 1. Horizontal Pod Autoscaling
+
+**Before**: Fixed 15 pods
+**After**: 8-25 pods based on load
+
+**Impact**:
+- Off-peak: 8 pods (47% reduction)
+- Peak: 25 pods (67% increase capacity)
+- Cost savings: $1,350/month (30%)
+
+### 2. Resource Right-Sizing
+
+**Optimized 12 deployments**:
+- Average CPU reduction: 55%
+- Average memory reduction: 48%
+- Cost impact: $945/month savings
+
+### 3. CDN Configuration
+
+**Implemented**:
+- CloudFront for static assets
+- Cache-Control headers optimized
+- Compression enabled
+
+**Impact**:
+- Origin requests: 85% reduction
+- TTFB: 750ms → 120ms (84% faster)
+- Bandwidth costs: $240/month savings
+
+### 4. Reserved Instances
+
+**Converted**:
+- 3 x t3.large on-demand → Reserved
+- Commitment: 1 year, no upfront
+
+**Savings**: $687/month (37% per instance)
+
+### 5. Storage Lifecycle Policies
+
+**Implemented**:
+- Logs: Standard → Standard-IA (30d) → Glacier (90d)
+- Backups: Glacier after 30 days
+- Old assets: Glacier after 180 days
+
+**Savings**: $285/month
+
+## Results Summary
+
+### Cost Optimization
+
+| Category | Before | After | Savings |
+|----------|--------|-------|---------|
+| Compute | $4,500 | $2,518 | $1,982 (44%) |
+| Database | $1,200 | $720 | $480 (40%) |
+| Storage | $800 | $515 | $285 (36%) |
+| Network | $600 | $360 | $240 (40%) |
+| **Total** | **$7,100** | **$4,113** | **$2,987 (42%)** |
+
+**Annual Savings**: $35,844
+
+### Performance Improvements
+
+| Metric | Before | After | Improvement |
+|--------|--------|-------|-------------|
+| Average Response Time | 285ms | 125ms | 56% faster |
+| TTFB (with CDN) | 750ms | 120ms | 84% faster |
+| Resource Utilization | 68% | 75% | Better efficiency |
+| Auto-scaling Response | N/A | 30s | Handles traffic spikes |
+
+### Scalability Improvements
+
+- **Traffic Capacity**: 2x increase (25 pods vs 15 fixed)
+- **Scaling Response Time**: 30 seconds to scale up
+- **Cost Efficiency**: Pay for what you use
+
+## Trade-offs and Considerations
+
+**Auto-scaling**:
+- **Benefit**: 42% cost reduction, 2x capacity
+- **Trade-off**: 30s delay for cold starts
+- **Mitigation**: Min 8 pods for baseline capacity
+
+**Reserved Instances**:
+- **Benefit**: 37% savings per instance
+- **Trade-off**: 1-year commitment
+- **Risk**: Low (steady baseline load confirmed)
+
+**CDN Caching**:
+- **Benefit**: 84% faster TTFB, 85% fewer origin requests
+- **Trade-off**: Cache invalidation complexity
+- **Mitigation**: Short TTL for dynamic content
+
+## Monitoring Recommendations
+
+1. **Cost Tracking**:
+   - Daily cost reports
+   - Budget alerts at 80%, 100%
+   - Tag-based cost allocation
+
+2. **Performance Monitoring**:
+   - CloudWatch dashboards
+   - Prometheus + Grafana
+   - APM for application metrics
+
+3. **Auto-scaling Health**:
+   - HPA metrics (scale events)
+   - Resource utilization trends
+   - Alert on frequent scaling
+
+## Next Steps
+
+1. Evaluate spot instances for batch workloads (potential 70% savings)
+2. Implement multi-region deployment for better global performance
+3. Consider serverless for low-traffic endpoints
+4. Review database read replicas for read-heavy workloads