Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 17:51:09 +08:00
commit 9d4643f587
14 changed files with 4713 additions and 0 deletions

View File

@@ -0,0 +1,740 @@
# FinOps Governance Framework
Organizational practices, processes, and governance for AWS cost optimization.
## Table of Contents
1. [FinOps Principles](#finops-principles)
2. [Cost Allocation & Tagging](#cost-allocation--tagging)
3. [Budget Management](#budget-management)
4. [Monthly Review Process](#monthly-review-process)
5. [Roles & Responsibilities](#roles--responsibilities)
6. [Chargeback & Showback](#chargeback--showback)
7. [Policy & Governance](#policy--governance)
8. [Metrics & KPIs](#metrics--kpis)
---
## FinOps Principles
### The FinOps Framework
FinOps is the practice of bringing financial accountability to cloud spending through collaboration between engineering, finance, and business teams.
**Core Principles:**
1. **Teams Need to Collaborate**
- Engineering makes technical decisions
- Finance provides visibility and reporting
- Business sets priorities and budgets
- Cross-functional cost optimization
2. **Everyone Takes Ownership**
- Engineers see cost impact of their decisions
- Teams have cost budgets and accountability
- Cost is a efficiency metric, not just finance
3. **Decisions Driven by Business Value**
- Speed, quality, and cost trade-offs
- Investment vs optimization decisions
- ROI-based prioritization
4. **Take Advantage of Variable Cost Model**
- Scale resources up and down as needed
- Use different pricing models strategically
- Optimize for actual usage patterns
5. **Centralized Team Drives FinOps**
- Central FinOps team enables
- Distributed execution by product teams
- Share best practices and tools
### FinOps Maturity Model
**Crawl Phase (Getting Started)**
- Basic cost visibility
- Manual reporting
- Ad-hoc optimization
- Initial tagging strategy
- Basic budget alerts
**Walk Phase (Improving)**
- Automated cost reporting
- Regular optimization reviews
- Systematic tagging enforcement
- Team cost allocation
- Reserved Instance planning
- Monthly optimization meetings
**Run Phase (Optimized)**
- Real-time cost visibility
- Automated optimization
- Cost-aware engineering culture
- Predictive forecasting
- Automated guardrails
- FinOps integrated in SDLC
---
## Cost Allocation & Tagging
### Tagging Strategy
**Required Tags (Enforce via Policy)**
```yaml
Required Tags:
Environment:
values: [prod, staging, dev, test]
purpose: Separate production from non-production costs
Owner:
values: [email or team name]
purpose: Contact for resource questions
Project:
values: [project code]
purpose: Track project spending
CostCenter:
values: [department code]
purpose: Chargeback allocation
Application:
values: [app name]
purpose: Application-level cost tracking
```
**Optional but Recommended Tags**
```yaml
Optional Tags:
ExpirationDate:
format: YYYY-MM-DD
purpose: Auto-cleanup scheduling
DataClassification:
values: [public, internal, confidential, restricted]
purpose: Security and compliance
BackupRequired:
values: [true, false]
purpose: Backup policy enforcement
Criticality:
values: [critical, high, medium, low]
purpose: Priority and SLA determination
```
### Tag Enforcement
**Using AWS Organizations Service Control Policies (SCP)**
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyEC2CreationWithoutTags",
"Effect": "Deny",
"Action": [
"ec2:RunInstances"
],
"Resource": [
"arn:aws:ec2:*:*:instance/*"
],
"Condition": {
"StringNotLike": {
"aws:RequestTag/Environment": ["prod", "staging", "dev", "test"],
"aws:RequestTag/Owner": "*",
"aws:RequestTag/Project": "*"
}
}
}
]
}
```
**Using AWS Config Rules**
- **required-tags**: Enforce tags on all resources
- **ec2-instance-no-public-ip**: Prevent public IPs unless tagged
- Custom Lambda-based rules for complex logic
**Tag Compliance Monitoring**
```python
# Example: Check tag compliance
# Run weekly to find untagged resources
aws resourcegroupstaggingapi get-resources \
--query 'ResourceTagMappingList[?length(Tags) == `0`]' \
--output table
# Or use Tag Editor in AWS Console
```
### Cost Allocation Tags
**Activating Cost Allocation Tags**
1. Go to AWS Billing → Cost Allocation Tags
2. Select user-defined tags to activate
3. Wait 24 hours for tags to appear in Cost Explorer
4. Tags only apply to charges after activation
**Best Practices**
- Activate tags before using them
- Use consistent naming (e.g., `Environment` not `Env` or `environment`)
- Document tag values in wiki/runbook
- Review and update tag strategy quarterly
---
## Budget Management
### AWS Budgets Setup
**Budget Types**
1. **Cost Budget**: Track spending against threshold
2. **Usage Budget**: Track service usage (e.g., EC2 hours)
3. **Savings Plans Budget**: Track commitment utilization
4. **Reservation Budget**: Track RI utilization
**Recommended Budgets**
**1. Overall Monthly Budget**
```yaml
Budget Name: Company-Wide-Monthly-Budget
Amount: $50,000/month
Alerts:
- 50% actual: Email CFO, FinOps team
- 80% actual: Email CFO, CTO, FinOps team
- 100% forecasted: Email CFO, CTO, all team leads
- 100% actual: Email everyone + Slack alert
```
**2. Per-Environment Budgets**
```yaml
Budget Name: Production-Environment-Budget
Amount: $30,000/month
Filter: Environment=prod
Alerts:
- 80% actual: Email engineering leads
- 100% forecasted: Email CTO + FinOps
Budget Name: Dev-Environment-Budget
Amount: $5,000/month
Filter: Environment=dev
Alerts:
- 100% actual: Email dev team leads
- 120% actual: Automated shutdown (if possible)
```
**3. Per-Team Budgets**
```yaml
Budget Name: Team-Platform-Budget
Amount: $15,000/month
Filter: Owner=platform-team
Alerts:
- 90% actual: Email platform team
- 100% forecasted: Email platform team + manager
```
**4. Per-Project Budgets**
```yaml
Budget Name: Project-Phoenix-Budget
Amount: $8,000/month
Filter: Project=phoenix
Alerts:
- 75% actual: Email project owner
- 100% actual: Email project owner + sponsor
```
### Budget Alert Actions
**Automated Responses to Budget Alerts**
```python
# Lambda function triggered by Budget alert SNS topic
def lambda_handler(event, context):
# Parse budget alert
budget_name = event['budgetName']
threshold = event['threshold']
if threshold >= 100:
# Stop non-production instances
stop_dev_instances()
# Send Slack alert
send_slack_alert(f"🚨 Budget {budget_name} exceeded!")
# Create JIRA ticket
create_cost_investigation_ticket()
elif threshold >= 80:
# Send warning
send_slack_alert(f"⚠️ Budget {budget_name} at 80%")
```
---
## Monthly Review Process
### FinOps Monthly Cadence
**Week 1: Data Collection**
- Export Cost & Usage Reports
- Run cost optimization scripts
- Gather CloudWatch metrics
- Compile anomaly reports
**Week 2: Analysis**
- Identify cost trends
- Find optimization opportunities
- Compare to previous months
- Analyze tag compliance
**Week 3: Team Review Meetings**
- Present findings to engineering teams
- Discuss optimization opportunities
- Assign action items
- Review upcoming projects
**Week 4: Executive Reporting**
- Create executive summary
- Present cost trends to leadership
- Report on optimization wins
- Forecast next quarter
### Monthly Review Meeting Agenda
**Attendees**: Engineering Leads, FinOps Team, Finance Rep, Product Manager
**Agenda (1 hour)**
1. **Previous Month Recap (10 min)**
- Total spend vs budget
- Top 5 services by cost
- Month-over-month comparison
- Budget variance explanation
2. **Cost Anomalies (10 min)**
- Unusual spending spikes
- Root cause analysis
- Prevention measures
3. **Optimization Opportunities (15 min)**
- Unused resources found
- Rightsizing recommendations
- Reserved Instance opportunities
- Estimated savings
4. **Team Cost Breakdown (10 min)**
- Per-team spending
- Top spenders
- Tag compliance status
5. **Upcoming Changes (10 min)**
- New projects launching
- Expected cost impact
- Budget adjustments needed
6. **Action Items Review (5 min)**
- Follow-up on previous items
- Assign new action items
- Set deadlines
**Deliverable**: Monthly FinOps Report (template provided)
### Monthly Report Template
```markdown
# AWS Cost Report - [Month Year]
## Executive Summary
- Total spend: $XX,XXX
- vs Budget: X% (under/over)
- vs Last month: +/-X%
- Optimization savings: $X,XXX
## Cost Breakdown
| Service | Cost | % of Total | MoM Change |
|---------|------|-----------|-----------|
| EC2 | $XX | XX% | +/-X% |
| RDS | $XX | XX% | +/-X% |
## Optimization Actions Taken
1. Migrated 20 instances to Graviton (saved $X/month)
2. Purchased Reserved Instances (saved $X/month)
3. Deleted unused resources (saved $X/month)
## Recommendations for Next Month
1. Right-size 15 oversized instances (potential $X/month savings)
2. Implement S3 lifecycle policies (potential $X/month savings)
## Action Items
- [ ] [Owner] Task description (Deadline)
```
---
## Roles & Responsibilities
### FinOps Team Structure
**FinOps Lead**
- Owns overall cloud financial management
- Reports to CFO and CTO
- Sets FinOps strategy and goals
- Manages budget process
**Cloud Cost Analyst**
- Analyzes spending trends
- Generates reports and dashboards
- Identifies optimization opportunities
- Runs monthly review process
**Cloud Architect (FinOps focus)**
- Advises on cost-optimized architectures
- Implements cost optimization tools
- Trains engineers on FinOps practices
- Reviews architectural designs for cost impact
### Engineering Team Responsibilities
**Engineering Manager**
- Owns team budget
- Reviews monthly cost reports
- Prioritizes optimization work
- Ensures tagging compliance
**Engineers**
- Tag all resources they create
- Consider cost in design decisions
- Implement optimization recommendations
- Delete unused resources
**Platform/SRE Team**
- Implements cost optimization tooling
- Automates cost monitoring
- Provides cost visibility dashboards
- Enforces tagging policies
---
## Chargeback & Showback
### Showback (Visibility Only)
**Purpose**: Show teams their costs without charging them
**Goal**: Raise cost awareness
**Implementation**:
- Monthly cost reports per team
- Dashboard showing team spending
- Highlight cost trends
- No budget enforcement
**Best for**: Organizations new to FinOps
### Chargeback (Financial Accountability)
**Purpose**: Allocate costs back to business units
**Goal**: Financial accountability
**Implementation**:
- Tag-based cost allocation
- Transfer costs between cost centers
- Teams have hard budgets
- Overspending requires justification
**Best for**: Mature FinOps organizations
### Hybrid Model (Recommended)
**Shared Costs**: Charged to central IT
- VPC resources
- Security tools
- Monitoring infrastructure
- Shared services
**Team Costs**: Charged to teams
- Compute resources (EC2, Lambda)
- Databases
- Storage
- Application-specific services
**Implementation**:
```
Total AWS Bill: $100,000
Shared Costs (30%): $30,000
→ Charged to IT/Platform budget
Team Costs (70%): $70,000
→ Allocated by tags:
- Team A (Project=alpha): $20,000
- Team B (Project=beta): $25,000
- Team C (Project=gamma): $15,000
- Untagged (alert!): $10,000 → Needs investigation
```
---
## Policy & Governance
### Cost Governance Policies
**1. Resource Creation Policies**
```yaml
Policy: All resources must be tagged
Enforcement: Service Control Policy (SCP)
Exception process: Request via FinOps team
Policy: Dev/test resources must auto-stop nights/weekends
Enforcement: AWS Instance Scheduler
Exception process: Tag with NoAutoStop=true (requires approval)
Policy: S3 buckets must have lifecycle policies
Enforcement: AWS Config rule
Exception process: Document justification in bucket tags
```
**2. Approval Workflows**
```yaml
# Spending thresholds requiring approval
< $1,000/month:
- Auto-approved
- Must be tagged
$1,000 - $5,000/month:
- Engineering manager approval
- Documented in JIRA
$5,000 - $20,000/month:
- Director approval
- Budget impact assessment
- FinOps team review
> $20,000/month:
- VP approval
- Business case required
- Quarterly review checkpoint
```
**3. Reserved Instance / Savings Plans Policy**
```yaml
Policy: All commitments require FinOps review
Process:
1. Team identifies workload suitable for commitment
2. Submit request to FinOps with:
- Resource details
- Usage history (30+ days)
- Business justification
3. FinOps analyzes and recommends
4. Finance approves commitment
5. FinOps purchases and tracks utilization
```
### Automation & Guardrails
**Automated Actions**
```yaml
# Non-production resource scheduling
Schedule: Instance Scheduler
- Stop all dev/test EC2/RDS instances at 7pm weekdays
- Stop all dev/test instances all weekend
- Start at 7am weekdays
- Exception tag: NoAutoStop=true
# Untagged resource alerts
Trigger: AWS Config rule violation
Action:
- Send Slack alert to team
- Create JIRA ticket
- Escalate if not tagged in 48 hours
# Old snapshot cleanup
Schedule: Weekly Lambda function
Action:
- Delete snapshots older than 90 days (unless tagged KeepForever=true)
- Notify teams of deletions
- Estimate savings
# Budget breach response
Trigger: Budget > 100%
Action:
- Email alerts to stakeholders
- Create incident ticket
- Stop non-production resources (optional)
```
---
## Metrics & KPIs
### Key FinOps Metrics
**1. Cost Metrics**
```yaml
Total Monthly Cloud Spend:
Target: Within budget
Trend: Track month-over-month
Cost per Customer:
Calculation: Total AWS Cost / Active Customers
Target: Decreasing over time
Cost per Transaction:
Calculation: Total AWS Cost / Transactions Processed
Target: Optimize for efficiency
Unit Economics:
Calculation: Revenue per Customer - Cost per Customer
Target: Positive and growing
```
**2. Efficiency Metrics**
```yaml
Compute Utilization:
Metric: Average CPU utilization
Target: 40-60% (room for burst, not over-provisioned)
Storage Utilization:
Metric: % of S3 in cost-optimized tiers
Target: >60% in IA or Glacier tiers
Reserved Instance Coverage:
Metric: % of On-Demand usage covered by RIs/SPs
Target: >70% for stable workloads
RI/SP Utilization:
Metric: % of RIs/SPs actually used
Target: >90%
```
**3. Operational Metrics**
```yaml
Tag Compliance:
Metric: % of resources with required tags
Target: >95%
Budget Variance:
Metric: Actual vs Budget %
Target: ±5%
Optimization Savings:
Metric: $ saved per month from optimizations
Target: Growing
Mean Time to Optimize (MTTO):
Metric: Days from finding opportunity to implementing
Target: <30 days
```
**4. Organizational Metrics**
```yaml
FinOps Engagement:
Metric: % of teams attending monthly reviews
Target: 100%
Cost Awareness:
Survey: Do engineers know their team's monthly cost?
Target: >80% aware
Optimization Velocity:
Metric: # optimization tasks completed per quarter
Target: Growing trend
```
### Dashboard Requirements
**Executive Dashboard (Monthly)**
- Total spend vs budget
- Spend by service (top 10)
- Month-over-month trend
- Forecast for next quarter
- Optimization savings achieved
**Engineering Dashboard (Real-time)**
- Per-team costs (daily)
- Cost anomaly alerts
- Untagged resources count
- Budget utilization %
- Top cost drivers
**FinOps Dashboard (Daily)**
- Detailed service costs
- Tag compliance metrics
- RI/SP utilization
- Rightsizing opportunities
- Unused resource counts
---
## Getting Started Checklist
### Phase 1: Foundation (Month 1)
- [ ] Enable Cost Explorer
- [ ] Set up AWS Budgets
- [ ] Define tagging strategy
- [ ] Activate cost allocation tags
- [ ] Set up Cost and Usage Reports (CUR)
- [ ] Create basic cost dashboard
### Phase 2: Visibility (Months 2-3)
- [ ] Implement tagging enforcement
- [ ] Run first optimization scripts
- [ ] Set up monthly review meeting
- [ ] Create team cost reports
- [ ] Assign team cost owners
- [ ] Document FinOps processes
### Phase 3: Optimization (Months 4-6)
- [ ] Implement automated resource scheduling
- [ ] Purchase first Reserved Instances
- [ ] Set up cost anomaly detection
- [ ] Automate reporting
- [ ] Train engineering teams
- [ ] Implement showback/chargeback
### Phase 4: Culture (Ongoing)
- [ ] Cost metrics in engineering KPIs
- [ ] Cost review in architecture reviews
- [ ] Regular optimization sprints
- [ ] FinOps champions in each team
- [ ] Cost-aware development practices
- [ ] Continuous improvement
---
## Resources
**AWS Native Tools**
- AWS Cost Explorer
- AWS Budgets
- AWS Cost Anomaly Detection
- AWS Compute Optimizer
- AWS Trusted Advisor
- AWS Cost & Usage Reports
**Third-Party Tools**
- CloudHealth (VMware)
- Cloudability (Apptio)
- Kubecost (Kubernetes cost monitoring)
- Spot.io (Cost optimization platform)
**FinOps Foundation**
- https://www.finops.org
- FinOps Certified Practitioner certification
- FinOps community and best practices