741 lines
17 KiB
Markdown
741 lines
17 KiB
Markdown
# FinOps Governance Framework
|
|
|
|
Organizational practices, processes, and governance for AWS cost optimization.
|
|
|
|
## Table of Contents
|
|
|
|
1. [FinOps Principles](#finops-principles)
|
|
2. [Cost Allocation & Tagging](#cost-allocation--tagging)
|
|
3. [Budget Management](#budget-management)
|
|
4. [Monthly Review Process](#monthly-review-process)
|
|
5. [Roles & Responsibilities](#roles--responsibilities)
|
|
6. [Chargeback & Showback](#chargeback--showback)
|
|
7. [Policy & Governance](#policy--governance)
|
|
8. [Metrics & KPIs](#metrics--kpis)
|
|
|
|
---
|
|
|
|
## FinOps Principles
|
|
|
|
### The FinOps Framework
|
|
|
|
FinOps is the practice of bringing financial accountability to cloud spending through collaboration between engineering, finance, and business teams.
|
|
|
|
**Core Principles:**
|
|
|
|
1. **Teams Need to Collaborate**
|
|
- Engineering makes technical decisions
|
|
- Finance provides visibility and reporting
|
|
- Business sets priorities and budgets
|
|
- Cross-functional cost optimization
|
|
|
|
2. **Everyone Takes Ownership**
|
|
- Engineers see cost impact of their decisions
|
|
- Teams have cost budgets and accountability
|
|
- Cost is a efficiency metric, not just finance
|
|
|
|
3. **Decisions Driven by Business Value**
|
|
- Speed, quality, and cost trade-offs
|
|
- Investment vs optimization decisions
|
|
- ROI-based prioritization
|
|
|
|
4. **Take Advantage of Variable Cost Model**
|
|
- Scale resources up and down as needed
|
|
- Use different pricing models strategically
|
|
- Optimize for actual usage patterns
|
|
|
|
5. **Centralized Team Drives FinOps**
|
|
- Central FinOps team enables
|
|
- Distributed execution by product teams
|
|
- Share best practices and tools
|
|
|
|
### FinOps Maturity Model
|
|
|
|
**Crawl Phase (Getting Started)**
|
|
- Basic cost visibility
|
|
- Manual reporting
|
|
- Ad-hoc optimization
|
|
- Initial tagging strategy
|
|
- Basic budget alerts
|
|
|
|
**Walk Phase (Improving)**
|
|
- Automated cost reporting
|
|
- Regular optimization reviews
|
|
- Systematic tagging enforcement
|
|
- Team cost allocation
|
|
- Reserved Instance planning
|
|
- Monthly optimization meetings
|
|
|
|
**Run Phase (Optimized)**
|
|
- Real-time cost visibility
|
|
- Automated optimization
|
|
- Cost-aware engineering culture
|
|
- Predictive forecasting
|
|
- Automated guardrails
|
|
- FinOps integrated in SDLC
|
|
|
|
---
|
|
|
|
## Cost Allocation & Tagging
|
|
|
|
### Tagging Strategy
|
|
|
|
**Required Tags (Enforce via Policy)**
|
|
|
|
```yaml
|
|
Required Tags:
|
|
Environment:
|
|
values: [prod, staging, dev, test]
|
|
purpose: Separate production from non-production costs
|
|
|
|
Owner:
|
|
values: [email or team name]
|
|
purpose: Contact for resource questions
|
|
|
|
Project:
|
|
values: [project code]
|
|
purpose: Track project spending
|
|
|
|
CostCenter:
|
|
values: [department code]
|
|
purpose: Chargeback allocation
|
|
|
|
Application:
|
|
values: [app name]
|
|
purpose: Application-level cost tracking
|
|
```
|
|
|
|
**Optional but Recommended Tags**
|
|
|
|
```yaml
|
|
Optional Tags:
|
|
ExpirationDate:
|
|
format: YYYY-MM-DD
|
|
purpose: Auto-cleanup scheduling
|
|
|
|
DataClassification:
|
|
values: [public, internal, confidential, restricted]
|
|
purpose: Security and compliance
|
|
|
|
BackupRequired:
|
|
values: [true, false]
|
|
purpose: Backup policy enforcement
|
|
|
|
Criticality:
|
|
values: [critical, high, medium, low]
|
|
purpose: Priority and SLA determination
|
|
```
|
|
|
|
### Tag Enforcement
|
|
|
|
**Using AWS Organizations Service Control Policies (SCP)**
|
|
|
|
```json
|
|
{
|
|
"Version": "2012-10-17",
|
|
"Statement": [
|
|
{
|
|
"Sid": "DenyEC2CreationWithoutTags",
|
|
"Effect": "Deny",
|
|
"Action": [
|
|
"ec2:RunInstances"
|
|
],
|
|
"Resource": [
|
|
"arn:aws:ec2:*:*:instance/*"
|
|
],
|
|
"Condition": {
|
|
"StringNotLike": {
|
|
"aws:RequestTag/Environment": ["prod", "staging", "dev", "test"],
|
|
"aws:RequestTag/Owner": "*",
|
|
"aws:RequestTag/Project": "*"
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**Using AWS Config Rules**
|
|
|
|
- **required-tags**: Enforce tags on all resources
|
|
- **ec2-instance-no-public-ip**: Prevent public IPs unless tagged
|
|
- Custom Lambda-based rules for complex logic
|
|
|
|
**Tag Compliance Monitoring**
|
|
|
|
```python
|
|
# Example: Check tag compliance
|
|
# Run weekly to find untagged resources
|
|
|
|
aws resourcegroupstaggingapi get-resources \
|
|
--query 'ResourceTagMappingList[?length(Tags) == `0`]' \
|
|
--output table
|
|
|
|
# Or use Tag Editor in AWS Console
|
|
```
|
|
|
|
### Cost Allocation Tags
|
|
|
|
**Activating Cost Allocation Tags**
|
|
|
|
1. Go to AWS Billing → Cost Allocation Tags
|
|
2. Select user-defined tags to activate
|
|
3. Wait 24 hours for tags to appear in Cost Explorer
|
|
4. Tags only apply to charges after activation
|
|
|
|
**Best Practices**
|
|
|
|
- Activate tags before using them
|
|
- Use consistent naming (e.g., `Environment` not `Env` or `environment`)
|
|
- Document tag values in wiki/runbook
|
|
- Review and update tag strategy quarterly
|
|
|
|
---
|
|
|
|
## Budget Management
|
|
|
|
### AWS Budgets Setup
|
|
|
|
**Budget Types**
|
|
|
|
1. **Cost Budget**: Track spending against threshold
|
|
2. **Usage Budget**: Track service usage (e.g., EC2 hours)
|
|
3. **Savings Plans Budget**: Track commitment utilization
|
|
4. **Reservation Budget**: Track RI utilization
|
|
|
|
**Recommended Budgets**
|
|
|
|
**1. Overall Monthly Budget**
|
|
```yaml
|
|
Budget Name: Company-Wide-Monthly-Budget
|
|
Amount: $50,000/month
|
|
Alerts:
|
|
- 50% actual: Email CFO, FinOps team
|
|
- 80% actual: Email CFO, CTO, FinOps team
|
|
- 100% forecasted: Email CFO, CTO, all team leads
|
|
- 100% actual: Email everyone + Slack alert
|
|
```
|
|
|
|
**2. Per-Environment Budgets**
|
|
```yaml
|
|
Budget Name: Production-Environment-Budget
|
|
Amount: $30,000/month
|
|
Filter: Environment=prod
|
|
Alerts:
|
|
- 80% actual: Email engineering leads
|
|
- 100% forecasted: Email CTO + FinOps
|
|
|
|
Budget Name: Dev-Environment-Budget
|
|
Amount: $5,000/month
|
|
Filter: Environment=dev
|
|
Alerts:
|
|
- 100% actual: Email dev team leads
|
|
- 120% actual: Automated shutdown (if possible)
|
|
```
|
|
|
|
**3. Per-Team Budgets**
|
|
```yaml
|
|
Budget Name: Team-Platform-Budget
|
|
Amount: $15,000/month
|
|
Filter: Owner=platform-team
|
|
Alerts:
|
|
- 90% actual: Email platform team
|
|
- 100% forecasted: Email platform team + manager
|
|
```
|
|
|
|
**4. Per-Project Budgets**
|
|
```yaml
|
|
Budget Name: Project-Phoenix-Budget
|
|
Amount: $8,000/month
|
|
Filter: Project=phoenix
|
|
Alerts:
|
|
- 75% actual: Email project owner
|
|
- 100% actual: Email project owner + sponsor
|
|
```
|
|
|
|
### Budget Alert Actions
|
|
|
|
**Automated Responses to Budget Alerts**
|
|
|
|
```python
|
|
# Lambda function triggered by Budget alert SNS topic
|
|
|
|
def lambda_handler(event, context):
|
|
# Parse budget alert
|
|
budget_name = event['budgetName']
|
|
threshold = event['threshold']
|
|
|
|
if threshold >= 100:
|
|
# Stop non-production instances
|
|
stop_dev_instances()
|
|
|
|
# Send Slack alert
|
|
send_slack_alert(f"🚨 Budget {budget_name} exceeded!")
|
|
|
|
# Create JIRA ticket
|
|
create_cost_investigation_ticket()
|
|
|
|
elif threshold >= 80:
|
|
# Send warning
|
|
send_slack_alert(f"⚠️ Budget {budget_name} at 80%")
|
|
```
|
|
|
|
---
|
|
|
|
## Monthly Review Process
|
|
|
|
### FinOps Monthly Cadence
|
|
|
|
**Week 1: Data Collection**
|
|
- Export Cost & Usage Reports
|
|
- Run cost optimization scripts
|
|
- Gather CloudWatch metrics
|
|
- Compile anomaly reports
|
|
|
|
**Week 2: Analysis**
|
|
- Identify cost trends
|
|
- Find optimization opportunities
|
|
- Compare to previous months
|
|
- Analyze tag compliance
|
|
|
|
**Week 3: Team Review Meetings**
|
|
- Present findings to engineering teams
|
|
- Discuss optimization opportunities
|
|
- Assign action items
|
|
- Review upcoming projects
|
|
|
|
**Week 4: Executive Reporting**
|
|
- Create executive summary
|
|
- Present cost trends to leadership
|
|
- Report on optimization wins
|
|
- Forecast next quarter
|
|
|
|
### Monthly Review Meeting Agenda
|
|
|
|
**Attendees**: Engineering Leads, FinOps Team, Finance Rep, Product Manager
|
|
|
|
**Agenda (1 hour)**
|
|
|
|
1. **Previous Month Recap (10 min)**
|
|
- Total spend vs budget
|
|
- Top 5 services by cost
|
|
- Month-over-month comparison
|
|
- Budget variance explanation
|
|
|
|
2. **Cost Anomalies (10 min)**
|
|
- Unusual spending spikes
|
|
- Root cause analysis
|
|
- Prevention measures
|
|
|
|
3. **Optimization Opportunities (15 min)**
|
|
- Unused resources found
|
|
- Rightsizing recommendations
|
|
- Reserved Instance opportunities
|
|
- Estimated savings
|
|
|
|
4. **Team Cost Breakdown (10 min)**
|
|
- Per-team spending
|
|
- Top spenders
|
|
- Tag compliance status
|
|
|
|
5. **Upcoming Changes (10 min)**
|
|
- New projects launching
|
|
- Expected cost impact
|
|
- Budget adjustments needed
|
|
|
|
6. **Action Items Review (5 min)**
|
|
- Follow-up on previous items
|
|
- Assign new action items
|
|
- Set deadlines
|
|
|
|
**Deliverable**: Monthly FinOps Report (template provided)
|
|
|
|
### Monthly Report Template
|
|
|
|
```markdown
|
|
# AWS Cost Report - [Month Year]
|
|
|
|
## Executive Summary
|
|
- Total spend: $XX,XXX
|
|
- vs Budget: X% (under/over)
|
|
- vs Last month: +/-X%
|
|
- Optimization savings: $X,XXX
|
|
|
|
## Cost Breakdown
|
|
| Service | Cost | % of Total | MoM Change |
|
|
|---------|------|-----------|-----------|
|
|
| EC2 | $XX | XX% | +/-X% |
|
|
| RDS | $XX | XX% | +/-X% |
|
|
|
|
## Optimization Actions Taken
|
|
1. Migrated 20 instances to Graviton (saved $X/month)
|
|
2. Purchased Reserved Instances (saved $X/month)
|
|
3. Deleted unused resources (saved $X/month)
|
|
|
|
## Recommendations for Next Month
|
|
1. Right-size 15 oversized instances (potential $X/month savings)
|
|
2. Implement S3 lifecycle policies (potential $X/month savings)
|
|
|
|
## Action Items
|
|
- [ ] [Owner] Task description (Deadline)
|
|
```
|
|
|
|
---
|
|
|
|
## Roles & Responsibilities
|
|
|
|
### FinOps Team Structure
|
|
|
|
**FinOps Lead**
|
|
- Owns overall cloud financial management
|
|
- Reports to CFO and CTO
|
|
- Sets FinOps strategy and goals
|
|
- Manages budget process
|
|
|
|
**Cloud Cost Analyst**
|
|
- Analyzes spending trends
|
|
- Generates reports and dashboards
|
|
- Identifies optimization opportunities
|
|
- Runs monthly review process
|
|
|
|
**Cloud Architect (FinOps focus)**
|
|
- Advises on cost-optimized architectures
|
|
- Implements cost optimization tools
|
|
- Trains engineers on FinOps practices
|
|
- Reviews architectural designs for cost impact
|
|
|
|
### Engineering Team Responsibilities
|
|
|
|
**Engineering Manager**
|
|
- Owns team budget
|
|
- Reviews monthly cost reports
|
|
- Prioritizes optimization work
|
|
- Ensures tagging compliance
|
|
|
|
**Engineers**
|
|
- Tag all resources they create
|
|
- Consider cost in design decisions
|
|
- Implement optimization recommendations
|
|
- Delete unused resources
|
|
|
|
**Platform/SRE Team**
|
|
- Implements cost optimization tooling
|
|
- Automates cost monitoring
|
|
- Provides cost visibility dashboards
|
|
- Enforces tagging policies
|
|
|
|
---
|
|
|
|
## Chargeback & Showback
|
|
|
|
### Showback (Visibility Only)
|
|
|
|
**Purpose**: Show teams their costs without charging them
|
|
**Goal**: Raise cost awareness
|
|
|
|
**Implementation**:
|
|
- Monthly cost reports per team
|
|
- Dashboard showing team spending
|
|
- Highlight cost trends
|
|
- No budget enforcement
|
|
|
|
**Best for**: Organizations new to FinOps
|
|
|
|
### Chargeback (Financial Accountability)
|
|
|
|
**Purpose**: Allocate costs back to business units
|
|
**Goal**: Financial accountability
|
|
|
|
**Implementation**:
|
|
- Tag-based cost allocation
|
|
- Transfer costs between cost centers
|
|
- Teams have hard budgets
|
|
- Overspending requires justification
|
|
|
|
**Best for**: Mature FinOps organizations
|
|
|
|
### Hybrid Model (Recommended)
|
|
|
|
**Shared Costs**: Charged to central IT
|
|
- VPC resources
|
|
- Security tools
|
|
- Monitoring infrastructure
|
|
- Shared services
|
|
|
|
**Team Costs**: Charged to teams
|
|
- Compute resources (EC2, Lambda)
|
|
- Databases
|
|
- Storage
|
|
- Application-specific services
|
|
|
|
**Implementation**:
|
|
```
|
|
Total AWS Bill: $100,000
|
|
|
|
Shared Costs (30%): $30,000
|
|
→ Charged to IT/Platform budget
|
|
|
|
Team Costs (70%): $70,000
|
|
→ Allocated by tags:
|
|
- Team A (Project=alpha): $20,000
|
|
- Team B (Project=beta): $25,000
|
|
- Team C (Project=gamma): $15,000
|
|
- Untagged (alert!): $10,000 → Needs investigation
|
|
```
|
|
|
|
---
|
|
|
|
## Policy & Governance
|
|
|
|
### Cost Governance Policies
|
|
|
|
**1. Resource Creation Policies**
|
|
|
|
```yaml
|
|
Policy: All resources must be tagged
|
|
Enforcement: Service Control Policy (SCP)
|
|
Exception process: Request via FinOps team
|
|
|
|
Policy: Dev/test resources must auto-stop nights/weekends
|
|
Enforcement: AWS Instance Scheduler
|
|
Exception process: Tag with NoAutoStop=true (requires approval)
|
|
|
|
Policy: S3 buckets must have lifecycle policies
|
|
Enforcement: AWS Config rule
|
|
Exception process: Document justification in bucket tags
|
|
```
|
|
|
|
**2. Approval Workflows**
|
|
|
|
```yaml
|
|
# Spending thresholds requiring approval
|
|
|
|
< $1,000/month:
|
|
- Auto-approved
|
|
- Must be tagged
|
|
|
|
$1,000 - $5,000/month:
|
|
- Engineering manager approval
|
|
- Documented in JIRA
|
|
|
|
$5,000 - $20,000/month:
|
|
- Director approval
|
|
- Budget impact assessment
|
|
- FinOps team review
|
|
|
|
> $20,000/month:
|
|
- VP approval
|
|
- Business case required
|
|
- Quarterly review checkpoint
|
|
```
|
|
|
|
**3. Reserved Instance / Savings Plans Policy**
|
|
|
|
```yaml
|
|
Policy: All commitments require FinOps review
|
|
|
|
Process:
|
|
1. Team identifies workload suitable for commitment
|
|
2. Submit request to FinOps with:
|
|
- Resource details
|
|
- Usage history (30+ days)
|
|
- Business justification
|
|
3. FinOps analyzes and recommends
|
|
4. Finance approves commitment
|
|
5. FinOps purchases and tracks utilization
|
|
```
|
|
|
|
### Automation & Guardrails
|
|
|
|
**Automated Actions**
|
|
|
|
```yaml
|
|
# Non-production resource scheduling
|
|
Schedule: Instance Scheduler
|
|
- Stop all dev/test EC2/RDS instances at 7pm weekdays
|
|
- Stop all dev/test instances all weekend
|
|
- Start at 7am weekdays
|
|
- Exception tag: NoAutoStop=true
|
|
|
|
# Untagged resource alerts
|
|
Trigger: AWS Config rule violation
|
|
Action:
|
|
- Send Slack alert to team
|
|
- Create JIRA ticket
|
|
- Escalate if not tagged in 48 hours
|
|
|
|
# Old snapshot cleanup
|
|
Schedule: Weekly Lambda function
|
|
Action:
|
|
- Delete snapshots older than 90 days (unless tagged KeepForever=true)
|
|
- Notify teams of deletions
|
|
- Estimate savings
|
|
|
|
# Budget breach response
|
|
Trigger: Budget > 100%
|
|
Action:
|
|
- Email alerts to stakeholders
|
|
- Create incident ticket
|
|
- Stop non-production resources (optional)
|
|
```
|
|
|
|
---
|
|
|
|
## Metrics & KPIs
|
|
|
|
### Key FinOps Metrics
|
|
|
|
**1. Cost Metrics**
|
|
```yaml
|
|
Total Monthly Cloud Spend:
|
|
Target: Within budget
|
|
Trend: Track month-over-month
|
|
|
|
Cost per Customer:
|
|
Calculation: Total AWS Cost / Active Customers
|
|
Target: Decreasing over time
|
|
|
|
Cost per Transaction:
|
|
Calculation: Total AWS Cost / Transactions Processed
|
|
Target: Optimize for efficiency
|
|
|
|
Unit Economics:
|
|
Calculation: Revenue per Customer - Cost per Customer
|
|
Target: Positive and growing
|
|
```
|
|
|
|
**2. Efficiency Metrics**
|
|
```yaml
|
|
Compute Utilization:
|
|
Metric: Average CPU utilization
|
|
Target: 40-60% (room for burst, not over-provisioned)
|
|
|
|
Storage Utilization:
|
|
Metric: % of S3 in cost-optimized tiers
|
|
Target: >60% in IA or Glacier tiers
|
|
|
|
Reserved Instance Coverage:
|
|
Metric: % of On-Demand usage covered by RIs/SPs
|
|
Target: >70% for stable workloads
|
|
|
|
RI/SP Utilization:
|
|
Metric: % of RIs/SPs actually used
|
|
Target: >90%
|
|
```
|
|
|
|
**3. Operational Metrics**
|
|
```yaml
|
|
Tag Compliance:
|
|
Metric: % of resources with required tags
|
|
Target: >95%
|
|
|
|
Budget Variance:
|
|
Metric: Actual vs Budget %
|
|
Target: ±5%
|
|
|
|
Optimization Savings:
|
|
Metric: $ saved per month from optimizations
|
|
Target: Growing
|
|
|
|
Mean Time to Optimize (MTTO):
|
|
Metric: Days from finding opportunity to implementing
|
|
Target: <30 days
|
|
```
|
|
|
|
**4. Organizational Metrics**
|
|
```yaml
|
|
FinOps Engagement:
|
|
Metric: % of teams attending monthly reviews
|
|
Target: 100%
|
|
|
|
Cost Awareness:
|
|
Survey: Do engineers know their team's monthly cost?
|
|
Target: >80% aware
|
|
|
|
Optimization Velocity:
|
|
Metric: # optimization tasks completed per quarter
|
|
Target: Growing trend
|
|
```
|
|
|
|
### Dashboard Requirements
|
|
|
|
**Executive Dashboard (Monthly)**
|
|
- Total spend vs budget
|
|
- Spend by service (top 10)
|
|
- Month-over-month trend
|
|
- Forecast for next quarter
|
|
- Optimization savings achieved
|
|
|
|
**Engineering Dashboard (Real-time)**
|
|
- Per-team costs (daily)
|
|
- Cost anomaly alerts
|
|
- Untagged resources count
|
|
- Budget utilization %
|
|
- Top cost drivers
|
|
|
|
**FinOps Dashboard (Daily)**
|
|
- Detailed service costs
|
|
- Tag compliance metrics
|
|
- RI/SP utilization
|
|
- Rightsizing opportunities
|
|
- Unused resource counts
|
|
|
|
---
|
|
|
|
## Getting Started Checklist
|
|
|
|
### Phase 1: Foundation (Month 1)
|
|
- [ ] Enable Cost Explorer
|
|
- [ ] Set up AWS Budgets
|
|
- [ ] Define tagging strategy
|
|
- [ ] Activate cost allocation tags
|
|
- [ ] Set up Cost and Usage Reports (CUR)
|
|
- [ ] Create basic cost dashboard
|
|
|
|
### Phase 2: Visibility (Months 2-3)
|
|
- [ ] Implement tagging enforcement
|
|
- [ ] Run first optimization scripts
|
|
- [ ] Set up monthly review meeting
|
|
- [ ] Create team cost reports
|
|
- [ ] Assign team cost owners
|
|
- [ ] Document FinOps processes
|
|
|
|
### Phase 3: Optimization (Months 4-6)
|
|
- [ ] Implement automated resource scheduling
|
|
- [ ] Purchase first Reserved Instances
|
|
- [ ] Set up cost anomaly detection
|
|
- [ ] Automate reporting
|
|
- [ ] Train engineering teams
|
|
- [ ] Implement showback/chargeback
|
|
|
|
### Phase 4: Culture (Ongoing)
|
|
- [ ] Cost metrics in engineering KPIs
|
|
- [ ] Cost review in architecture reviews
|
|
- [ ] Regular optimization sprints
|
|
- [ ] FinOps champions in each team
|
|
- [ ] Cost-aware development practices
|
|
- [ ] Continuous improvement
|
|
|
|
---
|
|
|
|
## Resources
|
|
|
|
**AWS Native Tools**
|
|
- AWS Cost Explorer
|
|
- AWS Budgets
|
|
- AWS Cost Anomaly Detection
|
|
- AWS Compute Optimizer
|
|
- AWS Trusted Advisor
|
|
- AWS Cost & Usage Reports
|
|
|
|
**Third-Party Tools**
|
|
- CloudHealth (VMware)
|
|
- Cloudability (Apptio)
|
|
- Kubecost (Kubernetes cost monitoring)
|
|
- Spot.io (Cost optimization platform)
|
|
|
|
**FinOps Foundation**
|
|
- https://www.finops.org
|
|
- FinOps Certified Practitioner certification
|
|
- FinOps community and best practices
|