Initial commit
This commit is contained in:
620
skills/aks-automatic-2025.md
Normal file
620
skills/aks-automatic-2025.md
Normal file
@@ -0,0 +1,620 @@
|
||||
## 🚨 CRITICAL GUIDELINES
|
||||
|
||||
### Windows File Path Requirements
|
||||
|
||||
**MANDATORY: Always Use Backslashes on Windows for File Paths**
|
||||
|
||||
When using Edit or Write tools on Windows, you MUST use backslashes (`\`) in file paths, NOT forward slashes (`/`).
|
||||
|
||||
**Examples:**
|
||||
- ❌ WRONG: `D:/repos/project/file.tsx`
|
||||
- ✅ CORRECT: `D:\repos\project\file.tsx`
|
||||
|
||||
This applies to:
|
||||
- Edit tool file_path parameter
|
||||
- Write tool file_path parameter
|
||||
- All file operations on Windows systems
|
||||
|
||||
### Documentation Guidelines
|
||||
|
||||
**NEVER create new documentation files unless explicitly requested by the user.**
|
||||
|
||||
- **Priority**: Update existing README.md files rather than creating new documentation
|
||||
- **Repository cleanliness**: Keep repository root clean - only README.md unless user requests otherwise
|
||||
- **Style**: Documentation should be concise, direct, and professional - avoid AI-generated tone
|
||||
- **User preference**: Only create additional .md files when user specifically asks for documentation
|
||||
|
||||
---
|
||||
|
||||
|
||||
# AKS Automatic - 2025 GA Features
|
||||
|
||||
Complete knowledge base for Azure Kubernetes Service Automatic mode (GA October 2025).
|
||||
|
||||
## Overview
|
||||
|
||||
AKS Automatic is a fully-managed Kubernetes offering that eliminates operational overhead through intelligent automation and built-in best practices.
|
||||
|
||||
## Key Features (GA October 2025)
|
||||
|
||||
### 1. Zero Operational Overhead
|
||||
- Fully-managed control plane and worker nodes
|
||||
- Automatic OS patching and security updates
|
||||
- Built-in monitoring and diagnostics
|
||||
- Integrated security and compliance
|
||||
|
||||
### 2. Karpenter Integration
|
||||
- Dynamic node provisioning based on real-time demand
|
||||
- Intelligent bin-packing for cost optimization
|
||||
- Automatic node consolidation and deprovisioning
|
||||
- Support for multiple node pools and instance types
|
||||
|
||||
### 3. Auto-Scaling (Enabled by Default)
|
||||
- **Horizontal Pod Autoscaler (HPA)**: Scale pods based on CPU/memory
|
||||
- **Vertical Pod Autoscaler (VPA)**: Adjust pod resource requests/limits
|
||||
- **KEDA**: Event-driven autoscaling for external triggers
|
||||
|
||||
### 4. Enhanced Security
|
||||
- Microsoft Entra ID integration for authentication
|
||||
- Azure RBAC for Kubernetes authorization
|
||||
- Network policies enabled by default
|
||||
- Automatic security patches
|
||||
- Workload identity for pod-level authentication
|
||||
|
||||
### 5. Advanced Networking
|
||||
- Azure CNI Overlay for efficient IP usage
|
||||
- Cilium dataplane for high-performance networking
|
||||
- Network policies for microsegmentation
|
||||
- Private clusters supported
|
||||
|
||||
### 6. New Billing Model (Effective October 19, 2025)
|
||||
- Hosted control plane fee: **$0.16/cluster/hour**
|
||||
- Compute charges based on actual node usage
|
||||
- No separate cluster management fee
|
||||
- Cost savings from Karpenter optimization
|
||||
|
||||
### 7. Node Operating System
|
||||
- Ubuntu 22.04 for Kubernetes < 1.34
|
||||
- Ubuntu 24.04 for Kubernetes >= 1.34
|
||||
- Automatic OS upgrades with node image channel
|
||||
|
||||
## Creating AKS Automatic Cluster
|
||||
|
||||
### Basic Creation
|
||||
|
||||
```bash
|
||||
az aks create \
|
||||
--resource-group MyRG \
|
||||
--name MyAKSAutomatic \
|
||||
--sku automatic \
|
||||
--kubernetes-version 1.34 \
|
||||
--location eastus
|
||||
```
|
||||
|
||||
### Production-Ready Configuration
|
||||
|
||||
```bash
|
||||
az aks create \
|
||||
--resource-group MyRG \
|
||||
--name MyAKSAutomatic \
|
||||
--location eastus \
|
||||
--sku automatic \
|
||||
--tier standard \
|
||||
\
|
||||
# Kubernetes version
|
||||
--kubernetes-version 1.34 \
|
||||
\
|
||||
# Karpenter (default in automatic mode)
|
||||
--enable-karpenter \
|
||||
\
|
||||
# Networking
|
||||
--network-plugin azure \
|
||||
--network-plugin-mode overlay \
|
||||
--network-dataplane cilium \
|
||||
--service-cidr 10.0.0.0/16 \
|
||||
--dns-service-ip 10.0.0.10 \
|
||||
--load-balancer-sku standard \
|
||||
\
|
||||
# Use custom VNet (optional)
|
||||
--vnet-subnet-id /subscriptions/<sub-id>/resourceGroups/MyRG/providers/Microsoft.Network/virtualNetworks/MyVNet/subnets/AKSSubnet \
|
||||
\
|
||||
# Availability zones
|
||||
--zones 1 2 3 \
|
||||
\
|
||||
# Authentication and authorization
|
||||
--enable-managed-identity \
|
||||
--enable-aad \
|
||||
--enable-azure-rbac \
|
||||
--aad-admin-group-object-ids <group-object-id> \
|
||||
\
|
||||
# Auto-upgrade
|
||||
--auto-upgrade-channel stable \
|
||||
--node-os-upgrade-channel NodeImage \
|
||||
\
|
||||
# Security
|
||||
--enable-defender \
|
||||
--enable-workload-identity \
|
||||
--enable-oidc-issuer \
|
||||
\
|
||||
# Monitoring
|
||||
--enable-addons monitoring \
|
||||
--workspace-resource-id /subscriptions/<sub-id>/resourceGroups/MyRG/providers/Microsoft.OperationalInsights/workspaces/MyWorkspace \
|
||||
\
|
||||
# Tags
|
||||
--tags Environment=Production ManagedBy=AKSAutomatic
|
||||
```
|
||||
|
||||
### With Azure Policy Add-on
|
||||
|
||||
```bash
|
||||
az aks create \
|
||||
--resource-group MyRG \
|
||||
--name MyAKSAutomatic \
|
||||
--sku automatic \
|
||||
--enable-addons azure-policy \
|
||||
--kubernetes-version 1.34
|
||||
```
|
||||
|
||||
## Karpenter Configuration
|
||||
|
||||
AKS Automatic uses Karpenter for intelligent node provisioning. Customize node provisioning with AKSNodeClass and NodePool CRDs.
|
||||
|
||||
### Default AKSNodeClass
|
||||
|
||||
```yaml
|
||||
apiVersion: karpenter.azure.com/v1alpha1
|
||||
kind: AKSNodeClass
|
||||
metadata:
|
||||
name: default
|
||||
spec:
|
||||
# OS Image - Ubuntu 24.04 for K8s 1.34+
|
||||
osImage:
|
||||
sku: Ubuntu
|
||||
version: "24.04"
|
||||
|
||||
# VM Series
|
||||
vmSeries:
|
||||
- Standard_D
|
||||
- Standard_E
|
||||
|
||||
# Max pods per node
|
||||
maxPodsPerNode: 110
|
||||
|
||||
# Security
|
||||
securityProfile:
|
||||
sshAccess: Disabled
|
||||
securityType: Standard
|
||||
```
|
||||
|
||||
### Custom NodePool
|
||||
|
||||
```yaml
|
||||
apiVersion: karpenter.sh/v1
|
||||
kind: NodePool
|
||||
metadata:
|
||||
name: general-purpose
|
||||
spec:
|
||||
# Constraints
|
||||
template:
|
||||
spec:
|
||||
requirements:
|
||||
- key: kubernetes.io/arch
|
||||
operator: In
|
||||
values: ["amd64"]
|
||||
- key: karpenter.sh/capacity-type
|
||||
operator: In
|
||||
values: ["on-demand"]
|
||||
- key: kubernetes.azure.com/agentpool
|
||||
operator: In
|
||||
values: ["general"]
|
||||
|
||||
# Node labels
|
||||
labels:
|
||||
workload-type: general
|
||||
|
||||
# Taints (optional)
|
||||
taints:
|
||||
- key: "dedicated"
|
||||
value: "general"
|
||||
effect: "NoSchedule"
|
||||
|
||||
# NodeClass reference
|
||||
nodeClassRef:
|
||||
group: karpenter.azure.com
|
||||
kind: AKSNodeClass
|
||||
name: default
|
||||
|
||||
# Limits
|
||||
limits:
|
||||
cpu: "1000"
|
||||
memory: 4000Gi
|
||||
|
||||
# Disruption budget
|
||||
disruption:
|
||||
consolidationPolicy: WhenEmpty
|
||||
consolidateAfter: 30s
|
||||
expireAfter: 720h # 30 days
|
||||
budgets:
|
||||
- nodes: "10%"
|
||||
duration: 5m
|
||||
```
|
||||
|
||||
### GPU NodePool for AI Workloads
|
||||
|
||||
```yaml
|
||||
apiVersion: karpenter.sh/v1
|
||||
kind: NodePool
|
||||
metadata:
|
||||
name: gpu-workloads
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
requirements:
|
||||
- key: kubernetes.io/arch
|
||||
operator: In
|
||||
values: ["amd64"]
|
||||
- key: karpenter.sh/capacity-type
|
||||
operator: In
|
||||
values: ["on-demand"]
|
||||
- key: node.kubernetes.io/instance-type
|
||||
operator: In
|
||||
values: ["Standard_NC6s_v3", "Standard_NC12s_v3", "Standard_NC24s_v3"]
|
||||
|
||||
labels:
|
||||
workload-type: gpu
|
||||
gpu-type: nvidia-v100
|
||||
|
||||
taints:
|
||||
- key: "nvidia.com/gpu"
|
||||
value: "true"
|
||||
effect: "NoSchedule"
|
||||
|
||||
nodeClassRef:
|
||||
group: karpenter.azure.com
|
||||
kind: AKSNodeClass
|
||||
name: gpu-nodeclass
|
||||
|
||||
limits:
|
||||
cpu: "200"
|
||||
memory: 800Gi
|
||||
nvidia.com/gpu: "16"
|
||||
|
||||
disruption:
|
||||
consolidationPolicy: WhenEmpty
|
||||
consolidateAfter: 300s
|
||||
```
|
||||
|
||||
## Autoscaling with HPA, VPA, and KEDA
|
||||
|
||||
### Horizontal Pod Autoscaler (HPA)
|
||||
|
||||
```yaml
|
||||
apiVersion: autoscaling/v2
|
||||
kind: HorizontalPodAutoscaler
|
||||
metadata:
|
||||
name: myapp-hpa
|
||||
spec:
|
||||
scaleTargetRef:
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
name: myapp
|
||||
minReplicas: 2
|
||||
maxReplicas: 50
|
||||
metrics:
|
||||
- type: Resource
|
||||
resource:
|
||||
name: cpu
|
||||
target:
|
||||
type: Utilization
|
||||
averageUtilization: 70
|
||||
- type: Resource
|
||||
resource:
|
||||
name: memory
|
||||
target:
|
||||
type: Utilization
|
||||
averageUtilization: 80
|
||||
behavior:
|
||||
scaleUp:
|
||||
stabilizationWindowSeconds: 0
|
||||
policies:
|
||||
- type: Percent
|
||||
value: 100
|
||||
periodSeconds: 15
|
||||
- type: Pods
|
||||
value: 4
|
||||
periodSeconds: 15
|
||||
selectPolicy: Max
|
||||
scaleDown:
|
||||
stabilizationWindowSeconds: 300
|
||||
policies:
|
||||
- type: Percent
|
||||
value: 50
|
||||
periodSeconds: 15
|
||||
```
|
||||
|
||||
### Vertical Pod Autoscaler (VPA)
|
||||
|
||||
```yaml
|
||||
apiVersion: autoscaling.k8s.io/v1
|
||||
kind: VerticalPodAutoscaler
|
||||
metadata:
|
||||
name: myapp-vpa
|
||||
spec:
|
||||
targetRef:
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
name: myapp
|
||||
updatePolicy:
|
||||
updateMode: "Auto" # Auto, Recreate, Initial, Off
|
||||
resourcePolicy:
|
||||
containerPolicies:
|
||||
- containerName: "*"
|
||||
minAllowed:
|
||||
cpu: 100m
|
||||
memory: 128Mi
|
||||
maxAllowed:
|
||||
cpu: 4
|
||||
memory: 8Gi
|
||||
controlledResources: ["cpu", "memory"]
|
||||
controlledValues: RequestsAndLimits
|
||||
```
|
||||
|
||||
### KEDA ScaledObject (Event-Driven)
|
||||
|
||||
```yaml
|
||||
apiVersion: keda.sh/v1alpha1
|
||||
kind: ScaledObject
|
||||
metadata:
|
||||
name: myapp-queue-scaler
|
||||
spec:
|
||||
scaleTargetRef:
|
||||
name: myapp
|
||||
minReplicaCount: 0 # Scale to zero
|
||||
maxReplicaCount: 100
|
||||
pollingInterval: 30
|
||||
cooldownPeriod: 300
|
||||
triggers:
|
||||
# Azure Service Bus Queue
|
||||
- type: azure-servicebus
|
||||
metadata:
|
||||
queueName: myqueue
|
||||
namespace: myservicebus
|
||||
messageCount: "5"
|
||||
authenticationRef:
|
||||
name: azure-servicebus-auth
|
||||
|
||||
# Azure Storage Queue
|
||||
- type: azure-queue
|
||||
metadata:
|
||||
queueName: myqueue
|
||||
queueLength: "10"
|
||||
accountName: mystorageaccount
|
||||
authenticationRef:
|
||||
name: azure-storage-auth
|
||||
|
||||
# Prometheus metrics
|
||||
- type: prometheus
|
||||
metadata:
|
||||
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
|
||||
metricName: http_requests_per_second
|
||||
threshold: "100"
|
||||
query: sum(rate(http_requests_total[2m]))
|
||||
```
|
||||
|
||||
## Workload Identity (Replaces AAD Pod Identity)
|
||||
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
# Workload identity is enabled by default in AKS Automatic
|
||||
|
||||
# Create managed identity
|
||||
az identity create \
|
||||
--name myapp-identity \
|
||||
--resource-group MyRG
|
||||
|
||||
# Get identity details
|
||||
export IDENTITY_CLIENT_ID=$(az identity show -g MyRG -n myapp-identity --query clientId -o tsv)
|
||||
export IDENTITY_OBJECT_ID=$(az identity show -g MyRG -n myapp-identity --query principalId -o tsv)
|
||||
|
||||
# Assign role to identity
|
||||
az role assignment create \
|
||||
--assignee $IDENTITY_OBJECT_ID \
|
||||
--role "Storage Blob Data Contributor" \
|
||||
--scope /subscriptions/<sub-id>/resourceGroups/MyRG/providers/Microsoft.Storage/storageAccounts/mystorage
|
||||
|
||||
# Create federated credential
|
||||
export AKS_OIDC_ISSUER=$(az aks show -g MyRG -n MyAKSAutomatic --query oidcIssuerProfile.issuerUrl -o tsv)
|
||||
|
||||
az identity federated-credential create \
|
||||
--name myapp-federated-credential \
|
||||
--identity-name myapp-identity \
|
||||
--resource-group MyRG \
|
||||
--issuer $AKS_OIDC_ISSUER \
|
||||
--subject system:serviceaccount:default:myapp-sa
|
||||
```
|
||||
|
||||
### Kubernetes Resources
|
||||
|
||||
```yaml
|
||||
# Service Account
|
||||
apiVersion: v1
|
||||
kind: ServiceAccount
|
||||
metadata:
|
||||
name: myapp-sa
|
||||
namespace: default
|
||||
annotations:
|
||||
azure.workload.identity/client-id: "<IDENTITY_CLIENT_ID>"
|
||||
|
||||
---
|
||||
# Deployment using workload identity
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: myapp
|
||||
spec:
|
||||
replicas: 2
|
||||
selector:
|
||||
matchLabels:
|
||||
app: myapp
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: myapp
|
||||
azure.workload.identity/use: "true" # Enable workload identity
|
||||
spec:
|
||||
serviceAccountName: myapp-sa
|
||||
containers:
|
||||
- name: myapp
|
||||
image: myregistry.azurecr.io/myapp:latest
|
||||
env:
|
||||
- name: AZURE_CLIENT_ID
|
||||
value: "<IDENTITY_CLIENT_ID>"
|
||||
- name: AZURE_TENANT_ID
|
||||
value: "<TENANT_ID>"
|
||||
- name: AZURE_FEDERATED_TOKEN_FILE
|
||||
value: /var/run/secrets/azure/tokens/azure-identity-token
|
||||
volumeMounts:
|
||||
- name: azure-identity-token
|
||||
mountPath: /var/run/secrets/azure/tokens
|
||||
readOnly: true
|
||||
volumes:
|
||||
- name: azure-identity-token
|
||||
projected:
|
||||
sources:
|
||||
- serviceAccountToken:
|
||||
path: azure-identity-token
|
||||
expirationSeconds: 3600
|
||||
audience: api://AzureADTokenExchange
|
||||
```
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
### Enable Container Insights
|
||||
|
||||
```bash
|
||||
# Already enabled with --enable-addons monitoring
|
||||
# Query logs using Azure Monitor
|
||||
|
||||
# Get cluster logs
|
||||
az monitor log-analytics query \
|
||||
--workspace <workspace-id> \
|
||||
--analytics-query "KubePodInventory | where ClusterName == 'MyAKSAutomatic' | take 10" \
|
||||
--output table
|
||||
|
||||
# Get Karpenter logs
|
||||
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter
|
||||
```
|
||||
|
||||
### Prometheus and Grafana
|
||||
|
||||
```bash
|
||||
# Enable managed Prometheus
|
||||
az aks update \
|
||||
--resource-group MyRG \
|
||||
--name MyAKSAutomatic \
|
||||
--enable-azure-monitor-metrics
|
||||
|
||||
# Access Grafana dashboards through Azure Portal
|
||||
```
|
||||
|
||||
## Cost Optimization
|
||||
|
||||
### Billing Model (October 2025)
|
||||
- **Control plane**: $0.16/hour per cluster
|
||||
- **Compute**: Pay for actual node usage
|
||||
- **Karpenter**: Automatic bin-packing and consolidation
|
||||
- **Scale-to-zero**: Possible with KEDA and Karpenter
|
||||
|
||||
### Cost-Saving Tips
|
||||
|
||||
1. **Use Spot Instances for Non-Critical Workloads**
|
||||
```yaml
|
||||
- key: karpenter.sh/capacity-type
|
||||
operator: In
|
||||
values: ["spot"]
|
||||
```
|
||||
|
||||
2. **Configure Aggressive Consolidation**
|
||||
```yaml
|
||||
disruption:
|
||||
consolidationPolicy: WhenUnderutilized
|
||||
consolidateAfter: 30s
|
||||
```
|
||||
|
||||
3. **Implement Pod Disruption Budgets**
|
||||
```yaml
|
||||
apiVersion: policy/v1
|
||||
kind: PodDisruptionBudget
|
||||
metadata:
|
||||
name: myapp-pdb
|
||||
spec:
|
||||
minAvailable: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: myapp
|
||||
```
|
||||
|
||||
4. **Use VPA for Right-Sizing**
|
||||
- VPA automatically adjusts resource requests based on actual usage
|
||||
|
||||
## Migration from Standard AKS to Automatic
|
||||
|
||||
AKS Automatic is a new cluster mode - in-place migration is not supported. Follow these steps:
|
||||
|
||||
1. **Create new AKS Automatic cluster**
|
||||
2. **Install workloads in new cluster**
|
||||
3. **Validate functionality**
|
||||
4. **Switch traffic** (DNS, load balancer)
|
||||
5. **Decommission old cluster**
|
||||
|
||||
## Best Practices
|
||||
|
||||
✓ Use AKS Automatic for new production clusters
|
||||
✓ Enable workload identity for pod authentication
|
||||
✓ Configure custom NodePools for specific workload types
|
||||
✓ Implement HPA, VPA, and KEDA for comprehensive scaling
|
||||
✓ Use spot instances for batch and fault-tolerant workloads
|
||||
✓ Enable Container Insights and Managed Prometheus
|
||||
✓ Configure Pod Disruption Budgets for critical apps
|
||||
✓ Use network policies for microsegmentation
|
||||
✓ Enable Azure Policy add-on for compliance
|
||||
✓ Implement GitOps with Flux or Argo CD
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Check Karpenter Status
|
||||
```bash
|
||||
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter --tail=100
|
||||
kubectl get nodepools
|
||||
kubectl get nodeclaims
|
||||
```
|
||||
|
||||
### View Node Provisioning Events
|
||||
```bash
|
||||
kubectl get events --field-selector involvedObject.kind=NodePool -A
|
||||
```
|
||||
|
||||
### Debug Workload Identity Issues
|
||||
```bash
|
||||
# Check service account annotation
|
||||
kubectl get sa myapp-sa -o yaml
|
||||
|
||||
# Check pod labels
|
||||
kubectl get pod <pod-name> -o yaml | grep azure.workload.identity
|
||||
|
||||
# Check federated credential
|
||||
az identity federated-credential show \
|
||||
--identity-name myapp-identity \
|
||||
--resource-group MyRG \
|
||||
--name myapp-federated-credential
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- [AKS Automatic Documentation](https://learn.microsoft.com/en-us/azure/aks/automatic)
|
||||
- [Karpenter on Azure](https://karpenter.sh)
|
||||
- [Workload Identity](https://learn.microsoft.com/en-us/azure/aks/workload-identity-overview)
|
||||
- [AKS Release Notes](https://github.com/Azure/AKS/releases)
|
||||
|
||||
AKS Automatic represents the future of managed Kubernetes on Azure - zero operational overhead with maximum automation!
|
||||
Reference in New Issue
Block a user