16 KiB
🚨 CRITICAL GUIDELINES
Windows File Path Requirements
MANDATORY: Always Use Backslashes on Windows for File Paths
When using Edit or Write tools on Windows, you MUST use backslashes (\) in file paths, NOT forward slashes (/).
Examples:
- ❌ WRONG:
D:/repos/project/file.tsx - ✅ CORRECT:
D:\repos\project\file.tsx
This applies to:
- Edit tool file_path parameter
- Write tool file_path parameter
- All file operations on Windows systems
Documentation Guidelines
NEVER create new documentation files unless explicitly requested by the user.
- Priority: Update existing README.md files rather than creating new documentation
- Repository cleanliness: Keep repository root clean - only README.md unless user requests otherwise
- Style: Documentation should be concise, direct, and professional - avoid AI-generated tone
- User preference: Only create additional .md files when user specifically asks for documentation
Azure Container Apps GPU Support - 2025 Features
Complete knowledge base for Azure Container Apps with GPU support, serverless capabilities, and Dapr integration (2025 GA features).
Overview
Azure Container Apps is a serverless container platform with native GPU support, Dapr integration, and scale-to-zero capabilities for cost-efficient AI/ML workloads.
Key 2025 Features (Build Announcements)
1. Serverless GPU (GA)
- Automatic scaling: Scale GPU workloads based on demand
- Scale-to-zero: Pay only when GPU is actively used
- Per-second billing: Granular cost control
- Optimized cold start: Fast initialization for AI models
- Reduced operational overhead: No infrastructure management
2. Dedicated GPU (GA)
- Consistent performance: Dedicated GPU resources
- Simplified AI deployment: Easy model hosting
- Long-running workloads: Ideal for training and continuous inference
- Multiple GPU types: NVIDIA A100, T4, and more
3. Dynamic Sessions with GPU (Early Access)
- Sandboxed execution: Run untrusted AI-generated code
- Hyper-V isolation: Enhanced security
- GPU-powered Python interpreter: Handle compute-intensive AI workloads
- Scale at runtime: Dynamic resource allocation
4. Foundry Models Integration
- Deploy AI models directly: During container app creation
- Ready-to-use models: Pre-configured inference endpoints
- Azure AI Foundry: Seamless integration
5. Workflow with Durable Task Scheduler (Preview)
- Long-running workflows: Reliable orchestration
- State management: Automatic persistence
- Event-driven: Trigger workflows from events
6. Native Azure Functions Support
- Functions runtime: Run Azure Functions in Container Apps
- Consistent development: Same code, serverless execution
- Event triggers: All Functions triggers supported
7. Dapr Integration (GA)
- Service discovery: Built-in DNS-based discovery
- State management: Distributed state stores
- Pub/sub messaging: Reliable messaging patterns
- Service invocation: Resilient service-to-service calls
- Observability: Integrated tracing and metrics
Creating Container Apps with GPU
Basic Container App with Serverless GPU
# Create Container Apps environment
az containerapp env create \
--name myenv \
--resource-group MyRG \
--location eastus \
--logs-workspace-id <workspace-id> \
--logs-workspace-key <workspace-key>
# Create Container App with GPU
az containerapp create \
--name myapp-gpu \
--resource-group MyRG \
--environment myenv \
--image myregistry.azurecr.io/ai-model:latest \
--cpu 4 \
--memory 8Gi \
--gpu-type nvidia-a100 \
--gpu-count 1 \
--min-replicas 0 \
--max-replicas 10 \
--ingress external \
--target-port 8080
Production-Ready Container App with GPU
az containerapp create \
--name myapp-gpu-prod \
--resource-group MyRG \
--environment myenv \
\
# Container configuration
--image myregistry.azurecr.io/ai-model:latest \
--registry-server myregistry.azurecr.io \
--registry-identity system \
\
# Resources
--cpu 4 \
--memory 8Gi \
--gpu-type nvidia-a100 \
--gpu-count 1 \
\
# Scaling
--min-replicas 0 \
--max-replicas 20 \
--scale-rule-name http-scaling \
--scale-rule-type http \
--scale-rule-http-concurrency 10 \
\
# Networking
--ingress external \
--target-port 8080 \
--transport http2 \
--exposed-port 8080 \
\
# Security
--registry-identity system \
--env-vars "AZURE_CLIENT_ID=secretref:client-id" \
\
# Monitoring
--dapr-app-id myapp \
--dapr-app-port 8080 \
--dapr-app-protocol http \
--enable-dapr \
\
# Identity
--system-assigned
Container Apps Environment Configuration
Environment with Zone Redundancy
az containerapp env create \
--name myenv-prod \
--resource-group MyRG \
--location eastus \
--logs-workspace-id <workspace-id> \
--logs-workspace-key <workspace-key> \
--zone-redundant true \
--enable-workload-profiles true
Workload Profiles (Dedicated GPU)
# Create environment with workload profiles
az containerapp env create \
--name myenv-gpu \
--resource-group MyRG \
--location eastus \
--enable-workload-profiles true
# Add GPU workload profile
az containerapp env workload-profile add \
--name myenv-gpu \
--resource-group MyRG \
--workload-profile-name gpu-profile \
--workload-profile-type GPU-A100 \
--min-nodes 0 \
--max-nodes 10
# Create container app with GPU profile
az containerapp create \
--name myapp-dedicated-gpu \
--resource-group MyRG \
--environment myenv-gpu \
--workload-profile-name gpu-profile \
--image myregistry.azurecr.io/training-job:latest \
--cpu 8 \
--memory 16Gi \
--min-replicas 1 \
--max-replicas 5
GPU Scaling Rules
Custom Prometheus Scaling
az containerapp create \
--name myapp-gpu-prometheus \
--resource-group MyRG \
--environment myenv \
--image myregistry.azurecr.io/ai-model:latest \
--cpu 4 \
--memory 8Gi \
--gpu-type nvidia-a100 \
--gpu-count 1 \
--min-replicas 0 \
--max-replicas 10 \
--scale-rule-name gpu-utilization \
--scale-rule-type custom \
--scale-rule-custom-type prometheus \
--scale-rule-metadata \
serverAddress=http://prometheus.monitoring.svc.cluster.local:9090 \
metricName=gpu_utilization \
threshold=80 \
query="avg(nvidia_gpu_utilization{app='myapp'})"
Queue-Based Scaling (Azure Service Bus)
az containerapp create \
--name myapp-queue-processor \
--resource-group MyRG \
--environment myenv \
--image myregistry.azurecr.io/batch-processor:latest \
--cpu 4 \
--memory 8Gi \
--gpu-type nvidia-t4 \
--gpu-count 1 \
--min-replicas 0 \
--max-replicas 50 \
--scale-rule-name queue-scaling \
--scale-rule-type azure-servicebus \
--scale-rule-metadata \
queueName=ai-jobs \
namespace=myservicebus \
messageCount=5 \
--scale-rule-auth connection=servicebus-connection
Dapr Integration
Enable Dapr on Container App
az containerapp create \
--name myapp-dapr \
--resource-group MyRG \
--environment myenv \
--image myregistry.azurecr.io/myapp:latest \
--enable-dapr \
--dapr-app-id myapp \
--dapr-app-port 8080 \
--dapr-app-protocol http \
--dapr-http-max-request-size 4 \
--dapr-http-read-buffer-size 4 \
--dapr-log-level info \
--dapr-enable-api-logging true
Dapr State Store (Azure Cosmos DB)
# Create Dapr component for state store
apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
name: statestore
spec:
type: state.azure.cosmosdb
version: v1
metadata:
- name: url
value: "https://mycosmosdb.documents.azure.com:443/"
- name: masterKey
secretRef: cosmosdb-key
- name: database
value: "mydb"
- name: collection
value: "state"
# Create the component
az containerapp env dapr-component set \
--name myenv \
--resource-group MyRG \
--dapr-component-name statestore \
--yaml component.yaml
Dapr Pub/Sub (Azure Service Bus)
apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
name: pubsub
spec:
type: pubsub.azure.servicebus.topics
version: v1
metadata:
- name: connectionString
secretRef: servicebus-connection
- name: consumerID
value: "myapp"
Service-to-Service Invocation
# Python example using Dapr SDK
from dapr.clients import DaprClient
with DaprClient() as client:
# Invoke another service
response = client.invoke_method(
app_id='other-service',
method_name='process',
data='{"input": "data"}'
)
# Save state
client.save_state(
store_name='statestore',
key='mykey',
value='myvalue'
)
# Publish message
client.publish_event(
pubsub_name='pubsub',
topic_name='orders',
data='{"orderId": "123"}'
)
AI Model Deployment Patterns
OpenAI-Compatible Endpoint
# Dockerfile for vLLM model serving
FROM vllm/vllm-openai:latest
ENV MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
ENV GPU_MEMORY_UTILIZATION=0.9
ENV MAX_MODEL_LEN=4096
CMD ["--model", "${MODEL_NAME}", \
"--gpu-memory-utilization", "${GPU_MEMORY_UTILIZATION}", \
"--max-model-len", "${MAX_MODEL_LEN}", \
"--port", "8080"]
# Deploy vLLM model
az containerapp create \
--name llama-inference \
--resource-group MyRG \
--environment myenv \
--image vllm/vllm-openai:latest \
--cpu 8 \
--memory 32Gi \
--gpu-type nvidia-a100 \
--gpu-count 1 \
--min-replicas 1 \
--max-replicas 5 \
--target-port 8080 \
--ingress external \
--env-vars \
MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct" \
GPU_MEMORY_UTILIZATION="0.9" \
HF_TOKEN=secretref:huggingface-token
Stable Diffusion Image Generation
az containerapp create \
--name stable-diffusion \
--resource-group MyRG \
--environment myenv \
--image myregistry.azurecr.io/stable-diffusion:latest \
--cpu 4 \
--memory 16Gi \
--gpu-type nvidia-a100 \
--gpu-count 1 \
--min-replicas 0 \
--max-replicas 10 \
--target-port 7860 \
--ingress external \
--scale-rule-name http-scaling \
--scale-rule-type http \
--scale-rule-http-concurrency 1
Batch Processing Job
az containerapp job create \
--name batch-training-job \
--resource-group MyRG \
--environment myenv \
--trigger-type Manual \
--image myregistry.azurecr.io/training:latest \
--cpu 8 \
--memory 32Gi \
--gpu-type nvidia-a100 \
--gpu-count 2 \
--parallelism 1 \
--replica-timeout 7200 \
--replica-retry-limit 3 \
--env-vars \
DATASET_URL="https://mystorage.blob.core.windows.net/datasets/train.csv" \
MODEL_OUTPUT="https://mystorage.blob.core.windows.net/models/" \
EPOCHS="100"
# Execute job
az containerapp job start \
--name batch-training-job \
--resource-group MyRG
Monitoring and Observability
Application Insights Integration
az containerapp create \
--name myapp-monitored \
--resource-group MyRG \
--environment myenv \
--image myregistry.azurecr.io/myapp:latest \
--env-vars \
APPLICATIONINSIGHTS_CONNECTION_STRING=secretref:appinsights-connection
Query Logs
# Stream logs
az containerapp logs show \
--name myapp-gpu \
--resource-group MyRG \
--follow
# Query with Log Analytics
az monitor log-analytics query \
--workspace <workspace-id> \
--analytics-query "ContainerAppConsoleLogs_CL | where ContainerAppName_s == 'myapp-gpu' | take 100"
Metrics and Alerts
# Create metric alert for GPU usage
az monitor metrics alert create \
--name high-gpu-usage \
--resource-group MyRG \
--scopes $(az containerapp show -g MyRG -n myapp-gpu --query id -o tsv) \
--condition "avg Requests > 100" \
--window-size 5m \
--evaluation-frequency 1m \
--action <action-group-id>
Security Best Practices
Managed Identity
# Create with system-assigned identity
az containerapp create \
--name myapp-identity \
--resource-group MyRG \
--environment myenv \
--system-assigned \
--image myregistry.azurecr.io/myapp:latest
# Get identity principal ID
IDENTITY_ID=$(az containerapp show -g MyRG -n myapp-identity --query identity.principalId -o tsv)
# Assign role to access Key Vault
az role assignment create \
--assignee $IDENTITY_ID \
--role "Key Vault Secrets User" \
--scope /subscriptions/<sub-id>/resourceGroups/MyRG/providers/Microsoft.KeyVault/vaults/mykeyvault
# Use user-assigned identity
az identity create --name myapp-identity --resource-group MyRG
IDENTITY_RESOURCE_ID=$(az identity show -g MyRG -n myapp-identity --query id -o tsv)
az containerapp create \
--name myapp-user-identity \
--resource-group MyRG \
--environment myenv \
--user-assigned $IDENTITY_RESOURCE_ID \
--image myregistry.azurecr.io/myapp:latest
Secret Management
# Add secrets
az containerapp secret set \
--name myapp-gpu \
--resource-group MyRG \
--secrets \
huggingface-token="<token>" \
api-key="<key>"
# Reference secrets in environment variables
az containerapp update \
--name myapp-gpu \
--resource-group MyRG \
--set-env-vars \
HF_TOKEN=secretref:huggingface-token \
API_KEY=secretref:api-key
Cost Optimization
Scale-to-Zero Configuration
az containerapp create \
--name myapp-scale-zero \
--resource-group MyRG \
--environment myenv \
--image myregistry.azurecr.io/myapp:latest \
--min-replicas 0 \
--max-replicas 10 \
--scale-rule-name http-scaling \
--scale-rule-type http \
--scale-rule-http-concurrency 10
Cost savings: Pay only when requests are being processed. GPU costs are per-second when active.
Right-Sizing Resources
# Start with minimal resources
--cpu 2 --memory 4Gi --gpu-count 1
# Monitor and adjust based on actual usage
az monitor metrics list \
--resource $(az containerapp show -g MyRG -n myapp-gpu --query id -o tsv) \
--metric "CpuPercentage,MemoryPercentage"
Use Spot/Preemptible GPUs (Future Feature)
When available, configure spot instances for non-critical workloads to save up to 80% on GPU costs.
Troubleshooting
Check Revision Status
az containerapp revision list \
--name myapp-gpu \
--resource-group MyRG \
--output table
View Revision Details
az containerapp revision show \
--name <revision-name> \
--app myapp-gpu \
--resource-group MyRG
Restart Container App
az containerapp update \
--name myapp-gpu \
--resource-group MyRG \
--force-restart
GPU Not Available
If GPU is not provisioning:
- Check region availability: Not all regions support GPU
- Verify quota: Request quota increase if needed
- Check workload profile: Ensure GPU workload profile is created
Best Practices
✓ Use scale-to-zero for intermittent workloads ✓ Implement health probes (liveness and readiness) ✓ Use managed identities for authentication ✓ Store secrets in Azure Key Vault ✓ Enable Dapr for microservices patterns ✓ Configure appropriate scaling rules ✓ Monitor GPU utilization and adjust resources ✓ Use Container Apps jobs for batch processing ✓ Implement retry logic for transient failures ✓ Use Application Insights for observability
References
Azure Container Apps with GPU support provides the ultimate serverless platform for AI/ML workloads!