3.5 KiB
GPU Acceleration on Modal
Quick Start
Run functions on GPUs with the gpu parameter:
import modal
image = modal.Image.debian_slim().pip_install("torch")
app = modal.App(image=image)
@app.function(gpu="A100")
def run():
import torch
assert torch.cuda.is_available()
Available GPU Types
Modal supports the following GPUs:
T4- Entry-level GPUL4- Balanced performance and costA10- Up to 4 GPUs, 96 GB totalA100- 40GB or 80GB variantsA100-40GB- Specific 40GB variantA100-80GB- Specific 80GB variantL40S- 48 GB, excellent for inferenceH100/H100!- Top-tier Hopper architectureH200- Improved Hopper with more memoryB200- Latest Blackwell architecture
See https://modal.com/pricing for pricing.
GPU Count
Request multiple GPUs per container with :n syntax:
@app.function(gpu="H100:8")
def run_llama_405b():
# 8 H100 GPUs available
...
Supported counts:
- B200, H200, H100, A100, L4, T4, L40S: up to 8 GPUs (up to 1,536 GB)
- A10: up to 4 GPUs (up to 96 GB)
Note: Requesting >2 GPUs may result in longer wait times.
GPU Selection Guide
For Inference (Recommended): Start with L40S
- Excellent cost/performance
- 48 GB memory
- Good for LLaMA, Stable Diffusion, etc.
For Training: Consider H100 or A100
- High compute throughput
- Large memory for batch processing
For Memory-Bound Tasks: H200 or A100-80GB
- More memory capacity
- Better for large models
B200 GPUs
NVIDIA's flagship Blackwell chip:
@app.function(gpu="B200:8")
def run_deepseek():
# Most powerful option
...
H200 and H100 GPUs
Hopper architecture GPUs with excellent software support:
@app.function(gpu="H100")
def train():
...
Automatic H200 Upgrades
Modal may upgrade gpu="H100" to H200 at no extra cost. H200 provides:
- 141 GB memory (vs 80 GB for H100)
- 4.8 TB/s bandwidth (vs 3.35 TB/s)
To avoid automatic upgrades (e.g., for benchmarking):
@app.function(gpu="H100!")
def benchmark():
...
A100 GPUs
Ampere architecture with 40GB or 80GB variants:
# May be automatically upgraded to 80GB
@app.function(gpu="A100")
def qwen_7b():
...
# Specific variants
@app.function(gpu="A100-40GB")
def model_40gb():
...
@app.function(gpu="A100-80GB")
def llama_70b():
...
GPU Fallbacks
Specify multiple GPU types with fallback:
@app.function(gpu=["H100", "A100-40GB:2"])
def run_on_80gb():
# Tries H100 first, falls back to 2x A100-40GB
...
Modal respects ordering and allocates most preferred available GPU.
Multi-GPU Training
Modal supports multi-GPU training on a single node. Multi-node training is in closed beta.
PyTorch Example
For frameworks that re-execute entrypoints, use subprocess or specific strategies:
@app.function(gpu="A100:2")
def train():
import subprocess
import sys
subprocess.run(
["python", "train.py"],
stdout=sys.stdout,
stderr=sys.stderr,
check=True,
)
For PyTorch Lightning, set strategy to ddp_spawn or ddp_notebook.
Performance Considerations
Memory-Bound vs Compute-Bound:
- Running models with small batch sizes is memory-bound
- Newer GPUs have faster arithmetic than memory access
- Speedup from newer hardware may not justify cost for memory-bound workloads
Optimization:
- Use batching when possible
- Consider L40S before jumping to H100/B200
- Profile to identify bottlenecks