169 lines
3.5 KiB
Markdown
169 lines
3.5 KiB
Markdown
# GPU Acceleration on Modal
|
|
|
|
## Quick Start
|
|
|
|
Run functions on GPUs with the `gpu` parameter:
|
|
|
|
```python
|
|
import modal
|
|
|
|
image = modal.Image.debian_slim().pip_install("torch")
|
|
app = modal.App(image=image)
|
|
|
|
@app.function(gpu="A100")
|
|
def run():
|
|
import torch
|
|
assert torch.cuda.is_available()
|
|
```
|
|
|
|
## Available GPU Types
|
|
|
|
Modal supports the following GPUs:
|
|
|
|
- `T4` - Entry-level GPU
|
|
- `L4` - Balanced performance and cost
|
|
- `A10` - Up to 4 GPUs, 96 GB total
|
|
- `A100` - 40GB or 80GB variants
|
|
- `A100-40GB` - Specific 40GB variant
|
|
- `A100-80GB` - Specific 80GB variant
|
|
- `L40S` - 48 GB, excellent for inference
|
|
- `H100` / `H100!` - Top-tier Hopper architecture
|
|
- `H200` - Improved Hopper with more memory
|
|
- `B200` - Latest Blackwell architecture
|
|
|
|
See https://modal.com/pricing for pricing.
|
|
|
|
## GPU Count
|
|
|
|
Request multiple GPUs per container with `:n` syntax:
|
|
|
|
```python
|
|
@app.function(gpu="H100:8")
|
|
def run_llama_405b():
|
|
# 8 H100 GPUs available
|
|
...
|
|
```
|
|
|
|
Supported counts:
|
|
- B200, H200, H100, A100, L4, T4, L40S: up to 8 GPUs (up to 1,536 GB)
|
|
- A10: up to 4 GPUs (up to 96 GB)
|
|
|
|
Note: Requesting >2 GPUs may result in longer wait times.
|
|
|
|
## GPU Selection Guide
|
|
|
|
**For Inference (Recommended)**: Start with L40S
|
|
- Excellent cost/performance
|
|
- 48 GB memory
|
|
- Good for LLaMA, Stable Diffusion, etc.
|
|
|
|
**For Training**: Consider H100 or A100
|
|
- High compute throughput
|
|
- Large memory for batch processing
|
|
|
|
**For Memory-Bound Tasks**: H200 or A100-80GB
|
|
- More memory capacity
|
|
- Better for large models
|
|
|
|
## B200 GPUs
|
|
|
|
NVIDIA's flagship Blackwell chip:
|
|
|
|
```python
|
|
@app.function(gpu="B200:8")
|
|
def run_deepseek():
|
|
# Most powerful option
|
|
...
|
|
```
|
|
|
|
## H200 and H100 GPUs
|
|
|
|
Hopper architecture GPUs with excellent software support:
|
|
|
|
```python
|
|
@app.function(gpu="H100")
|
|
def train():
|
|
...
|
|
```
|
|
|
|
### Automatic H200 Upgrades
|
|
|
|
Modal may upgrade `gpu="H100"` to H200 at no extra cost. H200 provides:
|
|
- 141 GB memory (vs 80 GB for H100)
|
|
- 4.8 TB/s bandwidth (vs 3.35 TB/s)
|
|
|
|
To avoid automatic upgrades (e.g., for benchmarking):
|
|
```python
|
|
@app.function(gpu="H100!")
|
|
def benchmark():
|
|
...
|
|
```
|
|
|
|
## A100 GPUs
|
|
|
|
Ampere architecture with 40GB or 80GB variants:
|
|
|
|
```python
|
|
# May be automatically upgraded to 80GB
|
|
@app.function(gpu="A100")
|
|
def qwen_7b():
|
|
...
|
|
|
|
# Specific variants
|
|
@app.function(gpu="A100-40GB")
|
|
def model_40gb():
|
|
...
|
|
|
|
@app.function(gpu="A100-80GB")
|
|
def llama_70b():
|
|
...
|
|
```
|
|
|
|
## GPU Fallbacks
|
|
|
|
Specify multiple GPU types with fallback:
|
|
|
|
```python
|
|
@app.function(gpu=["H100", "A100-40GB:2"])
|
|
def run_on_80gb():
|
|
# Tries H100 first, falls back to 2x A100-40GB
|
|
...
|
|
```
|
|
|
|
Modal respects ordering and allocates most preferred available GPU.
|
|
|
|
## Multi-GPU Training
|
|
|
|
Modal supports multi-GPU training on a single node. Multi-node training is in closed beta.
|
|
|
|
### PyTorch Example
|
|
|
|
For frameworks that re-execute entrypoints, use subprocess or specific strategies:
|
|
|
|
```python
|
|
@app.function(gpu="A100:2")
|
|
def train():
|
|
import subprocess
|
|
import sys
|
|
subprocess.run(
|
|
["python", "train.py"],
|
|
stdout=sys.stdout,
|
|
stderr=sys.stderr,
|
|
check=True,
|
|
)
|
|
```
|
|
|
|
For PyTorch Lightning, set strategy to `ddp_spawn` or `ddp_notebook`.
|
|
|
|
## Performance Considerations
|
|
|
|
**Memory-Bound vs Compute-Bound**:
|
|
- Running models with small batch sizes is memory-bound
|
|
- Newer GPUs have faster arithmetic than memory access
|
|
- Speedup from newer hardware may not justify cost for memory-bound workloads
|
|
|
|
**Optimization**:
|
|
- Use batching when possible
|
|
- Consider L40S before jumping to H100/B200
|
|
- Profile to identify bottlenecks
|