gh-k-dense-ai-claude-scientific-skills-scientific-skills/skills/modal/references/gpu.md at f0bd18fb4e4ab3ad01880c98f331ff394f9b87c3

zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Fork 0

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

3.5 KiB

Raw Blame History

Quick Start

Run functions on GPUs with the gpu parameter:

import modal

image = modal.Image.debian_slim().pip_install("torch")
app = modal.App(image=image)

@app.function(gpu="A100")
def run():
    import torch
    assert torch.cuda.is_available()

Available GPU Types

Modal supports the following GPUs:

T4 - Entry-level GPU
L4 - Balanced performance and cost
A10 - Up to 4 GPUs, 96 GB total
A100 - 40GB or 80GB variants
A100-40GB - Specific 40GB variant
A100-80GB - Specific 80GB variant
L40S - 48 GB, excellent for inference
H100 / H100! - Top-tier Hopper architecture
H200 - Improved Hopper with more memory
B200 - Latest Blackwell architecture

See https://modal.com/pricing for pricing.

GPU Count

Request multiple GPUs per container with :n syntax:

@app.function(gpu="H100:8")
def run_llama_405b():
    # 8 H100 GPUs available
    ...

Supported counts:

B200, H200, H100, A100, L4, T4, L40S: up to 8 GPUs (up to 1,536 GB)
A10: up to 4 GPUs (up to 96 GB)

Note: Requesting >2 GPUs may result in longer wait times.

GPU Selection Guide

For Inference (Recommended): Start with L40S

Excellent cost/performance
48 GB memory
Good for LLaMA, Stable Diffusion, etc.

For Training: Consider H100 or A100

High compute throughput
Large memory for batch processing

For Memory-Bound Tasks: H200 or A100-80GB

More memory capacity
Better for large models

B200 GPUs

NVIDIA's flagship Blackwell chip:

@app.function(gpu="B200:8")
def run_deepseek():
    # Most powerful option
    ...

H200 and H100 GPUs

Hopper architecture GPUs with excellent software support:

@app.function(gpu="H100")
def train():
    ...

Automatic H200 Upgrades

Modal may upgrade gpu="H100" to H200 at no extra cost. H200 provides:

141 GB memory (vs 80 GB for H100)
4.8 TB/s bandwidth (vs 3.35 TB/s)

To avoid automatic upgrades (e.g., for benchmarking):

@app.function(gpu="H100!")
def benchmark():
    ...

A100 GPUs

Ampere architecture with 40GB or 80GB variants:

# May be automatically upgraded to 80GB
@app.function(gpu="A100")
def qwen_7b():
    ...

# Specific variants
@app.function(gpu="A100-40GB")
def model_40gb():
    ...

@app.function(gpu="A100-80GB")
def llama_70b():
    ...

GPU Fallbacks

Specify multiple GPU types with fallback:

@app.function(gpu=["H100", "A100-40GB:2"])
def run_on_80gb():
    # Tries H100 first, falls back to 2x A100-40GB
    ...

Modal respects ordering and allocates most preferred available GPU.

Multi-GPU Training

Modal supports multi-GPU training on a single node. Multi-node training is in closed beta.

PyTorch Example

For frameworks that re-execute entrypoints, use subprocess or specific strategies:

@app.function(gpu="A100:2")
def train():
    import subprocess
    import sys
    subprocess.run(
        ["python", "train.py"],
        stdout=sys.stdout,
        stderr=sys.stderr,
        check=True,
    )

For PyTorch Lightning, set strategy to ddp_spawn or ddp_notebook.

Performance Considerations

Memory-Bound vs Compute-Bound:

Running models with small batch sizes is memory-bound
Newer GPUs have faster arithmetic than memory access
Speedup from newer hardware may not justify cost for memory-bound workloads

Optimization:

Use batching when possible
Consider L40S before jumping to H100/B200
Profile to identify bottlenecks

3.5 KiB Raw Blame History

GPU Acceleration on Modal