Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/modal/references/gpu.md
+++ b/skills/modal/references/gpu.md
@@ -0,0 +1,168 @@
+# GPU Acceleration on Modal
+
+## Quick Start
+
+Run functions on GPUs with the `gpu` parameter:
+
+```python
+import modal
+
+image = modal.Image.debian_slim().pip_install("torch")
+app = modal.App(image=image)
+
+@app.function(gpu="A100")
+def run():
+    import torch
+    assert torch.cuda.is_available()
+```
+
+## Available GPU Types
+
+Modal supports the following GPUs:
+
+- `T4` - Entry-level GPU
+- `L4` - Balanced performance and cost
+- `A10` - Up to 4 GPUs, 96 GB total
+- `A100` - 40GB or 80GB variants
+- `A100-40GB` - Specific 40GB variant
+- `A100-80GB` - Specific 80GB variant
+- `L40S` - 48 GB, excellent for inference
+- `H100` / `H100!` - Top-tier Hopper architecture
+- `H200` - Improved Hopper with more memory
+- `B200` - Latest Blackwell architecture
+
+See https://modal.com/pricing for pricing.
+
+## GPU Count
+
+Request multiple GPUs per container with `:n` syntax:
+
+```python
+@app.function(gpu="H100:8")
+def run_llama_405b():
+    # 8 H100 GPUs available
+    ...
+```
+
+Supported counts:
+- B200, H200, H100, A100, L4, T4, L40S: up to 8 GPUs (up to 1,536 GB)
+- A10: up to 4 GPUs (up to 96 GB)
+
+Note: Requesting >2 GPUs may result in longer wait times.
+
+## GPU Selection Guide
+
+**For Inference (Recommended)**: Start with L40S
+- Excellent cost/performance
+- 48 GB memory
+- Good for LLaMA, Stable Diffusion, etc.
+
+**For Training**: Consider H100 or A100
+- High compute throughput
+- Large memory for batch processing
+
+**For Memory-Bound Tasks**: H200 or A100-80GB
+- More memory capacity
+- Better for large models
+
+## B200 GPUs
+
+NVIDIA's flagship Blackwell chip:
+
+```python
+@app.function(gpu="B200:8")
+def run_deepseek():
+    # Most powerful option
+    ...
+```
+
+## H200 and H100 GPUs
+
+Hopper architecture GPUs with excellent software support:
+
+```python
+@app.function(gpu="H100")
+def train():
+    ...
+```
+
+### Automatic H200 Upgrades
+
+Modal may upgrade `gpu="H100"` to H200 at no extra cost. H200 provides:
+- 141 GB memory (vs 80 GB for H100)
+- 4.8 TB/s bandwidth (vs 3.35 TB/s)
+
+To avoid automatic upgrades (e.g., for benchmarking):
+```python
+@app.function(gpu="H100!")
+def benchmark():
+    ...
+```
+
+## A100 GPUs
+
+Ampere architecture with 40GB or 80GB variants:
+
+```python
+# May be automatically upgraded to 80GB
+@app.function(gpu="A100")
+def qwen_7b():
+    ...
+
+# Specific variants
+@app.function(gpu="A100-40GB")
+def model_40gb():
+    ...
+
+@app.function(gpu="A100-80GB")
+def llama_70b():
+    ...
+```
+
+## GPU Fallbacks
+
+Specify multiple GPU types with fallback:
+
+```python
+@app.function(gpu=["H100", "A100-40GB:2"])
+def run_on_80gb():
+    # Tries H100 first, falls back to 2x A100-40GB
+    ...
+```
+
+Modal respects ordering and allocates most preferred available GPU.
+
+## Multi-GPU Training
+
+Modal supports multi-GPU training on a single node. Multi-node training is in closed beta.
+
+### PyTorch Example
+
+For frameworks that re-execute entrypoints, use subprocess or specific strategies:
+
+```python
+@app.function(gpu="A100:2")
+def train():
+    import subprocess
+    import sys
+    subprocess.run(
+        ["python", "train.py"],
+        stdout=sys.stdout,
+        stderr=sys.stderr,
+        check=True,
+    )
+```
+
+For PyTorch Lightning, set strategy to `ddp_spawn` or `ddp_notebook`.
+
+## Performance Considerations
+
+**Memory-Bound vs Compute-Bound**:
+- Running models with small batch sizes is memory-bound
+- Newer GPUs have faster arithmetic than memory access
+- Speedup from newer hardware may not justify cost for memory-bound workloads
+
+**Optimization**:
+- Use batching when possible
+- Consider L40S before jumping to H100/B200
+- Profile to identify bottlenecks