Files
gh-tachyon-beep-skillpacks-…/skills/using-pytorch-engineering/SKILL.md
2025-11-30 09:00:03 +08:00

384 lines
14 KiB
Markdown

---
name: using-pytorch-engineering
description: Routes to appropriate PyTorch specialist skill based on symptoms and problem type
mode: true
---
# Using PyTorch Engineering
## Overview
This meta-skill routes you to the right PyTorch specialist based on symptoms. PyTorch engineering problems fall into distinct categories that require specialized knowledge. Load this skill when you encounter PyTorch-specific issues but aren't sure which specialized skill to use.
**Core Principle**: Different PyTorch problems require different specialists. Match symptoms to the appropriate specialist skill. Don't guess at solutions—route to the expert.
## When to Use
Load this skill when:
- Working with PyTorch and encountering problems
- User mentions: "PyTorch", "torch", "CUDA", "GPU", "distributed training"
- Need to implement PyTorch models or optimize performance
- Debugging PyTorch training issues
- Setting up production PyTorch infrastructure
**Don't use for**: Framework-agnostic ML theory, non-PyTorch frameworks, algorithm selection (use training-optimization or other packs)
---
## Routing by Symptom
### Memory Issues
**Symptoms**:
- "CUDA out of memory"
- "OOM error"
- "RuntimeError: CUDA out of memory"
- "GPU memory usage too high"
- "tensor memory leak"
- "memory consumption increasing"
**Route to**: See [tensor-operations-and-memory.md](tensor-operations-and-memory.md) for memory management and optimization.
**Why**: Memory management is foundational. Must understand tensor lifecycles, efficient operations, and profiling before other optimizations.
**Example queries**:
- "Getting OOM after a few batches"
- "How to reduce memory usage?"
- "Memory grows over time during training"
---
### Module and Model Design
**Symptoms**:
- "How to structure my PyTorch model?"
- "Custom layer implementation"
- "nn.Module best practices"
- "Forward/backward pass design"
- "Model architecture implementation"
- "Parameter initialization"
**Route to**: See [module-design-patterns.md](module-design-patterns.md) for model architecture and nn.Module patterns.
**Why**: Proper module design prevents bugs and enables features like checkpointing, distributed training, and serialization.
**Example queries**:
- "Building custom ResNet variant"
- "How to organize model components?"
- "Module initialization best practices"
---
### Distributed Training Setup
**Symptoms**:
- "Multiple GPUs"
- "DistributedDataParallel"
- "DDP"
- "Multi-node training"
- "Scale training to N GPUs"
- "torch.distributed"
- "NCCL"
**Route to**: See [distributed-training-strategies.md](distributed-training-strategies.md) for DDP setup and multi-GPU training.
**Why**: Distributed training has unique setup requirements, synchronization patterns, and pitfalls. Generic advice breaks in distributed settings.
**Example queries**:
- "Setup DDP for 8 GPUs"
- "Multi-node training not working"
- "How to launch distributed training?"
---
### Performance and Speed
**Symptoms**:
- "Training too slow"
- "Low GPU utilization"
- "Iterations per second"
- "Throughput"
- "Performance optimization"
- "Speed up training"
**Route to**: See [performance-profiling.md](performance-profiling.md) FIRST for systematic bottleneck identification.
**Why**: MUST profile before optimizing. Many "performance" problems are actually data loading or other non-compute bottlenecks. Profile to identify the real bottleneck.
**After profiling**, may route to:
- [mixed-precision-and-optimization.md](mixed-precision-and-optimization.md) if compute-bound
- [tensor-operations-and-memory.md](tensor-operations-and-memory.md) if memory-bound
- [distributed-training-strategies.md](distributed-training-strategies.md) if need to scale
**Example queries**:
- "Training is slow, how to speed up?"
- "GPU usage is only 30%"
- "Bottleneck in my training loop"
---
### Mixed Precision and Optimization
**Symptoms**:
- "Mixed precision"
- "FP16", "BF16"
- "torch.cuda.amp"
- "Automatic mixed precision"
- "AMP"
- "TF32"
**Route to**: See [mixed-precision-and-optimization.md](mixed-precision-and-optimization.md) for AMP and numerical stability.
**Why**: Mixed precision requires careful handling of numerical stability, gradient scaling, and operation compatibility.
**Example queries**:
- "How to use mixed precision training?"
- "AMP causing NaN losses"
- "FP16 vs BF16 for my model"
---
### Training Instability and NaN
**Symptoms**:
- "NaN loss"
- "Inf gradients"
- "Loss exploding"
- "Training becomes unstable"
- "Gradients are NaN"
- "Model diverging"
**Route to**: See [debugging-techniques.md](debugging-techniques.md) for systematic NaN/Inf debugging.
**Why**: NaN/Inf issues require systematic debugging—checking gradients layer by layer, identifying numerical instability sources, and targeted fixes.
**Example queries**:
- "Loss becomes NaN after epoch 3"
- "How to debug gradient explosion?"
- "Model outputs Inf values"
---
### Checkpointing and State Management
**Symptoms**:
- "Save model"
- "Resume training"
- "Checkpoint"
- "Reproducible training"
- "Save optimizer state"
- "Load pretrained weights"
**Route to**: See [checkpointing-and-reproducibility.md](checkpointing-and-reproducibility.md) for complete state management.
**Why**: Proper checkpointing requires saving ALL state (model, optimizer, scheduler, RNG states). Reproducibility requires deterministic operations and careful seed management.
**Example queries**:
- "How to checkpoint training properly?"
- "Resume from checkpoint"
- "Make training reproducible"
---
### Custom Operations and Autograd
**Symptoms**:
- "Custom backward pass"
- "torch.autograd.Function"
- "Define custom gradient"
- "Efficient custom operation"
- "Non-differentiable operation"
- "Custom CUDA kernel"
**Route to**: See [custom-autograd-functions.md](custom-autograd-functions.md) for custom backward passes.
**Why**: Custom autograd functions require understanding the autograd engine, proper gradient computation, and numerical stability.
**Example queries**:
- "Implement custom activation with gradient"
- "Efficient backwards pass for my operation"
- "How to use torch.autograd.Function?"
---
## Cross-Cutting Scenarios
### Multiple Skills Needed
Some scenarios require multiple specialized skills in sequence:
**Distributed training with memory constraints**:
1. Route to [distributed-training-strategies.md](distributed-training-strategies.md) (setup)
2. THEN [tensor-operations-and-memory.md](tensor-operations-and-memory.md) (optimize per-GPU memory)
**Performance optimization**:
1. Route to [performance-profiling.md](performance-profiling.md) (identify bottleneck)
2. THEN appropriate skill based on bottleneck:
- Compute → [mixed-precision-and-optimization.md](mixed-precision-and-optimization.md)
- Memory → [tensor-operations-and-memory.md](tensor-operations-and-memory.md)
- Scale → [distributed-training-strategies.md](distributed-training-strategies.md)
**Custom module with proper patterns**:
1. Route to [module-design-patterns.md](module-design-patterns.md) (structure)
2. THEN [custom-autograd-functions.md](custom-autograd-functions.md) if custom backward needed
**Training instability with mixed precision**:
1. Route to [debugging-techniques.md](debugging-techniques.md) (diagnose root cause)
2. May need [mixed-precision-and-optimization.md](mixed-precision-and-optimization.md) for gradient scaling
**Load in order of execution**: Setup before optimization, diagnosis before fixes, structure before customization.
---
## Ambiguous Queries - Ask First
When symptom unclear, ASK ONE clarifying question:
**"Fix my PyTorch training"**
→ Ask: "What specific issue? Memory? Speed? Accuracy? NaN?"
**"Optimize my model"**
→ Ask: "Optimize what? Training speed? Memory usage? Inference?"
**"Setup distributed training"**
→ Ask: "Single-node multi-GPU or multi-node? What's not working?"
**"Model not working"**
→ Ask: "What's broken? Training fails? Wrong outputs? Performance?"
**Never guess when ambiguous. Ask once, route accurately.**
---
## Common Routing Mistakes
| Symptom | Wrong Route | Correct Route | Why |
|---------|-------------|---------------|-----|
| "Training slow" | mixed-precision | performance-profiling FIRST | Don't optimize without profiling |
| "OOM in distributed" | tensor-memory | distributed-strategies FIRST | Distributed setup might be wrong |
| "Custom layer slow" | performance-profiling | module-design-patterns FIRST | Design might be inefficient |
| "NaN with AMP" | mixed-precision | debugging-techniques FIRST | Debug NaN source, then fix AMP |
| "Save model" | module-design | checkpointing FIRST | Checkpointing is specialized topic |
**Key principle**: Diagnosis before solutions, setup before optimization, root cause before fixes.
---
## Red Flags - Stop and Route
If you catch yourself about to:
- Suggest reducing batch size → Route to [tensor-operations-and-memory.md](tensor-operations-and-memory.md) for systematic approach
- Show basic DDP code → Route to [distributed-training-strategies.md](distributed-training-strategies.md) for complete setup
- Guess at optimizations → Route to [performance-profiling.md](performance-profiling.md) to measure first
- List possible NaN fixes → Route to [debugging-techniques.md](debugging-techniques.md) for diagnostic methodology
- Show torch.save example → Route to [checkpointing-and-reproducibility.md](checkpointing-and-reproducibility.md) for complete solution
**All of these mean: You're about to give incomplete advice. Route to the specialist instead.**
---
## Common Rationalizations (Don't Do These)
| Excuse | Reality | What To Do |
|--------|---------|------------|
| "User is rushed, skip routing" | Routing takes 5 seconds. Wrong fix wastes minutes. | Route anyway - specialists have quick diagnostics |
| "They already tried X" | May have done X wrong, misunderstood, or X wasn't applicable. | Route to specialist to verify X was done correctly |
| "Authority/senior says Y" | Authority can misdiagnose bottlenecks without profiling. | Profile first, authority second. Respect skills over seniority. |
| "User is tired, don't ask" | Exhaustion makes clarity MORE important, not less. | Ask ONE clarifying question - saves time overall |
| "User suggested Z" | Z might not be best option for their specific case. | Route to specialist to evaluate if Z is right approach |
| "Too complex, can't route" | Complex scenarios need specialists MORE, not less. | Use cross-cutting section - route to multiple skills in sequence |
| "User sounds confident" | Confidence about custom autograd often precedes subtle bugs. | Route to specialist for systematic verification |
| "Just a quick question" | No such thing - symptoms need diagnosis. | Quick questions deserve correct answers - route properly |
| "Simple issue" | Simple symptoms can have complex root causes. | Route based on symptoms, not perceived complexity |
| "Direct answer is helpful" | Wrong direct answer wastes time and frustrates user. | Routing to specialist IS the helpful answer |
**If you catch yourself thinking ANY of these, STOP and route to the specialist.**
---
## Red Flags Checklist - Self-Check Before Answering
Before giving ANY PyTorch advice, ask yourself:
1.**Did I identify the symptom?**
- If no → Read query again, identify symptoms
2.**Is this symptom in my routing table?**
- If yes → Route to that specialist
- If no → Ask clarifying question
3.**Am I about to give advice directly?**
- If yes → STOP. Why am I not routing?
- Check rationalization table - am I making excuses?
4.**Is this a diagnosis issue or solution issue?**
- Diagnosis → Route to profiling/debugging skill FIRST
- Solution → Route to appropriate implementation skill
5.**Is query ambiguous?**
- If yes → Ask ONE clarifying question
- If no → Route confidently
6.**Am I feeling pressure to skip routing?**
- Time pressure → Route anyway (faster overall)
- Sunk cost → Route anyway (verify first attempt)
- Authority → Route anyway (verify diagnosis)
- Exhaustion → Route anyway (clarity more important)
**If you failed ANY check above, do NOT give direct advice. Route to specialist or ask clarifying question.**
---
## When NOT to Use PyTorch Skills
**Skip PyTorch pack when**:
- Choosing algorithms (use training-optimization or algorithm packs)
- Model architecture selection (use neural-architectures)
- Framework-agnostic training issues (use training-optimization)
- Production deployment (use ml-production)
**PyTorch pack is for**: PyTorch-specific implementation, infrastructure, debugging, and optimization issues.
---
## Diagnosis-First Principle
**Critical**: Many PyTorch issues require diagnosis before solutions:
| Issue Type | Diagnosis Skill | Then Solution Skill |
|------------|----------------|---------------------|
| Performance | performance-profiling | mixed-precision / distributed |
| Memory | tensor-memory (profiling section) | tensor-memory (optimization) |
| NaN/Inf | debugging-techniques | mixed-precision / module-design |
| Training bugs | debugging-techniques | Appropriate fix |
**If unclear what's wrong, route to diagnostic skill first.**
---
## PyTorch Engineering Specialist Skills
After routing, load the appropriate specialist skill for detailed guidance:
1. [tensor-operations-and-memory.md](tensor-operations-and-memory.md) - Memory management, efficient operations, profiling
2. [module-design-patterns.md](module-design-patterns.md) - Model structure, nn.Module best practices, initialization
3. [distributed-training-strategies.md](distributed-training-strategies.md) - DDP setup, multi-node, synchronization patterns
4. [mixed-precision-and-optimization.md](mixed-precision-and-optimization.md) - AMP, FP16/BF16, gradient scaling, numerical stability
5. [performance-profiling.md](performance-profiling.md) - PyTorch profiler, bottleneck identification, optimization strategies
6. [debugging-techniques.md](debugging-techniques.md) - NaN/Inf debugging, gradient checking, systematic troubleshooting
7. [checkpointing-and-reproducibility.md](checkpointing-and-reproducibility.md) - Complete checkpointing, RNG state, determinism
8. [custom-autograd-functions.md](custom-autograd-functions.md) - torch.autograd.Function, custom gradients, efficient backward
---
## Integration Notes
**Phase 1 - Standalone**: PyTorch skills are self-contained
**Future cross-references**:
- training-optimization (framework-agnostic training techniques)
- neural-architectures (architecture selection before implementation)
- ml-production (deployment after training)
**Current focus**: Route within PyTorch pack only. Other packs handle other concerns.