--- name: using-pytorch-engineering description: Routes to appropriate PyTorch specialist skill based on symptoms and problem type mode: true --- # Using PyTorch Engineering ## Overview This meta-skill routes you to the right PyTorch specialist based on symptoms. PyTorch engineering problems fall into distinct categories that require specialized knowledge. Load this skill when you encounter PyTorch-specific issues but aren't sure which specialized skill to use. **Core Principle**: Different PyTorch problems require different specialists. Match symptoms to the appropriate specialist skill. Don't guess at solutions—route to the expert. ## When to Use Load this skill when: - Working with PyTorch and encountering problems - User mentions: "PyTorch", "torch", "CUDA", "GPU", "distributed training" - Need to implement PyTorch models or optimize performance - Debugging PyTorch training issues - Setting up production PyTorch infrastructure **Don't use for**: Framework-agnostic ML theory, non-PyTorch frameworks, algorithm selection (use training-optimization or other packs) --- ## Routing by Symptom ### Memory Issues **Symptoms**: - "CUDA out of memory" - "OOM error" - "RuntimeError: CUDA out of memory" - "GPU memory usage too high" - "tensor memory leak" - "memory consumption increasing" **Route to**: See [tensor-operations-and-memory.md](tensor-operations-and-memory.md) for memory management and optimization. **Why**: Memory management is foundational. Must understand tensor lifecycles, efficient operations, and profiling before other optimizations. **Example queries**: - "Getting OOM after a few batches" - "How to reduce memory usage?" - "Memory grows over time during training" --- ### Module and Model Design **Symptoms**: - "How to structure my PyTorch model?" - "Custom layer implementation" - "nn.Module best practices" - "Forward/backward pass design" - "Model architecture implementation" - "Parameter initialization" **Route to**: See [module-design-patterns.md](module-design-patterns.md) for model architecture and nn.Module patterns. **Why**: Proper module design prevents bugs and enables features like checkpointing, distributed training, and serialization. **Example queries**: - "Building custom ResNet variant" - "How to organize model components?" - "Module initialization best practices" --- ### Distributed Training Setup **Symptoms**: - "Multiple GPUs" - "DistributedDataParallel" - "DDP" - "Multi-node training" - "Scale training to N GPUs" - "torch.distributed" - "NCCL" **Route to**: See [distributed-training-strategies.md](distributed-training-strategies.md) for DDP setup and multi-GPU training. **Why**: Distributed training has unique setup requirements, synchronization patterns, and pitfalls. Generic advice breaks in distributed settings. **Example queries**: - "Setup DDP for 8 GPUs" - "Multi-node training not working" - "How to launch distributed training?" --- ### Performance and Speed **Symptoms**: - "Training too slow" - "Low GPU utilization" - "Iterations per second" - "Throughput" - "Performance optimization" - "Speed up training" **Route to**: See [performance-profiling.md](performance-profiling.md) FIRST for systematic bottleneck identification. **Why**: MUST profile before optimizing. Many "performance" problems are actually data loading or other non-compute bottlenecks. Profile to identify the real bottleneck. **After profiling**, may route to: - [mixed-precision-and-optimization.md](mixed-precision-and-optimization.md) if compute-bound - [tensor-operations-and-memory.md](tensor-operations-and-memory.md) if memory-bound - [distributed-training-strategies.md](distributed-training-strategies.md) if need to scale **Example queries**: - "Training is slow, how to speed up?" - "GPU usage is only 30%" - "Bottleneck in my training loop" --- ### Mixed Precision and Optimization **Symptoms**: - "Mixed precision" - "FP16", "BF16" - "torch.cuda.amp" - "Automatic mixed precision" - "AMP" - "TF32" **Route to**: See [mixed-precision-and-optimization.md](mixed-precision-and-optimization.md) for AMP and numerical stability. **Why**: Mixed precision requires careful handling of numerical stability, gradient scaling, and operation compatibility. **Example queries**: - "How to use mixed precision training?" - "AMP causing NaN losses" - "FP16 vs BF16 for my model" --- ### Training Instability and NaN **Symptoms**: - "NaN loss" - "Inf gradients" - "Loss exploding" - "Training becomes unstable" - "Gradients are NaN" - "Model diverging" **Route to**: See [debugging-techniques.md](debugging-techniques.md) for systematic NaN/Inf debugging. **Why**: NaN/Inf issues require systematic debugging—checking gradients layer by layer, identifying numerical instability sources, and targeted fixes. **Example queries**: - "Loss becomes NaN after epoch 3" - "How to debug gradient explosion?" - "Model outputs Inf values" --- ### Checkpointing and State Management **Symptoms**: - "Save model" - "Resume training" - "Checkpoint" - "Reproducible training" - "Save optimizer state" - "Load pretrained weights" **Route to**: See [checkpointing-and-reproducibility.md](checkpointing-and-reproducibility.md) for complete state management. **Why**: Proper checkpointing requires saving ALL state (model, optimizer, scheduler, RNG states). Reproducibility requires deterministic operations and careful seed management. **Example queries**: - "How to checkpoint training properly?" - "Resume from checkpoint" - "Make training reproducible" --- ### Custom Operations and Autograd **Symptoms**: - "Custom backward pass" - "torch.autograd.Function" - "Define custom gradient" - "Efficient custom operation" - "Non-differentiable operation" - "Custom CUDA kernel" **Route to**: See [custom-autograd-functions.md](custom-autograd-functions.md) for custom backward passes. **Why**: Custom autograd functions require understanding the autograd engine, proper gradient computation, and numerical stability. **Example queries**: - "Implement custom activation with gradient" - "Efficient backwards pass for my operation" - "How to use torch.autograd.Function?" --- ## Cross-Cutting Scenarios ### Multiple Skills Needed Some scenarios require multiple specialized skills in sequence: **Distributed training with memory constraints**: 1. Route to [distributed-training-strategies.md](distributed-training-strategies.md) (setup) 2. THEN [tensor-operations-and-memory.md](tensor-operations-and-memory.md) (optimize per-GPU memory) **Performance optimization**: 1. Route to [performance-profiling.md](performance-profiling.md) (identify bottleneck) 2. THEN appropriate skill based on bottleneck: - Compute → [mixed-precision-and-optimization.md](mixed-precision-and-optimization.md) - Memory → [tensor-operations-and-memory.md](tensor-operations-and-memory.md) - Scale → [distributed-training-strategies.md](distributed-training-strategies.md) **Custom module with proper patterns**: 1. Route to [module-design-patterns.md](module-design-patterns.md) (structure) 2. THEN [custom-autograd-functions.md](custom-autograd-functions.md) if custom backward needed **Training instability with mixed precision**: 1. Route to [debugging-techniques.md](debugging-techniques.md) (diagnose root cause) 2. May need [mixed-precision-and-optimization.md](mixed-precision-and-optimization.md) for gradient scaling **Load in order of execution**: Setup before optimization, diagnosis before fixes, structure before customization. --- ## Ambiguous Queries - Ask First When symptom unclear, ASK ONE clarifying question: **"Fix my PyTorch training"** → Ask: "What specific issue? Memory? Speed? Accuracy? NaN?" **"Optimize my model"** → Ask: "Optimize what? Training speed? Memory usage? Inference?" **"Setup distributed training"** → Ask: "Single-node multi-GPU or multi-node? What's not working?" **"Model not working"** → Ask: "What's broken? Training fails? Wrong outputs? Performance?" **Never guess when ambiguous. Ask once, route accurately.** --- ## Common Routing Mistakes | Symptom | Wrong Route | Correct Route | Why | |---------|-------------|---------------|-----| | "Training slow" | mixed-precision | performance-profiling FIRST | Don't optimize without profiling | | "OOM in distributed" | tensor-memory | distributed-strategies FIRST | Distributed setup might be wrong | | "Custom layer slow" | performance-profiling | module-design-patterns FIRST | Design might be inefficient | | "NaN with AMP" | mixed-precision | debugging-techniques FIRST | Debug NaN source, then fix AMP | | "Save model" | module-design | checkpointing FIRST | Checkpointing is specialized topic | **Key principle**: Diagnosis before solutions, setup before optimization, root cause before fixes. --- ## Red Flags - Stop and Route If you catch yourself about to: - Suggest reducing batch size → Route to [tensor-operations-and-memory.md](tensor-operations-and-memory.md) for systematic approach - Show basic DDP code → Route to [distributed-training-strategies.md](distributed-training-strategies.md) for complete setup - Guess at optimizations → Route to [performance-profiling.md](performance-profiling.md) to measure first - List possible NaN fixes → Route to [debugging-techniques.md](debugging-techniques.md) for diagnostic methodology - Show torch.save example → Route to [checkpointing-and-reproducibility.md](checkpointing-and-reproducibility.md) for complete solution **All of these mean: You're about to give incomplete advice. Route to the specialist instead.** --- ## Common Rationalizations (Don't Do These) | Excuse | Reality | What To Do | |--------|---------|------------| | "User is rushed, skip routing" | Routing takes 5 seconds. Wrong fix wastes minutes. | Route anyway - specialists have quick diagnostics | | "They already tried X" | May have done X wrong, misunderstood, or X wasn't applicable. | Route to specialist to verify X was done correctly | | "Authority/senior says Y" | Authority can misdiagnose bottlenecks without profiling. | Profile first, authority second. Respect skills over seniority. | | "User is tired, don't ask" | Exhaustion makes clarity MORE important, not less. | Ask ONE clarifying question - saves time overall | | "User suggested Z" | Z might not be best option for their specific case. | Route to specialist to evaluate if Z is right approach | | "Too complex, can't route" | Complex scenarios need specialists MORE, not less. | Use cross-cutting section - route to multiple skills in sequence | | "User sounds confident" | Confidence about custom autograd often precedes subtle bugs. | Route to specialist for systematic verification | | "Just a quick question" | No such thing - symptoms need diagnosis. | Quick questions deserve correct answers - route properly | | "Simple issue" | Simple symptoms can have complex root causes. | Route based on symptoms, not perceived complexity | | "Direct answer is helpful" | Wrong direct answer wastes time and frustrates user. | Routing to specialist IS the helpful answer | **If you catch yourself thinking ANY of these, STOP and route to the specialist.** --- ## Red Flags Checklist - Self-Check Before Answering Before giving ANY PyTorch advice, ask yourself: 1. ❓ **Did I identify the symptom?** - If no → Read query again, identify symptoms 2. ❓ **Is this symptom in my routing table?** - If yes → Route to that specialist - If no → Ask clarifying question 3. ❓ **Am I about to give advice directly?** - If yes → STOP. Why am I not routing? - Check rationalization table - am I making excuses? 4. ❓ **Is this a diagnosis issue or solution issue?** - Diagnosis → Route to profiling/debugging skill FIRST - Solution → Route to appropriate implementation skill 5. ❓ **Is query ambiguous?** - If yes → Ask ONE clarifying question - If no → Route confidently 6. ❓ **Am I feeling pressure to skip routing?** - Time pressure → Route anyway (faster overall) - Sunk cost → Route anyway (verify first attempt) - Authority → Route anyway (verify diagnosis) - Exhaustion → Route anyway (clarity more important) **If you failed ANY check above, do NOT give direct advice. Route to specialist or ask clarifying question.** --- ## When NOT to Use PyTorch Skills **Skip PyTorch pack when**: - Choosing algorithms (use training-optimization or algorithm packs) - Model architecture selection (use neural-architectures) - Framework-agnostic training issues (use training-optimization) - Production deployment (use ml-production) **PyTorch pack is for**: PyTorch-specific implementation, infrastructure, debugging, and optimization issues. --- ## Diagnosis-First Principle **Critical**: Many PyTorch issues require diagnosis before solutions: | Issue Type | Diagnosis Skill | Then Solution Skill | |------------|----------------|---------------------| | Performance | performance-profiling | mixed-precision / distributed | | Memory | tensor-memory (profiling section) | tensor-memory (optimization) | | NaN/Inf | debugging-techniques | mixed-precision / module-design | | Training bugs | debugging-techniques | Appropriate fix | **If unclear what's wrong, route to diagnostic skill first.** --- ## PyTorch Engineering Specialist Skills After routing, load the appropriate specialist skill for detailed guidance: 1. [tensor-operations-and-memory.md](tensor-operations-and-memory.md) - Memory management, efficient operations, profiling 2. [module-design-patterns.md](module-design-patterns.md) - Model structure, nn.Module best practices, initialization 3. [distributed-training-strategies.md](distributed-training-strategies.md) - DDP setup, multi-node, synchronization patterns 4. [mixed-precision-and-optimization.md](mixed-precision-and-optimization.md) - AMP, FP16/BF16, gradient scaling, numerical stability 5. [performance-profiling.md](performance-profiling.md) - PyTorch profiler, bottleneck identification, optimization strategies 6. [debugging-techniques.md](debugging-techniques.md) - NaN/Inf debugging, gradient checking, systematic troubleshooting 7. [checkpointing-and-reproducibility.md](checkpointing-and-reproducibility.md) - Complete checkpointing, RNG state, determinism 8. [custom-autograd-functions.md](custom-autograd-functions.md) - torch.autograd.Function, custom gradients, efficient backward --- ## Integration Notes **Phase 1 - Standalone**: PyTorch skills are self-contained **Future cross-references**: - training-optimization (framework-agnostic training techniques) - neural-architectures (architecture selection before implementation) - ml-production (deployment after training) **Current focus**: Route within PyTorch pack only. Other packs handle other concerns.