--- name: using-neural-architectures description: Architecture selection router: CNNs, Transformers, RNNs, GANs, GNNs by data modality and constraints mode: true pack: neural-architectures faction: yzmir --- # Using Neural Architectures: Architecture Selection Router Architecture selection comes BEFORE training optimization. Wrong architecture = no amount of training will fix it. This meta-skill routes you to the right architecture guidance based on: - Data modality (images, sequences, graphs, etc.) - Problem type (classification, generation, regression) - Constraints (data size, compute, latency, interpretability) Load this skill when architecture decisions are needed. ## When to Use This Skill Use this skill when: - ✅ Selecting an architecture for a new problem - ✅ Comparing architecture families (CNN vs Transformer, RNN vs Transformer, etc.) - ✅ Designing custom network topology - ✅ Troubleshooting architectural instability (deep networks, gradient issues) - ✅ Understanding when to use specialized architectures (GNNs, generative models) DO NOT use for: - ❌ Training/optimization issues (use training-optimization pack) - ❌ PyTorch implementation details (use pytorch-engineering pack) - ❌ Production deployment (use ml-production pack) **When in doubt:** If choosing WHAT architecture → this skill. If training/deploying architecture → different pack. --- ## Core Routing Logic ### Step 1: Identify Data Modality **Question to ask:** "What type of data are you working with?" | Data Type | Route To | Why | |-----------|----------|-----| | Images (photos, medical scans, etc.) | [cnn-families-and-selection.md](cnn-families-and-selection.md) | CNNs excel at spatial hierarchies | | Sequences (time series, text, audio) | [sequence-models-comparison.md](sequence-models-comparison.md) | Temporal dependencies need sequential models | | Graphs (social networks, molecules) | [graph-neural-networks-basics.md](graph-neural-networks-basics.md) | Graph structure requires GNNs | | Generation task (create images, text) | [generative-model-families.md](generative-model-families.md) | Generative models are specialized | | Multiple modalities (text + images) | [architecture-design-principles.md](architecture-design-principles.md) | Need custom design | | Unclear / Generic | [architecture-design-principles.md](architecture-design-principles.md) | Start with fundamentals | ### Step 2: Check for Special Requirements **If any of these apply, address FIRST:** | Requirement | Route To | Priority | |-------------|----------|----------| | Deep network (> 20 layers) unstable | [normalization-techniques.md](normalization-techniques.md) | CRITICAL - fix before continuing | | Need attention mechanisms | [attention-mechanisms-catalog.md](attention-mechanisms-catalog.md) | Specialized component | | Custom architecture design | [architecture-design-principles.md](architecture-design-principles.md) | Foundation before specifics | | Transformer-specific question | [transformer-architecture-deepdive.md](transformer-architecture-deepdive.md) | Specialized architecture | ### Step 3: Consider Problem Characteristics **Clarify BEFORE routing:** Ask: - "How large is your dataset?" (Small < 10k, Medium 10k-1M, Large > 1M) - "What are your computational constraints?" (Edge device, cloud, GPU availability) - "What are your latency requirements?" (Real-time, batch, offline) - "Do you need interpretability?" (Clinical, research, production) These answers determine architecture appropriateness. --- ## Routing by Data Modality ### Images → CNN Families **Symptoms triggering this route:** - "classify images" - "object detection" - "semantic segmentation" - "medical imaging" - "computer vision" **Route to:** See [cnn-families-and-selection.md](cnn-families-and-selection.md) for CNN architecture selection and comparison. **When to route here:** - ANY vision task (CNNs are default for spatial data) - Even if considering Transformers, check CNN families first (often better with less data) **Clarifying questions:** - "Dataset size?" (< 10k → Start with proven CNNs, > 100k → Consider ViT) - "Deployment target?" (Edge → EfficientNet, Cloud → Anything) - "Task type?" (Classification → ResNet/EfficientNet, Detection → YOLO/Faster-RCNN) --- ### Sequences → Sequence Models Comparison **Symptoms triggering this route:** - "time series" - "forecasting" - "natural language" (NLP) - "sequential data" - "temporal patterns" - "RNN vs LSTM vs Transformer" **Route to:** See [sequence-models-comparison.md](sequence-models-comparison.md) for sequential model selection (RNN, LSTM, Transformer, TCN). **When to route here:** - ANY sequential data - When user asks "RNN vs LSTM" (skill will present modern alternatives) - Time-dependent patterns **Clarifying questions:** - "Sequence length?" (< 100 → RNN/LSTM/TCN, 100-1000 → Transformer, > 1000 → Sparse Transformers) - "Latency requirements?" (Real-time → TCN/LSTM, Offline → Transformer) - "Data volume?" (Small → Simpler models, Large → Transformers) **CRITICAL:** Challenge "RNN vs LSTM" premise if they ask. Modern alternatives (Transformers, TCN) often better. --- ### Graphs → Graph Neural Networks **Symptoms triggering this route:** - "social network" - "molecular structure" - "knowledge graph" - "graph data" - "node classification" - "link prediction" - "graph embeddings" **Route to:** See [graph-neural-networks-basics.md](graph-neural-networks-basics.md) for GNN architectures and graph learning. **When to route here:** - Data has explicit graph structure (nodes + edges) - Relational information is important - Network topology matters **Red flag:** If treating graph as tabular data (extracting features and ignoring edges) → WRONG. Route to GNN skill. --- ### Generation → Generative Model Families **Symptoms triggering this route:** - "generate images" - "synthesize data" - "GAN vs VAE vs Diffusion" - "image-to-image translation" - "style transfer" - "generative modeling" **Route to:** See [generative-model-families.md](generative-model-families.md) for GANs, VAEs, and Diffusion models. **When to route here:** - Goal is to CREATE data, not classify/predict - Need to sample from distribution - Data augmentation through generation **Clarifying questions:** - "Use case?" (Real-time game → GAN, Art/research → Diffusion, Fast training → VAE) - "Quality vs speed?" (Quality → Diffusion, Speed → GAN) - "Controllability?" (Fine control → StyleGAN/Conditional models) **CRITICAL:** Different generative models have VERY different trade-offs. Must clarify requirements. --- ## Routing by Architecture Component ### Attention Mechanisms **Symptoms triggering this route:** - "when to use attention" - "self-attention vs cross-attention" - "attention in CNNs" - "attention bottleneck" - "multi-head attention" **Route to:** See [attention-mechanisms-catalog.md](attention-mechanisms-catalog.md) for attention mechanism selection and design. **When to route here:** - Designing custom architecture that might benefit from attention - Understanding where attention helps vs hinders - Comparing attention variants **NOT for:** General Transformer questions → [transformer-architecture-deepdive.md](transformer-architecture-deepdive.md) instead --- ### Transformer Deep Dive **Symptoms triggering this route:** - "how do transformers work" - "Vision Transformer (ViT)" - "BERT architecture" - "positional encoding" - "transformer blocks" - "scaling transformers" **Route to:** See [transformer-architecture-deepdive.md](transformer-architecture-deepdive.md) for Transformer internals and implementation. **When to route here:** - Implementing/customizing transformers - Understanding transformer internals - Debugging transformer-specific issues **Cross-reference:** - For sequence models generally → [sequence-models-comparison.md](sequence-models-comparison.md) (includes transformers in context) - For LLMs specifically → `yzmir/llm-specialist/transformer-for-llms` (LLM-specific transformers) --- ### Normalization Techniques **Symptoms triggering this route:** - "gradient explosion" - "training instability in deep network" - "BatchNorm vs LayerNorm" - "normalization layers" - "50+ layer network won't train" **Route to:** See [normalization-techniques.md](normalization-techniques.md) for deep network stability and normalization methods. **When to route here:** - Deep networks (> 20 layers) with training instability - Choosing between normalization methods - Architectural stability issues **CRITICAL:** This is often the ROOT CAUSE of "training won't work" - fix architecture before blaming hyperparameters. --- ### Architecture Design Principles **Symptoms triggering this route:** - "how to design architecture" - "architecture best practices" - "when to use skip connections" - "how deep should network be" - "custom architecture for [novel task]" - Unclear problem modality **Route to:** See [architecture-design-principles.md](architecture-design-principles.md) for custom architecture design fundamentals. **When to route here:** - Designing custom architectures - Novel problems without established architecture - Understanding WHY architectures work - User is unsure what modality/problem type they have **This is the foundational skill** - route here if other specific skills don't match. --- ## Multi-Modal / Cross-Pack Routing ### When Problem Spans Multiple Modalities **Example:** "Text + image classification" (multimodal) **Route to BOTH:** 1. [sequence-models-comparison.md](sequence-models-comparison.md) (for text) 2. [cnn-families-and-selection.md](cnn-families-and-selection.md) (for images) 3. [architecture-design-principles.md](architecture-design-principles.md) (for fusion strategy) **Order matters:** Understand individual modalities BEFORE fusion. ### When Architecture + Other Concerns **Example:** "Select architecture AND optimize training" **Route order:** 1. Architecture skill FIRST (this pack) 2. Training-optimization SECOND (after architecture chosen) **Why:** Wrong architecture can't be fixed by better training. **Example:** "Select architecture AND deploy efficiently" **Route order:** 1. Architecture skill FIRST 2. ML-production SECOND (quantization, serving) **Deployment constraints might influence architecture choice** - if so, note constraints during architecture selection. --- ## Common Routing Mistakes (DON'T DO THESE) | Symptom | Wrong Route | Correct Route | Why | |---------|-------------|---------------|-----| | "My transformer won't train" | [transformer-architecture-deepdive.md](transformer-architecture-deepdive.md) | training-optimization | Training issue, not architecture understanding | | "Deploy image classifier" | [cnn-families-and-selection.md](cnn-families-and-selection.md) | ml-production | Deployment, not selection | | "ViT vs ResNet for medical imaging" | [transformer-architecture-deepdive.md](transformer-architecture-deepdive.md) | [cnn-families-and-selection.md](cnn-families-and-selection.md) | Comparative selection, not single architecture detail | | "Implement BatchNorm in PyTorch" | [normalization-techniques.md](normalization-techniques.md) | pytorch-engineering | Implementation, not architecture concept | | "GAN won't converge" | [generative-model-families.md](generative-model-families.md) | training-optimization | Training stability, not architecture selection | | "Which optimizer for CNN" | [cnn-families-and-selection.md](cnn-families-and-selection.md) | training-optimization | Optimization, not architecture | **Rule:** Architecture pack is for CHOOSING and DESIGNING architectures. Training/deployment/implementation are other packs. --- ## Red Flags: Stop and Clarify If query contains these patterns, ASK clarifying questions before routing: | Pattern | Why Clarify | What to Ask | |---------|-------------|--------------| | "Best architecture for X" | "Best" depends on constraints | "What are your data size, compute, and latency constraints?" | | Generic problem description | Can't route without modality | "What type of data? (images, sequences, graphs, etc.)" | | Latest trend mentioned (ViT, Diffusion) | Recency bias risk | "Have you considered alternatives? What are your specific requirements?" | | "Should I use X or Y" | May be wrong question | "What's the underlying problem? There might be option Z." | | Very deep network (> 50 layers) | Likely needs normalization first | "Are you using normalization layers? Skip connections?" | **Never guess modality or constraints. Always clarify.** --- ## Recency Bias: Resistance Table | Trendy Architecture | When NOT to Use | Better Alternative | |---------------------|------------------|-------------------| | **Vision Transformers (ViT)** | Small datasets (< 10k images) | CNNs (ResNet, EfficientNet) | | **Vision Transformers (ViT)** | Edge deployment (latency/power) | EfficientNets, MobileNets | | **Transformers (general)** | Very small datasets | RNNs, CNNs (less capacity, less overfit) | | **Diffusion Models** | Real-time generation needed | GANs (1 forward pass vs 50-1000 steps) | | **Diffusion Models** | Limited compute for training | VAEs (faster training) | | **Graph Transformers** | Small graphs (< 100 nodes) | Standard GNNs (GCN, GAT) simpler and effective | | **LLMs (GPT-style)** | < 1M tokens of training data | Simpler language models or fine-tuning | **Counter-narrative:** "New architecture ≠ better for your use case. Match architecture to constraints." --- ## Decision Tree ``` Start here: What's your primary goal? ┌─ SELECT architecture for task │ ├─ Data modality? │ │ ├─ Images → [cnn-families-and-selection.md](cnn-families-and-selection.md) │ │ ├─ Sequences → [sequence-models-comparison.md](sequence-models-comparison.md) │ │ ├─ Graphs → [graph-neural-networks-basics.md](graph-neural-networks-basics.md) │ │ ├─ Generation → [generative-model-families.md](generative-model-families.md) │ │ └─ Unknown/Multiple → [architecture-design-principles.md](architecture-design-principles.md) │ └─ Special requirements? │ ├─ Deep network (>20 layers) unstable → [normalization-techniques.md](normalization-techniques.md) (CRITICAL) │ ├─ Need attention mechanism → [attention-mechanisms-catalog.md](attention-mechanisms-catalog.md) │ └─ None → Proceed with modality-based route │ ├─ UNDERSTAND specific architecture │ ├─ Transformers → [transformer-architecture-deepdive.md](transformer-architecture-deepdive.md) │ ├─ Attention → [attention-mechanisms-catalog.md](attention-mechanisms-catalog.md) │ ├─ Normalization → [normalization-techniques.md](normalization-techniques.md) │ └─ General principles → [architecture-design-principles.md](architecture-design-principles.md) │ ├─ DESIGN custom architecture │ └─ [architecture-design-principles.md](architecture-design-principles.md) (start here always) │ └─ COMPARE architectures ├─ CNNs (ResNet vs EfficientNet) → [cnn-families-and-selection.md](cnn-families-and-selection.md) ├─ Sequence models (RNN vs Transformer) → [sequence-models-comparison.md](sequence-models-comparison.md) ├─ Generative (GAN vs Diffusion) → [generative-model-families.md](generative-model-families.md) └─ General comparison → [architecture-design-principles.md](architecture-design-principles.md) ``` --- ## Workflow **Standard Architecture Selection Workflow:** ``` 1. Clarify Problem ☐ What data modality? (images, sequences, graphs, etc.) ☐ What's the task? (classification, generation, regression, etc.) ☐ Dataset size? ☐ Computational constraints? ☐ Latency requirements? ☐ Interpretability needs? 2. Route Based on Modality ☐ Images → [cnn-families-and-selection.md](cnn-families-and-selection.md) ☐ Sequences → [sequence-models-comparison.md](sequence-models-comparison.md) ☐ Graphs → [graph-neural-networks-basics.md](graph-neural-networks-basics.md) ☐ Generation → [generative-model-families.md](generative-model-families.md) ☐ Custom/Unclear → [architecture-design-principles.md](architecture-design-principles.md) 3. Check for Critical Issues ☐ Deep network unstable? → [normalization-techniques.md](normalization-techniques.md) FIRST ☐ Need specialized component? → [attention-mechanisms-catalog.md](attention-mechanisms-catalog.md) or [transformer-architecture-deepdive.md](transformer-architecture-deepdive.md) 4. Apply Architecture Skill ☐ Follow guidance from routed skill ☐ Consider trade-offs (accuracy vs speed vs data requirements) 5. Cross-Pack if Needed ☐ Architecture chosen → training-optimization (for training) ☐ Architecture chosen → ml-production (for deployment) ``` --- ## Rationalization Table | Rationalization | Reality | Counter | |-----------------|---------|---------| | "Transformers are SOTA, recommend them" | SOTA on benchmark ≠ best for user's constraints | "Ask about dataset size and compute first" | | "User said RNN vs LSTM, answer that" | Question premise might be outdated | "Challenge: Have you considered Transformers or TCN?" | | "Just recommend latest architecture" | Latest ≠ appropriate | "Match architecture to requirements, not trends" | | "Architecture doesn't matter, training matters" | Wrong architecture can't be fixed by training | "Architecture is foundation - get it right first" | | "They seem rushed, skip clarification" | Wrong route wastes more time than clarification | "30 seconds to clarify saves hours of wasted effort" | | "Generic architecture advice is safe" | Generic = useless for specific domains | "Route to domain-specific skill for actionable guidance" | --- ## Integration with Other Packs ### After Architecture Selection Once architecture is chosen, route to: **Training the architecture:** → `yzmir/training-optimization/using-training-optimization` - Optimizer selection - Learning rate schedules - Debugging training issues **Implementing in PyTorch:** → `yzmir/pytorch-engineering/using-pytorch-engineering` - Module design patterns - Performance optimization - Custom components **Deploying to production:** → `yzmir/ml-production/using-ml-production` - Model serving - Quantization - Inference optimization ### Before Architecture Selection If problem involves: **Reinforcement learning:** → `yzmir/deep-rl/using-deep-rl` FIRST - RL algorithms dictate architecture requirements - Value networks vs policy networks have different needs **Large language models:** → `yzmir/llm-specialist/using-llm-specialist` FIRST - LLM architectures are specialized transformers - Different considerations than general sequence models **Architecture is downstream of algorithm choice in RL and LLMs.** --- ## Summary **Use this meta-skill to:** - ✅ Route architecture queries to appropriate specialized skill - ✅ Identify data modality and problem type - ✅ Clarify constraints before recommending - ✅ Resist recency bias (latest ≠ best) - ✅ Recognize when architecture is the problem (vs training/implementation) ## Neural Architecture Specialist Skills After routing, load the appropriate specialist skill for detailed guidance: 1. [architecture-design-principles.md](architecture-design-principles.md) - Custom design, architectural best practices, skip connections, network depth fundamentals 2. [attention-mechanisms-catalog.md](attention-mechanisms-catalog.md) - Self-attention, cross-attention, multi-head attention, attention in CNNs, attention variants comparison 3. [cnn-families-and-selection.md](cnn-families-and-selection.md) - ResNet, EfficientNet, MobileNet, YOLO, computer vision architecture selection 4. [generative-model-families.md](generative-model-families.md) - GANs, VAEs, Diffusion models, image generation, style transfer, generative modeling trade-offs 5. [graph-neural-networks-basics.md](graph-neural-networks-basics.md) - GCN, GAT, node classification, link prediction, graph embeddings, molecular structures 6. [normalization-techniques.md](normalization-techniques.md) - BatchNorm, LayerNorm, GroupNorm, training stability for deep networks (>20 layers) 7. [sequence-models-comparison.md](sequence-models-comparison.md) - RNN, LSTM, Transformer, TCN comparison, time series, NLP, sequential data 8. [transformer-architecture-deepdive.md](transformer-architecture-deepdive.md) - Transformer internals, ViT, BERT, positional encoding, scaling transformers **Critical principle:** Architecture comes BEFORE training. Get this right first. --- **END OF SKILL**