Initial commit

2025-11-30 09:00:00 +08:00
commit 955d5c6743
12 changed files with 6996 additions and 0 deletions
--- a/.claude-plugin/plugin.json
+++ b/.claude-plugin/plugin.json
@@ -0,0 +1,12 @@
 {
  "name": "yzmir-neural-architectures",
  "description": "Neural architectures - CNNs, Transformers, RNNs, selection guidance - 9 skills",
  "version": "1.0.1",
  "author": {
    "name": "tachyon-beep",
    "url": "https://github.com/tachyon-beep"
  },
  "skills": [
    "./skills"
  ]
 }
--- a/README.md
+++ b/README.md
@@ -0,0 +1,3 @@
 # yzmir-neural-architectures
 Neural architectures - CNNs, Transformers, RNNs, selection guidance - 9 skills
--- a/plugin.lock.json
+++ b/plugin.lock.json
@@ -0,0 +1,77 @@
 {
  "$schema": "internal://schemas/plugin.lock.v1.json",
  "pluginId": "gh:tachyon-beep/skillpacks:plugins/yzmir-neural-architectures",
  "normalized": {
    "repo": null,
    "ref": "refs/tags/v20251128.0",
    "commit": "52681919f25de737a14d958d80b685171797dfdb",
    "treeHash": "fb6fcb8194fb2cba8912c22cc1b73554c78da695cf2417a0331443a4d8e719bc",
    "generatedAt": "2025-11-28T10:28:34.235937Z",
    "toolVersion": "publish_plugins.py@0.2.0"
  },
  "origin": {
    "remote": "git@github.com:zhongweili/42plugin-data.git",
    "branch": "master",
    "commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
    "repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
  },
  "manifest": {
    "name": "yzmir-neural-architectures",
    "description": "Neural architectures - CNNs, Transformers, RNNs, selection guidance - 9 skills",
    "version": "1.0.1"
  },
  "content": {
    "files": [
      {
        "path": "README.md",
        "sha256": "cf12418cc5cc226683c463dca6c54e98f97f2ddf57482d593839f430acc36b86"
      },
      {
        "path": ".claude-plugin/plugin.json",
        "sha256": "9893529f120b1582ad62d899531401988bdf89d6b221e057239e09399d47e255"
      },
      {
        "path": "skills/using-neural-architectures/architecture-design-principles.md",
        "sha256": "bd1cde3e2cf997b6e42dc52bc28638e644ee387c81b07464edc7fb4cf33d4329"
      },
      {
        "path": "skills/using-neural-architectures/sequence-models-comparison.md",
        "sha256": "5c041c08261cf04e47edd3ec092db04a8b4c725fb6bb91620df54005881cbb66"
      },
      {
        "path": "skills/using-neural-architectures/normalization-techniques.md",
        "sha256": "c87c82e926951a963d7909321c60a62b5cf6d09dbcbb576472eda46d72993766"
      },
      {
        "path": "skills/using-neural-architectures/graph-neural-networks-basics.md",
        "sha256": "85a968972f0c1c226d6bcb155b231560af0eab52326f419d308b1a6a22ebd7c5"
      },
      {
        "path": "skills/using-neural-architectures/generative-model-families.md",
        "sha256": "7c527707a0d3ecab99e83079880b1166c4310de62a743823ae747a03545180b3"
      },
      {
        "path": "skills/using-neural-architectures/cnn-families-and-selection.md",
        "sha256": "acce2a55a84012259119e70d3eb035aee08beb124d0df1aab009ae7fdd01ff51"
      },
      {
        "path": "skills/using-neural-architectures/SKILL.md",
        "sha256": "b8dc5fac750ca1740bd97591409701ca6162e0d5dc5faaf7a2cfcf7feef432b3"
      },
      {
        "path": "skills/using-neural-architectures/attention-mechanisms-catalog.md",
        "sha256": "3eab5267c7e5a1263d08be2dca4ef236ad66b554e08aec0e0cd85d5c1f8b5500"
      },
      {
        "path": "skills/using-neural-architectures/transformer-architecture-deepdive.md",
        "sha256": "1913626a82ff02f638d10aa00644fcc59fc9836329dc152a7becd9aa030e0937"
      }
    ],
    "dirSha256": "fb6fcb8194fb2cba8912c22cc1b73554c78da695cf2417a0331443a4d8e719bc"
  },
  "security": {
    "scannedAt": null,
    "scannerVersion": null,
    "flags": []
  }
 }
--- a/skills/using-neural-architectures/SKILL.md
+++ b/skills/using-neural-architectures/SKILL.md
@@ -0,0 +1,496 @@
 ---
 name: using-neural-architectures
 description: Architecture selection router: CNNs, Transformers, RNNs, GANs, GNNs by data modality and constraints
 mode: true
 pack: neural-architectures
 faction: yzmir
 ---
 # Using Neural Architectures: Architecture Selection Router
 <CRITICAL_CONTEXT>
 Architecture selection comes BEFORE training optimization. Wrong architecture = no amount of training will fix it.
 This meta-skill routes you to the right architecture guidance based on:
 - Data modality (images, sequences, graphs, etc.)
 - Problem type (classification, generation, regression)
 - Constraints (data size, compute, latency, interpretability)
 Load this skill when architecture decisions are needed.
 </CRITICAL_CONTEXT>
 ## When to Use This Skill
 Use this skill when:
 - ✅ Selecting an architecture for a new problem
 - ✅ Comparing architecture families (CNN vs Transformer, RNN vs Transformer, etc.)
 - ✅ Designing custom network topology
 - ✅ Troubleshooting architectural instability (deep networks, gradient issues)
 - ✅ Understanding when to use specialized architectures (GNNs, generative models)
 DO NOT use for:
 - ❌ Training/optimization issues (use training-optimization pack)
 - ❌ PyTorch implementation details (use pytorch-engineering pack)
 - ❌ Production deployment (use ml-production pack)
 **When in doubt:** If choosing WHAT architecture → this skill. If training/deploying architecture → different pack.
 ---
 ## Core Routing Logic
 ### Step 1: Identify Data Modality
 **Question to ask:** "What type of data are you working with?"
 | Data Type | Route To | Why |
 |-----------|----------|-----|
 | Images (photos, medical scans, etc.) | [cnn-families-and-selection.md](cnn-families-and-selection.md) | CNNs excel at spatial hierarchies |
 | Sequences (time series, text, audio) | [sequence-models-comparison.md](sequence-models-comparison.md) | Temporal dependencies need sequential models |
 | Graphs (social networks, molecules) | [graph-neural-networks-basics.md](graph-neural-networks-basics.md) | Graph structure requires GNNs |
 | Generation task (create images, text) | [generative-model-families.md](generative-model-families.md) | Generative models are specialized |
 | Multiple modalities (text + images) | [architecture-design-principles.md](architecture-design-principles.md) | Need custom design |
 | Unclear / Generic | [architecture-design-principles.md](architecture-design-principles.md) | Start with fundamentals |
 ### Step 2: Check for Special Requirements
 **If any of these apply, address FIRST:**
 | Requirement | Route To | Priority |
 |-------------|----------|----------|
 | Deep network (> 20 layers) unstable | [normalization-techniques.md](normalization-techniques.md) | CRITICAL - fix before continuing |
 | Need attention mechanisms | [attention-mechanisms-catalog.md](attention-mechanisms-catalog.md) | Specialized component |
 | Custom architecture design | [architecture-design-principles.md](architecture-design-principles.md) | Foundation before specifics |
 | Transformer-specific question | [transformer-architecture-deepdive.md](transformer-architecture-deepdive.md) | Specialized architecture |
 ### Step 3: Consider Problem Characteristics
 **Clarify BEFORE routing:**
 Ask:
 - "How large is your dataset?" (Small < 10k, Medium 10k-1M, Large > 1M)
 - "What are your computational constraints?" (Edge device, cloud, GPU availability)
 - "What are your latency requirements?" (Real-time, batch, offline)
 - "Do you need interpretability?" (Clinical, research, production)
 These answers determine architecture appropriateness.
 ---
 ## Routing by Data Modality
 ### Images → CNN Families
 **Symptoms triggering this route:**
 - "classify images"
 - "object detection"
 - "semantic segmentation"
 - "medical imaging"
 - "computer vision"
 **Route to:** See [cnn-families-and-selection.md](cnn-families-and-selection.md) for CNN architecture selection and comparison.
 **When to route here:**
 - ANY vision task (CNNs are default for spatial data)
 - Even if considering Transformers, check CNN families first (often better with less data)
 **Clarifying questions:**
 - "Dataset size?" (< 10k → Start with proven CNNs, > 100k → Consider ViT)
 - "Deployment target?" (Edge → EfficientNet, Cloud → Anything)
 - "Task type?" (Classification → ResNet/EfficientNet, Detection → YOLO/Faster-RCNN)
 ---
 ### Sequences → Sequence Models Comparison
 **Symptoms triggering this route:**
 - "time series"
 - "forecasting"
 - "natural language" (NLP)
 - "sequential data"
 - "temporal patterns"
 - "RNN vs LSTM vs Transformer"
 **Route to:** See [sequence-models-comparison.md](sequence-models-comparison.md) for sequential model selection (RNN, LSTM, Transformer, TCN).
 **When to route here:**
 - ANY sequential data
 - When user asks "RNN vs LSTM" (skill will present modern alternatives)
 - Time-dependent patterns
 **Clarifying questions:**
 - "Sequence length?" (< 100 → RNN/LSTM/TCN, 100-1000 → Transformer, > 1000 → Sparse Transformers)
 - "Latency requirements?" (Real-time → TCN/LSTM, Offline → Transformer)
 - "Data volume?" (Small → Simpler models, Large → Transformers)
 **CRITICAL:** Challenge "RNN vs LSTM" premise if they ask. Modern alternatives (Transformers, TCN) often better.
 ---
 ### Graphs → Graph Neural Networks
 **Symptoms triggering this route:**
 - "social network"
 - "molecular structure"
 - "knowledge graph"
 - "graph data"
 - "node classification"
 - "link prediction"
 - "graph embeddings"
 **Route to:** See [graph-neural-networks-basics.md](graph-neural-networks-basics.md) for GNN architectures and graph learning.
 **When to route here:**
 - Data has explicit graph structure (nodes + edges)
 - Relational information is important
 - Network topology matters
 **Red flag:** If treating graph as tabular data (extracting features and ignoring edges) → WRONG. Route to GNN skill.
 ---
 ### Generation → Generative Model Families
 **Symptoms triggering this route:**
 - "generate images"
 - "synthesize data"
 - "GAN vs VAE vs Diffusion"
 - "image-to-image translation"
 - "style transfer"
 - "generative modeling"
 **Route to:** See [generative-model-families.md](generative-model-families.md) for GANs, VAEs, and Diffusion models.
 **When to route here:**
 - Goal is to CREATE data, not classify/predict
 - Need to sample from distribution
 - Data augmentation through generation
 **Clarifying questions:**
 - "Use case?" (Real-time game → GAN, Art/research → Diffusion, Fast training → VAE)
 - "Quality vs speed?" (Quality → Diffusion, Speed → GAN)
 - "Controllability?" (Fine control → StyleGAN/Conditional models)
 **CRITICAL:** Different generative models have VERY different trade-offs. Must clarify requirements.
 ---
 ## Routing by Architecture Component
 ### Attention Mechanisms
 **Symptoms triggering this route:**
 - "when to use attention"
 - "self-attention vs cross-attention"
 - "attention in CNNs"
 - "attention bottleneck"
 - "multi-head attention"
 **Route to:** See [attention-mechanisms-catalog.md](attention-mechanisms-catalog.md) for attention mechanism selection and design.
 **When to route here:**
 - Designing custom architecture that might benefit from attention
 - Understanding where attention helps vs hinders
 - Comparing attention variants
 **NOT for:** General Transformer questions → [transformer-architecture-deepdive.md](transformer-architecture-deepdive.md) instead
 ---
 ### Transformer Deep Dive
 **Symptoms triggering this route:**
 - "how do transformers work"
 - "Vision Transformer (ViT)"
 - "BERT architecture"
 - "positional encoding"
 - "transformer blocks"
 - "scaling transformers"
 **Route to:** See [transformer-architecture-deepdive.md](transformer-architecture-deepdive.md) for Transformer internals and implementation.
 **When to route here:**
 - Implementing/customizing transformers
 - Understanding transformer internals
 - Debugging transformer-specific issues
 **Cross-reference:**
 - For sequence models generally → [sequence-models-comparison.md](sequence-models-comparison.md) (includes transformers in context)
 - For LLMs specifically → `yzmir/llm-specialist/transformer-for-llms` (LLM-specific transformers)
 ---
 ### Normalization Techniques
 **Symptoms triggering this route:**
 - "gradient explosion"
 - "training instability in deep network"
 - "BatchNorm vs LayerNorm"
 - "normalization layers"
 - "50+ layer network won't train"
 **Route to:** See [normalization-techniques.md](normalization-techniques.md) for deep network stability and normalization methods.
 **When to route here:**
 - Deep networks (> 20 layers) with training instability
 - Choosing between normalization methods
 - Architectural stability issues
 **CRITICAL:** This is often the ROOT CAUSE of "training won't work" - fix architecture before blaming hyperparameters.
 ---
 ### Architecture Design Principles
 **Symptoms triggering this route:**
 - "how to design architecture"
 - "architecture best practices"
 - "when to use skip connections"
 - "how deep should network be"
 - "custom architecture for [novel task]"
 - Unclear problem modality
 **Route to:** See [architecture-design-principles.md](architecture-design-principles.md) for custom architecture design fundamentals.
 **When to route here:**
 - Designing custom architectures
 - Novel problems without established architecture
 - Understanding WHY architectures work
 - User is unsure what modality/problem type they have
 **This is the foundational skill** - route here if other specific skills don't match.
 ---
 ## Multi-Modal / Cross-Pack Routing
 ### When Problem Spans Multiple Modalities
 **Example:** "Text + image classification" (multimodal)
 **Route to BOTH:**
 1. [sequence-models-comparison.md](sequence-models-comparison.md) (for text)
 2. [cnn-families-and-selection.md](cnn-families-and-selection.md) (for images)
 3. [architecture-design-principles.md](architecture-design-principles.md) (for fusion strategy)
 **Order matters:** Understand individual modalities BEFORE fusion.
 ### When Architecture + Other Concerns
 **Example:** "Select architecture AND optimize training"
 **Route order:**
 1. Architecture skill FIRST (this pack)
 2. Training-optimization SECOND (after architecture chosen)
 **Why:** Wrong architecture can't be fixed by better training.
 **Example:** "Select architecture AND deploy efficiently"
 **Route order:**
 1. Architecture skill FIRST
 2. ML-production SECOND (quantization, serving)
 **Deployment constraints might influence architecture choice** - if so, note constraints during architecture selection.
 ---
 ## Common Routing Mistakes (DON'T DO THESE)
 | Symptom | Wrong Route | Correct Route | Why |
 |---------|-------------|---------------|-----|
 | "My transformer won't train" | [transformer-architecture-deepdive.md](transformer-architecture-deepdive.md) | training-optimization | Training issue, not architecture understanding |
 | "Deploy image classifier" | [cnn-families-and-selection.md](cnn-families-and-selection.md) | ml-production | Deployment, not selection |
 | "ViT vs ResNet for medical imaging" | [transformer-architecture-deepdive.md](transformer-architecture-deepdive.md) | [cnn-families-and-selection.md](cnn-families-and-selection.md) | Comparative selection, not single architecture detail |
 | "Implement BatchNorm in PyTorch" | [normalization-techniques.md](normalization-techniques.md) | pytorch-engineering | Implementation, not architecture concept |
 | "GAN won't converge" | [generative-model-families.md](generative-model-families.md) | training-optimization | Training stability, not architecture selection |
 | "Which optimizer for CNN" | [cnn-families-and-selection.md](cnn-families-and-selection.md) | training-optimization | Optimization, not architecture |
 **Rule:** Architecture pack is for CHOOSING and DESIGNING architectures. Training/deployment/implementation are other packs.
 ---
 ## Red Flags: Stop and Clarify
 If query contains these patterns, ASK clarifying questions before routing:
 | Pattern | Why Clarify | What to Ask |
 |---------|-------------|--------------|
 | "Best architecture for X" | "Best" depends on constraints | "What are your data size, compute, and latency constraints?" |
 | Generic problem description | Can't route without modality | "What type of data? (images, sequences, graphs, etc.)" |
 | Latest trend mentioned (ViT, Diffusion) | Recency bias risk | "Have you considered alternatives? What are your specific requirements?" |
 | "Should I use X or Y" | May be wrong question | "What's the underlying problem? There might be option Z." |
 | Very deep network (> 50 layers) | Likely needs normalization first | "Are you using normalization layers? Skip connections?" |
 **Never guess modality or constraints. Always clarify.**
 ---
 ## Recency Bias: Resistance Table
 | Trendy Architecture | When NOT to Use | Better Alternative |
 |---------------------|------------------|-------------------|
 | **Vision Transformers (ViT)** | Small datasets (< 10k images) | CNNs (ResNet, EfficientNet) |
 | **Vision Transformers (ViT)** | Edge deployment (latency/power) | EfficientNets, MobileNets |
 | **Transformers (general)** | Very small datasets | RNNs, CNNs (less capacity, less overfit) |
 | **Diffusion Models** | Real-time generation needed | GANs (1 forward pass vs 50-1000 steps) |
 | **Diffusion Models** | Limited compute for training | VAEs (faster training) |
 | **Graph Transformers** | Small graphs (< 100 nodes) | Standard GNNs (GCN, GAT) simpler and effective |
 | **LLMs (GPT-style)** | < 1M tokens of training data | Simpler language models or fine-tuning |
 **Counter-narrative:** "New architecture ≠ better for your use case. Match architecture to constraints."
 ---
 ## Decision Tree
 ```
 Start here: What's your primary goal?
 ┌─ SELECT architecture for task
 │  ├─ Data modality?
 │  │  ├─ Images → [cnn-families-and-selection.md](cnn-families-and-selection.md)
 │  │  ├─ Sequences → [sequence-models-comparison.md](sequence-models-comparison.md)
 │  │  ├─ Graphs → [graph-neural-networks-basics.md](graph-neural-networks-basics.md)
 │  │  ├─ Generation → [generative-model-families.md](generative-model-families.md)
 │  │  └─ Unknown/Multiple → [architecture-design-principles.md](architecture-design-principles.md)
 │  └─ Special requirements?
 │     ├─ Deep network (>20 layers) unstable → [normalization-techniques.md](normalization-techniques.md) (CRITICAL)
 │     ├─ Need attention mechanism → [attention-mechanisms-catalog.md](attention-mechanisms-catalog.md)
 │     └─ None → Proceed with modality-based route
 │
 ├─ UNDERSTAND specific architecture
 │  ├─ Transformers → [transformer-architecture-deepdive.md](transformer-architecture-deepdive.md)
 │  ├─ Attention → [attention-mechanisms-catalog.md](attention-mechanisms-catalog.md)
 │  ├─ Normalization → [normalization-techniques.md](normalization-techniques.md)
 │  └─ General principles → [architecture-design-principles.md](architecture-design-principles.md)
 │
 ├─ DESIGN custom architecture
 │  └─ [architecture-design-principles.md](architecture-design-principles.md) (start here always)
 │
 └─ COMPARE architectures
   ├─ CNNs (ResNet vs EfficientNet) → [cnn-families-and-selection.md](cnn-families-and-selection.md)
   ├─ Sequence models (RNN vs Transformer) → [sequence-models-comparison.md](sequence-models-comparison.md)
   ├─ Generative (GAN vs Diffusion) → [generative-model-families.md](generative-model-families.md)
   └─ General comparison → [architecture-design-principles.md](architecture-design-principles.md)
 ```
 ---
 ## Workflow
 **Standard Architecture Selection Workflow:**
 ```
 1. Clarify Problem
   ☐ What data modality? (images, sequences, graphs, etc.)
   ☐ What's the task? (classification, generation, regression, etc.)
   ☐ Dataset size?
   ☐ Computational constraints?
   ☐ Latency requirements?
   ☐ Interpretability needs?
 2. Route Based on Modality
   ☐ Images → [cnn-families-and-selection.md](cnn-families-and-selection.md)
   ☐ Sequences → [sequence-models-comparison.md](sequence-models-comparison.md)
   ☐ Graphs → [graph-neural-networks-basics.md](graph-neural-networks-basics.md)
   ☐ Generation → [generative-model-families.md](generative-model-families.md)
   ☐ Custom/Unclear → [architecture-design-principles.md](architecture-design-principles.md)
 3. Check for Critical Issues
   ☐ Deep network unstable? → [normalization-techniques.md](normalization-techniques.md) FIRST
   ☐ Need specialized component? → [attention-mechanisms-catalog.md](attention-mechanisms-catalog.md) or [transformer-architecture-deepdive.md](transformer-architecture-deepdive.md)
 4. Apply Architecture Skill
   ☐ Follow guidance from routed skill
   ☐ Consider trade-offs (accuracy vs speed vs data requirements)
 5. Cross-Pack if Needed
   ☐ Architecture chosen → training-optimization (for training)
   ☐ Architecture chosen → ml-production (for deployment)
 ```
 ---
 ## Rationalization Table
 | Rationalization | Reality | Counter |
 |-----------------|---------|---------|
 | "Transformers are SOTA, recommend them" | SOTA on benchmark ≠ best for user's constraints | "Ask about dataset size and compute first" |
 | "User said RNN vs LSTM, answer that" | Question premise might be outdated | "Challenge: Have you considered Transformers or TCN?" |
 | "Just recommend latest architecture" | Latest ≠ appropriate | "Match architecture to requirements, not trends" |
 | "Architecture doesn't matter, training matters" | Wrong architecture can't be fixed by training | "Architecture is foundation - get it right first" |
 | "They seem rushed, skip clarification" | Wrong route wastes more time than clarification | "30 seconds to clarify saves hours of wasted effort" |
 | "Generic architecture advice is safe" | Generic = useless for specific domains | "Route to domain-specific skill for actionable guidance" |
 ---
 ## Integration with Other Packs
 ### After Architecture Selection
 Once architecture is chosen, route to:
 **Training the architecture:**
 → `yzmir/training-optimization/using-training-optimization`
 - Optimizer selection
 - Learning rate schedules
 - Debugging training issues
 **Implementing in PyTorch:**
 → `yzmir/pytorch-engineering/using-pytorch-engineering`
 - Module design patterns
 - Performance optimization
 - Custom components
 **Deploying to production:**
 → `yzmir/ml-production/using-ml-production`
 - Model serving
 - Quantization
 - Inference optimization
 ### Before Architecture Selection
 If problem involves:
 **Reinforcement learning:**
 → `yzmir/deep-rl/using-deep-rl` FIRST
 - RL algorithms dictate architecture requirements
 - Value networks vs policy networks have different needs
 **Large language models:**
 → `yzmir/llm-specialist/using-llm-specialist` FIRST
 - LLM architectures are specialized transformers
 - Different considerations than general sequence models
 **Architecture is downstream of algorithm choice in RL and LLMs.**
 ---
 ## Summary
 **Use this meta-skill to:**
 - ✅ Route architecture queries to appropriate specialized skill
 - ✅ Identify data modality and problem type
 - ✅ Clarify constraints before recommending
 - ✅ Resist recency bias (latest ≠ best)
 - ✅ Recognize when architecture is the problem (vs training/implementation)
 ## Neural Architecture Specialist Skills
 After routing, load the appropriate specialist skill for detailed guidance:
 1. [architecture-design-principles.md](architecture-design-principles.md) - Custom design, architectural best practices, skip connections, network depth fundamentals
 2. [attention-mechanisms-catalog.md](attention-mechanisms-catalog.md) - Self-attention, cross-attention, multi-head attention, attention in CNNs, attention variants comparison
 3. [cnn-families-and-selection.md](cnn-families-and-selection.md) - ResNet, EfficientNet, MobileNet, YOLO, computer vision architecture selection
 4. [generative-model-families.md](generative-model-families.md) - GANs, VAEs, Diffusion models, image generation, style transfer, generative modeling trade-offs
 5. [graph-neural-networks-basics.md](graph-neural-networks-basics.md) - GCN, GAT, node classification, link prediction, graph embeddings, molecular structures
 6. [normalization-techniques.md](normalization-techniques.md) - BatchNorm, LayerNorm, GroupNorm, training stability for deep networks (>20 layers)
 7. [sequence-models-comparison.md](sequence-models-comparison.md) - RNN, LSTM, Transformer, TCN comparison, time series, NLP, sequential data
 8. [transformer-architecture-deepdive.md](transformer-architecture-deepdive.md) - Transformer internals, ViT, BERT, positional encoding, scaling transformers
 **Critical principle:** Architecture comes BEFORE training. Get this right first.
 ---
 **END OF SKILL**
--- a/skills/using-neural-architectures/architecture-design-principles.md
+++ b/skills/using-neural-architectures/architecture-design-principles.md
@@ -0,0 +1,960 @@
 # Architecture Design Principles
 ## Context
 You're designing a neural network architecture or debugging why your network isn't learning. Common mistakes:
 - **Ignoring inductive biases**: Using MLP for images (should use CNN)
 - **Over-engineering**: Using Transformer for 100 samples (should use linear regression)
 - **No skip connections**: 50-layer plain network fails (should use ResNet)
 - **Wrong depth-width balance**: 100 layers × 8 channels bottlenecks capacity
 - **Ignoring constraints**: 1.5B parameter model doesn't fit 24GB GPU
 **This skill provides principled architecture design: match structure to problem, respect constraints, avoid over-engineering.**
 ## Core Principle: Inductive Biases
 **Inductive bias = assumptions baked into architecture about problem structure**
 **Key insight**: The right inductive bias makes learning dramatically easier. Wrong bias makes learning impossible.
 ### What are Inductive Biases?
 ```python
 # Example: Image classification
 # MLP (no inductive bias):
 # - Treats each pixel independently
 # - No concept of "spatial locality" or "translation"
 # - Must learn from scratch that nearby pixels are related
 # - Learns "cat at position (10,10)" and "cat at (50,50)" separately
 # Parameters: 150M, Accuracy: 75%
 # CNN (strong inductive bias):
 # - Assumes spatial locality (nearby pixels related)
 # - Assumes translation invariance (cat is cat anywhere)
 # - Shares filters across spatial positions
 # - Hierarchical feature learning (edges → textures → objects)
 # Parameters: 11M, Accuracy: 95%
 # CNN's inductive bias: 14× fewer parameters, 20% better accuracy!
 ```
 **Principle**: Match your architecture's inductive biases to your problem's structure.
 ## Architecture Families and Their Inductive Biases
 ### 1. Fully Connected (MLP)
 **Inductive bias:** None (general-purpose)
 **Structure:**
 ```python
 class MLP(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x
 ```
 **When to use:**
 - ✅ Tabular data (independent features)
 - ✅ Small datasets (< 10,000 samples)
 - ✅ Baseline / proof of concept
 **When NOT to use:**
 - ❌ Images (use CNN)
 - ❌ Sequences (use RNN/Transformer)
 - ❌ Graphs (use GNN)
 **Strengths:**
 - Simple and interpretable
 - Fast training
 - Works for any input type (flattened)
 **Weaknesses:**
 - No structural assumptions (must learn everything from data)
 - Parameter explosion (input_size × hidden_size can be huge)
 - Doesn't leverage problem structure
 ### 2. Convolutional Neural Networks (CNN)
 **Inductive bias:** Spatial locality + Translation invariance
 **Structure:**
 ```python
 class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc = nn.Linear(128 * 7 * 7, 1000)
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))  # 112×112
        x = self.pool(F.relu(self.conv2(x)))  # 56×56
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x
 ```
 **Inductive biases:**
 1. **Local connectivity**: Neurons see only nearby pixels (spatial locality)
 2. **Translation invariance**: Same filter slides across image (parameter sharing)
 3. **Hierarchical features**: Stack layers to build complex features from simple ones
 **When to use:**
 - ✅ Images (classification, detection, segmentation)
 - ✅ Spatial data (maps, medical scans)
 - ✅ Any grid-structured data
 **When NOT to use:**
 - ❌ Sequences with long-range dependencies (use Transformer)
 - ❌ Graphs (irregular structure, use GNN)
 - ❌ Tabular data (no spatial structure)
 **Strengths:**
 - Parameter efficient (filter sharing)
 - Translation invariant (cat anywhere = cat)
 - Hierarchical feature learning
 **Weaknesses:**
 - Fixed receptive field (limited by kernel size)
 - Not suitable for variable-length inputs
 - Requires grid structure
 ### 3. Recurrent Neural Networks (RNN/LSTM)
 **Inductive bias:** Temporal dependencies
 **Structure:**
 ```python
 class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)
    def forward(self, x):
        # x: (batch, seq_len, input_size)
        lstm_out, (h_n, c_n) = self.lstm(x)
        # Use last hidden state
        output = self.fc(h_n[-1])
        return output
 ```
 **Inductive bias:** Sequential processing (earlier elements influence later elements)
 **When to use:**
 - ✅ Time series (stock prices, sensor data)
 - ✅ Short sequences (< 100 timesteps)
 - ✅ Online processing (process one timestep at a time)
 **When NOT to use:**
 - ❌ Long sequences (> 1000 timesteps, use Transformer)
 - ❌ Non-sequential data (images, tabular)
 - ❌ When parallel processing needed (use Transformer)
 **Strengths:**
 - Natural for sequential data
 - Constant memory (doesn't grow with sequence length)
 - Online processing capability
 **Weaknesses:**
 - Slow (sequential, can't parallelize)
 - Vanishing gradients (long sequences)
 - Struggles with long-range dependencies
 ### 4. Transformers
 **Inductive bias:** Minimal (self-attention is general-purpose)
 **Structure:**
 ```python
 class SimpleTransformer(nn.Module):
    def __init__(self, d_model, num_heads, num_layers):
        super().__init__()
        self.encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model, num_heads),
            num_layers
        )
        self.fc = nn.Linear(d_model, num_classes)
    def forward(self, x):
        # x: (batch, seq_len, d_model)
        x = self.encoder(x)
        # Global average pooling
        x = x.mean(dim=1)
        return self.fc(x)
 ```
 **Inductive bias:** Minimal (learns relationships from data via attention)
 **When to use:**
 - ✅ Long sequences (> 100 tokens)
 - ✅ Language (text, code)
 - ✅ Large datasets (> 100k samples)
 - ✅ When relationships are complex and data-dependent
 **When NOT to use:**
 - ❌ Small datasets (< 10k samples, use RNN or MLP)
 - ❌ Strong structural priors available (images → CNN)
 - ❌ Very long sequences (> 16k tokens, use sparse attention)
 - ❌ Low-latency requirements (RNN faster)
 **Strengths:**
 - Parallel processing (fast training)
 - Long-range dependencies (attention)
 - State-of-the-art for language
 **Weaknesses:**
 - Quadratic complexity O(n²) with sequence length
 - Requires large datasets (weak inductive bias)
 - High memory usage
 ### 5. Graph Neural Networks (GNN)
 **Inductive bias:** Message passing over graph structure
 **Structure:**
 ```python
 class SimpleGNN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)
    def forward(self, x, edge_index):
        # x: node features (num_nodes, input_dim)
        # edge_index: graph structure (2, num_edges)
        x = F.relu(self.conv1(x, edge_index))
        x = self.conv2(x, edge_index)
        return x
 ```
 **Inductive bias:** Nodes influenced by neighbors (message passing)
 **When to use:**
 - ✅ Graph data (social networks, molecules, knowledge graphs)
 - ✅ Irregular connectivity (different # of neighbors per node)
 - ✅ Relational reasoning
 **When NOT to use:**
 - ❌ Grid data (images → CNN)
 - ❌ Sequences (text → Transformer)
 - ❌ If graph structure doesn't help (test MLP baseline first!)
 **Strengths:**
 - Handles irregular structure
 - Permutation invariant
 - Natural for relational data
 **Weaknesses:**
 - Requires meaningful graph structure
 - Over-smoothing (too many layers)
 - Scalability challenges (large graphs)
 ## Decision Tree: Architecture Selection
 ```
 START
 |
 ├─ Is data grid-structured (images)?
 │  ├─ YES → Use CNN
 │  │   └─ ResNet (general), EfficientNet (mobile), ViT (very large datasets)
 │  └─ NO → Continue
 │
 ├─ Is data sequential (text, time series)?
 │  ├─ YES → Check sequence length
 │  │   ├─ < 100 timesteps → LSTM/GRU
 │  │   ├─ 100-4000 tokens → Transformer
 │  │   └─ > 4000 tokens → Sparse Transformer (Longformer)
 │  └─ NO → Continue
 │
 ├─ Is data graph-structured (molecules, social networks)?
 │  ├─ YES → Check if structure helps
 │  │   ├─ Test MLP baseline first
 │  │   └─ If structure helps → GNN (GCN, GraphSAGE, GAT)
 │  └─ NO → Continue
 │
 └─ Is data tabular (independent features)?
   └─ YES → Start simple
       ├─ < 1000 samples → Linear / Ridge regression
       ├─ 1000-100k samples → Small MLP (2-3 layers)
       └─ > 100k samples → Larger MLP or Gradient Boosting (XGBoost)
 ```
 ## Principle: Start Simple, Add Complexity Only When Needed
 **Occam's Razor**: Simplest model that solves the problem is best.
 ### Progression:
 ```python
 # Step 1: Linear baseline (ALWAYS start here!)
 model = nn.Linear(input_size, num_classes)
 # Train and evaluate
 # Step 2: IF linear insufficient, add small MLP
 if linear_accuracy < target:
    model = nn.Sequential(
        nn.Linear(input_size, 128),
        nn.ReLU(),
        nn.Linear(128, num_classes)
    )
 # Step 3: IF small MLP insufficient, add depth/width
 if mlp_accuracy < target:
    model = nn.Sequential(
        nn.Linear(input_size, 256),
        nn.ReLU(),
        nn.Linear(256, 256),
        nn.ReLU(),
        nn.Linear(256, num_classes)
    )
 # Step 4: IF simple models fail, use specialized architecture
 if simple_models_fail:
    # Images → CNN
    # Sequences → RNN/Transformer
    # Graphs → GNN
 # NEVER skip to Step 4 without testing Step 1-3!
 ```
 ### Why Start Simple?
 1. **Faster iteration**: Linear model trains in seconds, Transformer in hours
 2. **Baseline**: Know if complexity helps (compare complex vs simple)
 3. **Occam's Razor**: Simple model generalizes better (less overfitting)
 4. **Debugging**: Easy to verify simple model works correctly
 ### Example: House Price Prediction
 ```python
 # Dataset: 1000 samples, 20 features
 # WRONG: Start with Transformer
 model = HugeTransformer(20, 512, 6, 1)  # 10M parameters
 # Result: Overfits (10M params / 1000 samples = 10,000:1 ratio!)
 # RIGHT: Start simple
 # Step 1: Linear
 model = nn.Linear(20, 1)  # 21 parameters
 # Trains in 1 second, achieves R² = 0.85 (good!)
 # Conclusion: Linear sufficient, stop here. No need for Transformer!
 ```
 **Rule**: Add complexity only when simple models demonstrably fail.
 ## Principle: Deep Networks Need Skip Connections
 **Problem**: Plain networks > 10 layers suffer from vanishing gradients and degradation.
 ### Vanishing Gradients:
 ```python
 # Gradient flow in plain 50-layer network:
 gradient_layer_1 = gradient_output × (∂L50/∂L49) × (∂L49/∂L48) × ... × (∂L2/∂L1)
 # Each term < 1 (due to activations):
 # If each ≈ 0.9, then: 0.9^50 = 0.0000051 (vanishes!)
 # Result: Early layers don't learn (gradients too small)
 ```
 ### Degradation:
 ```python
 # Empirical observation (ResNet paper):
 20-layer plain network: 85% accuracy
 56-layer plain network: 78% accuracy  # WORSE with more layers!
 # This is NOT overfitting (training accuracy also drops)
 # This is optimization difficulty
 ```
 ### Solution: Skip Connections (Residual Networks)
 ```python
 class ResidualBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(channels)
    def forward(self, x):
        identity = x  # Save input
        out = self.conv1(x)
        out = self.bn1(out)
        out = F.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)
        out = out + identity  # Skip connection!
        out = F.relu(out)
        return out
 ```
 **Why skip connections work:**
 ```python
 # Gradient flow with skip connections:
 ∂loss/∂x = ∂loss/∂out × (1 + ∂F/∂x)
 #                         ↑
 #                    Always flows! ("+1" term)
 # Even if ∂F/∂x ≈ 0, gradient flows through identity path
 ```
 **Results:**
 ```python
 # Without skip connections:
 20-layer plain: 85% accuracy
 50-layer plain: 78% accuracy (worse!)
 # With skip connections (ResNet):
 20-layer ResNet: 87% accuracy
 50-layer ResNet: 92% accuracy (better!)
 152-layer ResNet: 95% accuracy (even better!)
 ```
 **Rule**: For networks > 10 layers, ALWAYS use skip connections.
 ### Skip Connection Variants:
 **1. Residual (ResNet):**
 ```python
 out = x + F(x)  # Add input to output
 ```
 **2. Dense (DenseNet):**
 ```python
 out = torch.cat([x, F(x)], dim=1)  # Concatenate input and output
 ```
 **3. Highway:**
 ```python
 gate = sigmoid(W_gate @ x)
 out = gate * F(x) + (1 - gate) * x  # Learned gating
 ```
 **Most common**: Residual (simple, effective)
 ## Principle: Balance Depth and Width
 **Depth = # of layers**
 **Width = # of channels/neurons per layer**
 ### Capacity Formula:
 ```python
 # Approximate capacity (for CNNs):
 capacity ≈ depth × width²
 # Why width²?
 # Each layer: input_channels × output_channels × kernel_size²
 # Doubling width → 4× parameters per layer
 ```
 ### Trade-offs:
 **Too deep, too narrow:**
 ```python
 # 100 layers × 8 channels
 # Problems:
 # - Information bottleneck (8 channels can't represent complex features)
 # - Harder to optimize (more layers)
 # - Slow inference (100 layers sequential)
 # Example:
 model = VeryDeepNarrow(num_layers=100, channels=8)
 # Result: 60% accuracy (bottleneck!)
 ```
 **Too shallow, too wide:**
 ```python
 # 2 layers × 1024 channels
 # Problems:
 # - Under-utilizes depth (no hierarchical features)
 # - Memory explosion (1024 × 1024 = 1M parameters per layer!)
 # Example:
 model = VeryWideShallow(num_layers=2, channels=1024)
 # Result: 70% accuracy (doesn't leverage depth)
 ```
 **Balanced:**
 ```python
 # 18 layers, gradually increasing width: 64 → 128 → 256 → 512
 # Benefits:
 # - Hierarchical features (depth)
 # - Sufficient capacity (width)
 # - Good optimization (not too deep)
 # Example (ResNet-18):
 model = ResNet18()
 # Layers: 18, Channels: 64-512 (average ~200)
 # Result: 95% accuracy (optimal balance!)
 ```
 ### Standard Patterns:
 ```python
 # CNNs: Gradually increase channels as spatial dims decrease
 # Input: 224×224×3
 # Layer 1: 224×224×64    (same spatial size, more channels)
 # Layer 2: 112×112×128   (half spatial, double channels)
 # Layer 3: 56×56×256     (half spatial, double channels)
 # Layer 4: 28×28×512     (half spatial, double channels)
 # Why? Compensate for spatial information loss with channel information
 ```
 **Rule**: Balance depth and width. Standard pattern: 12-50 layers, 64-512 channels.
 ## Principle: Match Capacity to Data Size
 **Capacity = # of learnable parameters**
 ### Parameter Budget:
 ```python
 # Rule of thumb: parameters should be 0.01-0.1× dataset size
 # Example 1: MNIST (60,000 images)
 # Budget: 600 - 6,000 parameters
 # Simple CNN: 60,000 parameters (10×) → Works, but might overfit
 # LeNet: 60,000 parameters → Classic, works well
 # Example 2: ImageNet (1.2M images)
 # Budget: 12,000 - 120,000 parameters
 # ResNet-50: 25M parameters (200×) → Works (aggressive augmentation helps)
 # Example 3: Tabular (100 samples, 20 features)
 # Budget: 1 - 10 parameters
 # Linear: 21 parameters → Perfect fit!
 # MLP: 1,000 parameters → Overfits horribly
 ```
 ### Overfitting Detection:
 ```python
 # Training accuracy >> Validation accuracy (gap > 5%)
 train_acc = 99%, val_acc = 70%  # 29% gap → OVERFITTING!
 # Solutions:
 # 1. Reduce model capacity (fewer layers/channels)
 # 2. Add regularization (dropout, weight decay)
 # 3. Collect more data
 # 4. Data augmentation
 # Order: Try (1) first (simplest), then (2), then (3)/(4)
 ```
 ### Underfitting Detection:
 ```python
 # Training accuracy < target (model too simple)
 train_acc = 60%, val_acc = 58%  # Both low → UNDERFITTING!
 # Solutions:
 # 1. Increase model capacity (more layers/channels)
 # 2. Train longer
 # 3. Reduce regularization
 # Order: Try (2) first (cheapest), then (1), then (3)
 ```
 **Rule**: Match parameters to data size. Start small, increase capacity only if underfitting.
 ## Principle: Design for Compute Constraints
 **Constraints:**
 1. **Memory**: Model + gradients + optimizer states < GPU VRAM
 2. **Latency**: Inference time < requirement (e.g., < 100ms for real-time)
 3. **Throughput**: Samples/second > requirement
 ### Memory Budget:
 ```python
 # Memory calculation (training):
 # 1. Model parameters (FP32): params × 4 bytes
 # 2. Gradients: params × 4 bytes
 # 3. Optimizer states (Adam): params × 8 bytes (2× weights)
 # 4. Activations: batch_size × feature_maps × spatial_size × 4 bytes
 # Example: ResNet-50
 params = 25M
 memory_params = 25M × 4 = 100 MB
 memory_gradients = 100 MB
 memory_optimizer = 200 MB
 memory_activations = batch_size × 64 × 7×7 × 4 ≈ batch_size × 12 KB
 # Total (batch=32): 100 + 100 + 200 + 0.4 = 400 MB
 # Fits easily on 4GB GPU!
 # Example: GPT-3 (175B parameters)
 memory_params = 175B × 4 = 700 GB
 memory_total = 700 + 700 + 1400 = 2800 GB = 2.8 TB!
 # Requires 35×A100 (80GB each)
 ```
 **Rule**: Calculate memory before training. Don't design models that don't fit.
 ### Latency Budget:
 ```python
 # Inference latency = # operations / throughput
 # Example: Mobile app (< 100ms latency requirement)
 # ResNet-50:
 # Operations: 4B FLOPs
 # Mobile CPU: 10 GFLOPS
 # Latency: 4B / 10G = 0.4 seconds (FAILS!)
 # MobileNetV2:
 # Operations: 300M FLOPs
 # Mobile CPU: 10 GFLOPS
 # Latency: 300M / 10G = 0.03 seconds = 30ms (PASSES!)
 # Solution: Use efficient architectures (MobileNet, EfficientNet) for mobile
 ```
 **Rule**: Measure latency. Use efficient architectures if latency-constrained.
 ## Common Architectural Patterns
 ### 1. Bottleneck (ResNet)
 **Structure:**
 ```python
 # Standard: 3×3 conv (256 channels) → 3×3 conv (256 channels)
 # Parameters: 256 × 256 × 3 × 3 = 590K
 # Bottleneck: 1×1 (256→64) → 3×3 (64→64) → 1×1 (64→256)
 # Parameters: 256×64 + 64×64×3×3 + 64×256 = 16K + 37K + 16K = 69K
 # Reduction: 590K → 69K (8.5× fewer!)
 ```
 **Purpose**: Reduce parameters while maintaining capacity
 **When to use**: Deep networks (> 50 layers) where parameters are a concern
 ### 2. Inverted Bottleneck (MobileNetV2)
 **Structure:**
 ```python
 # Bottleneck (ResNet): Wide → Narrow → Wide (256 → 64 → 256)
 # Inverted: Narrow → Wide → Narrow (64 → 256 → 64)
 # Why? Efficient for mobile (depthwise separable convolutions)
 ```
 **Purpose**: Maximize efficiency (FLOPs per parameter)
 **When to use**: Mobile/edge deployment
 ### 3. Multi-scale Features (Inception)
 **Structure:**
 ```python
 # Parallel branches with different kernel sizes:
 # Branch 1: 1×1 conv
 # Branch 2: 3×3 conv
 # Branch 3: 5×5 conv
 # Branch 4: 3×3 max pool
 # Concatenate all branches
 # Captures features at multiple scales simultaneously
 ```
 **Purpose**: Capture multi-scale patterns
 **When to use**: When features exist at multiple scales (object detection)
 ### 4. Attention (Transformers, SE-Net)
 **Structure:**
 ```python
 # Squeeze-and-Excitation (SE) block:
 # 1. Global average pooling (spatial → channel descriptor)
 # 2. FC layer (bottleneck)
 # 3. FC layer (restore channels)
 # 4. Sigmoid (attention weights)
 # 5. Multiply input channels by attention weights
 # Result: Emphasize important channels, suppress irrelevant
 ```
 **Purpose**: Learn importance of features (channels or positions)
 **When to use**: When not all features equally important
 ## Debugging Architectures
 ### Problem 1: Network doesn't learn (loss stays constant)
 **Diagnosis:**
 ```python
 # Check gradient flow
 for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: grad_mean={param.grad.mean():.6f}, grad_std={param.grad.std():.6f}")
 # Vanishing: grad_mean ≈ 0, grad_std ≈ 0 → Add skip connections
 # Exploding: grad_mean > 1, grad_std > 1 → Gradient clipping or lower LR
 ```
 **Solutions:**
 - Add skip connections (ResNet)
 - Check initialization (Xavier or He initialization)
 - Lower learning rate
 - Check data preprocessing (normalized inputs?)
 ### Problem 2: Overfitting (train >> val)
 **Diagnosis:**
 ```python
 train_acc = 99%, val_acc = 70%  # 29% gap → Overfitting
 # Check parameter/data ratio:
 num_params = sum(p.numel() for p in model.parameters())
 data_size = len(train_dataset)
 ratio = num_params / data_size
 # If ratio > 1: Model has more parameters than data points!
 ```
 **Solutions (in order):**
 1. Reduce capacity (fewer layers/channels)
 2. Add dropout / weight decay
 3. Data augmentation
 4. Collect more data
 ### Problem 3: Underfitting (train and val both low)
 **Diagnosis:**
 ```python
 train_acc = 65%, val_acc = 63%  # Both low → Underfitting
 # Model too simple for task complexity
 ```
 **Solutions (in order):**
 1. Train longer (more epochs)
 2. Increase capacity (more layers/channels)
 3. Reduce regularization (lower dropout/weight decay)
 4. Check learning rate (too low?)
 ### Problem 4: Slow training
 **Diagnosis:**
 ```python
 # Profile forward/backward pass
 import time
 start = time.time()
 loss = criterion(model(inputs), targets)
 forward_time = time.time() - start
 start = time.time()
 loss.backward()
 backward_time = time.time() - start
 # If backward_time >> forward_time: Gradient computation bottleneck
 ```
 **Solutions:**
 - Use mixed precision (FP16)
 - Reduce batch size (if memory-bound)
 - Use gradient accumulation (simulate large batch)
 - Simplify architecture (fewer layers)
 ## Design Checklist
 Before finalizing an architecture:
 ### ☐ Match inductive bias to problem
 - Images → CNN
 - Sequences → RNN/Transformer
 - Graphs → GNN
 - Tabular → MLP
 ### ☐ Start simple, add complexity only when needed
 - Test linear baseline first
 - Add complexity incrementally
 - Compare performance at each step
 ### ☐ Use skip connections for deep networks (> 10 layers)
 - ResNet for CNNs
 - Pre-norm for Transformers
 - Gradient flow is critical
 ### ☐ Balance depth and width
 - Not too deep and narrow (bottleneck)
 - Not too shallow and wide (under-utilizes depth)
 - Standard: 12-50 layers, 64-512 channels
 ### ☐ Match capacity to data size
 - Parameters ≈ 0.01-0.1× dataset size
 - Monitor train/val gap (overfitting indicator)
 ### ☐ Respect compute constraints
 - Memory: Model + gradients + optimizer + activations < VRAM
 - Latency: Inference time < requirement
 - Use efficient architectures if constrained (MobileNet, EfficientNet)
 ### ☐ Verify gradient flow
 - Check gradients in early layers (should be non-zero)
 - Use skip connections if vanishing
 ### ☐ Benchmark against baselines
 - Compare to simple model (linear, small MLP)
 - Ensure complexity adds value (% improvement > 5%)
 ## Anti-Patterns
 ### Anti-pattern 1: "Architecture X is state-of-the-art, so I'll use it"
 **Wrong:**
 ```python
 # Transformer is SOTA for NLP, so use for tabular data (100 samples)
 model = HugeTransformer(...)  # 10M parameters
 # Result: Overfits horribly (100 samples / 10M params = 0.00001 ratio!)
 ```
 **Right:**
 ```python
 # Match architecture to problem AND data size
 # Tabular + small data → Linear or small MLP
 model = nn.Linear(20, 1)  # 21 parameters (appropriate!)
 ```
 ### Anti-pattern 2: "More layers = better"
 **Wrong:**
 ```python
 # 100-layer plain network (no skip connections)
 for i in range(100):
    layers.append(nn.Conv2d(64, 64, 3, padding=1))
 # Result: Doesn't train (vanishing gradients)
 ```
 **Right:**
 ```python
 # 50-layer ResNet (with skip connections)
 # Each block: out = x + F(x)  # Skip connection
 # Result: Trains well, high accuracy
 ```
 ### Anti-pattern 3: "Deeper + narrower = efficient"
 **Wrong:**
 ```python
 # 100 layers × 8 channels = information bottleneck
 model = VeryDeepNarrow(100, 8)
 # Result: 60% accuracy (8 channels insufficient)
 ```
 **Right:**
 ```python
 # 18 layers, 64-512 channels (balanced)
 model = ResNet18()  # Balanced depth and width
 # Result: 95% accuracy
 ```
 ### Anti-pattern 4: "Ignore constraints, optimize later"
 **Wrong:**
 ```python
 # Design 1.5B parameter model for 24GB GPU
 model = HugeModel(1.5e9)
 # Result: OOM (out of memory), can't train
 ```
 **Right:**
 ```python
 # Calculate memory first:
 # 1.5B params × 4 bytes = 6GB (weights)
 # + 6GB (gradients) + 12GB (Adam) + 8GB (activations) = 32GB
 # > 24GB → Doesn't fit!
 # Design for hardware:
 model = ReasonableSizeModel(200e6)  # 200M parameters (fits!)
 ```
 ### Anti-pattern 5: "Hyperparameters will fix architectural problems"
 **Wrong:**
 ```python
 # Architecture: MLP for images (wrong inductive bias)
 # Response: "Just tune learning rate!"
 for lr in [0.1, 0.01, 0.001, 0.0001]:
    train(model, lr=lr)
 # Result: All fail (architecture is wrong!)
 ```
 **Right:**
 ```python
 # Fix architecture first (use CNN for images)
 model = ResNet18()  # Correct inductive bias
 # Then tune hyperparameters
 ```
 ## Summary
 **Core principles:**
 1. **Inductive bias**: Match architecture to problem structure (CNN for images, RNN/Transformer for sequences, GNN for graphs)
 2. **Occam's Razor**: Start simple (linear, small MLP), add complexity only when needed
 3. **Skip connections**: Use for networks > 10 layers (ResNet, DenseNet)
 4. **Depth-width balance**: Not too deep+narrow (bottleneck) or too shallow+wide (under-utilizes depth)
 5. **Capacity**: Match parameters to data size (0.01-0.1× dataset size)
 6. **Constraints**: Design for available memory, latency, throughput
 **Decision framework:**
 - Images → CNN (ResNet, EfficientNet)
 - Short sequences → LSTM
 - Long sequences → Transformer
 - Graphs → GNN (test if structure helps first!)
 - Tabular → Linear or small MLP
 **Key insight**: Architecture design is about matching structural assumptions to problem structure, not about using the "best" or "most complex" model. Simple models often win.
 **When in doubt**: Start with the simplest model that could plausibly work. Add complexity only when you have evidence it helps.
--- a/skills/using-neural-architectures/attention-mechanisms-catalog.md
+++ b/skills/using-neural-architectures/attention-mechanisms-catalog.md
@@ -0,0 +1,824 @@
 # Attention Mechanisms Catalog
 ## When to Use This Skill
 Use this skill when you need to:
 - ✅ Select attention mechanism for long sequences (> 2k tokens)
 - ✅ Optimize memory usage (GPU OOM errors)
 - ✅ Speed up training or inference
 - ✅ Understand exact vs approximate attention trade-offs
 - ✅ Choose between Flash, sparse, or linear attention
 - ✅ Implement cross-attention for multimodal models
 **Do NOT use this skill for:**
 - ❌ Basic Transformer understanding (use `transformer-architecture-deepdive`)
 - ❌ High-level architecture selection (use `using-neural-architectures`)
 - ❌ LLM-specific optimization (use `llm-specialist/llm-inference-optimization`)
 ## Core Principle
 **Not all attention is O(n²).** Standard self-attention has quadratic complexity, but modern variants achieve:
 - **O(n²) with less memory**: Flash Attention (exact, 4x less memory)
 - **O(n × w)**: Sparse attention (exact, sliding window)
 - **O(n)**: Linear attention (approximate, 1-3% accuracy loss)
 **Default recommendation:** Flash Attention (exact + fast + memory-efficient)
 ## Part 1: Complexity Hierarchy
 ### Standard Self-Attention (Baseline)
 **Formula:**
 ```python
 Attention(Q, K, V) = softmax(Q K^T / √d_k) V
 ```
 **Complexity:**
 - Time: O(n² · d) where n = seq_len, d = d_model
 - Memory: O(n²) for attention matrix
 - Exact: Yes (no approximation)
 **Memory breakdown (4k tokens, d=768):**
 ```
 Attention scores: 4096² × 4 bytes = 64MB per layer
 Multi-head (12 heads): 64MB × 12 = 768MB per layer
 16 layers: 768MB × 16 = 12GB just for attention!
 Batch size 8: 12GB × 8 = 96GB (impossible on single GPU)
 ```
 **When to use:**
 - Sequence length < 2k tokens
 - Standard use case (most models)
 - Pair with Flash Attention optimization
 **Limitations:**
 - Memory explosion for long sequences
 - Quadratic scaling impractical beyond 4k tokens
 ## Part 2: Flash Attention ⭐ (Modern Default)
 ### What is Flash Attention?
 **Breakthrough (2022):** Exact attention with 4x less memory, 2-3x faster
 **Key insight:**
 - Standard attention is **memory-bound** (not compute-bound)
 - GPUs: Fast compute (TFLOPS), slow memory bandwidth (GB/s)
 - Bottleneck: Moving n² attention matrix to/from HBM
 **Solution:**
 - Tile attention computation
 - Recompute instead of store intermediate values
 - Fuse operations (reduce memory transfers)
 - Result: Same O(n²) compute, O(n) memory
 ### Algorithm
 ```
 Standard attention (3 memory operations):
 1. Compute scores: S = Q K^T (store n² matrix)
 2. Softmax: P = softmax(S) (store n² matrix)
 3. Output: O = P V (store n×d matrix)
 Flash Attention (tiled):
 1. Divide Q, K, V into blocks
 2. For each Q block:
   - Load block to SRAM (fast memory)
   - For each K, V block:
     - Compute attention for this tile
     - Update output incrementally
   - Never materialize full n² matrix!
 3. Result: Same output, O(n) memory
 ```
 ### Performance
 **Benchmarks (A100 GPU, 2k tokens):**
 Standard attention:
 - Memory: 4GB for batch_size=8
 - Speed: 150ms/batch
 - Max batch size: 16
 Flash Attention:
 - Memory: 1GB for batch_size=8 **(4x reduction)**
 - Speed: 75ms/batch **(2x faster)**
 - Max batch size: 64 **(4x larger)**
 **Flash Attention 2 (2023 update):**
 - Further optimized: 2-3x faster than Flash Attention 1
 - Better parallelism
 - Supports more head dimensions
 ### When to Use
 ✅ **ALWAYS use Flash Attention when:**
 - Sequence length < 16k tokens
 - Need exact attention (no approximation)
 - Available in your framework
 **It's a FREE LUNCH:**
 - No accuracy loss (mathematically exact)
 - Faster training AND inference
 - Less memory usage
 - Drop-in replacement
 ### Implementation
 **PyTorch 2.0+ (built-in):**
 ```python
 import torch.nn.functional as F
 # Automatic Flash Attention (if available)
 output = F.scaled_dot_product_attention(
    query, key, value,
    attn_mask=None,
    dropout_p=0.0,
    is_causal=False
 )
 # PyTorch automatically uses Flash Attention if:
 # - CUDA available
 # - Sequence length suitable
 # - No attention mask (or causal mask)
 ```
 **HuggingFace Transformers:**
 ```python
 from transformers import AutoModel
 # Enable Flash Attention 2
 model = AutoModel.from_pretrained(
    "bert-base-uncased",
    attn_implementation="flash_attention_2",  # Requires flash-attn package
    torch_dtype=torch.float16
 )
 ```
 **Manual installation:**
 ```bash
 pip install flash-attn --no-build-isolation
 ```
 ### Limitations
 ❌ **Flash Attention NOT suitable when:**
 - Sequence length > 16k (memory still grows quadratically)
 - Custom attention masks (complex patterns not supported)
 - Inference on CPU (CUDA-only)
 **For > 16k tokens:** Use sparse or linear attention
 ## Part 3: Sparse Attention (Exact for Long Sequences)
 ### Concept
 **Idea:** Each token attends to subset of tokens (not all)
 - Sliding window: Local context
 - Global tokens: Long-range connections
 - Result: O(n × window_size) instead of O(n²)
 **Key property:** Still EXACT attention (not approximate)
 - Just more structured attention pattern
 - No accuracy loss if pattern matches task
 ### Variant 1: Longformer
 **Pattern:** Sliding window + global attention
 ```
 Attention pattern (window=2, global=[0]):
    0  1  2  3  4  5
 0 [ 1  1  1  1  1  1 ]  ← Global token (attends to all)
 1 [ 1  1  1  0  0  0 ]  ← Window: tokens 0-2
 2 [ 1  1  1  1  0  0 ]  ← Window: tokens 1-3
 3 [ 1  0  1  1  1  0 ]  ← Window: tokens 2-4
 4 [ 1  0  0  1  1  1 ]  ← Window: tokens 3-5
 5 [ 1  0  0  0  1  1 ]  ← Window: tokens 4-5
 Complexity: O(n × (window + num_global))
 ```
 **Components:**
 1. **Sliding window**: Each token attends to w/2 tokens before and after
 2. **Global tokens**: Special tokens (like [CLS]) attend to all tokens
 3. **Dilated windows**: Optional (stride > 1 for longer context)
 **Implementation:**
 ```python
 from transformers import LongformerModel
 model = LongformerModel.from_pretrained("allenai/longformer-base-4096")
 # Attention mask (shape: batch, seq_len)
 attention_mask = torch.ones(batch_size, seq_len)
 attention_mask[:, 0] = 2  # 2 = global attention for [CLS] token
 output = model(input_ids, attention_mask=attention_mask)
 ```
 **Memory comparison (4k tokens, window=512):**
 ```
 Standard: 4096² = 16M elements → 64MB
 Longformer: 4096 × 512 = 2M elements → 8MB (8x reduction!)
 ```
 **When to use:**
 - Documents: 4k-16k tokens (legal, scientific papers)
 - Need full context but can't fit O(n²)
 - Task has local + global structure
 **Pretrained models:**
 - `allenai/longformer-base-4096`: Max 4096 tokens
 - `allenai/longformer-large-4096`: Larger version
 ### Variant 2: BigBird
 **Pattern:** Random + window + global
 ```
 Attention pattern:
 - Sliding window: Like Longformer
 - Random connections: Each token attends to r random tokens
 - Global tokens: Special tokens attend to all
 Complexity: O(n × (window + r + num_global))
 ```
 **Key difference from Longformer:**
 - Random connections help information flow
 - Theoretically proven to approximate full attention
 **When to use:**
 - Similar to Longformer
 - Slightly better for tasks needing long-range
 - Less widely adopted than Longformer
 **Implementation:**
 ```python
 from transformers import BigBirdModel
 model = BigBirdModel.from_pretrained(
    "google/bigbird-roberta-base",
    attention_type="block_sparse"  # or "original_full"
 )
 ```
 ### Sparse Attention Decision
 ```
 Sequence length < 4k:
 → Flash Attention (exact, no pattern needed)
 Sequence length 4k-16k:
 → Longformer (sliding window + global)
 → Best for: Documents, long-form text
 Sequence length > 16k:
 → Longformer if possible
 → Linear attention if Longformer too slow
 ```
 ## Part 4: Linear Attention (Approximate for Very Long)
 ### Concept
 **Idea:** Approximate softmax attention with linear operations
 - Complexity: O(n × k) where k << n
 - Trade-off: 1-3% accuracy loss
 - Benefit: Can handle very long sequences (> 16k)
 **Key property:** APPROXIMATE (not exact)
 - Do NOT use if accuracy critical
 - Good for extremely long sequences where exact is impossible
 ### Variant 1: Performer
 **Method:** Random Fourier Features to approximate softmax(Q K^T)
 **Formula:**
 ```python
 # Standard attention
 Attention(Q, K, V) = softmax(Q K^T) V
 # Performer approximation
 φ(Q) ≈ φ(K)^T ≈ softmax(Q K^T)
 Attention(Q, K, V) ≈ φ(Q) (φ(K)^T V)
 # Complexity: O(n × k) where k = feature dimension
 ```
 **Key trick:**
 - Compute φ(K)^T V first: (k × d) matrix (small!)
 - Then multiply by φ(Q): O(n × k × d) instead of O(n² × d)
 - Never materialize n² attention matrix
 **Implementation:**
 ```python
 # From performer-pytorch library
 from performer_pytorch import Performer
 model = Performer(
    dim=512,
    depth=6,
    heads=8,
    dim_head=64,
    causal=False,
    nb_features=256  # k = number of random features
 )
 ```
 **Accuracy:**
 - Typical loss: 1-2% vs standard attention
 - Depends on nb_features (more features = better approximation)
 - k=256 usually sufficient
 **When to use:**
 - Sequence length > 16k tokens
 - Accuracy loss acceptable (not critical task)
 - Need better than sparse attention (no structure assumptions)
 ### Variant 2: Linformer
 **Method:** Project K and V to lower dimension
 **Formula:**
 ```python
 # Standard attention (n × n attention matrix)
 Attention(Q, K, V) = softmax(Q K^T / √d_k) V
 # Linformer (project K, V to n × k where k << n)
 K_proj = E K  # E: (k × n) projection matrix
 V_proj = F V  # F: (k × n) projection matrix
 Attention(Q, K, V) ≈ softmax(Q K_proj^T / √d_k) V_proj
 # Attention matrix: (n × k) instead of (n × n)
 ```
 **Complexity:**
 - Time: O(n × k × d) where k << n
 - Memory: O(n × k) instead of O(n²)
 **Implementation:**
 ```python
 # From linformer library
 from linformer import Linformer
 model = Linformer(
    dim=512,
    seq_len=8192,
    depth=12,
    heads=8,
    k=256  # Projected dimension
 )
 ```
 **Accuracy:**
 - Typical loss: 1-3% vs standard attention
 - More loss than Performer
 - Fixed sequence length (k is tied to max_seq_len)
 **When to use:**
 - Fixed-length long sequences
 - Memory more critical than speed
 - Accuracy loss OK (2-3%)
 ### Linear Attention Decision
 ```
 Need exact attention:
 → Flash Attention or Sparse Attention (NOT linear)
 Sequence > 16k, accuracy critical:
 → Sparse Attention (Longformer)
 Sequence > 16k, accuracy loss OK:
 → Performer (better) or Linformer
 Sequence > 100k:
 → State space models (S4, Mamba, not attention)
 ```
 ## Part 5: Cross-Attention (Multimodal)
 ### Concept
 **Self-attention:** Q, K, V from same source
 **Cross-attention:** Q from one source, K/V from another
 **Use cases:**
 - Multimodal: vision → language (image captioning)
 - Seq2seq: source language → target language (translation)
 - RAG: query → document retrieval
 - Conditioning: generation conditioned on context
 ### Architecture
 ```python
 class CrossAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.mha = MultiHeadAttention(d_model, num_heads)
    def forward(self, query_source, key_value_source, mask=None):
        # query_source: (batch, n_q, d_model) - e.g., text tokens
        # key_value_source: (batch, n_kv, d_model) - e.g., image patches
        # Q from query source
        Q = self.W_q(query_source)
        # K, V from key-value source
        K = self.W_k(key_value_source)
        V = self.W_v(key_value_source)
        # Attention: (batch, n_q, d_model)
        output = attention(Q, K, V, mask)
        return output
 ```
 ### Example: Image Captioning
 **Task:** Generate caption from image
 **Architecture:**
 1. **Image Encoder:** ViT processes image → image features (n_patches × d)
 2. **Text Decoder:** Autoregressive text generation
 3. **Cross-Attention:** Text queries image features
 ```python
 class ImageCaptioningDecoder(nn.Module):
    def forward(self, text_tokens, image_features):
        # 1. Self-attention on text (causal)
        text = self.text_self_attention(
            query=text,
            key=text,
            value=text,
            causal_mask=True  # Don't see future words
        )
        # 2. Cross-attention (text queries image)
        text = self.cross_attention(
            query=text,               # From text decoder
            key=image_features,       # From image encoder
            value=image_features      # From image encoder
            # No causal mask! Can attend to all image patches
        )
        # 3. Feed-forward
        text = self.feed_forward(text)
        return text
 ```
 **Attention flow:**
 - Text token "cat" → High attention to cat region in image
 - Text token "sitting" → High attention to posture in image
 ### Example: Retrieval-Augmented Generation (RAG)
 **Task:** Generate answer using retrieved documents
 ```python
 class RAGDecoder(nn.Module):
    def forward(self, query_tokens, document_embeddings):
        # 1. Self-attention on query
        query = self.query_self_attention(query, query, query)
        # 2. Cross-attention (query → documents)
        query = self.cross_attention(
            query=query,                    # What we're generating
            key=document_embeddings,        # Retrieved docs
            value=document_embeddings       # Retrieved docs
        )
        # Query learns to extract relevant info from docs
        return query
 ```
 ### When to Use Cross-Attention
 ✅ **Use cross-attention when:**
 - Two different modalities (vision + language)
 - Conditioning generation on context (RAG)
 - Seq2seq with different input/output (translation)
 - Query-document matching
 ❌ **Don't use cross-attention when:**
 - Same modality (use self-attention)
 - No clear query vs key-value separation
 ## Part 6: Other Attention Variants
 ### Axial Attention (2D Images)
 **Idea:** For 2D data (images), attend along each axis separately
 ```
 Standard 2D attention: H×W tokens → (HW)² attention matrix
 Axial attention:
  - Row attention: Each row attends to itself (H × W²)
  - Column attention: Each column attends to itself (W × H²)
  - Total: O(HW × (H + W)) << O((HW)²)
 ```
 **When to use:**
 - High-resolution images
 - 2D positional structure important
 ### Block-Sparse Attention
 **Idea:** Divide attention into blocks, attend only within/across blocks
 **Pattern:**
 ```
 Block size = 64 tokens
 - Local block: Attend within same block
 - Vertical stripe: Attend to corresponding position in other blocks
 ```
 **Used in:** Sparse Transformer (OpenAI), GPT-3
 ### Multi-Query Attention (MQA)
 **Idea:** One K/V head shared across all Q heads
 **Benefit:**
 - Smaller KV cache during inference
 - Much faster decoding (4-8x)
 - Trade-off: ~1% accuracy loss
 **Used in:** PaLM, Falcon
 ### Grouped-Query Attention (GQA)
 **Idea:** Middle ground between multi-head and multi-query
 - Group Q heads share K/V heads
 - Example: 32 Q heads → 8 K/V heads (4:1 ratio)
 **Benefit:**
 - 4x smaller KV cache
 - Minimal accuracy loss (< 0.5%)
 **Used in:** LLaMA-2, Mistral
 ## Part 7: Decision Framework
 ### By Sequence Length
 ```
 < 2k tokens:
 → Flash Attention
   Exact, fast, standard
 2k-4k tokens:
 → Flash Attention
   Still manageable with modern GPUs
 4k-16k tokens:
 → Sparse Attention (Longformer, BigBird)
   Exact, designed for documents
 → OR Flash Attention if batch size = 1
 > 16k tokens:
 → Sparse Attention
   If task has local structure
 → Linear Attention (Performer)
   If accuracy loss OK (1-2%)
 → State Space Models (S4, Mamba)
   If sequence > 100k
 ```
 ### By Memory Constraints
 ```
 GPU OOM with standard attention:
 1. Try Flash Attention (4x less memory, free lunch)
 2. If still OOM, reduce batch size
 3. If batch size = 1 and still OOM, use sparse attention
 4. Last resort: Linear attention (if accuracy loss OK)
 DON'T:
 - Gradient checkpointing (slower, use Flash Attention instead)
 - Throwing more GPUs (algorithmic problem, not hardware)
 ```
 ### By Accuracy Requirements
 ```
 Must be exact (no approximation):
 → Flash Attention or Sparse Attention
   Never use linear attention!
 Accuracy loss acceptable (1-3%):
 → Linear Attention (Performer, Linformer)
   Only for very long sequences (> 16k)
 Critical task (medical, legal):
 → Exact attention only
   Flash Attention or Sparse Attention
 ```
 ### By Task Type
 ```
 Classification / Understanding:
 → Standard + Flash Attention
   Sequence usually < 2k
 Document processing:
 → Longformer (4096 tokens)
   Designed for documents
 Generation (LLM):
 → Flash Attention for training
 → + GQA/MQA for inference (faster decoding)
 Multimodal (vision + language):
 → Cross-attention for modality fusion
 → Self-attention within each modality
 Retrieval-augmented:
 → Cross-attention (query → documents)
 ```
 ## Part 8: Implementation Checklist
 ### Using Flash Attention
 **PyTorch 2.0+:**
 ```python
 # Automatic (recommended)
 output = F.scaled_dot_product_attention(query, key, value)
 # Verify Flash Attention is used
 import torch.backends.cuda
 print(torch.backends.cuda.flash_sdp_enabled())  # Should be True
 ```
 **HuggingFace:**
 ```python
 model = AutoModel.from_pretrained(
    "model-name",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.float16  # Flash Attention needs fp16/bf16
 )
 ```
 **Requirements:**
 - CUDA GPU (not CPU)
 - PyTorch >= 2.0 OR flash-attn package
 - fp16 or bf16 dtype (not fp32)
 ### Using Sparse Attention
 **Longformer:**
 ```python
 from transformers import LongformerModel, LongformerTokenizer
 tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-base-4096")
 model = LongformerModel.from_pretrained("allenai/longformer-base-4096")
 # Attention mask
 # 0 = no attention, 1 = local attention, 2 = global attention
 attention_mask = torch.ones(batch_size, seq_len)
 attention_mask[:, 0] = 2  # [CLS] token gets global attention
 outputs = model(input_ids, attention_mask=attention_mask)
 ```
 **Custom sparse pattern:**
 ```python
 # Create custom block-sparse mask
 def create_block_sparse_mask(seq_len, block_size):
    num_blocks = seq_len // block_size
    mask = torch.zeros(seq_len, seq_len)
    for i in range(num_blocks):
        start = i * block_size
        end = start + block_size
        mask[start:end, start:end] = 1  # Local block
    return mask
 ```
 ### Using Cross-Attention
 ```python
 class DecoderWithCrossAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
    def forward(self, decoder_input, encoder_output, causal_mask=None):
        # Self-attention (causal)
        x = self.self_attn(
            query=decoder_input,
            key=decoder_input,
            value=decoder_input,
            mask=causal_mask
        )
        # Cross-attention (Q from decoder, K/V from encoder)
        x = self.cross_attn(
            query=x,                  # From decoder
            key=encoder_output,       # From encoder
            value=encoder_output,     # From encoder
            mask=None                 # No causal mask for cross-attention!
        )
        return x
 ```
 ## Part 9: Common Mistakes
 ### Mistake 1: Ignoring Flash Attention
 **Symptom:** Training slow, high memory usage
 **Fix:** Always use Flash Attention for < 16k tokens
 ### Mistake 2: Using Linear Attention Unnecessarily
 **Symptom:** 1-3% accuracy loss for no reason
 **Fix:** Use Flash Attention (exact) unless sequence > 16k
 ### Mistake 3: Gradient Checkpointing Instead of Flash Attention
 **Symptom:** Training 20% slower
 **Fix:** Flash Attention gives memory savings AND speed
 ### Mistake 4: Cross-Attention with Causal Mask
 **Symptom:** Decoder can't attend to encoder properly
 **Fix:** Causal mask only for self-attention, NOT cross-attention
 ### Mistake 5: Accepting O(n²) Memory
 **Symptom:** GPU OOM for > 4k tokens
 **Fix:** Use sparse or Flash Attention, don't just add GPUs
 ## Summary: Quick Reference
 ### Attention Selection
 ```
 Sequence length:
  < 2k → Flash Attention (default)
  2-4k → Flash Attention
  4-16k → Longformer (documents) or Flash Attention (batch=1)
  > 16k → Sparse or Linear Attention
 Memory constrained:
  First: Try Flash Attention (4x less memory)
  Still OOM: Use sparse attention (Longformer)
  Last resort: Linear attention (accuracy loss)
 Speed critical:
  Training: Flash Attention (2x faster)
  Inference: Flash Attention + GQA/MQA
 Accuracy critical:
  Use exact attention only (Flash or Sparse)
  NEVER linear attention
 Multimodal:
  Cross-attention for modality fusion
 ```
 ### Implementation
 ```
 PyTorch 2.0+:
  F.scaled_dot_product_attention() # Auto Flash Attention
 HuggingFace:
  attn_implementation="flash_attention_2"
 Longformer:
  LongformerModel.from_pretrained("allenai/longformer-base-4096")
 Custom:
  Inherit from nn.Module, implement forward()
 ```
 ## Next Steps
 After mastering this skill:
 - `llm-specialist/llm-inference-optimization`: Apply attention optimizations to inference
 - `llm-specialist/context-window-management`: Manage long contexts in LLMs
 - `architecture-design-principles`: Understand broader design trade-offs
 **Remember:** Flash Attention is the modern default. Use it unless you have a specific reason not to (> 16k tokens, custom patterns).
--- a/skills/using-neural-architectures/cnn-families-and-selection.md
+++ b/skills/using-neural-architectures/cnn-families-and-selection.md
@@ -0,0 +1,622 @@
 # CNN Families and Selection: Choosing the Right Convolutional Network
 <CRITICAL_CONTEXT>
 CNNs are the foundation of computer vision. Different families have vastly different trade-offs:
 - Accuracy vs Speed vs Size
 - Dataset size requirements
 - Deployment target (cloud vs edge vs mobile)
 - Task type (classification vs detection vs segmentation)
 This skill helps you choose the RIGHT CNN for YOUR constraints.
 </CRITICAL_CONTEXT>
 ## When to Use This Skill
 Use this skill when:
 - ✅ Selecting CNN for vision task (classification, detection, segmentation)
 - ✅ Comparing CNN families (ResNet vs EfficientNet vs MobileNet)
 - ✅ Optimizing for specific constraints (latency, size, accuracy)
 - ✅ Understanding CNN evolution (why newer architectures exist)
 - ✅ Deployment-specific selection (cloud, edge, mobile)
 DO NOT use for:
 - ❌ Non-vision tasks (use sequence-models-comparison or other skills)
 - ❌ Training optimization (use training-optimization pack)
 - ❌ Implementation details (use pytorch-engineering pack)
 **When in doubt:** If choosing WHICH CNN → this skill. If implementing/training CNN → other skills.
 ## Selection Framework
 ### Step 1: Identify Constraints
 **Before recommending ANY architecture, ask:**
 | Constraint | Question | Impact |
 |------------|----------|--------|
 | **Deployment** | Where will model run? | Cloud → Any, Edge → MobileNet/EfficientNet-Lite, Mobile → MobileNetV3 |
 | **Latency** | Speed requirement? | Real-time (< 10ms) → MobileNet, Batch (> 100ms) → Any |
 | **Model Size** | Parameter/memory budget? | < 10M params → MobileNet, < 50M → ResNet/EfficientNet, Any → Large models OK |
 | **Dataset Size** | Training images? | < 10k → Small models, 10k-100k → Medium, > 100k → Large |
 | **Accuracy** | Required accuracy? | Competitive → EfficientNet-B4+, Production → ResNet-50/EfficientNet-B2 |
 | **Task Type** | Classification/detection/segmentation? | Detection → FPN-compatible, Segmentation → Multi-scale |
 **Critical:** Get answers to these BEFORE recommending architecture.
 ### Step 2: Apply Decision Tree
 ```
 START: What's your primary constraint?
 ┌─ DEPLOYMENT TARGET
 │  ├─ Cloud / Server
 │  │  └─ Dataset size?
 │  │     ├─ Small (< 10k) → ResNet-18, EfficientNet-B0
 │  │     ├─ Medium (10k-100k) → ResNet-50, EfficientNet-B2
 │  │     └─ Large (> 100k) → ResNet-101, EfficientNet-B4, ViT
 │  │
 │  ├─ Edge Device (Jetson, Coral)
 │  │  └─ Latency requirement?
 │  │     ├─ Real-time (< 10ms) → MobileNetV3-Small, EfficientNet-Lite0
 │  │     ├─ Medium (10-50ms) → MobileNetV3-Large, EfficientNet-Lite2
 │  │     └─ Relaxed (> 50ms) → EfficientNet-B0, ResNet-18
 │  │
 │  └─ Mobile (iOS/Android)
 │     └─ MobileNetV3-Small (fastest), MobileNetV3-Large (balanced)
 │        + INT8 quantization (route to ml-production)
 │
 ├─ ACCURACY PRIORITY (cloud deployment assumed)
 │  ├─ Maximum accuracy → EfficientNet-B7, ResNet-152, ViT-Large
 │  ├─ Balanced → EfficientNet-B2/B3, ResNet-50
 │  └─ Fast training → ResNet-18, EfficientNet-B0
 │
 ├─ EFFICIENCY PRIORITY
 │  └─ Best accuracy per FLOP → EfficientNet family (B0-B7)
 │     (EfficientNet dominates ResNet on Pareto frontier)
 │
 └─ TASK TYPE
   ├─ Classification → Any CNN (use constraint-based selection above)
   ├─ Object Detection → ResNet + FPN, EfficientDet, YOLOv8 (CSPDarknet)
   └─ Segmentation → ResNet + U-Net, EfficientNet + DeepLabV3
 ```
 ## CNN Family Catalog
 ### 1. ResNet Family (2015) - The Standard Baseline
 **Architecture:** Residual connections (skip connections) enable very deep networks
 **Variants:**
 - ResNet-18: 11M params, 1.8 GFLOPs, 69.8% ImageNet
 - ResNet-34: 22M params, 3.7 GFLOPs, 73.3% ImageNet
 - ResNet-50: 25M params, 4.1 GFLOPs, 76.1% ImageNet
 - ResNet-101: 44M params, 7.8 GFLOPs, 77.4% ImageNet
 - ResNet-152: 60M params, 11.6 GFLOPs, 78.3% ImageNet
 **When to Use:**
 - ✅ **Baseline choice**: Well-tested, widely supported
 - ✅ **Transfer learning**: Excellent pre-trained weights available
 - ✅ **Object detection**: Standard backbone for Faster R-CNN, Mask R-CNN
 - ✅ **Interpretability**: Simple architecture, easy to understand
 **When NOT to Use:**
 - ❌ **Edge/mobile deployment**: Too large and slow
 - ❌ **Efficiency priority**: EfficientNet beats ResNet on accuracy/FLOP
 - ❌ **Small datasets (< 10k)**: Use ResNet-18, not ResNet-50+
 **Key Insight:** Skip connections solve vanishing gradient, enable depth
 **Code Example:**
 ```python
 import torchvision.models as models
 # For cloud/server (good dataset)
 model = models.resnet50(pretrained=True)
 # For small dataset or faster training
 model = models.resnet18(pretrained=True)
 # For maximum accuracy (cloud only)
 model = models.resnet101(pretrained=True)
 ```
 ### 2. EfficientNet Family (2019) - Best Efficiency
 **Architecture:** Compound scaling (depth + width + resolution) optimized via neural architecture search
 **Variants:**
 - EfficientNet-B0: 5M params, 0.4 GFLOPs, 77.3% ImageNet
 - EfficientNet-B1: 8M params, 0.7 GFLOPs, 79.2% ImageNet
 - EfficientNet-B2: 9M params, 1.0 GFLOPs, 80.3% ImageNet
 - EfficientNet-B3: 12M params, 1.8 GFLOPs, 81.7% ImageNet
 - EfficientNet-B4: 19M params, 4.2 GFLOPs, 82.9% ImageNet
 - EfficientNet-B7: 66M params, 37 GFLOPs, 84.4% ImageNet
 **When to Use:**
 - ✅ **Efficiency matters**: Best accuracy per FLOP/parameter
 - ✅ **Cloud deployment**: B2-B4 sweet spot for production
 - ✅ **Limited compute**: B0 matches ResNet-50 accuracy at 10x fewer FLOPs
 - ✅ **Scaling needs**: Want to scale model up/down systematically
 **When NOT to Use:**
 - ❌ **Real-time mobile**: Use MobileNet (EfficientNet has more layers)
 - ❌ **Very small datasets**: Can overfit despite efficiency
 - ❌ **Simplicity needed**: More complex than ResNet
 **Key Insight:** Compound scaling balances depth, width, and resolution optimally
 **Efficiency Comparison:**
 ```
 Same accuracy as ResNet-50 (76%):
 - ResNet-50: 25M params, 4.1 GFLOPs
 - EfficientNet-B0: 5M params, 0.4 GFLOPs (10x more efficient!)
 Better accuracy (82.9%):
 - ResNet-152: 60M params, 11.6 GFLOPs → 78.3% ImageNet
 - EfficientNet-B4: 19M params, 4.2 GFLOPs → 82.9% ImageNet
  (Better accuracy with 3x fewer params and 3x less compute)
 ```
 **Code Example:**
 ```python
 import timm  # PyTorch Image Models library
 # Balanced choice (production)
 model = timm.create_model('efficientnet_b2', pretrained=True)
 # Efficiency priority (edge)
 model = timm.create_model('efficientnet_b0', pretrained=True)
 # Accuracy priority (research)
 model = timm.create_model('efficientnet_b4', pretrained=True)
 ```
 ### 3. MobileNet Family (2017-2019) - Mobile Optimized
 **Architecture:** Depthwise separable convolutions (drastically reduce compute)
 **Variants:**
 - MobileNetV1: 4.2M params, 0.6 GFLOPs, 70.6% ImageNet
 - MobileNetV2: 3.5M params, 0.3 GFLOPs, 72.0% ImageNet
 - MobileNetV3-Small: 2.5M params, 0.06 GFLOPs, 67.4% ImageNet
 - MobileNetV3-Large: 5.4M params, 0.2 GFLOPs, 75.2% ImageNet
 **When to Use:**
 - ✅ **Mobile deployment**: iOS/Android apps
 - ✅ **Edge devices**: Raspberry Pi, Jetson Nano
 - ✅ **Real-time inference**: < 100ms latency
 - ✅ **Extreme efficiency**: < 10M parameters budget
 **When NOT to Use:**
 - ❌ **Cloud deployment with no constraints**: EfficientNet or ResNet better accuracy
 - ❌ **Accuracy priority**: Sacrifices accuracy for speed
 - ❌ **Large datasets with compute**: Can afford better models
 **Key Insight:** Depthwise separable convolutions = standard conv split into depthwise + pointwise (9x fewer operations)
 **Deployment Performance:**
 ```
 Raspberry Pi 4 inference (224×224 image):
 - ResNet-50: ~2000ms (unusable)
 - ResNet-18: ~600ms (slow)
 - MobileNetV2: ~150ms (acceptable)
 - MobileNetV3-Large: ~80ms (good)
 - MobileNetV3-Small: ~40ms (fast)
 With INT8 quantization:
 - MobileNetV3-Large: ~30ms (production-ready)
 - MobileNetV3-Small: ~15ms (real-time)
 ```
 **Code Example:**
 ```python
 import torchvision.models as models
 # For mobile deployment
 model = models.mobilenet_v3_large(pretrained=True)
 # For ultra-low latency (sacrifice accuracy)
 model = models.mobilenet_v3_small(pretrained=True)
 # Quantization for mobile (route to ml-production skill for details)
 # Achieves 2-4x speedup with minimal accuracy loss
 ```
 ### 4. Inception Family (2014-2016) - Multi-Scale Features
 **Architecture:** Multi-scale convolutions in parallel (inception modules)
 **Variants:**
 - InceptionV3: 24M params, 5.7 GFLOPs, 77.5% ImageNet
 - InceptionV4: 42M params, 12.3 GFLOPs, 80.0% ImageNet
 - Inception-ResNet: Hybrid with residual connections
 **When to Use:**
 - ✅ **Multi-scale features**: Objects at different sizes
 - ✅ **Object detection**: Good backbone for detection
 - ✅ **Historical interest**: Understanding multi-scale approaches
 **When NOT to Use:**
 - ❌ **Simplicity needed**: Complex architecture, hard to modify
 - ❌ **Efficiency priority**: EfficientNet better
 - ❌ **Modern projects**: Largely superseded by ResNet/EfficientNet
 **Key Insight:** Parallel multi-scale convolutions (1×1, 3×3, 5×5) capture different receptive fields
 **Status:** Mostly historical - ResNet and EfficientNet have replaced Inception in practice
 ### 5. DenseNet Family (2017) - Dense Connections
 **Architecture:** Every layer connects to every other layer (dense connections)
 **Variants:**
 - DenseNet-121: 8M params, 2.9 GFLOPs, 74.4% ImageNet
 - DenseNet-169: 14M params, 3.4 GFLOPs, 75.6% ImageNet
 - DenseNet-201: 20M params, 4.3 GFLOPs, 76.9% ImageNet
 **When to Use:**
 - ✅ **Parameter efficiency**: Good accuracy with few parameters
 - ✅ **Feature reuse**: Dense connections enable feature reuse
 - ✅ **Small datasets**: Better gradient flow helps with limited data
 **When NOT to Use:**
 - ❌ **Inference speed priority**: Dense connections slow (high memory bandwidth)
 - ❌ **Training speed**: Slower to train than ResNet
 - ❌ **Production deployment**: Less mature ecosystem than ResNet
 **Key Insight:** Dense connections improve gradient flow, enable feature reuse, but slow inference
 **Status:** Theoretically elegant, but ResNet/EfficientNet more practical
 ### 6. VGG Family (2014) - Historical Baseline
 **Architecture:** Very deep (16-19 layers), small 3×3 convolutions, many parameters
 **Variants:**
 - VGG-16: 138M params, 15.5 GFLOPs, 71.5% ImageNet
 - VGG-19: 144M params, 19.6 GFLOPs, 71.1% ImageNet
 **When to Use:**
 - ❌ **DON'T use VGG for new projects**
 - Historical understanding only
 **Why NOT to Use:**
 - Massive parameter count (138M vs ResNet-50's 25M)
 - Poor accuracy for size
 - Superseded by ResNet (2015)
 **Key Insight:** Proved that depth matters, but skip connections (ResNet) are better
 **Status:** **Obsolete** - use ResNet or EfficientNet instead
 ## Practical Selection Guide
 ### Scenario 1: Cloud/Server Deployment
 **Goal:** Best accuracy, no compute constraints
 **Recommendation:**
 ```
 Small dataset (< 10k images):
 → EfficientNet-B0 or ResNet-18
  (Avoid overfitting with smaller model)
 Medium dataset (10k-100k images):
 → EfficientNet-B2 or ResNet-50
  (Balanced accuracy and efficiency)
 Large dataset (> 100k images):
 → EfficientNet-B4 or ResNet-101
  (Can afford larger model)
 Maximum accuracy (research):
 → EfficientNet-B7 or Vision Transformer
  (If dataset > 1M images and compute unlimited)
 ```
 ### Scenario 2: Edge Deployment (Jetson, Coral TPU)
 **Goal:** Optimize for edge hardware latency
 **Recommendation:**
 ```
 Real-time requirement (< 10ms):
 → MobileNetV3-Small or EfficientNet-Lite0
  + INT8 quantization
 Medium latency (10-50ms):
 → MobileNetV3-Large or EfficientNet-Lite2
 Relaxed latency (> 50ms):
 → EfficientNet-B0 or ResNet-18
 ```
 **Critical:** Profile on actual edge hardware. Quantization is mandatory (route to ml-production).
 ### Scenario 3: Mobile Deployment (iOS/Android)
 **Goal:** On-device inference, minimal battery drain
 **Recommendation:**
 ```
 All mobile deployments:
 → MobileNetV3-Large (balanced)
 → MobileNetV3-Small (fastest, less accurate)
 Always use:
 - INT8 quantization (2-4x speedup)
 - CoreML (iOS) or TFLite (Android) optimization
 - Benchmark on target device before deploying
 ```
 **Expected latency (iPhone 12, INT8 quantized):**
 - MobileNetV3-Small: 5-10ms
 - MobileNetV3-Large: 15-25ms
 ### Scenario 4: Object Detection
 **Goal:** Select backbone for detection framework
 **Recommendation:**
 ```
 Faster R-CNN:
 → ResNet-50 + FPN (standard)
 → ResNet-101 + FPN (more accuracy)
 YOLOv8:
 → CSPDarknet (built-in, optimized)
 EfficientDet:
 → EfficientNet + BiFPN (best efficiency)
 Custom detection:
 → ResNet or EfficientNet as backbone
 → Add Feature Pyramid Network (FPN) for multi-scale
 ```
 **Note:** Detection adds significant compute on top of backbone. Choose efficient backbone.
 ### Scenario 5: Semantic Segmentation
 **Goal:** Dense pixel-wise prediction
 **Recommendation:**
 ```
 U-Net style:
 → ResNet-18/34 as encoder (fast)
 → EfficientNet-B0 as encoder (efficient)
 DeepLabV3:
 → ResNet-50 (standard)
 → MobileNetV3 (mobile deployment)
 Key: Segmentation requires multi-scale features
 → Ensure backbone has skip connections or FPN
 ```
 ## Trade-Off Analysis
 ### Accuracy vs Efficiency (Pareto Frontier)
 **ImageNet Top-1 Accuracy vs FLOPs:**
 ```
 Efficiency Winners (best accuracy per FLOP):
 1. EfficientNet-B0: 77.3% @ 0.4 GFLOPs (best efficiency)
 2. EfficientNet-B2: 80.3% @ 1.0 GFLOPs
 3. EfficientNet-B4: 82.9% @ 4.2 GFLOPs
 Accuracy Winners (best absolute accuracy):
 1. EfficientNet-B7: 84.4% @ 37 GFLOPs
 2. ViT-Large: 85.2% @ 190 GFLOPs (requires huge dataset)
 3. ResNet-152: 78.3% @ 11.6 GFLOPs (dominated by EfficientNet)
 Speed Winners (lowest latency):
 1. MobileNetV3-Small: 67.4% @ 0.06 GFLOPs (50ms on mobile)
 2. MobileNetV3-Large: 75.2% @ 0.2 GFLOPs (100ms on mobile)
 3. EfficientNet-Lite0: 75.0% @ 0.4 GFLOPs
 ```
 **Key Takeaway:** EfficientNet dominates ResNet on Pareto frontier (better accuracy at same compute).
 ### Parameters vs Accuracy
 **For same ~75% ImageNet accuracy:**
 ```
 VGG-16:           138M params (❌ terrible efficiency)
 ResNet-50:         25M params
 EfficientNet-B0:    5M params (✅ 5x fewer parameters!)
 MobileNetV3-Large:  5M params (fast inference)
 ```
 **Conclusion:** Modern architectures (EfficientNet, MobileNet) achieve same accuracy with far fewer parameters.
 ## Common Pitfalls
 ### Pitfall 1: Defaulting to ResNet-50
 **Symptom:** Using ResNet-50 without considering alternatives
 **Why it's wrong:** EfficientNet-B0 matches ResNet-50 accuracy with 10x less compute
 **Fix:** Consider EfficientNet family first (better efficiency)
 ### Pitfall 2: Choosing Large Model for Small Dataset
 **Symptom:** Using ResNet-101 with < 10k images
 **Why it's wrong:** Model will overfit (too many parameters for data)
 **Fix:**
 - < 10k images → ResNet-18 or EfficientNet-B0
 - 10k-100k → ResNet-50 or EfficientNet-B2
 - > 100k → Can use larger models
 ### Pitfall 3: Using Desktop Model on Mobile
 **Symptom:** Trying to run ResNet-50 on mobile device
 **Why it's wrong:** 2000ms inference time is unusable
 **Fix:** Use MobileNetV3 + quantization for mobile (15-30ms)
 ### Pitfall 4: Ignoring Task Type
 **Symptom:** Using standard CNN for object detection without FPN
 **Why it's wrong:** Detection needs multi-scale features
 **Fix:** Use detection-specific frameworks (YOLOv8, Faster R-CNN) with appropriate backbone
 ### Pitfall 5: Believing "Bigger = Better"
 **Symptom:** Choosing ResNet-152 over ResNet-50 without justification
 **Why it's wrong:** Diminishing returns - 3x compute for 1.3% accuracy, will overfit on small data
 **Fix:** Match model capacity to dataset size, consider efficiency
 ## Evolution and Historical Context
 **Why CNNs evolved the way they did:**
 ```
 2012: AlexNet
 → Proved deep learning works for vision
 → 8 layers, 60M params
 2014: VGG
 → Deeper is better (16-19 layers)
 → But: 138M params (too many)
 2014: Inception/GoogLeNet
 → Multi-scale convolutions
 → More efficient than VGG
 2015: ResNet ★
 → Skip connections enable very deep networks (152 layers)
 → Solved vanishing gradient problem
 → Became standard baseline
 2017: MobileNet
 → Mobile deployment needs
 → Depthwise separable convolutions (9x fewer ops)
 2017: DenseNet
 → Dense connections for feature reuse
 → Parameter efficient but slow inference
 2019: EfficientNet ★
 → Compound scaling (depth + width + resolution)
 → Neural architecture search
 → Dominates Pareto frontier (best accuracy per FLOP)
 → New standard for efficiency
 2020: Vision Transformer
 → Attention-based (no convolutions)
 → Requires very large datasets (> 1M images)
 → For research/large-scale applications
 ```
 **Current Recommendations (2025):**
 - Cloud: **EfficientNet** (best efficiency) or ResNet (simplicity)
 - Edge: **EfficientNet-Lite** or MobileNetV3
 - Mobile: **MobileNetV3** + quantization
 - Detection: **EfficientDet** or YOLOv8
 - Baseline: **ResNet** (simple, well-tested)
 ## Decision Checklist
 Before choosing CNN, answer these:
 ```
 ☐ Deployment target? (cloud/edge/mobile)
 ☐ Latency requirement? (< 10ms / 10-100ms / > 100ms)
 ☐ Model size budget? (< 10M / 10-50M / unlimited params)
 ☐ Dataset size? (< 10k / 10k-100k / > 100k images)
 ☐ Accuracy priority? (maximum / production / fast iteration)
 ☐ Task type? (classification / detection / segmentation)
 ☐ Efficiency matters? (yes → EfficientNet, no → flexibility)
 Based on answers:
 → Mobile → MobileNetV3
 → Edge → EfficientNet-Lite or MobileNetV3
 → Cloud + efficiency → EfficientNet
 → Cloud + simplicity → ResNet
 → Maximum accuracy → EfficientNet-B7 or ViT
 → Small dataset → Small models (ResNet-18, EfficientNet-B0)
 ```
 ## Integration with Other Skills
 **After selecting CNN architecture:**
 **Training the model:**
 → `yzmir/training-optimization/using-training-optimization`
 - Optimizer selection (Adam, SGD, AdamW)
 - Learning rate schedules
 - Data augmentation
 **Implementing in PyTorch:**
 → `yzmir/pytorch-engineering/using-pytorch-engineering`
 - Custom modifications to pre-trained models
 - Multi-GPU training
 - Performance optimization
 **Deploying to production:**
 → `yzmir/ml-production/using-ml-production`
 - Quantization (INT8, FP16)
 - Model serving (TorchServe, ONNX)
 - Optimization for edge/mobile (TFLite, CoreML)
 **If architecture is unstable (very deep):**
 → `yzmir/neural-architectures/normalization-techniques`
 - Normalization layers (BatchNorm, LayerNorm)
 - Skip connections
 - Initialization strategies
 ## Summary
 **CNN Selection in One Table:**
 | Scenario | Recommendation | Why |
 |----------|----------------|-----|
 | Cloud, balanced | EfficientNet-B2 | Best efficiency, 80% accuracy |
 | Cloud, max accuracy | EfficientNet-B4 | 83% accuracy, reasonable compute |
 | Cloud, simple baseline | ResNet-50 | Well-tested, widely used |
 | Edge device | MobileNetV3-Large | Optimized for edge, 75% accuracy |
 | Mobile app | MobileNetV3-Small + quantization | < 20ms inference |
 | Small dataset (< 10k) | ResNet-18 or EfficientNet-B0 | Avoid overfitting |
 | Object detection | ResNet-50 + FPN, EfficientDet | Multi-scale features |
 | Segmentation | ResNet + U-Net, DeepLabV3 | Dense prediction |
 **Key Principles:**
 1. **Match model capacity to dataset size** (small data → small model)
 2. **EfficientNet dominates ResNet on efficiency** (same accuracy, less compute)
 3. **Mobile needs mobile-specific architectures** (MobileNet, quantization)
 4. **Task type matters** (detection/segmentation need multi-scale features)
 5. **Bigger ≠ always better** (diminishing returns, overfitting risk)
 **When in doubt:** Start with **EfficientNet-B2** (cloud) or **MobileNetV3-Large** (edge/mobile).
 **END OF SKILL**
--- a/skills/using-neural-architectures/generative-model-families.md
+++ b/skills/using-neural-architectures/generative-model-families.md
@@ -0,0 +1,811 @@
 # Generative Model Families
 ## When to Use This Skill
 Use this skill when you need to:
 - ✅ Select generative model for image/audio/video generation
 - ✅ Understand VAE vs GAN vs Diffusion trade-offs
 - ✅ Decide between training from scratch vs fine-tuning
 - ✅ Address mode collapse in GANs
 - ✅ Choose between quality, speed, and training stability
 - ✅ Understand modern landscape (Stable Diffusion, StyleGAN, etc.)
 **Do NOT use this skill for:**
 - ❌ Text generation (use `llm-specialist` pack)
 - ❌ Architecture implementation details (use model-specific docs)
 - ❌ High-level architecture selection (use `using-neural-architectures`)
 ## Core Principle
 **Generative models have fundamental trade-offs:**
 - **Quality vs Stability**: GANs sharp but unstable, VAEs blurry but stable
 - **Quality vs Speed**: Diffusion high-quality but slow, GANs fast
 - **Explicitness vs Flexibility**: Autoregressive/Flow have likelihood, GANs don't
 **Modern default (2025):** Diffusion models (best quality + stability)
 ## Part 1: Model Family Overview
 ### The Five Families
 **1. VAE (Variational Autoencoder)**
 - **Approach**: Learn latent space with encoder-decoder
 - **Quality**: Blurry (6/10)
 - **Training**: Very stable
 - **Use**: Latent space exploration, NOT high-quality generation
 **2. GAN (Generative Adversarial Network)**
 - **Approach**: Adversarial game (generator vs discriminator)
 - **Quality**: Sharp (9/10)
 - **Training**: Unstable (adversarial dynamics)
 - **Use**: High-quality generation, fast inference
 **3. Diffusion Models**
 - **Approach**: Iterative denoising
 - **Quality**: Very sharp (9.5/10)
 - **Training**: Stable
 - **Use**: Modern default for high-quality generation
 **4. Autoregressive Models**
 - **Approach**: Sequential generation (pixel-by-pixel, token-by-token)
 - **Quality**: Good (7-8/10)
 - **Training**: Stable
 - **Use**: Explicit likelihood, sequential data
 **5. Flow Models**
 - **Approach**: Invertible transformations
 - **Quality**: Good (7-8/10)
 - **Training**: Stable
 - **Use**: Exact likelihood, invertibility needed
 ### Quick Comparison
 | Model | Quality | Training Stability | Inference Speed | Mode Collapse | Likelihood |
 |-------|---------|-------------------|----------------|---------------|------------|
 | VAE | 6/10 (blurry) | 10/10 | Fast | No | Approximate |
 | GAN | 9/10 | 3/10 | Fast | Yes | No |
 | Diffusion | 9.5/10 | 9/10 | Slow | No | Approximate |
 | Autoregressive | 7-8/10 | 9/10 | Very slow | No | Exact |
 | Flow | 7-8/10 | 8/10 | Fast (both ways) | No | Exact |
 ## Part 2: VAE (Variational Autoencoder)
 ### Architecture
 **Components:**
 1. **Encoder**: x → z (image to latent)
 2. **Latent space**: z ~ N(μ, σ²)
 3. **Decoder**: z → x' (latent to reconstruction)
 **Loss function:**
 ```python
 # ELBO (Evidence Lower Bound)
 loss = reconstruction_loss + KL_divergence
 # Reconstruction: How well decoder reconstructs input
 reconstruction_loss = MSE(x, x_reconstructed)
 # KL: How close latent is to standard normal
 KL_divergence = KL(q(z|x) || p(z))
 ```
 ### Why VAE is Blurry
 **Problem**: MSE loss encourages pixel-wise averaging
 **Example:**
 - Dataset: Faces with both smiles and no smiles
 - VAE learns: "Average face has half-smile blur"
 - Result: Blurry, hedges between modes
 **Mathematical reason:**
 - MSE minimization = mean prediction
 - Mean of sharp images = blurry image
 ### When to Use VAE
 ✅ **Use VAE for:**
 - Latent space exploration (interpolation, arithmetic)
 - Anomaly detection (reconstruction error)
 - Disentangled representations (β-VAE)
 - Compression (lossy, with latent codes)
 ❌ **DON'T use VAE for:**
 - High-quality image generation (use GAN or Diffusion!)
 - Sharp, realistic outputs
 ### Implementation
 ```python
 import torch
 import torch.nn as nn
 class VAE(nn.Module):
    def __init__(self, latent_dim=128):
        super().__init__()
        # Encoder
        self.encoder = nn.Sequential(
            nn.Conv2d(3, 32, 4, 2, 1),  # 64x64 -> 32x32
            nn.ReLU(),
            nn.Conv2d(32, 64, 4, 2, 1),  # 32x32 -> 16x16
            nn.ReLU(),
            nn.Conv2d(64, 128, 4, 2, 1),  # 16x16 -> 8x8
            nn.ReLU(),
            nn.Flatten()
        )
        self.fc_mu = nn.Linear(128 * 8 * 8, latent_dim)
        self.fc_logvar = nn.Linear(128 * 8 * 8, latent_dim)
        # Decoder
        self.fc_decode = nn.Linear(latent_dim, 128 * 8 * 8)
        self.decoder = nn.Sequential(
            nn.ConvTranspose2d(128, 64, 4, 2, 1),
            nn.ReLU(),
            nn.ConvTranspose2d(64, 32, 4, 2, 1),
            nn.ReLU(),
            nn.ConvTranspose2d(32, 3, 4, 2, 1),
            nn.Sigmoid()
        )
    def reparameterize(self, mu, logvar):
        # Reparameterization trick: z = μ + σ * ε
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std
    def forward(self, x):
        # Encode
        h = self.encoder(x)
        mu = self.fc_mu(h)
        logvar = self.fc_logvar(h)
        # Sample latent
        z = self.reparameterize(mu, logvar)
        # Decode
        h = self.fc_decode(z)
        h = h.view(-1, 128, 8, 8)
        x_recon = self.decoder(h)
        return x_recon, mu, logvar
    def loss_function(self, x, x_recon, mu, logvar):
        # Reconstruction loss
        recon_loss = F.mse_loss(x_recon, x, reduction='sum')
        # KL divergence
        kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
        return recon_loss + kl_loss
 ```
 ## Part 3: GAN (Generative Adversarial Network)
 ### Architecture
 **Components:**
 1. **Generator**: z → x (noise to image)
 2. **Discriminator**: x → [0, 1] (image to real/fake probability)
 **Adversarial Training:**
 ```python
 # Discriminator loss: Classify real as real, fake as fake
 D_loss = -log(D(x_real)) - log(1 - D(G(z)))
 # Generator loss: Fool discriminator
 G_loss = -log(D(G(z)))
 # Minimax game:
 min_G max_D V(D, G) = E[log D(x)] + E[log(1 - D(G(z)))]
 ```
 ### Training Instability
 **Problem**: Adversarial dynamics are unstable
 **Common issues:**
 1. **Mode collapse**: Generator produces limited variety
 2. **Non-convergence**: Oscillation, never settles
 3. **Vanishing gradients**: Discriminator too strong, generator can't learn
 4. **Hyperparameter sensitivity**: Learning rates critical
 **Solutions:**
 - Spectral normalization (StyleGAN2)
 - Progressive growing (start low-res, increase)
 - Minibatch discrimination (penalize lack of diversity)
 - Wasserstein loss (WGAN, more stable)
 ### Mode Collapse
 **What is it?**
 - Generator produces subset of distribution
 - Example: Face GAN only generates 10 face types
 **Why it happens:**
 - Generator exploits discriminator weaknesses
 - Finds "easy" samples that fool discriminator
 - Forgets other modes
 **Detection:**
 ```python
 # Check diversity: Generate many samples
 samples = generator.generate(n=1000)
 diversity = compute_pairwise_distance(samples)
 if diversity < threshold:
    print("Mode collapse detected!")
 ```
 **Solutions:**
 - Minibatch discrimination
 - Unrolled GANs (slow but helps)
 - Switch to diffusion (no mode collapse by design!)
 ### Modern GANs
 **StyleGAN2 (2020):**
 - State-of-the-art for faces
 - Style-based generator
 - Spectral normalization for stability
 - Resolution: 1024×1024
 **StyleGAN3 (2021):**
 - Alias-free architecture
 - Better animation/video
 **When to use GAN:**
 ✅ Fast inference needed (50ms per image)
 ✅ Pretrained model available (StyleGAN2)
 ✅ Can tolerate training difficulty
 ❌ Training instability unacceptable
 ❌ Mode collapse problematic
 ❌ Starting from scratch (use diffusion instead)
 ### Implementation (Basic GAN)
 ```python
 class Generator(nn.Module):
    def __init__(self, latent_dim=100, img_channels=3):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(latent_dim, 128 * 8 * 8),
            nn.ReLU(),
            nn.Unflatten(1, (128, 8, 8)),
            nn.ConvTranspose2d(128, 64, 4, 2, 1),  # 8x8 -> 16x16
            nn.ReLU(),
            nn.ConvTranspose2d(64, 32, 4, 2, 1),   # 16x16 -> 32x32
            nn.ReLU(),
            nn.ConvTranspose2d(32, img_channels, 4, 2, 1),  # 32x32 -> 64x64
            nn.Tanh()
        )
    def forward(self, z):
        return self.model(z)
 class Discriminator(nn.Module):
    def __init__(self, img_channels=3):
        super().__init__()
        self.model = nn.Sequential(
            nn.Conv2d(img_channels, 32, 4, 2, 1),  # 64x64 -> 32x32
            nn.LeakyReLU(0.2),
            nn.Conv2d(32, 64, 4, 2, 1),            # 32x32 -> 16x16
            nn.LeakyReLU(0.2),
            nn.Conv2d(64, 128, 4, 2, 1),           # 16x16 -> 8x8
            nn.LeakyReLU(0.2),
            nn.Flatten(),
            nn.Linear(128 * 8 * 8, 1),
            nn.Sigmoid()
        )
    def forward(self, x):
        return self.model(x)
 # Training loop
 for real_images in dataloader:
    # Train discriminator
    fake_images = generator(noise)
    D_real = discriminator(real_images)
    D_fake = discriminator(fake_images.detach())
    D_loss = -torch.mean(torch.log(D_real) + torch.log(1 - D_fake))
    D_loss.backward()
    optimizer_D.step()
    # Train generator
    D_fake = discriminator(fake_images)
    G_loss = -torch.mean(torch.log(D_fake))
    G_loss.backward()
    optimizer_G.step()
 ```
 ## Part 4: Diffusion Models (Modern Default)
 ### Architecture
 **Concept**: Learn to reverse a diffusion (noising) process
 **Forward process** (fixed):
 ```python
 # Gradually add noise to image
 x_0 (original) → x_1 → x_2 → ... → x_T (pure noise)
 # At each step:
 x_t = √(1 - β_t) * x_{t-1} + √β_t * ε
 where ε ~ N(0, I), β_t = noise schedule
 ```
 **Reverse process** (learned):
 ```python
 # Model learns to denoise
 x_T (noise) → x_{T-1} → ... → x_1 → x_0 (image)
 # Model predicts: ε_θ(x_t, t)
 # Then: x_{t-1} = (x_t - √β_t * ε_θ(x_t, t)) / √(1 - β_t)
 ```
 **Training:**
 ```python
 # Simple loss: Predict the noise
 loss = MSE(ε, ε_θ(x_t, t))
 # x_t = noisy image at step t
 # ε = actual noise added
 # ε_θ(x_t, t) = model's noise prediction
 ```
 ### Why Diffusion is Excellent
 **Advantages:**
 1. **High quality**: State-of-the-art (better than GAN)
 2. **Stable training**: Standard MSE loss (no adversarial dynamics)
 3. **No mode collapse**: By design, covers full distribution
 4. **Controllable**: Easy to add conditioning (text, class, etc.)
 **Disadvantages:**
 1. **Slow inference**: 50-1000 denoising steps (vs GAN's 1 step)
 2. **Compute intensive**: T forward passes (T = 50-1000)
 **Speed comparison:**
 ```
 GAN: 1 forward pass = 50ms
 Diffusion (T=50): 50 forward passes = 2.5 seconds
 Diffusion (T=1000): 1000 forward passes = 50 seconds
 ```
 **Speedup techniques:**
 - DDIM (fewer steps, 10-50 instead of 1000)
 - DPM-Solver (fast sampler)
 - Latent diffusion (Stable Diffusion, denoise in latent space)
 ### Modern Diffusion Models
 **Stable Diffusion (2022+):**
 - Latent diffusion (denoise in VAE latent space)
 - Text conditioning (CLIP text encoder)
 - Pretrained on billions of images
 - Fine-tunable
 **DALL-E 2 (2022):**
 - Prior network (text → image embedding)
 - Diffusion decoder (embedding → image)
 **Imagen (2022, Google):**
 - Text conditioning with T5 encoder
 - Cascaded diffusion (64×64 → 256×256 → 1024×1024)
 **When to use Diffusion:**
 ✅ High-quality generation (best quality)
 ✅ Stable training (standard loss)
 ✅ Diversity needed (no mode collapse)
 ✅ Conditioning (text-to-image, class-conditional)
 ❌ Need fast inference (< 1 second)
 ❌ Real-time generation
 ### Implementation (DDPM)
 ```python
 class DiffusionModel(nn.Module):
    def __init__(self, img_channels=3):
        super().__init__()
        # U-Net architecture
        self.model = UNet(
            in_channels=img_channels,
            out_channels=img_channels,
            time_embedding_dim=256
        )
    def forward(self, x_t, t):
        # Predict noise ε at timestep t
        return self.model(x_t, t)
 # Training
 def train_step(model, x_0):
    # Sample random timestep
    t = torch.randint(0, T, (batch_size,))
    # Sample noise
    ε = torch.randn_like(x_0)
    # Create noisy image x_t
    α_t = alpha_schedule[t]
    x_t = torch.sqrt(α_t) * x_0 + torch.sqrt(1 - α_t) * ε
    # Predict noise
    ε_pred = model(x_t, t)
    # Loss: MSE between actual and predicted noise
    loss = F.mse_loss(ε_pred, ε)
    return loss
 # Sampling (generation)
@torch.no_grad()
 def sample(model, shape):
    # Start from pure noise
    x_t = torch.randn(shape)
    # Iteratively denoise
    for t in reversed(range(T)):
        # Predict noise
        ε_pred = model(x_t, t)
        # Denoise one step
        α_t = alpha_schedule[t]
        x_t = (x_t - (1 - α_t) / torch.sqrt(1 - ᾱ_t) * ε_pred) / torch.sqrt(α_t)
        # Add noise (except last step)
        if t > 0:
            x_t += torch.sqrt(β_t) * torch.randn_like(x_t)
    return x_t  # x_0 (generated image)
 ```
 ## Part 5: Autoregressive Models
 ### Concept
 **Idea**: Model probability as product of conditionals
 ```
 p(x) = p(x_1) * p(x_2|x_1) * p(x_3|x_1,x_2) * ... * p(x_n|x_1,...,x_{n-1})
 ```
 **For images**: Generate pixel-by-pixel (or patch-by-patch)
 **Architectures:**
 - **PixelCNN**: Convolutional with masked kernels
 - **PixelCNN++**: Improved with mixture of logistics
 - **VQ-VAE + PixelCNN**: Two-stage (learn discrete codes, model codes)
 - **ImageGPT**: GPT-style Transformer for images
 ### Advantages
 ✅ **Explicit likelihood**: Can compute p(x) exactly
 ✅ **Stable training**: Standard cross-entropy loss
 ✅ **Theoretical guarantees**: Proper probability model
 ### Disadvantages
 ❌ **Very slow generation**: Sequential (can't parallelize)
 ❌ **Limited quality**: Worse than GAN/Diffusion for high-res
 ❌ **Resolution scaling**: Impractical for 1024×1024 (1M pixels!)
 **Speed comparison:**
 ```
 GAN: Generate 1024×1024 in 50ms (parallel)
 PixelCNN: Generate 32×32 in 5 seconds (sequential!)
 ImageGPT: Generate 256×256 in 30 seconds
 For 1024×1024: 1M pixels × 5ms/pixel = 83 minutes!
 ```
 ### When to Use
 ✅ **Use autoregressive for:**
 - Explicit likelihood needed (compression, evaluation)
 - Small images (32×32, 64×64)
 - Two-stage models (VQ-VAE + Transformer)
 ❌ **Don't use for:**
 - High-resolution images (too slow)
 - Real-time generation
 - Quality-critical applications (use diffusion)
 ### Modern Usage
 **Two-stage approach (DALL-E, VQ-GAN):**
 1. **Stage 1**: VQ-VAE learns discrete codes
   - Image → 32×32 grid of codes (instead of 1M pixels)
 2. **Stage 2**: Autoregressive model (Transformer) on codes
   - Much faster (32×32 = 1024 codes, not 1M pixels)
 ## Part 6: Flow Models
 ### Concept
 **Idea**: Invertible transformations
 ```
 z ~ N(0, I)  ←→  x ~ p_data
 f: z → x (forward)
 f⁻¹: x → z (inverse)
 ```
 **Requirement**: f must be invertible and differentiable
 **Advantage**: Exact likelihood via change-of-variables
 ```
 log p(x) = log p(z) + log |det(∂f⁻¹/∂x)|
 ```
 ### Architectures
 **RealNVP (2017):**
 - Coupling layers (affine transformations)
 - Invertible by design
 **Glow (2018, OpenAI):**
 - Actnorm, invertible 1×1 convolutions
 - Multi-scale architecture
 **When to use Flow:**
 ✅ Exact likelihood needed (better than VAE)
 ✅ Invertibility needed (both z→x and x→z)
 ✅ Stable training (standard loss)
 ❌ Architecture constraints (must be invertible)
 ❌ Quality not as good as GAN/Diffusion
 ### Modern Status
 **Mostly superseded by Diffusion:**
 - Diffusion has better quality
 - Diffusion more flexible (no invertibility constraint)
 - Flow models still used in specialized applications
 ## Part 7: Decision Framework
 ### By Primary Goal
 ```
 Goal: High-quality images
 → Diffusion (modern default)
 → OR GAN if pretrained available
 Goal: Fast inference
 → GAN (50ms per image)
 → Avoid Diffusion (too slow for real-time)
 Goal: Training stability
 → Diffusion or VAE (standard loss)
 → Avoid GAN (adversarial training hard)
 Goal: Latent space exploration
 → VAE (smooth interpolation)
 → Avoid GAN (no encoder)
 Goal: Explicit likelihood
 → Autoregressive or Flow
 → For evaluation, compression
 Goal: Diversity (no mode collapse)
 → Diffusion (by design)
 → OR VAE (stable)
 → Avoid GAN (mode collapse common)
 ```
 ### By Data Type
 ```
 Images (high-quality):
 → Diffusion (Stable Diffusion)
 → OR GAN (StyleGAN2)
 Images (small, 32×32):
 → Any model works
 → Try VAE first (simplest)
 Audio waveforms:
 → WaveGAN
 → OR Diffusion (WaveGrad)
 Video:
 → Video Diffusion (limited)
 → OR GAN (StyleGAN-V)
 Text:
 → Autoregressive (GPT)
 → NOT VAE/GAN/Diffusion (discrete tokens)
 ```
 ### By Training Budget
 ```
 Large budget (millions $, pretrain from scratch):
 → Diffusion (Stable Diffusion scale)
 → Billions of images, weeks on cluster
 Medium budget (thousands $, train from scratch):
 → GAN or Diffusion
 → 10k-1M images, days on GPU
 Small budget (hundreds $, fine-tune):
 → Fine-tune Stable Diffusion (LoRA)
 → 1k-10k images, hours on consumer GPU
 Tiny budget (research, small scale):
 → VAE (simplest, most stable)
 → Few thousand images, CPU possible
 ```
 ### Modern Recommendations (2025)
 **For new projects:**
 1. **Default: Diffusion**
   - Fine-tune Stable Diffusion or train from scratch
   - Best quality + stability
 2. **If need speed: GAN**
   - Use pretrained StyleGAN2 if available
   - Or train GAN (if can tolerate instability)
 3. **If need latent space: VAE**
   - For interpolation, not generation quality
 **AVOID:**
 - Training GAN from scratch (unless necessary)
 - Using VAE for high-quality generation
 - Autoregressive for high-res images
 ## Part 8: Training from Scratch vs Fine-Tuning
 ### Stable Diffusion Example
 **Pretraining (what Stability AI did):**
 - Dataset: LAION-5B (5 billion images)
 - Compute: 150,000 A100 GPU hours
 - Cost: ~$600,000
 - Time: Weeks on massive cluster
 - **DON'T DO THIS!**
 **Fine-tuning (what users do):**
 - Dataset: 10k-100k domain images
 - Compute: 100-1000 GPU hours
 - Cost: $100-1,000
 - Time: Days on single A100
 - **DO THIS!**
 **LoRA (Low-Rank Adaptation):**
 - Efficient fine-tuning (fewer parameters)
 - Dataset: 1k-5k images
 - Compute: 10-100 GPU hours
 - Cost: $10-100
 - Time: Hours on consumer GPU (RTX 3090)
 - **Best for small budgets!**
 ### Decision
 ```
 Have pretrained model in your domain:
 → Fine-tune (don't retrain!)
 No pretrained model:
 → Train from scratch (small model)
 → OR find closest pretrained and fine-tune
 Budget < $1000:
 → LoRA fine-tuning
 → OR train small model (64×64)
 Budget < $100:
 → LoRA with free Colab
 → OR VAE from scratch (cheap)
 ```
 ## Part 9: Common Mistakes
 ### Mistake 1: VAE for High-Quality Generation
 **Symptom:** Blurry outputs
 **Fix:** Use GAN or Diffusion for quality
 **VAE is for:** Latent space, not generation
 ### Mistake 2: Ignoring Mode Collapse
 **Symptom:** GAN generates same images
 **Fix:** Spectral norm, minibatch discrimination
 **Better:** Switch to Diffusion (no mode collapse)
 ### Mistake 3: Training Stable Diffusion from Scratch
 **Symptom:** Burning money, poor results
 **Fix:** Fine-tune pretrained model
 **Reality:** Pretraining costs $600k+
 ### Mistake 4: Slow Inference with Diffusion
 **Symptom:** 50 seconds per image
 **Fix:** Use DDIM (fewer steps, 10-50)
 **OR:** Use GAN if speed critical
 ### Mistake 5: Wrong Loss for GAN
 **Symptom:** Training diverges
 **Fix:** Use Wasserstein loss (WGAN)
 **OR:** Spectral normalization
 **Better:** Switch to Diffusion (standard loss)
 ## Summary: Quick Reference
 ### Model Selection
 ```
 High quality + stable training:
 → Diffusion (modern default)
 Fast inference required:
 → GAN (if pretrained) or trained GAN
 Latent space exploration:
 → VAE
 Explicit likelihood:
 → Autoregressive or Flow
 Small images (< 64×64):
 → Any model (start with VAE)
 Large images (> 256×256):
 → Diffusion or GAN (avoid autoregressive)
 ```
 ### Quality Ranking
 ```
 1. Diffusion (9.5/10)
 2. GAN (9/10)
 3. Autoregressive (7-8/10)
 4. Flow (7-8/10)
 5. VAE (6/10 - blurry)
 ```
 ### Training Stability Ranking
 ```
 1. VAE (10/10)
 2. Diffusion (9/10)
 3. Autoregressive (9/10)
 4. Flow (8/10)
 5. GAN (3/10 - very unstable)
 ```
 ### Modern Stack (2025)
 ```
 Image generation: Stable Diffusion (fine-tuned)
 Fast inference: StyleGAN2 (if available)
 Latent space: VAE
 Research: Diffusion (easiest to train)
 ```
 ## Next Steps
 After mastering this skill:
 - `llm-specialist/llm-finetuning-strategies`: Apply to text generation
 - `architecture-design-principles`: Understand design trade-offs
 - `training-optimization`: Optimize training for your chosen model
 **Remember:** Diffusion models dominate in 2025. Use them unless you have specific reason not to (speed, latent space, likelihood).
--- a/skills/using-neural-architectures/graph-neural-networks-basics.md
+++ b/skills/using-neural-architectures/graph-neural-networks-basics.md
@@ -0,0 +1,625 @@
 # Graph Neural Networks Basics
 ## When to Use This Skill
 Use this skill when you need to:
 - ✅ Work with graph-structured data (molecules, social networks, citations)
 - ✅ Understand why CNN/RNN don't work on graphs
 - ✅ Learn message passing framework
 - ✅ Choose between GCN, GraphSAGE, GAT
 - ✅ Decide if GNN is appropriate (vs simple model)
 - ✅ Implement permutation-invariant aggregations
 **Do NOT use this skill for:**
 - ❌ Sequential data (use RNN/Transformer)
 - ❌ Grid data (use CNN)
 - ❌ High-level architecture selection (use `using-neural-architectures`)
 ## Core Principle
 **Graphs have irregular structure.** CNN (grid) and RNN (sequence) don't work.
 **GNN solution:** Message passing
 - Nodes aggregate information from neighbors
 - Multiple layers = multi-hop neighborhoods
 - Permutation invariant (order doesn't matter)
 **Critical question:** Does graph structure actually help? (Test: Compare with/without edges)
 ## Part 1: Why GNN (Not CNN/RNN)
 ### Problem: Graph Structure
 **Graph components:**
 - **Nodes**: Entities (atoms, users, papers)
 - **Edges**: Relationships (bonds, friendships, citations)
 - **Features**: Node/edge attributes
 **Key property:** Irregular structure
 - Each node has variable number of neighbors
 - No fixed spatial arrangement
 - Permutation invariant (node order doesn't matter)
 ### Why CNN Doesn't Work
 **CNN assumption:** Regular grid structure
 **Example:** Image (2D grid)
 ```
 Every pixel has exactly 8 neighbors:
 [■][■][■]
 [■][X][■]  ← Center pixel has 8 neighbors (fixed!)
 [■][■][■]
 CNN kernel: 3×3 (fixed size, fixed positions)
 ```
 **Graph reality:** Irregular neighborhoods
 ```
 Node A: 2 neighbors (H, C)
 Node B: 4 neighbors (C, C, C, H)
 Node C: 1 neighbor (H)
 No fixed kernel size or position!
 ```
 **CNN limitations:**
 - Requires fixed-size neighborhoods → Graphs have variable-size
 - Assumes spatial locality → Graphs have arbitrary connectivity
 - Depends on node ordering → Should be permutation invariant
 ### Why RNN Doesn't Work
 **RNN assumption:** Sequential structure
 **Example:** Text (1D sequence)
 ```
 "The cat sat" → [The] → [cat] → [sat]
 Clear sequential order, temporal dependencies
 ```
 **Graph reality:** No inherent sequence
 ```
 Social network:
 A — B — C
 |       |
 D ——————E
 What's the "sequence"? A→B→C? A→D→E? No natural ordering!
 ```
 **RNN limitations:**
 - Requires sequential order → Graphs have no natural order
 - Processes one element at a time → Graphs have parallel connections
 - Order-dependent → Should be permutation invariant
 ### GNN Solution
 **Key innovation:** Message passing on graph structure
 - Operate directly on nodes and edges
 - Variable-size neighborhoods (handled naturally)
 - Permutation invariant aggregations
 ## Part 2: Message Passing Framework
 ### Core Mechanism
 **Message passing in 3 steps:**
 **1. Aggregate neighbor messages**
 ```python
 # Node i aggregates from neighbors N(i)
 messages = [h_j for j in neighbors(i)]
 aggregated = aggregate(messages)  # e.g., mean, sum, max
 ```
 **2. Update node representation**
 ```python
 # Combine own features with aggregated messages
 h_i_new = update(h_i_old, aggregated)  # e.g., neural network
 ```
 **3. Repeat for L layers**
 - Layer 1: Node sees 1-hop neighbors
 - Layer 2: Node sees 2-hop neighbors
 - Layer L: Node sees L-hop neighborhood
 ### Concrete Example: Social Network
 **Task:** Predict user interests
 **Graph:**
 ```
     B (sports)
     |
 A ---+--- C (cooking)
     |
     D (music)
 ```
 **Layer 1: 1-hop neighbors**
 ```python
 # Node A aggregates from direct friends
 h_A_layer1 = update(
    h_A,
    aggregate([h_B, h_C, h_D])
 )
 # Now h_A includes friend interests
 ```
 **Layer 2: 2-hop neighbors (friends of friends)**
 ```python
 # B's friends: E, F
 # C's friends: G, H
 # D's friends: I
 h_A_layer2 = update(
    h_A_layer1,
    aggregate([h_B', h_C', h_D'])  # h_B' includes E, F
 )
 # Now h_A includes friends-of-friends!
 ```
 **Key insight:** More layers = larger receptive field (L-hop neighborhood)
 ### Permutation Invariance
 **Critical property:** Same graph → same output (regardless of node ordering)
 **Example:**
 ```python
 Graph: A-B, B-C
 Node list 1: [A, B, C]
 Node list 2: [C, B, A]
 Output MUST be identical! (Same graph, different ordering)
 ```
 **Invariant aggregations:**
 - ✅ Mean: `mean([1, 2, 3]) == mean([3, 2, 1])`
 - ✅ Sum: `sum([1, 2, 3]) == sum([3, 2, 1])`
 - ✅ Max: `max([1, 2, 3]) == max([3, 2, 1])`
 **NOT invariant:**
 - ❌ LSTM: `LSTM([1, 2, 3]) != LSTM([3, 2, 1])`
 - ❌ Concatenate: `[1, 2, 3] != [3, 2, 1]`
 **Implementation:**
 ```python
 # CORRECT: Permutation invariant
 def aggregate(neighbor_features):
    return torch.mean(neighbor_features, dim=0)
 # WRONG: Order-dependent!
 def aggregate(neighbor_features):
    return LSTM(neighbor_features)  # Output depends on order
 ```
 ## Part 3: GNN Architectures
 ### Architecture 1: GCN (Graph Convolutional Network)
 **Key idea:** Spectral convolution on graphs (simplified)
 **Formula:**
 ```python
 h_i^(l+1) = σ(∑_{j∈N(i)} W^(l) h_j^(l) / √(|N(i)| |N(j)|))
 # Normalize by degree (√(deg(i) * deg(j)))
 ```
 **Aggregation:** Weighted mean (degree-normalized)
 **Properties:**
 - Transductive (needs full graph at training)
 - Computationally efficient
 - Good baseline
 **When to use:**
 - Full graph available at training time
 - Starting point (simplest GNN)
 - Small to medium graphs
 **Implementation:**
 ```python
 from torch_geometric.nn import GCNConv
 class GCN(nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = GCNConv(in_channels, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, out_channels)
    def forward(self, x, edge_index):
        # x: Node features (N, in_channels)
        # edge_index: Graph connectivity (2, E)
        # Layer 1
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=0.5, training=self.training)
        # Layer 2
        x = self.conv2(x, edge_index)
        return x
 ```
 ### Architecture 2: GraphSAGE
 **Key idea:** Sample and aggregate (inductive learning)
 **Formula:**
 ```python
 # Sample fixed-size neighborhood
 neighbors_sampled = sample(neighbors(i), k=10)
 # Aggregate
 h_N = aggregate({h_j for j in neighbors_sampled})
 # Concatenate and transform
 h_i^(l+1) = σ(W^(l) [h_i^(l); h_N])
 ```
 **Aggregation:** Mean, max, or LSTM (but mean/max preferred for invariance)
 **Key innovation:** Sampling
 - Sample fixed number of neighbors (e.g., 10)
 - Makes computation tractable for large graphs
 - Enables inductive learning (generalizes to unseen nodes)
 **When to use:**
 - Large graphs (millions of nodes)
 - Need inductive capability (new nodes appear)
 - Training on subset, testing on full graph
 **Implementation:**
 ```python
 from torch_geometric.nn import SAGEConv
 class GraphSAGE(nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = SAGEConv(in_channels, hidden_channels)
        self.conv2 = SAGEConv(hidden_channels, out_channels)
    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        return x
 ```
 ### Architecture 3: GAT (Graph Attention Network)
 **Key idea:** Learn attention weights for neighbors
 **Formula:**
 ```python
 # Attention scores
 α_ij = attention(h_i, h_j)  # How important is neighbor j to node i?
 # Normalize (softmax)
 α_ij = softmax_j(α_ij)
 # Weighted aggregation
 h_i^(l+1) = σ(∑_{j∈N(i)} α_ij W h_j^(l))
 ```
 **Key innovation:** Learned neighbor importance
 - Not all neighbors equally important
 - Attention mechanism decides weights
 - Multi-head attention (like Transformer)
 **When to use:**
 - Neighbors have varying importance
 - Need interpretability (attention weights)
 - Have sufficient data (attention needs more data to learn)
 **Implementation:**
 ```python
 from torch_geometric.nn import GATConv
 class GAT(nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, heads=8):
        super().__init__()
        self.conv1 = GATConv(in_channels, hidden_channels, heads=heads)
        self.conv2 = GATConv(hidden_channels * heads, out_channels, heads=1)
    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = F.elu(x)
        x = F.dropout(x, p=0.6, training=self.training)
        x = self.conv2(x, edge_index)
        return x
 ```
 ### Architecture Comparison
 | Feature | GCN | GraphSAGE | GAT |
 |---------|-----|-----------|-----|
 | Aggregation | Degree-weighted mean | Mean/max/LSTM | Attention-weighted |
 | Neighbor weighting | Fixed (by degree) | Equal | Learned |
 | Inductive | No | Yes | Yes |
 | Scalability | Medium | High (sampling) | Medium |
 | Interpretability | Low | Low | High (attention) |
 | Complexity | Low | Medium | High |
 ### Decision Tree
 ```
 Starting out / Small graph:
 → GCN (simplest baseline)
 Large graph (millions of nodes):
 → GraphSAGE (sampling enables scalability)
 Need inductive learning (new nodes):
 → GraphSAGE or GAT
 Neighbors have different importance:
 → GAT (attention learns importance)
 Need interpretability:
 → GAT (attention weights explain predictions)
 Production deployment:
 → GraphSAGE (most robust and scalable)
 ```
 ## Part 4: When NOT to Use GNN
 ### Critical Question
 **Does graph structure actually help?**
 **Test:** Compare model with and without edges
 ```python
 # Baseline: MLP on node features only
 mlp_accuracy = train_mlp(node_features, labels)
 # GNN: Use node features + graph structure
 gnn_accuracy = train_gnn(node_features, edges, labels)
 # Decision:
 if gnn_accuracy - mlp_accuracy < 2%:
    print("Graph structure doesn't help much")
    print("Use simpler model (MLP or XGBoost)")
 else:
    print("Graph structure adds value")
    print("Use GNN")
 ```
 ### Scenarios Where GNN Doesn't Help
 **1. Node features dominate**
 ```
 User churn prediction:
 - Node features: Usage hours, demographics, subscription → Highly predictive
 - Graph edges: Sparse user interactions → Weak signal
 - Result: MLP 85%, GNN 86% (not worth complexity!)
 ```
 **2. Sparse graphs**
 ```
 Graph with 1000 nodes, 100 edges (0.01% density):
 - Most nodes have 0-1 neighbors
 - No information to aggregate
 - GNN reduces to MLP
 ```
 **3. Random graph structure**
 ```
 If edges are random (no homophily):
 - Neighbor labels uncorrelated
 - Aggregation adds noise
 - Simple model better
 ```
 ### When GNN DOES Help
 ✅ **Molecular property prediction**
 - Structure is PRIMARY signal
 - Atom types + bonds determine properties
 - GNN: Huge improvement over fingerprints
 ✅ **Citation networks**
 - Paper quality correlated with neighbors
 - "You are what you cite"
 - Clear homophily
 ✅ **Social recommendation**
 - Friends have similar preferences
 - Graph structure informative
 - GNN: Moderate to large improvement
 ✅ **Knowledge graphs**
 - Entities connected by relations
 - Multi-hop reasoning valuable
 - GNN captures complex patterns
 ### Decision Framework
 ```
 1. Start simple:
   - Try MLP or XGBoost on node features
   - Establish baseline performance
 2. Check graph structure value:
   - Does edge information correlate with target?
   - Is there homophily (similar nodes connected)?
   - Test: Remove edges, compare performance
 3. Use GNN if:
   - Graph structure adds >2-5% accuracy
   - Structure is interpretable (not random)
   - Have enough nodes for GNN to learn
 4. Stick with simple if:
   - Node features alone sufficient
   - Graph structure weak/random
   - Small dataset (< 1000 nodes)
 ```
 ## Part 5: Practical Implementation
 ### Using PyTorch Geometric
 **Installation:**
 ```bash
 pip install torch-geometric
 ```
 **Basic workflow:**
 ```python
 import torch
 from torch_geometric.data import Data
 from torch_geometric.nn import GCNConv
 # 1. Create graph data
 x = torch.tensor([[feature1], [feature2], ...])  # Node features
 edge_index = torch.tensor([[0, 1, 2], [1, 2, 0]])  # Edges (COO format)
 y = torch.tensor([label1, label2, ...])  # Node labels
 data = Data(x=x, edge_index=edge_index, y=y)
 # 2. Define model
 class GNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = GCNConv(in_features, 64)
        self.conv2 = GCNConv(64, num_classes)
    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = F.relu(self.conv1(x, edge_index))
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)
 # 3. Train
 model = GNN()
 optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
 for epoch in range(200):
    model.train()
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[train_mask], data.y[train_mask])
    loss.backward()
    optimizer.step()
 ```
 ### Edge Index Format
 **COO (Coordinate) format:**
 ```python
 # Edge list: (0→1), (1→2), (2→0)
 edge_index = torch.tensor([
    [0, 1, 2],  # Source nodes
    [1, 2, 0]   # Target nodes
 ])
 # For undirected graph, include both directions:
 edge_index = torch.tensor([
    [0, 1, 1, 2, 2, 0],  # Source
    [1, 0, 2, 1, 0, 2]   # Target
 ])
 ```
 ### Mini-batching Graphs
 **Problem:** Graphs have different sizes
 **Solution:** Batch graphs as one large disconnected graph
 ```python
 from torch_geometric.data import DataLoader
 # Create dataset
 dataset = [Data(...), Data(...), ...]
 # DataLoader handles batching
 loader = DataLoader(dataset, batch_size=32, shuffle=True)
 for batch in loader:
    # batch contains multiple graphs as one large graph
    # batch.batch: Indicator which nodes belong to which graph
    out = model(batch.x, batch.edge_index)
 ```
 ## Part 6: Common Mistakes
 ### Mistake 1: LSTM Aggregation
 **Symptom:** Different outputs for same graph with reordered nodes
 **Fix:** Use mean/sum/max aggregation (permutation invariant)
 ### Mistake 2: Forgetting Edge Direction
 **Symptom:** Information flows wrong way
 **Fix:** For undirected graphs, add edges in both directions
 ### Mistake 3: Too Many Layers
 **Symptom:** Performance degrades, over-smoothing
 **Fix:** Use 2-3 layers (most graphs have small diameter)
 **Explanation:** Too many layers → all nodes converge to same representation
 ### Mistake 4: Not Testing Simple Baseline
 **Symptom:** Complex GNN with minimal improvement
 **Fix:** Always test MLP on node features first
 ### Mistake 5: Using GNN on Euclidean Data
 **Symptom:** CNN/RNN would work better
 **Fix:** Use GNN only for irregular graph structure (not grids/sequences)
 ## Part 7: Summary
 ### Quick Reference
 **When to use GNN:**
 - Graph-structured data (molecules, social networks, citations)
 - Irregular neighborhoods (not grid/sequence)
 - Graph structure informative (test this!)
 **Architecture selection:**
 ```
 Start: GCN (simplest)
 Large graph: GraphSAGE (scalable)
 Inductive learning: GraphSAGE or GAT
 Neighbor importance: GAT (attention)
 ```
 **Key principles:**
 - Message passing: Aggregate neighbors + Update node
 - Permutation invariance: Use mean/sum/max (not LSTM)
 - Test baseline: MLP first, GNN if structure helps
 - Layers: 2-3 sufficient (more = over-smoothing)
 **Implementation:**
 - PyTorch Geometric: Standard library
 - COO format: Edge index as 2×E tensor
 - Batching: Merge graphs into one large graph
 ## Next Steps
 After mastering this skill:
 - `transformer-architecture-deepdive`: Understand attention (used in GAT)
 - `architecture-design-principles`: Design principles for graph architectures
 - Advanced GNNs: Graph Transformers, Equivariant GNNs
 **Remember:** Not all graph data needs GNN. Test if graph structure actually helps! (Compare with MLP baseline)
--- a/skills/using-neural-architectures/normalization-techniques.md
+++ b/skills/using-neural-architectures/normalization-techniques.md
@@ -0,0 +1,915 @@
 # Normalization Techniques
 ## Context
 You're designing a neural network or debugging training instability. Someone suggests "add BatchNorm" without considering:
 - **Batch size dependency**: BatchNorm fails with small batches (< 8)
 - **Architecture mismatch**: BatchNorm breaks RNNs/Transformers (use LayerNorm)
 - **Task-specific needs**: Style transfer needs InstanceNorm, not BatchNorm
 - **Modern alternatives**: RMSNorm simpler and faster than LayerNorm for LLMs
 **This skill prevents normalization cargo-culting and provides architecture-specific selection.**
 ## Why Normalization Matters
 **Problem: Internal Covariate Shift**
 During training, layer input distributions shift as previous layers update. This causes:
 - Vanishing/exploding gradients (deep networks)
 - Slow convergence (small learning rates required)
 - Training instability (loss spikes)
 **Solution: Normalization**
 Normalize activations to have stable statistics (mean=0, std=1). Benefits:
 - **10x faster convergence**: Can use larger learning rates
 - **Better generalization**: Regularization effect
 - **Enables deep networks**: 50+ layers without gradient issues
 - **Less sensitive to initialization**: Weights can start further from optimal
 **Key insight**: Normalization is NOT optional for modern deep learning. The question is WHICH normalization, not WHETHER to normalize.
 ## Normalization Families
 ### 1. Batch Normalization (BatchNorm)
 **What it does:**
 Normalizes across the batch dimension for each channel/feature.
 **Formula:**
 ```
 Given input x with shape (B, C, H, W):  # Batch, Channel, Height, Width
 For each channel c:
  μ_c = mean(x[:, c, :, :])  # Mean over batch + spatial dims
  σ_c = std(x[:, c, :, :])   # Std over batch + spatial dims
  x_norm[:, c, :, :] = (x[:, c, :, :] - μ_c) / √(σ_c² + ε)
  # Learnable scale and shift
  y[:, c, :, :] = γ_c * x_norm[:, c, :, :] + β_c
 ```
 **When to use:**
 - ✅ CNNs for classification (ResNet, EfficientNet)
 - ✅ Large batch sizes (≥ 16)
 - ✅ IID data (image classification, object detection)
 **When NOT to use:**
 - ❌ Small batch sizes (< 8): Noisy statistics cause training failure
 - ❌ RNNs/LSTMs: Breaks temporal dependencies
 - ❌ Transformers: Batch dependency problematic for variable-length sequences
 - ❌ Style transfer: Batch statistics erase style information
 **Batch size dependency:**
 ```python
 batch_size = 32:  # ✓ Works well (stable statistics)
 batch_size = 16:  # ✓ Acceptable
 batch_size = 8:   # ✓ Marginal (consider GroupNorm)
 batch_size = 4:   # ✗ Unstable (use GroupNorm)
 batch_size = 2:   # ✗ FAILS! (noisy statistics)
 batch_size = 1:   # ✗ Undefined (no batch to normalize over!)
 ```
 **PyTorch example:**
 ```python
 import torch.nn as nn
 # For Conv2d
 bn = nn.BatchNorm2d(num_features=64)  # 64 channels
 x = torch.randn(32, 64, 28, 28)  # Batch=32, Channels=64
 y = bn(x)
 # For Linear
 bn = nn.BatchNorm1d(num_features=128)  # 128 features
 x = torch.randn(32, 128)  # Batch=32, Features=128
 y = bn(x)
 ```
 **Inference mode:**
 ```python
 # Training: Uses batch statistics
 model.train()
 y = bn(x)  # Normalizes using current batch mean/std
 # Inference: Uses running statistics (accumulated during training)
 model.eval()
 y = bn(x)  # Normalizes using running_mean/running_std
 ```
 ### 2. Layer Normalization (LayerNorm)
 **What it does:**
 Normalizes across the feature dimension for each sample independently.
 **Formula:**
 ```
 Given input x with shape (B, C):  # Batch, Features
 For each sample b:
  μ_b = mean(x[b, :])  # Mean over features
  σ_b = std(x[b, :])   # Std over features
  x_norm[b, :] = (x[b, :] - μ_b) / √(σ_b² + ε)
  # Learnable scale and shift
  y[b, :] = γ * x_norm[b, :] + β
 ```
 **When to use:**
 - ✅ Transformers (BERT, GPT, T5)
 - ✅ RNNs/LSTMs (maintains temporal independence)
 - ✅ Small batch sizes (batch-independent!)
 - ✅ Variable-length sequences
 - ✅ Reinforcement learning (batch_size=1 common)
 **Advantages over BatchNorm:**
 - ✅ **Batch-independent**: Works with batch_size=1
 - ✅ **No running statistics**: Inference = training (no mode switching)
 - ✅ **Sequence-friendly**: Doesn't mix information across timesteps
 **PyTorch example:**
 ```python
 import torch.nn as nn
 # For Transformer
 ln = nn.LayerNorm(normalized_shape=512)  # d_model=512
 x = torch.randn(32, 128, 512)  # Batch=32, SeqLen=128, d_model=512
 y = ln(x)  # Normalizes last dimension independently per (batch, position)
 # For RNN hidden states
 ln = nn.LayerNorm(normalized_shape=256)  # hidden_size=256
 h = torch.randn(32, 256)  # Batch=32, Hidden=256
 h_norm = ln(h)
 ```
 **Key difference from BatchNorm:**
 ```python
 # BatchNorm: Normalizes across batch dimension
 # Given (B=32, C=64, H=28, W=28)
 # Computes 64 means/stds (one per channel, across batch + spatial)
 # LayerNorm: Normalizes across feature dimension
 # Given (B=32, L=128, D=512)
 # Computes 32×128 means/stds (one per (batch, position), across features)
 ```
 ### 3. Group Normalization (GroupNorm)
 **What it does:**
 Normalizes channels in groups, batch-independent.
 **Formula:**
 ```
 Given input x with shape (B, C, H, W):
 Divide C channels into G groups (C must be divisible by G)
 For each sample b and group g:
  channels = x[b, g*(C/G):(g+1)*(C/G), :, :]  # Channels in group g
  μ_{b,g} = mean(channels)  # Mean over channels in group + spatial
  σ_{b,g} = std(channels)   # Std over channels in group + spatial
  x_norm[b, g*(C/G):(g+1)*(C/G), :, :] = (channels - μ_{b,g}) / √(σ_{b,g}² + ε)
 ```
 **When to use:**
 - ✅ Small batch sizes (< 8)
 - ✅ CNNs with batch_size=1 (style transfer, RL)
 - ✅ Object detection/segmentation (often use small batches)
 - ✅ When BatchNorm unstable but want spatial normalization
 **Group size selection:**
 ```python
 # num_groups trade-off:
 num_groups = 1:    # = LayerNorm (all channels together)
 num_groups = C:    # = InstanceNorm (each channel separate)
 num_groups = 32:   # Standard choice (good balance)
 # Rule: C must be divisible by num_groups
 channels = 64, num_groups = 32:  # ✓ 64/32 = 2 channels per group
 channels = 64, num_groups = 16:  # ✓ 64/16 = 4 channels per group
 channels = 64, num_groups = 30:  # ✗ 64/30 not integer!
 ```
 **PyTorch example:**
 ```python
 import torch.nn as nn
 # For small batch CNN
 gn = nn.GroupNorm(num_groups=32, num_channels=64)
 x = torch.randn(2, 64, 28, 28)  # Batch=2 (small!)
 y = gn(x)  # Works well even with batch=2
 # Compare performance:
 batch_sizes = [1, 2, 4, 8, 16, 32]
 bn = nn.BatchNorm2d(64)
 gn = nn.GroupNorm(32, 64)
 for bs in batch_sizes:
    x = torch.randn(bs, 64, 28, 28)
    # BatchNorm gets more stable with larger batch
    # GroupNorm consistent across all batch sizes
 ```
 **Empirical results (He et al. 2018):**
 ```
 ImageNet classification with ResNet-50:
 batch_size = 32:  BatchNorm = 76.5%, GroupNorm = 76.3%  (tie)
 batch_size = 8:   BatchNorm = 75.8%, GroupNorm = 76.1%  (GroupNorm wins!)
 batch_size = 2:   BatchNorm = 72.1%, GroupNorm = 75.3%  (GroupNorm wins!)
 ```
 ### 4. Instance Normalization (InstanceNorm)
 **What it does:**
 Normalizes each sample and channel independently (no batch mixing).
 **Formula:**
 ```
 Given input x with shape (B, C, H, W):
 For each sample b and channel c:
  μ_{b,c} = mean(x[b, c, :, :])  # Mean over spatial dimensions only
  σ_{b,c} = std(x[b, c, :, :])   # Std over spatial dimensions only
  x_norm[b, c, :, :] = (x[b, c, :, :] - μ_{b,c}) / √(σ_{b,c}² + ε)
 ```
 **When to use:**
 - ✅ Style transfer (neural style, CycleGAN, pix2pix)
 - ✅ Image-to-image translation
 - ✅ When batch/channel mixing destroys information
 **Why for style transfer:**
 ```python
 # Style transfer goal: Transfer style while preserving content
 # BatchNorm: Mixes statistics across batch (erases individual style!)
 # InstanceNorm: Per-image statistics (preserves each image's style)
 # Example: Neural style transfer
 content_image = load_image("photo.jpg")
 style_image = load_image("starry_night.jpg")
 # With BatchNorm: Output loses content image's unique characteristics
 # With InstanceNorm: Content characteristics preserved, style applied
 ```
 **PyTorch example:**
 ```python
 import torch.nn as nn
 # For style transfer generator
 in_norm = nn.InstanceNorm2d(num_features=64)
 x = torch.randn(1, 64, 256, 256)  # Single image
 y = in_norm(x)  # Normalizes each channel independently
 # CycleGAN generator architecture
 class Generator(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 64, 7, padding=3)
        self.in1 = nn.InstanceNorm2d(64)  # NOT BatchNorm!
        self.relu = nn.ReLU(inplace=True)
    def forward(self, x):
        x = self.conv1(x)
        x = self.in1(x)  # Per-image normalization
        x = self.relu(x)
        return x
 ```
 **Relation to GroupNorm:**
 ```python
 # InstanceNorm is GroupNorm with num_groups = num_channels
 InstanceNorm2d(C) == GroupNorm(num_groups=C, num_channels=C)
 ```
 ### 5. RMS Normalization (RMSNorm)
 **What it does:**
 Simplified LayerNorm that only rescales (no recentering), faster and simpler.
 **Formula:**
 ```
 Given input x:
 # LayerNorm (2 steps):
 x_centered = x - mean(x)  # 1. Center
 x_norm = x_centered / std(x)  # 2. Scale
 # RMSNorm (1 step):
 rms = sqrt(mean(x²))  # Root Mean Square
 x_norm = x / rms  # Only scale, no centering
 ```
 **When to use:**
 - ✅ Modern LLMs (LLaMA, Mistral, Gemma)
 - ✅ When speed matters (15-20% faster than LayerNorm)
 - ✅ Large Transformer models (billions of parameters)
 **Advantages:**
 - ✅ **Simpler**: One operation instead of two
 - ✅ **Faster**: ~15-20% speedup over LayerNorm
 - ✅ **Numerically stable**: No subtraction (avoids catastrophic cancellation)
 - ✅ **Same performance**: Empirically matches LayerNorm quality
 **PyTorch implementation:**
 ```python
 import torch
 import torch.nn as nn
 class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))
    def forward(self, x):
        # Compute RMS
        rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
        # Normalize
        x_norm = x / rms
        # Scale (learnable)
        return self.weight * x_norm
 # Usage in Transformer
 rms = RMSNorm(dim=512)  # d_model=512
 x = torch.randn(32, 128, 512)  # Batch, SeqLen, d_model
 y = rms(x)
 ```
 **Speed comparison (LLaMA-7B, A100 GPU):**
 ```
 LayerNorm:  1000 tokens/sec
 RMSNorm:    1180 tokens/sec  # 18% faster!
 # For large models, this adds up:
 # 1 million tokens: 180 seconds saved
 ```
 **Modern LLM adoption:**
 ```python
 # LLaMA (Meta, 2023): RMSNorm
 # Mistral (Mistral AI, 2023): RMSNorm
 # Gemma (Google, 2024): RMSNorm
 # PaLM (Google, 2022): RMSNorm
 # Older models:
 # GPT-2/3 (OpenAI): LayerNorm
 # BERT (Google, 2018): LayerNorm
 ```
 ## Architecture-Specific Selection
 ### CNN (Convolutional Neural Networks)
 **Default: BatchNorm**
 ```python
 import torch.nn as nn
 class ConvBlock(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, 3, padding=1)
        self.bn = nn.BatchNorm2d(out_channels)  # After conv
        self.relu = nn.ReLU(inplace=True)
    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)  # Normalize
        x = self.relu(x)
        return x
 ```
 **Exception: Small batch sizes**
 ```python
 # If batch_size < 8, use GroupNorm instead
 class ConvBlock(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, 3, padding=1)
        self.norm = nn.GroupNorm(32, out_channels)  # GroupNorm for small batches
        self.relu = nn.ReLU(inplace=True)
 ```
 **Exception: Style transfer**
 ```python
 # Use InstanceNorm for style transfer
 class StyleConvBlock(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, 3, padding=1)
        self.norm = nn.InstanceNorm2d(out_channels)  # Per-image normalization
        self.relu = nn.ReLU(inplace=True)
 ```
 ### RNN / LSTM (Recurrent Neural Networks)
 **Default: LayerNorm**
 ```python
 import torch.nn as nn
 class NormalizedLSTM(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.ln = nn.LayerNorm(hidden_size)  # Normalize hidden states
    def forward(self, x):
        # x: (batch, seq_len, input_size)
        output, (h_n, c_n) = self.lstm(x)
        # output: (batch, seq_len, hidden_size)
        # Normalize each timestep's output
        output_norm = self.ln(output)  # Applies independently per timestep
        return output_norm, (h_n, c_n)
 ```
 **Why NOT BatchNorm:**
 ```python
 # BatchNorm in RNN mixes information across timesteps!
 # Given (batch=32, seq_len=100, hidden=256)
 # BatchNorm would compute:
 # mean/std over (batch × seq_len) = 3200 values
 # This mixes t=0 with t=99 (destroys temporal structure!)
 # LayerNorm computes:
 # mean/std over hidden_size = 256 values per (batch, timestep)
 # Each timestep normalized independently (preserves temporal structure)
 ```
 **Layer-wise normalization in stacked RNN:**
 ```python
 class StackedNormalizedLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers):
        super().__init__()
        self.layers = nn.ModuleList()
        for i in range(num_layers):
            in_size = input_size if i == 0 else hidden_size
            self.layers.append(nn.LSTM(in_size, hidden_size, batch_first=True))
            self.layers.append(nn.LayerNorm(hidden_size))  # After each LSTM layer
    def forward(self, x):
        for lstm, ln in zip(self.layers[::2], self.layers[1::2]):
            x, _ = lstm(x)
            x = ln(x)  # Normalize between layers
        return x
 ```
 ### Transformer
 **Default: LayerNorm (or RMSNorm for modern/large models)**
 **Two placement options: Pre-norm vs Post-norm**
 **Post-norm (original Transformer, "Attention is All You Need"):**
 ```python
 class TransformerLayerPostNorm(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, num_heads)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.ReLU(),
            nn.Linear(4 * d_model, d_model)
        )
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
    def forward(self, x):
        # Post-norm: Apply normalization AFTER residual
        x = self.ln1(x + self.attn(x, x, x)[0])  # Normalize after adding
        x = self.ln2(x + self.ffn(x))  # Normalize after adding
        return x
 ```
 **Pre-norm (modern, more stable):**
 ```python
 class TransformerLayerPreNorm(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, num_heads)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.ReLU(),
            nn.Linear(4 * d_model, d_model)
        )
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
    def forward(self, x):
        # Pre-norm: Apply normalization BEFORE sublayer
        x = x + self.attn(self.ln1(x), self.ln1(x), self.ln1(x))[0]  # Normalize before attention
        x = x + self.ffn(self.ln2(x))  # Normalize before FFN
        return x
 ```
 **Pre-norm vs Post-norm comparison:**
 ```python
 # Post-norm (original):
 # - Less stable (requires careful initialization + warmup)
 # - Slightly better performance IF training succeeds
 # - Hard to train deep models (> 12 layers)
 # Pre-norm (modern):
 # - More stable (easier to train deep models)
 # - Standard for large models (GPT-3: 96 layers!)
 # - Recommended default
 # Empirical: GPT-2, BERT (post-norm, ≤12 layers)
 #            GPT-3, T5, LLaMA (pre-norm, ≥24 layers)
 ```
 **Using RMSNorm instead:**
 ```python
 class TransformerLayerRMSNorm(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, num_heads)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.ReLU(),
            nn.Linear(4 * d_model, d_model)
        )
        self.rms1 = RMSNorm(d_model)  # 15-20% faster than LayerNorm
        self.rms2 = RMSNorm(d_model)
    def forward(self, x):
        # Pre-norm with RMSNorm (LLaMA style)
        x = x + self.attn(self.rms1(x), self.rms1(x), self.rms1(x))[0]
        x = x + self.ffn(self.rms2(x))
        return x
 ```
 ### GAN (Generative Adversarial Network)
 **Generator: InstanceNorm or no normalization**
 ```python
 class Generator(nn.Module):
    def __init__(self):
        super().__init__()
        # Use InstanceNorm for image-to-image translation
        self.conv1 = nn.Conv2d(3, 64, 7, padding=3)
        self.in1 = nn.InstanceNorm2d(64)  # Per-image normalization
    def forward(self, x):
        x = self.conv1(x)
        x = self.in1(x)  # Preserves per-image characteristics
        return x
 ```
 **Discriminator: No normalization or LayerNorm**
 ```python
 class Discriminator(nn.Module):
    def __init__(self):
        super().__init__()
        # Often no normalization (BatchNorm can hurt GAN training)
        self.conv1 = nn.Conv2d(3, 64, 4, stride=2, padding=1)
        # No normalization here
    def forward(self, x):
        x = self.conv1(x)
        # Directly to activation (no norm)
        return x
 ```
 **Why avoid BatchNorm in GANs:**
 ```python
 # BatchNorm in discriminator:
 # - Mixes real and fake samples in batch
 # - Leaks information (discriminator can detect batch composition)
 # - Hurts training stability
 # Recommendation:
 # Generator: InstanceNorm (for image translation) or no norm
 # Discriminator: No normalization or LayerNorm
 ```
 ## Decision Framework
 ### Step 1: Check batch size
 ```python
 if batch_size >= 8:
    consider_batchnorm = True
 else:
    use_groupnorm_or_layernorm = True  # BatchNorm will be unstable
 ```
 ### Step 2: Check architecture
 ```python
 if architecture == "CNN":
    if batch_size >= 8:
        use_batchnorm()
    else:
        use_groupnorm(num_groups=32)
    # Exception: Style transfer
    if task == "style_transfer":
        use_instancenorm()
 elif architecture in ["RNN", "LSTM", "GRU"]:
    use_layernorm()  # NEVER BatchNorm!
 elif architecture == "Transformer":
    if model_size == "large":  # > 1B parameters
        use_rmsnorm()  # 15-20% faster
    else:
        use_layernorm()
    # Placement: Pre-norm (more stable)
    use_prenorm_placement()
 elif architecture == "GAN":
    if component == "generator":
        if task == "image_translation":
            use_instancenorm()
        else:
            use_no_norm()  # Or InstanceNorm
    elif component == "discriminator":
        use_no_norm()  # Or LayerNorm
 ```
 ### Step 3: Verify placement
 ```python
 # CNNs: After convolution, before activation
 x = conv(x)
 x = norm(x)  # Here!
 x = relu(x)
 # RNNs: After LSTM, normalize hidden states
 output, (h, c) = lstm(x)
 output = norm(output)  # Here!
 # Transformers: Pre-norm (modern) or post-norm (original)
 # Pre-norm (recommended):
 x = x + sublayer(norm(x))  # Normalize before sublayer
 # Post-norm (original):
 x = norm(x + sublayer(x))  # Normalize after residual
 ```
 ## Implementation Checklist
 ### Before adding normalization:
 1. ☐ **Check batch size**: If < 8, avoid BatchNorm
 2. ☐ **Check architecture**: CNN→BatchNorm, RNN→LayerNorm, Transformer→LayerNorm/RMSNorm
 3. ☐ **Check task**: Style transfer→InstanceNorm
 4. ☐ **Verify placement**: After conv/linear, before activation (CNNs)
 5. ☐ **Test training stability**: Loss should decrease smoothly
 ### During training:
 6. ☐ **Monitor running statistics** (BatchNorm): Check running_mean/running_var are updating
 7. ☐ **Test inference mode**: Verify model.eval() uses running stats correctly
 8. ☐ **Check gradient flow**: Normalization should help, not hurt gradients
 ### If training is unstable:
 9. ☐ **Try different normalization**: BatchNorm→GroupNorm, LayerNorm→RMSNorm
 10. ☐ **Try pre-norm** (Transformers): More stable than post-norm
 11. ☐ **Reduce learning rate**: Normalization allows larger LR, but start conservatively
 ## Common Mistakes
 ### Mistake 1: BatchNorm with small batches
 ```python
 # WRONG: BatchNorm with batch_size=2
 model = ResNet50(norm_layer=nn.BatchNorm2d)
 dataloader = DataLoader(dataset, batch_size=2)  # Too small!
 # RIGHT: GroupNorm for small batches
 model = ResNet50(norm_layer=lambda channels: nn.GroupNorm(32, channels))
 dataloader = DataLoader(dataset, batch_size=2)  # Works!
 ```
 ### Mistake 2: BatchNorm in RNN
 ```python
 # WRONG: BatchNorm in LSTM
 class BadLSTM(nn.Module):
    def __init__(self):
        super().__init__()
        self.lstm = nn.LSTM(100, 256)
        self.bn = nn.BatchNorm1d(256)  # WRONG! Mixes timesteps
    def forward(self, x):
        output, _ = self.lstm(x)
        output = output.permute(0, 2, 1)  # (B, H, T)
        output = self.bn(output)  # Mixes timesteps!
        return output
 # RIGHT: LayerNorm in LSTM
 class GoodLSTM(nn.Module):
    def __init__(self):
        super().__init__()
        self.lstm = nn.LSTM(100, 256)
        self.ln = nn.LayerNorm(256)  # Per-timestep normalization
    def forward(self, x):
        output, _ = self.lstm(x)
        output = self.ln(output)  # Independent per timestep
        return output
 ```
 ### Mistake 3: Forgetting model.eval()
 ```python
 # WRONG: Using training mode during inference
 model.train()  # BatchNorm uses batch statistics
 predictions = model(test_data)  # Batch statistics from test data (leakage!)
 # RIGHT: Use eval mode during inference
 model.eval()  # BatchNorm uses running statistics
 with torch.no_grad():
    predictions = model(test_data)  # Uses accumulated running stats
 ```
 ### Mistake 4: Post-norm for deep Transformers
 ```python
 # WRONG: Post-norm for 24-layer Transformer (unstable!)
 class DeepTransformerPostNorm(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerLayerPostNorm(512, 8) for _ in range(24)
        ])  # Hard to train!
 # RIGHT: Pre-norm for deep Transformers
 class DeepTransformerPreNorm(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerLayerPreNorm(512, 8) for _ in range(24)
        ])  # Stable training!
 ```
 ### Mistake 5: Wrong normalization for style transfer
 ```python
 # WRONG: BatchNorm for style transfer
 class StyleGenerator(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(3, 64, 7, padding=3)
        self.norm = nn.BatchNorm2d(64)  # WRONG! Mixes styles across batch
 # RIGHT: InstanceNorm for style transfer
 class StyleGenerator(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(3, 64, 7, padding=3)
        self.norm = nn.InstanceNorm2d(64)  # Per-image normalization
 ```
 ## Performance Impact
 ### Training speed:
 ```python
 # Without normalization: 100 epochs to converge
 # With normalization: 10 epochs to converge (10x faster!)
 # Reason: Larger learning rates possible
 lr_no_norm = 0.001  # Must be small (unstable otherwise)
 lr_with_norm = 0.01  # Can be 10x larger (normalization stabilizes)
 ```
 ### Inference speed:
 ```python
 # Normalization overhead (relative to no normalization):
 BatchNorm: +2% (minimal, cached running stats)
 LayerNorm: +3-5% (compute mean/std per forward pass)
 RMSNorm: +2-3% (faster than LayerNorm)
 GroupNorm: +5-8% (more computation than BatchNorm)
 InstanceNorm: +3-5% (similar to LayerNorm)
 # For most models: Overhead is negligible compared to conv/linear layers
 ```
 ### Memory usage:
 ```python
 # Normalization memory (per layer):
 BatchNorm: 2 × num_channels (running_mean, running_std) + 2 × num_channels (γ, β)
 LayerNorm: 2 × normalized_shape (γ, β)
 RMSNorm: 1 × normalized_shape (γ only, no β)
 # Example: 512 channels
 BatchNorm: 4 × 512 = 2048 parameters
 LayerNorm: 2 × 512 = 1024 parameters
 RMSNorm: 1 × 512 = 512 parameters  # Most efficient!
 ```
 ## When NOT to Normalize
 **Case 1: Final output layer**
 ```python
 # Don't normalize final predictions
 class Classifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.backbone = ResNet50()  # Normalization inside
        self.fc = nn.Linear(2048, 1000)
        # NO normalization here! (final logits should be unnormalized)
    def forward(self, x):
        x = self.backbone(x)
        x = self.fc(x)  # Raw logits
        return x  # Don't normalize!
 ```
 **Case 2: Very small networks**
 ```python
 # Single-layer network: Normalization overkill
 class TinyNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(784, 10)  # MNIST classifier
        # No normalization needed (network too simple)
    def forward(self, x):
        return self.fc(x)
 ```
 **Case 3: When debugging**
 ```python
 # Remove normalization to isolate issues
 # If training fails with normalization, try without to check if:
 # - Initialization is correct
 # - Loss function is correct
 # - Data is correctly preprocessed
 ```
 ## Modern Recommendations (2025)
 ### CNNs:
 - **Default**: BatchNorm (if batch_size ≥ 8)
 - **Small batches**: GroupNorm (num_groups=32)
 - **Style transfer**: InstanceNorm
 ### RNNs/LSTMs:
 - **Default**: LayerNorm
 - **Never**: BatchNorm (breaks temporal structure)
 ### Transformers:
 - **Small models** (< 1B): LayerNorm + pre-norm
 - **Large models** (≥ 1B): RMSNorm + pre-norm (15-20% faster)
 - **Avoid**: Post-norm for deep models (> 12 layers)
 ### GANs:
 - **Generator**: InstanceNorm (image translation) or no norm
 - **Discriminator**: No normalization or LayerNorm
 - **Avoid**: BatchNorm (leaks information)
 ### Emerging:
 - **RMSNorm adoption increasing**: LLaMA, Mistral, Gemma all use RMSNorm
 - **Pre-norm becoming standard**: More stable for deep networks
 - **GroupNorm gaining traction**: Object detection, small-batch training
 ## Summary
 **Normalization is mandatory for modern deep learning.** The question is which normalization, not whether to normalize.
 **Quick decision tree:**
 1. **Batch size ≥ 8?** → Consider BatchNorm (CNNs)
 2. **Batch size < 8?** → Use GroupNorm (CNNs) or LayerNorm (all)
 3. **RNN/LSTM?** → LayerNorm (never BatchNorm!)
 4. **Transformer?** → LayerNorm or RMSNorm with pre-norm
 5. **Style transfer?** → InstanceNorm
 6. **GAN?** → InstanceNorm (generator) or no norm (discriminator)
 **Modern defaults:**
 - CNNs: BatchNorm (batch ≥ 8) or GroupNorm (batch < 8)
 - RNNs: LayerNorm
 - Transformers: RMSNorm + pre-norm (large models) or LayerNorm + pre-norm (small models)
 - GANs: InstanceNorm (generator), no norm (discriminator)
 **Key insight**: Match normalization to architecture and batch size. Don't cargo-cult "add BatchNorm everywhere"—it fails for small batches, RNNs, Transformers, and style transfer.
--- a/skills/using-neural-architectures/sequence-models-comparison.md
+++ b/skills/using-neural-architectures/sequence-models-comparison.md
@@ -0,0 +1,714 @@
 # Sequence Models Comparison: Choosing the Right Architecture for Sequential Data
 <CRITICAL_CONTEXT>
 Sequence modeling has evolved rapidly:
 - 2014-2017: LSTM/GRU dominated
 - 2017+: Transformers revolutionized the field
 - 2018+: TCN emerged as efficient alternative
 - 2021+: Sparse Transformers for very long sequences
 - 2022+: State Space Models (S4) for extreme lengths
 Don't default to LSTM (outdated) or blindly use Transformers (not always appropriate).
 Match architecture to your sequence characteristics.
 </CRITICAL_CONTEXT>
 ## When to Use This Skill
 Use this skill when:
 - ✅ Selecting model for sequential/temporal data
 - ✅ Comparing RNN vs LSTM vs Transformer
 - ✅ Deciding on sequence architecture for time series, text, audio
 - ✅ Understanding modern alternatives to LSTM
 - ✅ Optimizing for sequence length, speed, or accuracy
 DO NOT use for:
 - ❌ Vision tasks (use cnn-families-and-selection)
 - ❌ Graph-structured data (use graph-neural-networks-basics)
 - ❌ LLM-specific questions (use llm-specialist pack)
 **When in doubt:** If data is sequential/temporal → this skill.
 ## Selection Framework
 ### Step 1: Identify Key Characteristics
 **Before recommending, ask:**
 | Characteristic | Question | Impact |
 |----------------|----------|--------|
 | **Sequence Length** | Typical length? | Short (< 100) → LSTM/CNN, Medium (100-1k) → Transformer, Long (> 1k) → Sparse Transformer/S4 |
 | **Data Type** | Language, time series, audio? | Language → Transformer, Time series → TCN/Transformer, Audio → Specialized |
 | **Data Volume** | Training examples? | Small (< 10k) → LSTM/TCN, Large (> 100k) → Transformer |
 | **Latency** | Real-time needed? | Yes → TCN/LSTM, No → Transformer |
 | **Deployment** | Cloud/edge/mobile? | Edge → TCN/LSTM, Cloud → Any |
 ### Step 2: Apply Decision Tree
 ```
 START: What's your primary constraint?
 ┌─ SEQUENCE LENGTH
 │  ├─ Short (< 100 steps)
 │  │  ├─ Language → BiLSTM or small Transformer
 │  │  └─ Time series → TCN or LSTM
 │  │
 │  ├─ Medium (100-1000 steps)
 │  │  ├─ Language → Transformer (BERT-style)
 │  │  └─ Time series → Transformer or TCN
 │  │
 │  ├─ Long (1000-10000 steps)
 │  │  ├─ Sparse Transformer (Longformer, BigBird)
 │  │  └─ Hierarchical models
 │  │
 │  └─ Very Long (> 10000 steps)
 │     └─ State Space Models (S4)
 │
 ├─ DATA TYPE
 │  ├─ Natural Language
 │  │  ├─ < 50k data → BiLSTM or DistilBERT
 │  │  └─ > 50k data → Transformer (BERT, RoBERTa)
 │  │
 │  ├─ Time Series
 │  │  ├─ Fast training → TCN
 │  │  ├─ Long sequences → Transformer
 │  │  └─ Multivariate → Transformer with cross-series attention
 │  │
 │  └─ Audio
 │     ├─ Waveform → WaveNet (TCN-based)
 │     └─ Spectrograms → CNN + Transformer
 │
 └─ COMPUTATIONAL CONSTRAINT
   ├─ Edge device → TCN or small LSTM
   ├─ Real-time latency → TCN (parallel inference)
   └─ Cloud, no constraint → Transformer
 ```
 ## Architecture Catalog
 ### 1. RNN (Recurrent Neural Networks) - Legacy Foundation
 **Architecture:** Basic recurrent cell with hidden state
 **Status:** **OUTDATED** - don't use for new projects
 **Why it existed:**
 - First neural approach to sequences
 - Hidden state captures temporal information
 - Theoretically can model any sequence
 **Why it failed:**
 - Vanishing gradient (can't learn long dependencies)
 - Very slow training (sequential processing)
 - Replaced by LSTM in 2014
 **When to mention:**
 - Historical context only
 - Teaching purposes
 - Never recommend for production
 **Key Insight:** Proved neural nets could handle sequences, but impractical due to vanishing gradients
 ### 2. LSTM (Long Short-Term Memory) - Legacy Standard
 **Architecture:** Gated recurrent cell (forget, input, output gates)
 **Complexity:** O(n) memory, sequential processing
 **Strengths:**
 - Solves vanishing gradient (gates maintain long-term info)
 - Works well for short-medium sequences (< 500 steps)
 - Small datasets (< 10k examples)
 - Low memory footprint
 **Weaknesses:**
 - Sequential processing (slow training, can't parallelize)
 - Still struggles with very long sequences (> 1000 steps)
 - Slow inference (especially bidirectional)
 - Superseded by Transformers for most language tasks
 **When to Use:**
 - ✅ Small datasets (< 10k examples)
 - ✅ Short sequences (< 100 steps)
 - ✅ Edge deployment (low memory)
 - ✅ Baseline comparison
 **When NOT to Use:**
 - ❌ Large datasets (Transformer better)
 - ❌ Long sequences (> 500 steps)
 - ❌ Modern NLP (Transformer standard)
 - ❌ Fast training needed (TCN better)
 **Code Example:**
 ```python
 class SeqLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size,
                           num_layers=2,
                           batch_first=True,
                           bidirectional=True)
        self.fc = nn.Linear(hidden_size * 2, num_classes)
    def forward(self, x):
        # x: (batch, seq_len, features)
        lstm_out, _ = self.lstm(x)
        # Use last timestep
        out = self.fc(lstm_out[:, -1, :])
        return out
 ```
 **Status:** Legacy but still useful for specific cases (small data, edge deployment)
 ### 3. GRU (Gated Recurrent Unit) - Simplified LSTM
 **Architecture:** Simplified gating (2 gates instead of 3)
 **Advantages over LSTM:**
 - Fewer parameters (faster training)
 - Similar performance in many tasks
 - Lower memory
 **Disadvantages:**
 - Still sequential (same as LSTM)
 - No major advantage over LSTM in practice
 - Also superseded by Transformers
 **When to Use:**
 - Same as LSTM, but prefer LSTM for slightly better performance
 - Use if computational savings matter
 **Status:** Rarely recommended - if using recurrent, prefer LSTM or move to Transformer/TCN
 ### 4. Transformer - Modern Standard
 **Architecture:** Self-attention mechanism, parallel processing
 **Complexity:**
 - Memory: O(n²) for sequence length n
 - Compute: O(n²d) where d is embedding dimension
 **Strengths:**
 - ✅ Parallel processing (fast training)
 - ✅ Captures long-range dependencies (better than LSTM)
 - ✅ State-of-the-art for language (BERT, GPT)
 - ✅ Pre-trained models available
 - ✅ Scales with data (more data = better performance)
 **Weaknesses:**
 - ❌ Quadratic memory (struggles with sequences > 1000)
 - ❌ Needs more data than LSTM (> 10k examples)
 - ❌ Slower inference than TCN
 - ❌ Harder to interpret than RNN
 **When to Use:**
 - ✅ **Natural language** (current standard)
 - ✅ Medium sequences (100-1000 tokens)
 - ✅ Large datasets (> 50k examples)
 - ✅ Pre-training available (BERT, GPT)
 - ✅ Accuracy priority
 **When NOT to Use:**
 - ❌ Short sequences (< 50 tokens) - LSTM/CNN competitive, simpler
 - ❌ Very long sequences (> 2000) - quadratic memory explodes
 - ❌ Small datasets (< 10k) - will overfit
 - ❌ Edge deployment - large model size
 **Memory Analysis:**
 ```python
 # Standard Transformer attention
 # For sequence length n=1000, batch_size=32, embedding_dim=512:
 attention_weights = softmax(Q @ K^T / sqrt(d))  # Shape: (32, 1000, 1000)
 # Memory: 32 * 1000 * 1000 * 4 bytes = 128 MB just for attention!
 # For n=5000:
 # Memory: 32 * 5000 * 5000 * 4 bytes = 3.2 GB per batch!
 # → Impossible on most GPUs
 ```
 **Code Example:**
 ```python
 from transformers import BertModel, BertTokenizer
 # Pre-trained BERT for text classification
 class TransformerClassifier(nn.Module):
    def __init__(self, num_classes):
        super().__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.classifier = nn.Linear(768, num_classes)
    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids,
                           attention_mask=attention_mask)
        # Use [CLS] token representation
        pooled = outputs.pooler_output
        return self.classifier(pooled)
 # Fine-tuning
 tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
 model = TransformerClassifier(num_classes=2)
 ```
 **Status:** **Current standard for NLP**, competitive for time series with large data
 ### 5. TCN (Temporal Convolutional Network) - Underrated Alternative
 **Architecture:** 1D convolutions with dilated causal convolutions
 **Complexity:** O(n) memory, fully parallel processing
 **Strengths:**
 - ✅ **Parallel training** (much faster than LSTM)
 - ✅ **Parallel inference** (faster than LSTM/Transformer)
 - ✅ Linear memory (no quadratic blow-up)
 - ✅ Large receptive field (dilation)
 - ✅ Works well for time series
 - ✅ Simple architecture
 **Weaknesses:**
 - ❌ Less popular (fewer pre-trained models)
 - ❌ Not standard for language (Transformer dominates)
 - ❌ Fixed receptive field (vs adaptive attention)
 **When to Use:**
 - ✅ **Time series forecasting** (often BETTER than LSTM)
 - ✅ **Fast training needed** (2-3x faster than LSTM)
 - ✅ **Fast inference** (real-time applications)
 - ✅ Long sequences (linear memory)
 - ✅ Audio processing (WaveNet is TCN-based)
 **When NOT to Use:**
 - ❌ Natural language with pre-training available (use Transformer)
 - ❌ Need very large receptive field (Transformer better)
 **Performance Comparison:**
 ```
 Time series forecasting (1000-step sequences):
 Training speed:
 - LSTM: 100% (baseline, sequential)
 - TCN: 35% (2.8x faster, parallel)
 - Transformer: 45% (2.2x faster)
 Inference speed:
 - LSTM: 100% (sequential)
 - TCN: 20% (5x faster, parallel)
 - Transformer: 60% (1.7x faster)
 Accuracy (similar across all three):
 - LSTM: Baseline
 - TCN: Equal or slightly better
 - Transformer: Equal or slightly better (needs more data)
 Conclusion: TCN wins on speed, matches accuracy
 ```
 **Code Example:**
 ```python
 class TCN(nn.Module):
    def __init__(self, input_channels, num_channels, kernel_size=3):
        super().__init__()
        layers = []
        num_levels = len(num_channels)
        for i in range(num_levels):
            dilation_size = 2 ** i
            in_channels = input_channels if i == 0 else num_channels[i-1]
            out_channels = num_channels[i]
            # Causal dilated convolution
            layers.append(
                nn.Conv1d(in_channels, out_channels, kernel_size,
                         padding=(kernel_size-1) * dilation_size,
                         dilation=dilation_size)
            )
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(0.2))
        self.network = nn.Sequential(*layers)
    def forward(self, x):
        return self.network(x)
 # Usage for time series
 # Input: (batch, channels, sequence_length)
 model = TCN(input_channels=1, num_channels=[64, 128, 256])
 ```
 **Key Insight:** Dilated convolutions create exponentially large receptive field (2^k) while maintaining linear memory
 **Status:** **Excellent for time series**, underrated, should be considered before LSTM
 ### 6. Sparse Transformers - Long Sequence Specialists
 **Architecture:** Modified attention patterns to reduce complexity
 **Variants:**
 - **Longformer**: Local + global attention
 - **BigBird**: Random + local + global attention
 - **Linformer**: Low-rank projection of keys/values
 - **Performer**: Kernel approximation of attention
 **Complexity:** O(n log n) or O(n) depending on variant
 **When to Use:**
 - ✅ **Long sequences** (1000-10000 tokens)
 - ✅ Document processing (multi-page documents)
 - ✅ Long-context language modeling
 - ✅ When standard Transformer runs out of memory
 **Trade-offs:**
 - Slightly lower accuracy than full attention (approximation)
 - More complex implementation
 - Fewer pre-trained models
 **Example Use Cases:**
 - Legal document analysis (10k+ tokens)
 - Scientific paper understanding
 - Long-form text generation
 - Time series with thousands of steps
 **Status:** Specialized for long sequences, active research area
 ### 7. State Space Models (S4) - Cutting Edge
 **Architecture:** Structured state space with efficient recurrence
 **Complexity:** O(n log n) training, O(n) inference
 **Strengths:**
 - ✅ **Very long sequences** (10k-100k steps)
 - ✅ Linear inference complexity
 - ✅ Strong theoretical foundations
 - ✅ Handles continuous-time sequences
 **Weaknesses:**
 - ❌ Newer (less mature ecosystem)
 - ❌ Complex mathematics
 - ❌ Fewer pre-trained models
 - ❌ Harder to implement
 **When to Use:**
 - ✅ Extremely long sequences (> 10k steps)
 - ✅ Audio (raw waveforms, 16kHz sampling)
 - ✅ Medical signals (ECG, EEG)
 - ✅ Research applications
 **Status:** **Cutting edge** (2022+), promising for very long sequences
 ## Practical Selection Guide
 ### Scenario 1: Natural Language Processing
 **Short text (< 50 tokens, e.g., tweets, titles):**
 ```
 Small dataset (< 10k):
 → BiLSTM or 1D CNN (simple, effective)
 Large dataset (> 10k):
 → DistilBERT (smaller Transformer, 40M params)
 → Or BiLSTM if latency critical
 ```
 **Medium text (50-512 tokens, e.g., reviews, articles):**
 ```
 Standard approach:
 → BERT, RoBERTa, or similar (110M params)
 → Fine-tune on task-specific data
 Small dataset:
 → DistilBERT (66M params, faster, similar accuracy)
 ```
 **Long documents (> 512 tokens):**
 ```
 → Longformer (4096 tokens max)
 → BigBird (4096 tokens max)
 → Hierarchical: Process in chunks, aggregate
 ```
 ### Scenario 2: Time Series Forecasting
 **Short sequences (< 100 steps):**
 ```
 Fast training:
 → TCN (2-3x faster than LSTM)
 Small dataset:
 → LSTM or simple models (ARIMA, Prophet)
 Baseline:
 → LSTM (well-tested)
 ```
 **Medium sequences (100-1000 steps):**
 ```
 Best accuracy:
 → Transformer (if data > 50k examples)
 Fast training/inference:
 → TCN (parallel processing)
 Multivariate:
 → Transformer with cross-series attention
 ```
 **Long sequences (> 1000 steps):**
 ```
 → Sparse Transformer (Informer for time series)
 → Hierarchical models (chunk + aggregate)
 → State Space Models (S4)
 ```
 ### Scenario 3: Audio Processing
 **Waveform (raw audio, 16kHz):**
 ```
 → WaveNet (TCN-based)
 → State Space Models (S4)
 ```
 **Spectrograms (mel-spectrograms):**
 ```
 → CNN + BiLSTM (traditional)
 → CNN + Transformer (modern)
 ```
 **Speech recognition:**
 ```
 → Transformer (Wav2Vec 2.0, Whisper)
 → Pre-trained models available
 ```
 ## Trade-Off Analysis
 ### Speed Comparison
 **Training speed (1000-step sequences):**
 ```
 LSTM:         100% (baseline, sequential)
 GRU:          75% (simpler gates)
 TCN:          35% (2.8x faster, parallel)
 Transformer:  45% (2.2x faster, parallel)
 Conclusion: TCN fastest for training
 ```
 **Inference speed:**
 ```
 LSTM:         100% (sequential)
 BiLSTM:       200% (2x passes)
 TCN:          20% (5x faster, parallel)
 Transformer:  60% (faster, but attention overhead)
 Conclusion: TCN fastest for inference
 ```
 ### Memory Comparison
 **Sequence length n=1000, batch=32:**
 ```
 LSTM:              ~500 MB (linear in n)
 Transformer:       ~2 GB (quadratic in n)
 TCN:               ~400 MB (linear in n)
 Sparse Transformer: ~800 MB (n log n)
 For n=5000:
 LSTM:              ~2 GB
 Transformer:       OUT OF MEMORY (50 GB needed!)
 TCN:               ~2 GB
 Sparse Transformer: ~4 GB
 ```
 ### Accuracy vs Data Size
 **Small dataset (< 10k examples):**
 ```
 LSTM:       ★★★★☆ (works well with little data)
 Transformer: ★★☆☆☆ (overfits, needs more data)
 TCN:        ★★★★☆ (similar to LSTM)
 Winner: LSTM or TCN
 ```
 **Large dataset (> 100k examples):**
 ```
 LSTM:       ★★★☆☆ (good but plateaus)
 Transformer: ★★★★★ (best, scales with data)
 TCN:        ★★★★☆ (competitive)
 Winner: Transformer
 ```
 ## Common Pitfalls
 ### Pitfall 1: Using LSTM in 2025 Without Considering Modern Alternatives
 **Symptom:** Defaulting to LSTM for all sequence tasks
 **Why it's wrong:** Transformers (language) and TCN (time series) often better
 **Fix:** Consider Transformer for language, TCN for time series, LSTM for small data/edge only
 ### Pitfall 2: Using Standard Transformer for Very Long Sequences
 **Symptom:** Running out of memory on sequences > 1000 tokens
 **Why it's wrong:** O(n²) memory explodes
 **Fix:** Use Sparse Transformer (Longformer, BigBird) or hierarchical approach
 ### Pitfall 3: Not Trying TCN for Time Series
 **Symptom:** Struggling with slow LSTM training
 **Why it's wrong:** TCN is 2-3x faster, often more accurate
 **Fix:** Try TCN before optimizing LSTM
 ### Pitfall 4: Using Transformer for Small Datasets
 **Symptom:** Transformer overfits on < 10k examples
 **Why it's wrong:** Transformers need large datasets to work well
 **Fix:** Use LSTM or TCN for small datasets, or use pre-trained Transformer
 ### Pitfall 5: Ignoring Sequence Length Constraints
 **Symptom:** Choosing architecture without considering typical sequence length
 **Why it's wrong:** Architecture effectiveness varies dramatically with length
 **Fix:** Match architecture to sequence length (short → LSTM/CNN, long → Sparse Transformer)
 ## Evolution Timeline
 **Understanding why architectures evolved:**
 ```
 2010-2013: Basic RNN
 → Vanishing gradient problem
 → Can't learn long dependencies
 2014: LSTM (Hochreiter & Schmidhuber)
 → Gates solve vanishing gradient
 → Became standard for sequences
 2014: GRU
 → Simplified LSTM
 → Similar performance, fewer parameters
 2017: Transformer (Attention Is All You Need)
 → Self-attention replaces recurrence
 → Parallel processing (fast training)
 → Revolutionized NLP
 2018: TCN (Temporal Convolutional Networks)
 → Dilated convolutions for sequences
 → Often better than LSTM for time series
 → Underrated alternative
 2020: Sparse Transformers
 → Reduce quadratic complexity
 → Enable longer sequences
 2021: State Space Models (S4)
 → Very long sequences (10k-100k)
 → Theoretical foundations
 → Cutting edge research
 Current (2025):
 - NLP: Transformer standard (BERT, GPT)
 - Time Series: TCN or Transformer
 - Audio: Specialized (WaveNet, Transformer)
 - Edge: LSTM or TCN (low memory)
 ```
 ## Decision Checklist
 Before choosing sequence model:
 ```
 ☐ Sequence length? (< 100 / 100-1k / > 1k)
 ☐ Data type? (language / time series / audio / other)
 ☐ Dataset size? (< 10k / 10k-100k / > 100k)
 ☐ Latency requirement? (real-time / batch / offline)
 ☐ Deployment target? (cloud / edge / mobile)
 ☐ Pre-trained models available? (yes / no)
 ☐ Training speed critical? (yes / no)
 Based on answers:
 → Language + large data → Transformer
 → Language + small data → BiLSTM or DistilBERT
 → Time series + speed → TCN
 → Time series + accuracy + large data → Transformer
 → Very long sequences → Sparse Transformer or S4
 → Edge deployment → TCN or LSTM
 → Real-time latency → TCN
 ```
 ## Integration with Other Skills
 **For language-specific questions:**
 → `yzmir/llm-specialist/using-llm-specialist`
 - LLM-specific Transformers (GPT, BERT variants)
 - Fine-tuning strategies
 - Prompt engineering
 **For Transformer internals:**
 → `yzmir/neural-architectures/transformer-architecture-deepdive`
 - Attention mechanisms
 - Positional encoding
 - Transformer variants
 **After selecting architecture:**
 → `yzmir/training-optimization/using-training-optimization`
 - Optimizer selection
 - Learning rate schedules
 - Handling sequence-specific training issues
 ## Summary
 **Quick Reference Table:**
 | Use Case | Best Choice | Alternative | Avoid |
 |----------|-------------|-------------|-------|
 | Short text (< 50 tokens) | BiLSTM, DistilBERT | 1D CNN | Full BERT (overkill) |
 | Long text (> 512 tokens) | Longformer, BigBird | Hierarchical | Standard BERT (memory) |
 | Time series (< 1k steps) | TCN, Transformer | LSTM | Basic RNN |
 | Time series (> 1k steps) | Sparse Transformer, S4 | Hierarchical | Standard Transformer |
 | Small dataset (< 10k) | LSTM, TCN | Simple models | Transformer (overfits) |
 | Large dataset (> 100k) | Transformer | TCN | LSTM (plateaus) |
 | Edge deployment | TCN, LSTM | Quantized Transformer | Large Transformer |
 | Real-time inference | TCN | Small LSTM | BiLSTM, Transformer |
 **Key Principles:**
 1. **Don't default to LSTM** (outdated for most tasks)
 2. **Transformer for language** (current standard, if data sufficient)
 3. **TCN for time series** (fast, effective, underrated)
 4. **Match to sequence length** (short → LSTM/CNN, long → Sparse Transformer)
 5. **Consider modern alternatives** (don't stop at LSTM vs Transformer)
 **END OF SKILL**
--- a/skills/using-neural-architectures/transformer-architecture-deepdive.md
+++ b/skills/using-neural-architectures/transformer-architecture-deepdive.md
@@ -0,0 +1,937 @@
 # Transformer Architecture Deep Dive
 ## When to Use This Skill
 Use this skill when you need to:
 - ✅ Implement a Transformer from scratch
 - ✅ Understand HOW and WHY self-attention works
 - ✅ Choose between encoder, decoder, or encoder-decoder architectures
 - ✅ Decide if Vision Transformer (ViT) is appropriate for your vision task
 - ✅ Understand modern variants (RoPE, ALiBi, GQA, MQA)
 - ✅ Debug Transformer implementation issues
 - ✅ Optimize Transformer performance
 **Do NOT use this skill for:**
 - ❌ High-level architecture selection (use `using-neural-architectures`)
 - ❌ Attention mechanism comparison (use `attention-mechanisms-catalog`)
 - ❌ LLM-specific topics like prompt engineering (use `llm-specialist` pack)
 ## Core Principle
 **Transformers are NOT magic.** They are:
 1. Self-attention mechanism (information retrieval)
 2. + Position encoding (break permutation invariance)
 3. + Residual connections + Layer norm (training stability)
 4. + Feed-forward networks (non-linearity)
 Understanding the mechanism beats cargo-culting implementations.
 ## Part 1: Self-Attention Mechanism Explained
 ### The Information Retrieval Analogy
 **Self-attention = Querying a database:**
 - **Query (Q)**: "What am I looking for?"
 - **Key (K)**: "What do I contain?"
 - **Value (V)**: "What information do I have?"
 **Process:**
 1. Compare your query with all keys (compute similarity)
 2. Weight values by similarity
 3. Return weighted sum of values
 **Example:** Sentence: "The cat sat on the mat"
 Token "sat" (verb):
 - High attention to "cat" (subject) → Learns verb-subject relationship
 - High attention to "mat" (object) → Learns verb-object relationship
 - Low attention to "the", "on" (function words)
 ### Mathematical Breakdown
 **Given input X:** (batch, seq_len, d_model)
 **Step 1: Project to Q, K, V**
 ```python
 Q = X @ W_Q  # (batch, seq_len, d_k)
 K = X @ W_K  # (batch, seq_len, d_k)
 V = X @ W_V  # (batch, seq_len, d_v)
 # Typically: d_k = d_v = d_model / num_heads
 ```
 **Step 2: Compute attention scores** (similarity)
 ```python
 scores = Q @ K.transpose(-2, -1)  # (batch, seq_len, seq_len)
 # scores[i, j] = similarity between query_i and key_j
 ```
 **Geometric interpretation:**
 - Dot product measures vector alignment
 - q · k = ||q|| ||k|| cos(θ)
 - Similar vectors → Large dot product → High attention
 - Orthogonal vectors → Zero dot product → No attention
 **Step 3: Scale by √d_k** (CRITICAL!)
 ```python
 scores = scores / math.sqrt(d_k)
 ```
 **WHY scaling?**
 - Dot products grow with dimension: Var(q · k) = d_k
 - Example: d_k=64 → Random dot products ~ ±64
 - Large scores → Softmax saturates → Gradients vanish
 - Scaling: Keep scores ~ O(1) regardless of dimension
 **Without scaling:** Softmax([30, 25, 20]) ≈ [0.99, 0.01, 0.00] (saturated!)
 **With scaling:** Softmax([3, 2.5, 2]) ≈ [0.50, 0.30, 0.20] (healthy gradients)
 **Step 4: Softmax to get attention weights**
 ```python
 attn_weights = F.softmax(scores, dim=-1)  # (batch, seq_len, seq_len)
 # Each row sums to 1 (probability distribution)
 # attn_weights[i, j] = "how much token i attends to token j"
 ```
 **Step 5: Weight values**
 ```python
 output = attn_weights @ V  # (batch, seq_len, d_v)
 # Each token's output = weighted average of all values
 ```
 **Complete formula:**
 ```python
 Attention(Q, K, V) = softmax(Q K^T / √d_k) V
 ```
 ### Why Three Matrices (Q, K, V)?
 **Could we use just one?** Attention(X, X, X)
 **Yes, but Q/K/V separation enables:**
 1. **Asymmetry**: Query can differ from key (search ≠ database)
 2. **Decoupling**: What you search for (Q@K) ≠ what you retrieve (V)
 3. **Cross-attention**: Q from one source, K/V from another
   - Example: Decoder queries encoder (translation)
 **Modern optimization:** Multi-Query Attention (MQA), Grouped-Query Attention (GQA)
 - Share K/V across heads (fewer parameters, faster inference)
 ### Implementation Example
 ```python
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 import math
 class SelfAttention(nn.Module):
    def __init__(self, d_model, d_k=None):
        super().__init__()
        self.d_k = d_k or d_model
        self.W_q = nn.Linear(d_model, self.d_k)
        self.W_k = nn.Linear(d_model, self.d_k)
        self.W_v = nn.Linear(d_model, self.d_k)
    def forward(self, x, mask=None):
        # x: (batch, seq_len, d_model)
        Q = self.W_q(x)  # (batch, seq_len, d_k)
        K = self.W_k(x)  # (batch, seq_len, d_k)
        V = self.W_v(x)  # (batch, seq_len, d_k)
        # Attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        # scores: (batch, seq_len, seq_len)
        # Apply mask if provided (for causal attention)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        # Attention weights
        attn_weights = F.softmax(scores, dim=-1)  # (batch, seq_len, seq_len)
        # Weighted sum of values
        output = torch.matmul(attn_weights, V)  # (batch, seq_len, d_k)
        return output, attn_weights
 ```
 **Complexity:** O(n² · d) where n = seq_len, d = d_model
 - **Quadratic in sequence length** (bottleneck for long sequences)
 - For n=1000, d=512: 1000² × 512 = 512M operations
 ## Part 2: Multi-Head Attention
 ### Why Multiple Heads?
 **Single-head attention** learns one attention pattern.
 **Multi-head attention** learns multiple parallel patterns:
 - Head 1: Syntactic relationships (subject-verb)
 - Head 2: Semantic similarity
 - Head 3: Positional proximity
 - Head 4: Long-range dependencies
 **Analogy:** Ensemble of attention functions, each specializing in different patterns.
 ### Head Dimension Calculation
 **CRITICAL CONSTRAINT:** num_heads must divide d_model evenly!
 ```python
 d_model = 512
 num_heads = 8
 d_k = d_model // num_heads  # 512 / 8 = 64
 # Each head operates in d_k dimensions
 # Concatenate all heads → back to d_model dimensions
 ```
 **Common configurations:**
 - BERT-base: d_model=768, heads=12, d_k=64
 - GPT-2: d_model=768, heads=12, d_k=64
 - GPT-3 175B: d_model=12288, heads=96, d_k=128
 - LLaMA-2 70B: d_model=8192, heads=64, d_k=128
 **Rule of thumb:** d_k (head dimension) should be 64-128
 - Too small (d_k < 32): Limited representational capacity
 - Too large (d_k > 256): Redundant, wasteful
 ### Multi-Head Implementation
 ```python
 class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        # Single linear layers for all heads (more efficient)
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)  # Output projection
    def split_heads(self, x):
        # x: (batch, seq_len, d_model)
        batch_size, seq_len, d_model = x.size()
        # Reshape to (batch, seq_len, num_heads, d_k)
        x = x.view(batch_size, seq_len, self.num_heads, self.d_k)
        # Transpose to (batch, num_heads, seq_len, d_k)
        return x.transpose(1, 2)
    def forward(self, x, mask=None):
        batch_size = x.size(0)
        # Linear projections
        Q = self.W_q(x)  # (batch, seq_len, d_model)
        K = self.W_k(x)
        V = self.W_v(x)
        # Split into multiple heads
        Q = self.split_heads(Q)  # (batch, num_heads, seq_len, d_k)
        K = self.split_heads(K)
        V = self.split_heads(V)
        # Attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        attn_weights = F.softmax(scores, dim=-1)
        # Weighted sum
        attn_output = torch.matmul(attn_weights, V)
        # attn_output: (batch, num_heads, seq_len, d_k)
        # Concatenate heads
        attn_output = attn_output.transpose(1, 2).contiguous()
        # (batch, seq_len, num_heads, d_k)
        attn_output = attn_output.view(batch_size, -1, self.d_model)
        # (batch, seq_len, d_model)
        # Final linear projection
        output = self.W_o(attn_output)
        return output, attn_weights
 ```
 ### Modern Variants: GQA and MQA
 **Problem:** K/V caching during inference is memory-intensive
 - LLaMA-2 70B: 8192 × 64 heads × 2 (K + V) = 1M parameters per token cached!
 **Solution 1: Multi-Query Attention (MQA)**
 - **One** K/V head shared across **all** Q heads
 - Benefit: Dramatically faster inference (smaller KV cache)
 - Trade-off: ~1-2% accuracy loss
 ```python
 # MQA: Single K/V projection
 self.W_k = nn.Linear(d_model, d_k)  # Not d_model!
 self.W_v = nn.Linear(d_model, d_k)
 self.W_q = nn.Linear(d_model, d_model)  # Multiple Q heads
 ```
 **Solution 2: Grouped-Query Attention (GQA)**
 - Middle ground: Group multiple Q heads per K/V head
 - Example: 32 Q heads → 8 K/V heads (4 Q per K/V)
 - Benefit: 4x smaller KV cache, minimal accuracy loss
 **Used in:** LLaMA-2, Mistral, Mixtral
 ## Part 3: Position Encoding
 ### Why Position Encoding?
 **Problem:** Self-attention is **permutation-invariant**
 - Attention("cat sat mat") = Attention("mat cat sat")
 - No inherent notion of position or order!
 **Solution:** Add position information to embeddings
 ### Strategy 1: Sinusoidal Position Encoding (Original)
 **Formula:**
 ```python
 PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
 PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
 ```
 **Implementation:**
 ```python
 def sinusoidal_position_encoding(seq_len, d_model):
    pe = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2).float() *
                         (-math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe
 # Usage: Add to input embeddings
 x = token_embeddings + positional_encoding
 ```
 **Properties:**
 - Deterministic (no learned parameters)
 - Extrapolates to unseen lengths (geometric properties)
 - Relative positions: PE(pos+k) is linear function of PE(pos)
 **When to use:** Variable-length sequences in NLP
 ### Strategy 2: Learned Position Embeddings
 ```python
 self.pos_embedding = nn.Embedding(max_seq_len, d_model)
 # Usage
 positions = torch.arange(seq_len, device=x.device)
 x = token_embeddings + self.pos_embedding(positions)
 ```
 **Properties:**
 - Learnable (adapts to data)
 - Cannot extrapolate beyond max_seq_len
 **When to use:**
 - Fixed-length sequences
 - Vision Transformers (image patches)
 - When training data covers all positions
 ### Strategy 3: Rotary Position Embeddings (RoPE) ⭐
 **Modern approach (2021+):** Rotate Q and K in complex plane
 **Key advantages:**
 - Encodes **relative** positions naturally
 - Better long-range decay properties
 - No addition to embeddings (applied in attention)
 **Used in:** GPT-NeoX, PaLM, LLaMA, LLaMA-2, Mistral
 ```python
 def apply_rotary_pos_emb(x, cos, sin):
    # x: (batch, num_heads, seq_len, d_k)
    # Split into even/odd
    x1, x2 = x[..., ::2], x[..., 1::2]
    # Rotate
    return torch.cat([x1 * cos - x2 * sin, x1 * sin + x2 * cos], dim=-1)
 ```
 ### Strategy 4: ALiBi (Attention with Linear Biases) ⭐
 **Simplest modern approach:** Add bias to attention scores (no embeddings!)
 ```python
 # Bias matrix: -1 * distance
 # [[0, -1, -2, -3],
 #  [0,  0, -1, -2],
 #  [0,  0,  0, -1],
 #  [0,  0,  0,  0]]
 scores = Q @ K^T / √d_k + alibi_bias
 ```
 **Key advantages:**
 - **Best extrapolation** to longer sequences
 - No positional embeddings (simpler)
 - Per-head slopes (different decay rates)
 **Used in:** BLOOM
 ### Position Encoding Selection Guide
 | Use Case | Recommended | Why |
 |----------|-------------|-----|
 | NLP (variable length) | RoPE or ALiBi | Better extrapolation |
 | NLP (fixed length) | Learned embeddings | Adapts to data |
 | Vision (ViT) | 2D learned embeddings | Spatial structure |
 | Long sequences (>2k) | ALiBi | Best extrapolation |
 | Legacy/compatibility | Sinusoidal | Original Transformer |
 **Modern trend (2023+):** RoPE and ALiBi dominate over sinusoidal
 ## Part 4: Architecture Variants
 ### Variant 1: Encoder-Only (Bidirectional)
 **Architecture:**
 - Self-attention: Each token attends to **ALL** tokens (past + future)
 - No masking (bidirectional context)
 **Examples:** BERT, RoBERTa, ELECTRA, DeBERTa
 **Use cases:**
 - Text classification
 - Named entity recognition
 - Question answering (extract span from context)
 - Sentence embeddings
 **Key property:** Sees full context → Good for **understanding**
 **Implementation:**
 ```python
 class TransformerEncoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers):
        super().__init__()
        self.layers = nn.ModuleList([
            EncoderLayer(d_model, num_heads, d_ff)
            for _ in range(num_layers)
        ])
    def forward(self, x, mask=None):
        for layer in self.layers:
            x = layer(x, mask)  # No causal mask!
        return x
 class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
    def forward(self, x, mask=None):
        # Self-attention + residual + norm
        attn_output, _ = self.self_attn(x, mask)
        x = self.norm1(x + attn_output)
        # Feed-forward + residual + norm
        ff_output = self.feed_forward(x)
        x = self.norm2(x + ff_output)
        return x
 ```
 ### Variant 2: Decoder-Only (Autoregressive)
 **Architecture:**
 - Self-attention with **causal masking**
 - Each token attends ONLY to past tokens (not future)
 **Causal mask (lower triangular):**
 ```python
 # mask[i, j] = 1 if j <= i else 0
 [[1, 0, 0, 0],   # Token 0 sees only itself
 [1, 1, 0, 0],   # Token 1 sees tokens 0-1
 [1, 1, 1, 0],   # Token 2 sees tokens 0-2
 [1, 1, 1, 1]]   # Token 3 sees all
 ```
 **Examples:** GPT, GPT-2, GPT-3, GPT-4, LLaMA, Mistral
 **Use cases:**
 - Text generation
 - Language modeling
 - Code generation
 - Autoregressive prediction
 **Key property:** Generates sequentially → Good for **generation**
 **Implementation:**
 ```python
 def create_causal_mask(seq_len, device):
    # Lower triangular matrix
    mask = torch.tril(torch.ones(seq_len, seq_len, device=device))
    return mask
 class TransformerDecoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers):
        super().__init__()
        self.layers = nn.ModuleList([
            DecoderLayer(d_model, num_heads, d_ff)
            for _ in range(num_layers)
        ])
    def forward(self, x):
        seq_len = x.size(1)
        causal_mask = create_causal_mask(seq_len, x.device)
        for layer in self.layers:
            x = layer(x, causal_mask)  # Apply causal mask!
        return x
 ```
 **Modern trend (2023+):** Decoder-only architectures dominate
 - Can do both generation AND understanding (via prompting)
 - Simpler than encoder-decoder (no cross-attention)
 - Scales better to massive sizes
 ### Variant 3: Encoder-Decoder (Seq2Seq)
 **Architecture:**
 - **Encoder**: Bidirectional self-attention (understands input)
 - **Decoder**: Causal self-attention (generates output)
 - **Cross-attention**: Decoder queries encoder outputs
 **Cross-attention mechanism:**
 ```python
 # Q from decoder, K and V from encoder
 Q = decoder_hidden @ W_q
 K = encoder_output @ W_k
 V = encoder_output @ W_v
 cross_attn = softmax(Q K^T / √d_k) V
 ```
 **Examples:** T5, BART, mT5, original Transformer (2017)
 **Use cases:**
 - Translation (input ≠ output language)
 - Summarization (long input → short output)
 - Question answering (generate answer, not extract)
 **When to use:** Input and output are fundamentally different
 **Implementation:**
 ```python
 class EncoderDecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)  # NEW!
        self.feed_forward = FeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
    def forward(self, decoder_input, encoder_output, causal_mask=None):
        # 1. Self-attention (causal)
        self_attn_out, _ = self.self_attn(decoder_input, causal_mask)
        x = self.norm1(decoder_input + self_attn_out)
        # 2. Cross-attention (Q from decoder, K/V from encoder)
        cross_attn_out, _ = self.cross_attn.forward_cross(
            query=x,
            key=encoder_output,
            value=encoder_output
        )
        x = self.norm2(x + cross_attn_out)
        # 3. Feed-forward
        ff_out = self.feed_forward(x)
        x = self.norm3(x + ff_out)
        return x
 ```
 ### Architecture Selection Guide
 | Task | Architecture | Why |
 |------|--------------|-----|
 | Classification | Encoder-only | Need full bidirectional context |
 | Text generation | Decoder-only | Autoregressive generation |
 | Translation | Encoder-decoder or Decoder-only | Different languages, or use prompting |
 | Summarization | Encoder-decoder or Decoder-only | Length mismatch, or use prompting |
 | Q&A (extract) | Encoder-only | Find span in context |
 | Q&A (generate) | Decoder-only | Generate freeform answer |
 **2023+ trend:** Decoder-only can do everything via prompting (but less parameter-efficient for some tasks)
 ## Part 5: Vision Transformers (ViT)
 ### From Images to Sequences
 **Key insight:** Treat image as sequence of patches
 **Process:**
 1. Split image into patches (e.g., 16×16 pixels)
 2. Flatten each patch → 1D vector
 3. Linear projection → token embeddings
 4. Add 2D positional embeddings
 5. Prepend [CLS] token (for classification)
 6. Feed to Transformer encoder
 **Example:** 224×224 image, 16×16 patches
 - Number of patches: (224/16)² = 196
 - Each patch: 16 × 16 × 3 = 768 dimensions
 - Transformer input: 197 tokens (196 patches + 1 [CLS])
 ### ViT Implementation
 ```python
 class VisionTransformer(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_channels=3,
                 d_model=768, num_heads=12, num_layers=12, num_classes=1000):
        super().__init__()
        self.patch_size = patch_size
        num_patches = (img_size // patch_size) ** 2
        patch_dim = in_channels * patch_size ** 2
        # Patch embedding (linear projection of flattened patches)
        self.patch_embed = nn.Linear(patch_dim, d_model)
        # [CLS] token (learnable)
        self.cls_token = nn.Parameter(torch.randn(1, 1, d_model))
        # Position embeddings (learnable)
        self.pos_embed = nn.Parameter(torch.randn(1, num_patches + 1, d_model))
        # Transformer encoder
        self.encoder = TransformerEncoder(d_model, num_heads,
                                         d_ff=4*d_model, num_layers=num_layers)
        # Classification head
        self.head = nn.Linear(d_model, num_classes)
    def forward(self, x):
        # x: (batch, channels, height, width)
        batch_size = x.size(0)
        # Divide into patches and flatten
        x = x.unfold(2, self.patch_size, self.patch_size)
        x = x.unfold(3, self.patch_size, self.patch_size)
        # (batch, channels, num_patches_h, num_patches_w, patch_size, patch_size)
        x = x.contiguous().view(batch_size, -1, self.patch_size ** 2 * 3)
        # (batch, num_patches, patch_dim)
        # Linear projection
        x = self.patch_embed(x)  # (batch, num_patches, d_model)
        # Prepend [CLS] token
        cls_tokens = self.cls_token.expand(batch_size, -1, -1)
        x = torch.cat([cls_tokens, x], dim=1)  # (batch, num_patches+1, d_model)
        # Add positional embeddings
        x = x + self.pos_embed
        # Transformer encoder
        x = self.encoder(x)
        # Classification: Use [CLS] token
        cls_output = x[:, 0]  # (batch, d_model)
        logits = self.head(cls_output)
        return logits
 ```
 ### ViT vs CNN: Critical Differences
 **1. Inductive Bias**
 | Property | CNN | ViT |
 |----------|-----|-----|
 | Locality | Strong (conv kernel) | Weak (global attention) |
 | Translation invariance | Strong (weight sharing) | Weak (position embeddings) |
 | Hierarchy | Strong (pooling layers) | None (flat patches) |
 **Implication:** CNN has strong priors, ViT learns from data
 **2. Data Requirements**
 | Dataset Size | CNN | ViT (from scratch) | ViT (pretrained) |
 |--------------|-----|-------------------|------------------|
 | Small (< 100k) | ✅ Good | ❌ Fails | ✅ Good |
 | Medium (100k-1M) | ✅ Excellent | ⚠️ Poor | ✅ Good |
 | Large (> 1M) | ✅ Excellent | ⚠️ OK | ✅ Excellent |
 | Huge (> 100M) | ✅ Excellent | ✅ SOTA | N/A |
 **Key finding:** ViT needs 100M+ images to train from scratch
 - Original ViT: Trained on JFT-300M (300 million images)
 - Without massive data, ViT underperforms CNNs significantly
 **3. Computational Cost**
 **Example: 224×224 images**
 | Model | Parameters | GFLOPs | Inference (GPU) |
 |-------|-----------|--------|-----------------|
 | ResNet-50 | 25M | 4.1 | ~30ms |
 | EfficientNet-B0 | 5M | 0.4 | ~10ms |
 | ViT-B/16 | 86M | 17.6 | ~100ms |
 **Implication:** ViT is 40x more expensive than EfficientNet!
 ### When to Use ViT
 ✅ **Use ViT when:**
 - Large dataset (> 1M images) OR using pretrained weights
 - Computational cost acceptable (cloud, large GPU)
 - Best possible accuracy needed
 - Can fine-tune from ImageNet-21k checkpoint
 ❌ **Use CNN when:**
 - Small/medium dataset (< 1M images) and training from scratch
 - Limited compute/memory
 - Edge deployment (mobile, embedded)
 - Need architectural inductive biases
 ### Hybrid Approaches (2022-2023)
 **ConvNeXt:** CNN with ViT design choices
 - Matches ViT accuracy with CNN efficiency
 - Works better on small datasets
 **Swin Transformer:** Hierarchical ViT with local windows
 - Shifted windows for efficiency
 - O(n) complexity instead of O(n²)
 - Better for dense prediction (segmentation)
 **CoAtNet:** Mix conv layers (early) + Transformer layers (late)
 - Gets both inductive bias and global attention
 ## Part 6: Implementation Checklist
 ### Critical Details
 **1. Layer Norm Placement**
 **Post-norm (original):**
 ```python
 x = x + self_attn(x)
 x = layer_norm(x)
 ```
 **Pre-norm (modern, recommended):**
 ```python
 x = x + self_attn(layer_norm(x))
 ```
 **Why pre-norm?** More stable training, less sensitive to learning rate
 **2. Attention Dropout**
 Apply dropout to **attention weights**, not Q/K/V!
 ```python
 attn_weights = F.softmax(scores, dim=-1)
 attn_weights = F.dropout(attn_weights, p=0.1, training=self.training)  # HERE!
 output = torch.matmul(attn_weights, V)
 ```
 **3. Feed-Forward Dimension**
 Typically: d_ff = 4 × d_model
 - BERT: d_model=768, d_ff=3072
 - GPT-2: d_model=768, d_ff=3072
 **4. Residual Connections**
 ALWAYS use residual connections (essential for training)!
 ```python
 x = x + self_attn(x)  # Residual
 x = x + feed_forward(x)  # Residual
 ```
 **5. Initialization**
 Use Xavier/Glorot initialization for attention weights:
 ```python
 nn.init.xavier_uniform_(self.W_q.weight)
 nn.init.xavier_uniform_(self.W_k.weight)
 nn.init.xavier_uniform_(self.W_v.weight)
 ```
 ## Part 7: When NOT to Use Transformers
 ### Limitation 1: Small Datasets
 **Problem:** Transformers have weak inductive bias (learn from data)
 **Impact:**
 - ViT: Fails on < 100k images without pretraining
 - NLP: BERT needs 100M+ tokens for pretraining
 **Solution:** Use models with stronger priors (CNN for vision, smaller models for text)
 ### Limitation 2: Long Sequences
 **Problem:** O(n²) memory complexity
 **Impact:**
 - Standard Transformer: n=10k → 100M attention scores
 - GPU memory: 10k² × 4 bytes = 400MB per sample!
 **Solution:**
 - Sparse attention (Longformer, BigBird)
 - Linear attention (Linformer, Performer)
 - Flash Attention (memory-efficient kernel)
 - State space models (S4, Mamba)
 ### Limitation 3: Edge Deployment
 **Problem:** Large model size, high latency
 **Impact:**
 - ViT-B: 86M parameters, ~100ms inference
 - Mobile/embedded: Need < 10M parameters, < 50ms
 **Solution:** Efficient CNNs (MobileNet, EfficientNet) or distilled models
 ### Limitation 4: Real-Time Processing
 **Problem:** Sequential generation in decoder (cannot parallelize at inference)
 **Impact:** GPT-style models generate one token at a time
 **Solution:** Non-autoregressive models, speculative decoding, or smaller models
 ## Part 8: Common Mistakes
 ### Mistake 1: Forgetting Causal Mask
 **Symptom:** Decoder "cheats" by seeing future tokens
 **Fix:** Always apply causal mask to decoder self-attention!
 ```python
 causal_mask = torch.tril(torch.ones(seq_len, seq_len))
 ```
 ### Mistake 2: Wrong Dimension for Multi-Head
 **Symptom:** Runtime error or dimension mismatch
 **Fix:** Ensure d_model % num_heads == 0
 ```python
 assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
 ```
 ### Mistake 3: Forgetting Position Encoding
 **Symptom:** Model ignores word order
 **Fix:** Always add position information!
 ```python
 x = token_embeddings + positional_encoding
 ```
 ### Mistake 4: Wrong Softmax Dimension
 **Symptom:** Attention weights don't sum to 1 per query
 **Fix:** Softmax over last dimension (keys)
 ```python
 attn_weights = F.softmax(scores, dim=-1)  # Sum over keys for each query
 ```
 ### Mistake 5: No Residual Connections
 **Symptom:** Training diverges or converges very slowly
 **Fix:** Always add residual connections!
 ```python
 x = x + self_attn(x)
 x = x + feed_forward(x)
 ```
 ## Summary: Quick Reference
 ### Architecture Selection
 ```
 Classification/Understanding → Encoder-only (BERT-style)
 Generation/Autoregressive → Decoder-only (GPT-style)
 Seq2Seq (input ≠ output) → Encoder-decoder (T5-style) or Decoder-only with prompting
 ```
 ### Position Encoding Selection
 ```
 NLP (variable length) → RoPE or ALiBi
 NLP (fixed length) → Learned embeddings
 Vision (ViT) → 2D learned embeddings
 Long sequences (> 2k) → ALiBi (best extrapolation)
 ```
 ### Multi-Head Configuration
 ```
 Small models (d_model < 512): 4-8 heads
 Medium models (d_model 512-1024): 8-12 heads
 Large models (d_model > 1024): 12-32 heads
 Rule: d_k (head dimension) should be 64-128
 ```
 ### ViT vs CNN
 ```
 ViT: Large dataset (> 1M) OR pretrained weights
 CNN: Small dataset (< 1M) OR edge deployment
 ```
 ### Implementation Essentials
 ```
 ✅ Pre-norm (more stable than post-norm)
 ✅ Residual connections (essential!)
 ✅ Causal mask for decoder
 ✅ Attention dropout (on weights, not Q/K/V)
 ✅ d_ff = 4 × d_model (feed-forward dimension)
 ✅ Check: d_model % num_heads == 0
 ```
 ## Next Steps
 After mastering this skill:
 - `attention-mechanisms-catalog`: Explore attention variants (sparse, linear, Flash)
 - `llm-specialist/llm-finetuning-strategies`: Apply to language models
 - `architecture-design-principles`: Understand design trade-offs
 **Remember:** Transformers are NOT magic. Understanding the mechanism (information retrieval via Q/K/V) beats cargo-culting implementations.
		`@@ -0,0 +1,3 @@`
							`# yzmir-neural-architectures`

							`Neural architectures - CNNs, Transformers, RNNs, selection guidance - 9 skills`