gh-tachyon-beep-skillpacks-…/skills/using-neural-architectures/architecture-design-principles.md


# Architecture Design Principles

## Context

You're designing a neural network architecture or debugging why your network isn't learning. Common mistakes:
- **Ignoring inductive biases**: Using MLP for images (should use CNN)
- **Over-engineering**: Using Transformer for 100 samples (should use linear regression)
- **No skip connections**: 50-layer plain network fails (should use ResNet)
- **Wrong depth-width balance**: 100 layers × 8 channels bottlenecks capacity
- **Ignoring constraints**: 1.5B parameter model doesn't fit 24GB GPU

**This skill provides principled architecture design: match structure to problem, respect constraints, avoid over-engineering.**


## Core Principle: Inductive Biases

**Inductive bias = assumptions baked into architecture about problem structure**

**Key insight**: The right inductive bias makes learning dramatically easier. Wrong bias makes learning impossible.

### What are Inductive Biases?

```python
# Example: Image classification

# MLP (no inductive bias):
# - Treats each pixel independently
# - No concept of "spatial locality" or "translation"
# - Must learn from scratch that nearby pixels are related
# - Learns "cat at position (10,10)" and "cat at (50,50)" separately
# Parameters: 150M, Accuracy: 75%

# CNN (strong inductive bias):
# - Assumes spatial locality (nearby pixels related)
# - Assumes translation invariance (cat is cat anywhere)
# - Shares filters across spatial positions
# - Hierarchical feature learning (edges → textures → objects)
# Parameters: 11M, Accuracy: 95%

# CNN's inductive bias: 14× fewer parameters, 20% better accuracy!
```

**Principle**: Match your architecture's inductive biases to your problem's structure.


## Architecture Families and Their Inductive Biases

### 1. Fully Connected (MLP)

**Inductive bias:** None (general-purpose)

**Structure:**
```python
class MLP(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x
```

**When to use:**
- ✅ Tabular data (independent features)
- ✅ Small datasets (< 10,000 samples)
- ✅ Baseline / proof of concept

**When NOT to use:**
- ❌ Images (use CNN)
- ❌ Sequences (use RNN/Transformer)
- ❌ Graphs (use GNN)

**Strengths:**
- Simple and interpretable
- Fast training
- Works for any input type (flattened)

**Weaknesses:**
- No structural assumptions (must learn everything from data)
- Parameter explosion (input_size × hidden_size can be huge)
- Doesn't leverage problem structure


### 2. Convolutional Neural Networks (CNN)

**Inductive bias:** Spatial locality + Translation invariance

**Structure:**
```python
class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc = nn.Linear(128 * 7 * 7, 1000)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))  # 112×112
        x = self.pool(F.relu(self.conv2(x)))  # 56×56
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x
```

**Inductive biases:**
1. **Local connectivity**: Neurons see only nearby pixels (spatial locality)
2. **Translation invariance**: Same filter slides across image (parameter sharing)
3. **Hierarchical features**: Stack layers to build complex features from simple ones

**When to use:**
- ✅ Images (classification, detection, segmentation)
- ✅ Spatial data (maps, medical scans)
- ✅ Any grid-structured data

**When NOT to use:**
- ❌ Sequences with long-range dependencies (use Transformer)
- ❌ Graphs (irregular structure, use GNN)
- ❌ Tabular data (no spatial structure)

**Strengths:**
- Parameter efficient (filter sharing)
- Translation invariant (cat anywhere = cat)
- Hierarchical feature learning

**Weaknesses:**
- Fixed receptive field (limited by kernel size)
- Not suitable for variable-length inputs
- Requires grid structure


### 3. Recurrent Neural Networks (RNN/LSTM)

**Inductive bias:** Temporal dependencies

**Structure:**
```python
class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        # x: (batch, seq_len, input_size)
        lstm_out, (h_n, c_n) = self.lstm(x)
        # Use last hidden state
        output = self.fc(h_n[-1])
        return output
```

**Inductive bias:** Sequential processing (earlier elements influence later elements)

**When to use:**
- ✅ Time series (stock prices, sensor data)
- ✅ Short sequences (< 100 timesteps)
- ✅ Online processing (process one timestep at a time)

**When NOT to use:**
- ❌ Long sequences (> 1000 timesteps, use Transformer)
- ❌ Non-sequential data (images, tabular)
- ❌ When parallel processing needed (use Transformer)

**Strengths:**
- Natural for sequential data
- Constant memory (doesn't grow with sequence length)
- Online processing capability

**Weaknesses:**
- Slow (sequential, can't parallelize)
- Vanishing gradients (long sequences)
- Struggles with long-range dependencies


### 4. Transformers

**Inductive bias:** Minimal (self-attention is general-purpose)

**Structure:**
```python
class SimpleTransformer(nn.Module):
    def __init__(self, d_model, num_heads, num_layers):
        super().__init__()
        self.encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model, num_heads),
            num_layers
        )
        self.fc = nn.Linear(d_model, num_classes)

    def forward(self, x):
        # x: (batch, seq_len, d_model)
        x = self.encoder(x)
        # Global average pooling
        x = x.mean(dim=1)
        return self.fc(x)
```

**Inductive bias:** Minimal (learns relationships from data via attention)

**When to use:**
- ✅ Long sequences (> 100 tokens)
- ✅ Language (text, code)
- ✅ Large datasets (> 100k samples)
- ✅ When relationships are complex and data-dependent

**When NOT to use:**
- ❌ Small datasets (< 10k samples, use RNN or MLP)
- ❌ Strong structural priors available (images → CNN)
- ❌ Very long sequences (> 16k tokens, use sparse attention)
- ❌ Low-latency requirements (RNN faster)

**Strengths:**
- Parallel processing (fast training)
- Long-range dependencies (attention)
- State-of-the-art for language

**Weaknesses:**
- Quadratic complexity O(n²) with sequence length
- Requires large datasets (weak inductive bias)
- High memory usage


### 5. Graph Neural Networks (GNN)

**Inductive bias:** Message passing over graph structure

**Structure:**
```python
class SimpleGNN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

    def forward(self, x, edge_index):
        # x: node features (num_nodes, input_dim)
        # edge_index: graph structure (2, num_edges)
        x = F.relu(self.conv1(x, edge_index))
        x = self.conv2(x, edge_index)
        return x
```

**Inductive bias:** Nodes influenced by neighbors (message passing)

**When to use:**
- ✅ Graph data (social networks, molecules, knowledge graphs)
- ✅ Irregular connectivity (different # of neighbors per node)
- ✅ Relational reasoning

**When NOT to use:**
- ❌ Grid data (images → CNN)
- ❌ Sequences (text → Transformer)
- ❌ If graph structure doesn't help (test MLP baseline first!)

**Strengths:**
- Handles irregular structure
- Permutation invariant
- Natural for relational data

**Weaknesses:**
- Requires meaningful graph structure
- Over-smoothing (too many layers)
- Scalability challenges (large graphs)


## Decision Tree: Architecture Selection

```
START
|
├─ Is data grid-structured (images)?
│  ├─ YES → Use CNN
│  │   └─ ResNet (general), EfficientNet (mobile), ViT (very large datasets)
│  └─ NO → Continue
│
├─ Is data sequential (text, time series)?
│  ├─ YES → Check sequence length
│  │   ├─ < 100 timesteps → LSTM/GRU
│  │   ├─ 100-4000 tokens → Transformer
│  │   └─ > 4000 tokens → Sparse Transformer (Longformer)
│  └─ NO → Continue
│
├─ Is data graph-structured (molecules, social networks)?
│  ├─ YES → Check if structure helps
│  │   ├─ Test MLP baseline first
│  │   └─ If structure helps → GNN (GCN, GraphSAGE, GAT)
│  └─ NO → Continue
│
└─ Is data tabular (independent features)?
   └─ YES → Start simple
       ├─ < 1000 samples → Linear / Ridge regression
       ├─ 1000-100k samples → Small MLP (2-3 layers)
       └─ > 100k samples → Larger MLP or Gradient Boosting (XGBoost)
```


## Principle: Start Simple, Add Complexity Only When Needed

**Occam's Razor**: Simplest model that solves the problem is best.

### Progression:

```python
# Step 1: Linear baseline (ALWAYS start here!)
model = nn.Linear(input_size, num_classes)
# Train and evaluate

# Step 2: IF linear insufficient, add small MLP
if linear_accuracy < target:
    model = nn.Sequential(
        nn.Linear(input_size, 128),
        nn.ReLU(),
        nn.Linear(128, num_classes)
    )

# Step 3: IF small MLP insufficient, add depth/width
if mlp_accuracy < target:
    model = nn.Sequential(
        nn.Linear(input_size, 256),
        nn.ReLU(),
        nn.Linear(256, 256),
        nn.ReLU(),
        nn.Linear(256, num_classes)
    )

# Step 4: IF simple models fail, use specialized architecture
if simple_models_fail:
    # Images → CNN
    # Sequences → RNN/Transformer
    # Graphs → GNN

# NEVER skip to Step 4 without testing Step 1-3!
```

### Why Start Simple?

1. **Faster iteration**: Linear model trains in seconds, Transformer in hours
2. **Baseline**: Know if complexity helps (compare complex vs simple)
3. **Occam's Razor**: Simple model generalizes better (less overfitting)
4. **Debugging**: Easy to verify simple model works correctly

### Example: House Price Prediction

```python
# Dataset: 1000 samples, 20 features

# WRONG: Start with Transformer
model = HugeTransformer(20, 512, 6, 1)  # 10M parameters
# Result: Overfits (10M params / 1000 samples = 10,000:1 ratio!)

# RIGHT: Start simple
# Step 1: Linear
model = nn.Linear(20, 1)  # 21 parameters
# Trains in 1 second, achieves R² = 0.85 (good!)

# Conclusion: Linear sufficient, stop here. No need for Transformer!
```

**Rule**: Add complexity only when simple models demonstrably fail.


## Principle: Deep Networks Need Skip Connections

**Problem**: Plain networks > 10 layers suffer from vanishing gradients and degradation.

### Vanishing Gradients:

```python
# Gradient flow in plain 50-layer network:
gradient_layer_1 = gradient_output × (∂L50/∂L49) × (∂L49/∂L48) × ... × (∂L2/∂L1)

# Each term < 1 (due to activations):
# If each ≈ 0.9, then: 0.9^50 = 0.0000051 (vanishes!)

# Result: Early layers don't learn (gradients too small)
```

### Degradation:

```python
# Empirical observation (ResNet paper):
20-layer plain network: 85% accuracy
56-layer plain network: 78% accuracy  # WORSE with more layers!

# This is NOT overfitting (training accuracy also drops)
# This is optimization difficulty
```

### Solution: Skip Connections (Residual Networks)

```python
class ResidualBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(channels)

    def forward(self, x):
        identity = x  # Save input

        out = self.conv1(x)
        out = self.bn1(out)
        out = F.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        out = out + identity  # Skip connection!
        out = F.relu(out)
        return out
```

**Why skip connections work:**
```python
# Gradient flow with skip connections:
∂loss/∂x = ∂loss/∂out × (1 + ∂F/∂x)
#                         ↑
#                    Always flows! ("+1" term)

# Even if ∂F/∂x ≈ 0, gradient flows through identity path
```

**Results:**
```python
# Without skip connections:
20-layer plain: 85% accuracy
50-layer plain: 78% accuracy (worse!)

# With skip connections (ResNet):
20-layer ResNet: 87% accuracy
50-layer ResNet: 92% accuracy (better!)
152-layer ResNet: 95% accuracy (even better!)
```

**Rule**: For networks > 10 layers, ALWAYS use skip connections.

### Skip Connection Variants:

**1. Residual (ResNet):**
```python
out = x + F(x)  # Add input to output
```

**2. Dense (DenseNet):**
```python
out = torch.cat([x, F(x)], dim=1)  # Concatenate input and output
```

**3. Highway:**
```python
gate = sigmoid(W_gate @ x)
out = gate * F(x) + (1 - gate) * x  # Learned gating
```

**Most common**: Residual (simple, effective)


## Principle: Balance Depth and Width

**Depth = # of layers**
**Width = # of channels/neurons per layer**

### Capacity Formula:

```python
# Approximate capacity (for CNNs):
capacity ≈ depth × width²

# Why width²?
# Each layer: input_channels × output_channels × kernel_size²
# Doubling width → 4× parameters per layer
```

### Trade-offs:

**Too deep, too narrow:**
```python
# 100 layers × 8 channels
# Problems:
# - Information bottleneck (8 channels can't represent complex features)
# - Harder to optimize (more layers)
# - Slow inference (100 layers sequential)

# Example:
model = VeryDeepNarrow(num_layers=100, channels=8)
# Result: 60% accuracy (bottleneck!)
```

**Too shallow, too wide:**
```python
# 2 layers × 1024 channels
# Problems:
# - Under-utilizes depth (no hierarchical features)
# - Memory explosion (1024 × 1024 = 1M parameters per layer!)

# Example:
model = VeryWideShallow(num_layers=2, channels=1024)
# Result: 70% accuracy (doesn't leverage depth)
```

**Balanced:**
```python
# 18 layers, gradually increasing width: 64 → 128 → 256 → 512
# Benefits:
# - Hierarchical features (depth)
# - Sufficient capacity (width)
# - Good optimization (not too deep)

# Example (ResNet-18):
model = ResNet18()
# Layers: 18, Channels: 64-512 (average ~200)
# Result: 95% accuracy (optimal balance!)
```

### Standard Patterns:

```python
# CNNs: Gradually increase channels as spatial dims decrease
# Input: 224×224×3
# Layer 1: 224×224×64    (same spatial size, more channels)
# Layer 2: 112×112×128   (half spatial, double channels)
# Layer 3: 56×56×256     (half spatial, double channels)
# Layer 4: 28×28×512     (half spatial, double channels)

# Why? Compensate for spatial information loss with channel information
```

**Rule**: Balance depth and width. Standard pattern: 12-50 layers, 64-512 channels.


## Principle: Match Capacity to Data Size

**Capacity = # of learnable parameters**

### Parameter Budget:

```python
# Rule of thumb: parameters should be 0.01-0.1× dataset size

# Example 1: MNIST (60,000 images)
# Budget: 600 - 6,000 parameters
# Simple CNN: 60,000 parameters (10×) → Works, but might overfit
# LeNet: 60,000 parameters → Classic, works well

# Example 2: ImageNet (1.2M images)
# Budget: 12,000 - 120,000 parameters
# ResNet-50: 25M parameters (200×) → Works (aggressive augmentation helps)

# Example 3: Tabular (100 samples, 20 features)
# Budget: 1 - 10 parameters
# Linear: 21 parameters → Perfect fit!
# MLP: 1,000 parameters → Overfits horribly
```

### Overfitting Detection:

```python
# Training accuracy >> Validation accuracy (gap > 5%)
train_acc = 99%, val_acc = 70%  # 29% gap → OVERFITTING!

# Solutions:
# 1. Reduce model capacity (fewer layers/channels)
# 2. Add regularization (dropout, weight decay)
# 3. Collect more data
# 4. Data augmentation

# Order: Try (1) first (simplest), then (2), then (3)/(4)
```

### Underfitting Detection:

```python
# Training accuracy < target (model too simple)
train_acc = 60%, val_acc = 58%  # Both low → UNDERFITTING!

# Solutions:
# 1. Increase model capacity (more layers/channels)
# 2. Train longer
# 3. Reduce regularization

# Order: Try (2) first (cheapest), then (1), then (3)
```

**Rule**: Match parameters to data size. Start small, increase capacity only if underfitting.


## Principle: Design for Compute Constraints

**Constraints:**
1. **Memory**: Model + gradients + optimizer states < GPU VRAM
2. **Latency**: Inference time < requirement (e.g., < 100ms for real-time)
3. **Throughput**: Samples/second > requirement

### Memory Budget:

```python
# Memory calculation (training):
# 1. Model parameters (FP32): params × 4 bytes
# 2. Gradients: params × 4 bytes
# 3. Optimizer states (Adam): params × 8 bytes (2× weights)
# 4. Activations: batch_size × feature_maps × spatial_size × 4 bytes

# Example: ResNet-50
params = 25M
memory_params = 25M × 4 = 100 MB
memory_gradients = 100 MB
memory_optimizer = 200 MB
memory_activations = batch_size × 64 × 7×7 × 4 ≈ batch_size × 12 KB

# Total (batch=32): 100 + 100 + 200 + 0.4 = 400 MB
# Fits easily on 4GB GPU!

# Example: GPT-3 (175B parameters)
memory_params = 175B × 4 = 700 GB
memory_total = 700 + 700 + 1400 = 2800 GB = 2.8 TB!
# Requires 35×A100 (80GB each)
```

**Rule**: Calculate memory before training. Don't design models that don't fit.

### Latency Budget:

```python
# Inference latency = # operations / throughput

# Example: Mobile app (< 100ms latency requirement)

# ResNet-50:
# Operations: 4B FLOPs
# Mobile CPU: 10 GFLOPS
# Latency: 4B / 10G = 0.4 seconds (FAILS!)

# MobileNetV2:
# Operations: 300M FLOPs
# Mobile CPU: 10 GFLOPS
# Latency: 300M / 10G = 0.03 seconds = 30ms (PASSES!)

# Solution: Use efficient architectures (MobileNet, EfficientNet) for mobile
```

**Rule**: Measure latency. Use efficient architectures if latency-constrained.


## Common Architectural Patterns

### 1. Bottleneck (ResNet)

**Structure:**
```python
# Standard: 3×3 conv (256 channels) → 3×3 conv (256 channels)
# Parameters: 256 × 256 × 3 × 3 = 590K

# Bottleneck: 1×1 (256→64) → 3×3 (64→64) → 1×1 (64→256)
# Parameters: 256×64 + 64×64×3×3 + 64×256 = 16K + 37K + 16K = 69K
# Reduction: 590K → 69K (8.5× fewer!)
```

**Purpose**: Reduce parameters while maintaining capacity

**When to use**: Deep networks (> 50 layers) where parameters are a concern

### 2. Inverted Bottleneck (MobileNetV2)

**Structure:**
```python
# Bottleneck (ResNet): Wide → Narrow → Wide (256 → 64 → 256)
# Inverted: Narrow → Wide → Narrow (64 → 256 → 64)

# Why? Efficient for mobile (depthwise separable convolutions)
```

**Purpose**: Maximize efficiency (FLOPs per parameter)

**When to use**: Mobile/edge deployment

### 3. Multi-scale Features (Inception)

**Structure:**
```python
# Parallel branches with different kernel sizes:
# Branch 1: 1×1 conv
# Branch 2: 3×3 conv
# Branch 3: 5×5 conv
# Branch 4: 3×3 max pool
# Concatenate all branches

# Captures features at multiple scales simultaneously
```

**Purpose**: Capture multi-scale patterns

**When to use**: When features exist at multiple scales (object detection)

### 4. Attention (Transformers, SE-Net)

**Structure:**
```python
# Squeeze-and-Excitation (SE) block:
# 1. Global average pooling (spatial → channel descriptor)
# 2. FC layer (bottleneck)
# 3. FC layer (restore channels)
# 4. Sigmoid (attention weights)
# 5. Multiply input channels by attention weights

# Result: Emphasize important channels, suppress irrelevant
```

**Purpose**: Learn importance of features (channels or positions)

**When to use**: When not all features equally important


## Debugging Architectures

### Problem 1: Network doesn't learn (loss stays constant)

**Diagnosis:**
```python
# Check gradient flow
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: grad_mean={param.grad.mean():.6f}, grad_std={param.grad.std():.6f}")

# Vanishing: grad_mean ≈ 0, grad_std ≈ 0 → Add skip connections
# Exploding: grad_mean > 1, grad_std > 1 → Gradient clipping or lower LR
```

**Solutions:**
- Add skip connections (ResNet)
- Check initialization (Xavier or He initialization)
- Lower learning rate
- Check data preprocessing (normalized inputs?)

### Problem 2: Overfitting (train >> val)

**Diagnosis:**
```python
train_acc = 99%, val_acc = 70%  # 29% gap → Overfitting

# Check parameter/data ratio:
num_params = sum(p.numel() for p in model.parameters())
data_size = len(train_dataset)
ratio = num_params / data_size

# If ratio > 1: Model has more parameters than data points!
```

**Solutions (in order):**
1. Reduce capacity (fewer layers/channels)
2. Add dropout / weight decay
3. Data augmentation
4. Collect more data

### Problem 3: Underfitting (train and val both low)

**Diagnosis:**
```python
train_acc = 65%, val_acc = 63%  # Both low → Underfitting

# Model too simple for task complexity
```

**Solutions (in order):**
1. Train longer (more epochs)
2. Increase capacity (more layers/channels)
3. Reduce regularization (lower dropout/weight decay)
4. Check learning rate (too low?)

### Problem 4: Slow training

**Diagnosis:**
```python
# Profile forward/backward pass
import time

start = time.time()
loss = criterion(model(inputs), targets)
forward_time = time.time() - start

start = time.time()
loss.backward()
backward_time = time.time() - start

# If backward_time >> forward_time: Gradient computation bottleneck
```

**Solutions:**
- Use mixed precision (FP16)
- Reduce batch size (if memory-bound)
- Use gradient accumulation (simulate large batch)
- Simplify architecture (fewer layers)


## Design Checklist

Before finalizing an architecture:

### ☐ Match inductive bias to problem
- Images → CNN
- Sequences → RNN/Transformer
- Graphs → GNN
- Tabular → MLP

### ☐ Start simple, add complexity only when needed
- Test linear baseline first
- Add complexity incrementally
- Compare performance at each step

### ☐ Use skip connections for deep networks (> 10 layers)
- ResNet for CNNs
- Pre-norm for Transformers
- Gradient flow is critical

### ☐ Balance depth and width
- Not too deep and narrow (bottleneck)
- Not too shallow and wide (under-utilizes depth)
- Standard: 12-50 layers, 64-512 channels

### ☐ Match capacity to data size
- Parameters ≈ 0.01-0.1× dataset size
- Monitor train/val gap (overfitting indicator)

### ☐ Respect compute constraints
- Memory: Model + gradients + optimizer + activations < VRAM
- Latency: Inference time < requirement
- Use efficient architectures if constrained (MobileNet, EfficientNet)

### ☐ Verify gradient flow
- Check gradients in early layers (should be non-zero)
- Use skip connections if vanishing

### ☐ Benchmark against baselines
- Compare to simple model (linear, small MLP)
- Ensure complexity adds value (% improvement > 5%)


## Anti-Patterns

### Anti-pattern 1: "Architecture X is state-of-the-art, so I'll use it"

**Wrong:**
```python
# Transformer is SOTA for NLP, so use for tabular data (100 samples)
model = HugeTransformer(...)  # 10M parameters
# Result: Overfits horribly (100 samples / 10M params = 0.00001 ratio!)
```

**Right:**
```python
# Match architecture to problem AND data size
# Tabular + small data → Linear or small MLP
model = nn.Linear(20, 1)  # 21 parameters (appropriate!)
```

### Anti-pattern 2: "More layers = better"

**Wrong:**
```python
# 100-layer plain network (no skip connections)
for i in range(100):
    layers.append(nn.Conv2d(64, 64, 3, padding=1))
# Result: Doesn't train (vanishing gradients)
```

**Right:**
```python
# 50-layer ResNet (with skip connections)
# Each block: out = x + F(x)  # Skip connection
# Result: Trains well, high accuracy
```

### Anti-pattern 3: "Deeper + narrower = efficient"

**Wrong:**
```python
# 100 layers × 8 channels = information bottleneck
model = VeryDeepNarrow(100, 8)
# Result: 60% accuracy (8 channels insufficient)
```

**Right:**
```python
# 18 layers, 64-512 channels (balanced)
model = ResNet18()  # Balanced depth and width
# Result: 95% accuracy
```

### Anti-pattern 4: "Ignore constraints, optimize later"

**Wrong:**
```python
# Design 1.5B parameter model for 24GB GPU
model = HugeModel(1.5e9)
# Result: OOM (out of memory), can't train
```

**Right:**
```python
# Calculate memory first:
# 1.5B params × 4 bytes = 6GB (weights)
# + 6GB (gradients) + 12GB (Adam) + 8GB (activations) = 32GB
# > 24GB → Doesn't fit!

# Design for hardware:
model = ReasonableSizeModel(200e6)  # 200M parameters (fits!)
```

### Anti-pattern 5: "Hyperparameters will fix architectural problems"

**Wrong:**
```python
# Architecture: MLP for images (wrong inductive bias)
# Response: "Just tune learning rate!"
for lr in [0.1, 0.01, 0.001, 0.0001]:
    train(model, lr=lr)
# Result: All fail (architecture is wrong!)
```

**Right:**
```python
# Fix architecture first (use CNN for images)
model = ResNet18()  # Correct inductive bias
# Then tune hyperparameters
```


## Summary

**Core principles:**

1. **Inductive bias**: Match architecture to problem structure (CNN for images, RNN/Transformer for sequences, GNN for graphs)

2. **Occam's Razor**: Start simple (linear, small MLP), add complexity only when needed

3. **Skip connections**: Use for networks > 10 layers (ResNet, DenseNet)

4. **Depth-width balance**: Not too deep+narrow (bottleneck) or too shallow+wide (under-utilizes depth)

5. **Capacity**: Match parameters to data size (0.01-0.1× dataset size)

6. **Constraints**: Design for available memory, latency, throughput

**Decision framework:**
- Images → CNN (ResNet, EfficientNet)
- Short sequences → LSTM
- Long sequences → Transformer
- Graphs → GNN (test if structure helps first!)
- Tabular → Linear or small MLP

**Key insight**: Architecture design is about matching structural assumptions to problem structure, not about using the "best" or "most complex" model. Simple models often win.

**When in doubt**: Start with the simplest model that could plausibly work. Add complexity only when you have evidence it helps.