Initial commit

2025-11-30 09:00:00 +08:00
commit 955d5c6743
12 changed files with 6996 additions and 0 deletions
--- a/skills/using-neural-architectures/transformer-architecture-deepdive.md
+++ b/skills/using-neural-architectures/transformer-architecture-deepdive.md
@@ -0,0 +1,937 @@
+
+# Transformer Architecture Deep Dive
+
+## When to Use This Skill
+
+Use this skill when you need to:
+- ✅ Implement a Transformer from scratch
+- ✅ Understand HOW and WHY self-attention works
+- ✅ Choose between encoder, decoder, or encoder-decoder architectures
+- ✅ Decide if Vision Transformer (ViT) is appropriate for your vision task
+- ✅ Understand modern variants (RoPE, ALiBi, GQA, MQA)
+- ✅ Debug Transformer implementation issues
+- ✅ Optimize Transformer performance
+
+**Do NOT use this skill for:**
+- ❌ High-level architecture selection (use `using-neural-architectures`)
+- ❌ Attention mechanism comparison (use `attention-mechanisms-catalog`)
+- ❌ LLM-specific topics like prompt engineering (use `llm-specialist` pack)
+
+
+## Core Principle
+
+**Transformers are NOT magic.** They are:
+1. Self-attention mechanism (information retrieval)
+2. + Position encoding (break permutation invariance)
+3. + Residual connections + Layer norm (training stability)
+4. + Feed-forward networks (non-linearity)
+
+Understanding the mechanism beats cargo-culting implementations.
+
+
+## Part 1: Self-Attention Mechanism Explained
+
+### The Information Retrieval Analogy
+
+**Self-attention = Querying a database:**
+- **Query (Q)**: "What am I looking for?"
+- **Key (K)**: "What do I contain?"
+- **Value (V)**: "What information do I have?"
+
+**Process:**
+1. Compare your query with all keys (compute similarity)
+2. Weight values by similarity
+3. Return weighted sum of values
+
+**Example:** Sentence: "The cat sat on the mat"
+Token "sat" (verb):
+- High attention to "cat" (subject) → Learns verb-subject relationship
+- High attention to "mat" (object) → Learns verb-object relationship
+- Low attention to "the", "on" (function words)
+
+### Mathematical Breakdown
+
+**Given input X:** (batch, seq_len, d_model)
+
+**Step 1: Project to Q, K, V**
+```python
+Q = X @ W_Q  # (batch, seq_len, d_k)
+K = X @ W_K  # (batch, seq_len, d_k)
+V = X @ W_V  # (batch, seq_len, d_v)
+
+# Typically: d_k = d_v = d_model / num_heads
+```
+
+**Step 2: Compute attention scores** (similarity)
+```python
+scores = Q @ K.transpose(-2, -1)  # (batch, seq_len, seq_len)
+# scores[i, j] = similarity between query_i and key_j
+```
+
+**Geometric interpretation:**
+- Dot product measures vector alignment
+- q · k = ||q|| ||k|| cos(θ)
+- Similar vectors → Large dot product → High attention
+- Orthogonal vectors → Zero dot product → No attention
+
+**Step 3: Scale by √d_k** (CRITICAL!)
+```python
+scores = scores / math.sqrt(d_k)
+```
+
+**WHY scaling?**
+- Dot products grow with dimension: Var(q · k) = d_k
+- Example: d_k=64 → Random dot products ~ ±64
+- Large scores → Softmax saturates → Gradients vanish
+- Scaling: Keep scores ~ O(1) regardless of dimension
+
+**Without scaling:** Softmax([30, 25, 20]) ≈ [0.99, 0.01, 0.00] (saturated!)
+**With scaling:** Softmax([3, 2.5, 2]) ≈ [0.50, 0.30, 0.20] (healthy gradients)
+
+**Step 4: Softmax to get attention weights**
+```python
+attn_weights = F.softmax(scores, dim=-1)  # (batch, seq_len, seq_len)
+# Each row sums to 1 (probability distribution)
+# attn_weights[i, j] = "how much token i attends to token j"
+```
+
+**Step 5: Weight values**
+```python
+output = attn_weights @ V  # (batch, seq_len, d_v)
+# Each token's output = weighted average of all values
+```
+
+**Complete formula:**
+```python
+Attention(Q, K, V) = softmax(Q K^T / √d_k) V
+```
+
+### Why Three Matrices (Q, K, V)?
+
+**Could we use just one?** Attention(X, X, X)
+**Yes, but Q/K/V separation enables:**
+
+1. **Asymmetry**: Query can differ from key (search ≠ database)
+2. **Decoupling**: What you search for (Q@K) ≠ what you retrieve (V)
+3. **Cross-attention**: Q from one source, K/V from another
+   - Example: Decoder queries encoder (translation)
+
+**Modern optimization:** Multi-Query Attention (MQA), Grouped-Query Attention (GQA)
+- Share K/V across heads (fewer parameters, faster inference)
+
+### Implementation Example
+
+```python
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+
+class SelfAttention(nn.Module):
+    def __init__(self, d_model, d_k=None):
+        super().__init__()
+        self.d_k = d_k or d_model
+
+        self.W_q = nn.Linear(d_model, self.d_k)
+        self.W_k = nn.Linear(d_model, self.d_k)
+        self.W_v = nn.Linear(d_model, self.d_k)
+
+    def forward(self, x, mask=None):
+        # x: (batch, seq_len, d_model)
+        Q = self.W_q(x)  # (batch, seq_len, d_k)
+        K = self.W_k(x)  # (batch, seq_len, d_k)
+        V = self.W_v(x)  # (batch, seq_len, d_k)
+
+        # Attention scores
+        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
+        # scores: (batch, seq_len, seq_len)
+
+        # Apply mask if provided (for causal attention)
+        if mask is not None:
+            scores = scores.masked_fill(mask == 0, -1e9)
+
+        # Attention weights
+        attn_weights = F.softmax(scores, dim=-1)  # (batch, seq_len, seq_len)
+
+        # Weighted sum of values
+        output = torch.matmul(attn_weights, V)  # (batch, seq_len, d_k)
+
+        return output, attn_weights
+```
+
+**Complexity:** O(n² · d) where n = seq_len, d = d_model
+- **Quadratic in sequence length** (bottleneck for long sequences)
+- For n=1000, d=512: 1000² × 512 = 512M operations
+
+
+## Part 2: Multi-Head Attention
+
+### Why Multiple Heads?
+
+**Single-head attention** learns one attention pattern.
+**Multi-head attention** learns multiple parallel patterns:
+- Head 1: Syntactic relationships (subject-verb)
+- Head 2: Semantic similarity
+- Head 3: Positional proximity
+- Head 4: Long-range dependencies
+
+**Analogy:** Ensemble of attention functions, each specializing in different patterns.
+
+### Head Dimension Calculation
+
+**CRITICAL CONSTRAINT:** num_heads must divide d_model evenly!
+
+```python
+d_model = 512
+num_heads = 8
+d_k = d_model // num_heads  # 512 / 8 = 64
+
+# Each head operates in d_k dimensions
+# Concatenate all heads → back to d_model dimensions
+```
+
+**Common configurations:**
+- BERT-base: d_model=768, heads=12, d_k=64
+- GPT-2: d_model=768, heads=12, d_k=64
+- GPT-3 175B: d_model=12288, heads=96, d_k=128
+- LLaMA-2 70B: d_model=8192, heads=64, d_k=128
+
+**Rule of thumb:** d_k (head dimension) should be 64-128
+- Too small (d_k < 32): Limited representational capacity
+- Too large (d_k > 256): Redundant, wasteful
+
+### Multi-Head Implementation
+
+```python
+class MultiHeadAttention(nn.Module):
+    def __init__(self, d_model, num_heads):
+        super().__init__()
+        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
+
+        self.d_model = d_model
+        self.num_heads = num_heads
+        self.d_k = d_model // num_heads
+
+        # Single linear layers for all heads (more efficient)
+        self.W_q = nn.Linear(d_model, d_model)
+        self.W_k = nn.Linear(d_model, d_model)
+        self.W_v = nn.Linear(d_model, d_model)
+        self.W_o = nn.Linear(d_model, d_model)  # Output projection
+
+    def split_heads(self, x):
+        # x: (batch, seq_len, d_model)
+        batch_size, seq_len, d_model = x.size()
+        # Reshape to (batch, seq_len, num_heads, d_k)
+        x = x.view(batch_size, seq_len, self.num_heads, self.d_k)
+        # Transpose to (batch, num_heads, seq_len, d_k)
+        return x.transpose(1, 2)
+
+    def forward(self, x, mask=None):
+        batch_size = x.size(0)
+
+        # Linear projections
+        Q = self.W_q(x)  # (batch, seq_len, d_model)
+        K = self.W_k(x)
+        V = self.W_v(x)
+
+        # Split into multiple heads
+        Q = self.split_heads(Q)  # (batch, num_heads, seq_len, d_k)
+        K = self.split_heads(K)
+        V = self.split_heads(V)
+
+        # Attention
+        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
+        if mask is not None:
+            scores = scores.masked_fill(mask == 0, -1e9)
+        attn_weights = F.softmax(scores, dim=-1)
+
+        # Weighted sum
+        attn_output = torch.matmul(attn_weights, V)
+        # attn_output: (batch, num_heads, seq_len, d_k)
+
+        # Concatenate heads
+        attn_output = attn_output.transpose(1, 2).contiguous()
+        # (batch, seq_len, num_heads, d_k)
+        attn_output = attn_output.view(batch_size, -1, self.d_model)
+        # (batch, seq_len, d_model)
+
+        # Final linear projection
+        output = self.W_o(attn_output)
+
+        return output, attn_weights
+```
+
+### Modern Variants: GQA and MQA
+
+**Problem:** K/V caching during inference is memory-intensive
+- LLaMA-2 70B: 8192 × 64 heads × 2 (K + V) = 1M parameters per token cached!
+
+**Solution 1: Multi-Query Attention (MQA)**
+- **One** K/V head shared across **all** Q heads
+- Benefit: Dramatically faster inference (smaller KV cache)
+- Trade-off: ~1-2% accuracy loss
+
+```python
+# MQA: Single K/V projection
+self.W_k = nn.Linear(d_model, d_k)  # Not d_model!
+self.W_v = nn.Linear(d_model, d_k)
+self.W_q = nn.Linear(d_model, d_model)  # Multiple Q heads
+```
+
+**Solution 2: Grouped-Query Attention (GQA)**
+- Middle ground: Group multiple Q heads per K/V head
+- Example: 32 Q heads → 8 K/V heads (4 Q per K/V)
+- Benefit: 4x smaller KV cache, minimal accuracy loss
+
+**Used in:** LLaMA-2, Mistral, Mixtral
+
+
+## Part 3: Position Encoding
+
+### Why Position Encoding?
+
+**Problem:** Self-attention is **permutation-invariant**
+- Attention("cat sat mat") = Attention("mat cat sat")
+- No inherent notion of position or order!
+
+**Solution:** Add position information to embeddings
+
+### Strategy 1: Sinusoidal Position Encoding (Original)
+
+**Formula:**
+```python
+PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
+PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
+```
+
+**Implementation:**
+```python
+def sinusoidal_position_encoding(seq_len, d_model):
+    pe = torch.zeros(seq_len, d_model)
+    position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
+    div_term = torch.exp(torch.arange(0, d_model, 2).float() *
+                         (-math.log(10000.0) / d_model))
+
+    pe[:, 0::2] = torch.sin(position * div_term)
+    pe[:, 1::2] = torch.cos(position * div_term)
+    return pe
+
+# Usage: Add to input embeddings
+x = token_embeddings + positional_encoding
+```
+
+**Properties:**
+- Deterministic (no learned parameters)
+- Extrapolates to unseen lengths (geometric properties)
+- Relative positions: PE(pos+k) is linear function of PE(pos)
+
+**When to use:** Variable-length sequences in NLP
+
+### Strategy 2: Learned Position Embeddings
+
+```python
+self.pos_embedding = nn.Embedding(max_seq_len, d_model)
+
+# Usage
+positions = torch.arange(seq_len, device=x.device)
+x = token_embeddings + self.pos_embedding(positions)
+```
+
+**Properties:**
+- Learnable (adapts to data)
+- Cannot extrapolate beyond max_seq_len
+
+**When to use:**
+- Fixed-length sequences
+- Vision Transformers (image patches)
+- When training data covers all positions
+
+### Strategy 3: Rotary Position Embeddings (RoPE) ⭐
+
+**Modern approach (2021+):** Rotate Q and K in complex plane
+
+**Key advantages:**
+- Encodes **relative** positions naturally
+- Better long-range decay properties
+- No addition to embeddings (applied in attention)
+
+**Used in:** GPT-NeoX, PaLM, LLaMA, LLaMA-2, Mistral
+
+```python
+def apply_rotary_pos_emb(x, cos, sin):
+    # x: (batch, num_heads, seq_len, d_k)
+    # Split into even/odd
+    x1, x2 = x[..., ::2], x[..., 1::2]
+    # Rotate
+    return torch.cat([x1 * cos - x2 * sin, x1 * sin + x2 * cos], dim=-1)
+```
+
+### Strategy 4: ALiBi (Attention with Linear Biases) ⭐
+
+**Simplest modern approach:** Add bias to attention scores (no embeddings!)
+
+```python
+# Bias matrix: -1 * distance
+# [[0, -1, -2, -3],
+#  [0,  0, -1, -2],
+#  [0,  0,  0, -1],
+#  [0,  0,  0,  0]]
+
+scores = Q @ K^T / √d_k + alibi_bias
+```
+
+**Key advantages:**
+- **Best extrapolation** to longer sequences
+- No positional embeddings (simpler)
+- Per-head slopes (different decay rates)
+
+**Used in:** BLOOM
+
+### Position Encoding Selection Guide
+
+| Use Case | Recommended | Why |
+|----------|-------------|-----|
+| NLP (variable length) | RoPE or ALiBi | Better extrapolation |
+| NLP (fixed length) | Learned embeddings | Adapts to data |
+| Vision (ViT) | 2D learned embeddings | Spatial structure |
+| Long sequences (>2k) | ALiBi | Best extrapolation |
+| Legacy/compatibility | Sinusoidal | Original Transformer |
+
+**Modern trend (2023+):** RoPE and ALiBi dominate over sinusoidal
+
+
+## Part 4: Architecture Variants
+
+### Variant 1: Encoder-Only (Bidirectional)
+
+**Architecture:**
+- Self-attention: Each token attends to **ALL** tokens (past + future)
+- No masking (bidirectional context)
+
+**Examples:** BERT, RoBERTa, ELECTRA, DeBERTa
+
+**Use cases:**
+- Text classification
+- Named entity recognition
+- Question answering (extract span from context)
+- Sentence embeddings
+
+**Key property:** Sees full context → Good for **understanding**
+
+**Implementation:**
+```python
+class TransformerEncoder(nn.Module):
+    def __init__(self, d_model, num_heads, d_ff, num_layers):
+        super().__init__()
+        self.layers = nn.ModuleList([
+            EncoderLayer(d_model, num_heads, d_ff)
+            for _ in range(num_layers)
+        ])
+
+    def forward(self, x, mask=None):
+        for layer in self.layers:
+            x = layer(x, mask)  # No causal mask!
+        return x
+
+class EncoderLayer(nn.Module):
+    def __init__(self, d_model, num_heads, d_ff):
+        super().__init__()
+        self.self_attn = MultiHeadAttention(d_model, num_heads)
+        self.feed_forward = nn.Sequential(
+            nn.Linear(d_model, d_ff),
+            nn.ReLU(),
+            nn.Linear(d_ff, d_model)
+        )
+        self.norm1 = nn.LayerNorm(d_model)
+        self.norm2 = nn.LayerNorm(d_model)
+
+    def forward(self, x, mask=None):
+        # Self-attention + residual + norm
+        attn_output, _ = self.self_attn(x, mask)
+        x = self.norm1(x + attn_output)
+
+        # Feed-forward + residual + norm
+        ff_output = self.feed_forward(x)
+        x = self.norm2(x + ff_output)
+
+        return x
+```
+
+### Variant 2: Decoder-Only (Autoregressive)
+
+**Architecture:**
+- Self-attention with **causal masking**
+- Each token attends ONLY to past tokens (not future)
+
+**Causal mask (lower triangular):**
+```python
+# mask[i, j] = 1 if j <= i else 0
+[[1, 0, 0, 0],   # Token 0 sees only itself
+ [1, 1, 0, 0],   # Token 1 sees tokens 0-1
+ [1, 1, 1, 0],   # Token 2 sees tokens 0-2
+ [1, 1, 1, 1]]   # Token 3 sees all
+```
+
+**Examples:** GPT, GPT-2, GPT-3, GPT-4, LLaMA, Mistral
+
+**Use cases:**
+- Text generation
+- Language modeling
+- Code generation
+- Autoregressive prediction
+
+**Key property:** Generates sequentially → Good for **generation**
+
+**Implementation:**
+```python
+def create_causal_mask(seq_len, device):
+    # Lower triangular matrix
+    mask = torch.tril(torch.ones(seq_len, seq_len, device=device))
+    return mask
+
+class TransformerDecoder(nn.Module):
+    def __init__(self, d_model, num_heads, d_ff, num_layers):
+        super().__init__()
+        self.layers = nn.ModuleList([
+            DecoderLayer(d_model, num_heads, d_ff)
+            for _ in range(num_layers)
+        ])
+
+    def forward(self, x):
+        seq_len = x.size(1)
+        causal_mask = create_causal_mask(seq_len, x.device)
+
+        for layer in self.layers:
+            x = layer(x, causal_mask)  # Apply causal mask!
+        return x
+```
+
+**Modern trend (2023+):** Decoder-only architectures dominate
+- Can do both generation AND understanding (via prompting)
+- Simpler than encoder-decoder (no cross-attention)
+- Scales better to massive sizes
+
+### Variant 3: Encoder-Decoder (Seq2Seq)
+
+**Architecture:**
+- **Encoder**: Bidirectional self-attention (understands input)
+- **Decoder**: Causal self-attention (generates output)
+- **Cross-attention**: Decoder queries encoder outputs
+
+**Cross-attention mechanism:**
+```python
+# Q from decoder, K and V from encoder
+Q = decoder_hidden @ W_q
+K = encoder_output @ W_k
+V = encoder_output @ W_v
+
+cross_attn = softmax(Q K^T / √d_k) V
+```
+
+**Examples:** T5, BART, mT5, original Transformer (2017)
+
+**Use cases:**
+- Translation (input ≠ output language)
+- Summarization (long input → short output)
+- Question answering (generate answer, not extract)
+
+**When to use:** Input and output are fundamentally different
+
+**Implementation:**
+```python
+class EncoderDecoderLayer(nn.Module):
+    def __init__(self, d_model, num_heads, d_ff):
+        super().__init__()
+        self.self_attn = MultiHeadAttention(d_model, num_heads)
+        self.cross_attn = MultiHeadAttention(d_model, num_heads)  # NEW!
+        self.feed_forward = FeedForward(d_model, d_ff)
+        self.norm1 = nn.LayerNorm(d_model)
+        self.norm2 = nn.LayerNorm(d_model)
+        self.norm3 = nn.LayerNorm(d_model)
+
+    def forward(self, decoder_input, encoder_output, causal_mask=None):
+        # 1. Self-attention (causal)
+        self_attn_out, _ = self.self_attn(decoder_input, causal_mask)
+        x = self.norm1(decoder_input + self_attn_out)
+
+        # 2. Cross-attention (Q from decoder, K/V from encoder)
+        cross_attn_out, _ = self.cross_attn.forward_cross(
+            query=x,
+            key=encoder_output,
+            value=encoder_output
+        )
+        x = self.norm2(x + cross_attn_out)
+
+        # 3. Feed-forward
+        ff_out = self.feed_forward(x)
+        x = self.norm3(x + ff_out)
+
+        return x
+```
+
+### Architecture Selection Guide
+
+| Task | Architecture | Why |
+|------|--------------|-----|
+| Classification | Encoder-only | Need full bidirectional context |
+| Text generation | Decoder-only | Autoregressive generation |
+| Translation | Encoder-decoder or Decoder-only | Different languages, or use prompting |
+| Summarization | Encoder-decoder or Decoder-only | Length mismatch, or use prompting |
+| Q&A (extract) | Encoder-only | Find span in context |
+| Q&A (generate) | Decoder-only | Generate freeform answer |
+
+**2023+ trend:** Decoder-only can do everything via prompting (but less parameter-efficient for some tasks)
+
+
+## Part 5: Vision Transformers (ViT)
+
+### From Images to Sequences
+
+**Key insight:** Treat image as sequence of patches
+
+**Process:**
+1. Split image into patches (e.g., 16×16 pixels)
+2. Flatten each patch → 1D vector
+3. Linear projection → token embeddings
+4. Add 2D positional embeddings
+5. Prepend [CLS] token (for classification)
+6. Feed to Transformer encoder
+
+**Example:** 224×224 image, 16×16 patches
+- Number of patches: (224/16)² = 196
+- Each patch: 16 × 16 × 3 = 768 dimensions
+- Transformer input: 197 tokens (196 patches + 1 [CLS])
+
+### ViT Implementation
+
+```python
+class VisionTransformer(nn.Module):
+    def __init__(self, img_size=224, patch_size=16, in_channels=3,
+                 d_model=768, num_heads=12, num_layers=12, num_classes=1000):
+        super().__init__()
+        self.patch_size = patch_size
+        num_patches = (img_size // patch_size) ** 2
+        patch_dim = in_channels * patch_size ** 2
+
+        # Patch embedding (linear projection of flattened patches)
+        self.patch_embed = nn.Linear(patch_dim, d_model)
+
+        # [CLS] token (learnable)
+        self.cls_token = nn.Parameter(torch.randn(1, 1, d_model))
+
+        # Position embeddings (learnable)
+        self.pos_embed = nn.Parameter(torch.randn(1, num_patches + 1, d_model))
+
+        # Transformer encoder
+        self.encoder = TransformerEncoder(d_model, num_heads,
+                                         d_ff=4*d_model, num_layers=num_layers)
+
+        # Classification head
+        self.head = nn.Linear(d_model, num_classes)
+
+    def forward(self, x):
+        # x: (batch, channels, height, width)
+        batch_size = x.size(0)
+
+        # Divide into patches and flatten
+        x = x.unfold(2, self.patch_size, self.patch_size)
+        x = x.unfold(3, self.patch_size, self.patch_size)
+        # (batch, channels, num_patches_h, num_patches_w, patch_size, patch_size)
+
+        x = x.contiguous().view(batch_size, -1, self.patch_size ** 2 * 3)
+        # (batch, num_patches, patch_dim)
+
+        # Linear projection
+        x = self.patch_embed(x)  # (batch, num_patches, d_model)
+
+        # Prepend [CLS] token
+        cls_tokens = self.cls_token.expand(batch_size, -1, -1)
+        x = torch.cat([cls_tokens, x], dim=1)  # (batch, num_patches+1, d_model)
+
+        # Add positional embeddings
+        x = x + self.pos_embed
+
+        # Transformer encoder
+        x = self.encoder(x)
+
+        # Classification: Use [CLS] token
+        cls_output = x[:, 0]  # (batch, d_model)
+        logits = self.head(cls_output)
+
+        return logits
+```
+
+### ViT vs CNN: Critical Differences
+
+**1. Inductive Bias**
+
+| Property | CNN | ViT |
+|----------|-----|-----|
+| Locality | Strong (conv kernel) | Weak (global attention) |
+| Translation invariance | Strong (weight sharing) | Weak (position embeddings) |
+| Hierarchy | Strong (pooling layers) | None (flat patches) |
+
+**Implication:** CNN has strong priors, ViT learns from data
+
+**2. Data Requirements**
+
+| Dataset Size | CNN | ViT (from scratch) | ViT (pretrained) |
+|--------------|-----|-------------------|------------------|
+| Small (< 100k) | ✅ Good | ❌ Fails | ✅ Good |
+| Medium (100k-1M) | ✅ Excellent | ⚠️ Poor | ✅ Good |
+| Large (> 1M) | ✅ Excellent | ⚠️ OK | ✅ Excellent |
+| Huge (> 100M) | ✅ Excellent | ✅ SOTA | N/A |
+
+**Key finding:** ViT needs 100M+ images to train from scratch
+- Original ViT: Trained on JFT-300M (300 million images)
+- Without massive data, ViT underperforms CNNs significantly
+
+**3. Computational Cost**
+
+**Example: 224×224 images**
+
+| Model | Parameters | GFLOPs | Inference (GPU) |
+|-------|-----------|--------|-----------------|
+| ResNet-50 | 25M | 4.1 | ~30ms |
+| EfficientNet-B0 | 5M | 0.4 | ~10ms |
+| ViT-B/16 | 86M | 17.6 | ~100ms |
+
+**Implication:** ViT is 40x more expensive than EfficientNet!
+
+### When to Use ViT
+
+✅ **Use ViT when:**
+- Large dataset (> 1M images) OR using pretrained weights
+- Computational cost acceptable (cloud, large GPU)
+- Best possible accuracy needed
+- Can fine-tune from ImageNet-21k checkpoint
+
+❌ **Use CNN when:**
+- Small/medium dataset (< 1M images) and training from scratch
+- Limited compute/memory
+- Edge deployment (mobile, embedded)
+- Need architectural inductive biases
+
+### Hybrid Approaches (2022-2023)
+
+**ConvNeXt:** CNN with ViT design choices
+- Matches ViT accuracy with CNN efficiency
+- Works better on small datasets
+
+**Swin Transformer:** Hierarchical ViT with local windows
+- Shifted windows for efficiency
+- O(n) complexity instead of O(n²)
+- Better for dense prediction (segmentation)
+
+**CoAtNet:** Mix conv layers (early) + Transformer layers (late)
+- Gets both inductive bias and global attention
+
+
+## Part 6: Implementation Checklist
+
+### Critical Details
+
+**1. Layer Norm Placement**
+
+**Post-norm (original):**
+```python
+x = x + self_attn(x)
+x = layer_norm(x)
+```
+
+**Pre-norm (modern, recommended):**
+```python
+x = x + self_attn(layer_norm(x))
+```
+
+**Why pre-norm?** More stable training, less sensitive to learning rate
+
+**2. Attention Dropout**
+
+Apply dropout to **attention weights**, not Q/K/V!
+
+```python
+attn_weights = F.softmax(scores, dim=-1)
+attn_weights = F.dropout(attn_weights, p=0.1, training=self.training)  # HERE!
+output = torch.matmul(attn_weights, V)
+```
+
+**3. Feed-Forward Dimension**
+
+Typically: d_ff = 4 × d_model
+- BERT: d_model=768, d_ff=3072
+- GPT-2: d_model=768, d_ff=3072
+
+**4. Residual Connections**
+
+ALWAYS use residual connections (essential for training)!
+
+```python
+x = x + self_attn(x)  # Residual
+x = x + feed_forward(x)  # Residual
+```
+
+**5. Initialization**
+
+Use Xavier/Glorot initialization for attention weights:
+```python
+nn.init.xavier_uniform_(self.W_q.weight)
+nn.init.xavier_uniform_(self.W_k.weight)
+nn.init.xavier_uniform_(self.W_v.weight)
+```
+
+
+## Part 7: When NOT to Use Transformers
+
+### Limitation 1: Small Datasets
+
+**Problem:** Transformers have weak inductive bias (learn from data)
+
+**Impact:**
+- ViT: Fails on < 100k images without pretraining
+- NLP: BERT needs 100M+ tokens for pretraining
+
+**Solution:** Use models with stronger priors (CNN for vision, smaller models for text)
+
+### Limitation 2: Long Sequences
+
+**Problem:** O(n²) memory complexity
+
+**Impact:**
+- Standard Transformer: n=10k → 100M attention scores
+- GPU memory: 10k² × 4 bytes = 400MB per sample!
+
+**Solution:**
+- Sparse attention (Longformer, BigBird)
+- Linear attention (Linformer, Performer)
+- Flash Attention (memory-efficient kernel)
+- State space models (S4, Mamba)
+
+### Limitation 3: Edge Deployment
+
+**Problem:** Large model size, high latency
+
+**Impact:**
+- ViT-B: 86M parameters, ~100ms inference
+- Mobile/embedded: Need < 10M parameters, < 50ms
+
+**Solution:** Efficient CNNs (MobileNet, EfficientNet) or distilled models
+
+### Limitation 4: Real-Time Processing
+
+**Problem:** Sequential generation in decoder (cannot parallelize at inference)
+
+**Impact:** GPT-style models generate one token at a time
+
+**Solution:** Non-autoregressive models, speculative decoding, or smaller models
+
+
+## Part 8: Common Mistakes
+
+### Mistake 1: Forgetting Causal Mask
+
+**Symptom:** Decoder "cheats" by seeing future tokens
+
+**Fix:** Always apply causal mask to decoder self-attention!
+
+```python
+causal_mask = torch.tril(torch.ones(seq_len, seq_len))
+```
+
+### Mistake 2: Wrong Dimension for Multi-Head
+
+**Symptom:** Runtime error or dimension mismatch
+
+**Fix:** Ensure d_model % num_heads == 0
+
+```python
+assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
+```
+
+### Mistake 3: Forgetting Position Encoding
+
+**Symptom:** Model ignores word order
+
+**Fix:** Always add position information!
+
+```python
+x = token_embeddings + positional_encoding
+```
+
+### Mistake 4: Wrong Softmax Dimension
+
+**Symptom:** Attention weights don't sum to 1 per query
+
+**Fix:** Softmax over last dimension (keys)
+
+```python
+attn_weights = F.softmax(scores, dim=-1)  # Sum over keys for each query
+```
+
+### Mistake 5: No Residual Connections
+
+**Symptom:** Training diverges or converges very slowly
+
+**Fix:** Always add residual connections!
+
+```python
+x = x + self_attn(x)
+x = x + feed_forward(x)
+```
+
+
+## Summary: Quick Reference
+
+### Architecture Selection
+
+```
+Classification/Understanding → Encoder-only (BERT-style)
+Generation/Autoregressive → Decoder-only (GPT-style)
+Seq2Seq (input ≠ output) → Encoder-decoder (T5-style) or Decoder-only with prompting
+```
+
+### Position Encoding Selection
+
+```
+NLP (variable length) → RoPE or ALiBi
+NLP (fixed length) → Learned embeddings
+Vision (ViT) → 2D learned embeddings
+Long sequences (> 2k) → ALiBi (best extrapolation)
+```
+
+### Multi-Head Configuration
+
+```
+Small models (d_model < 512): 4-8 heads
+Medium models (d_model 512-1024): 8-12 heads
+Large models (d_model > 1024): 12-32 heads
+Rule: d_k (head dimension) should be 64-128
+```
+
+### ViT vs CNN
+
+```
+ViT: Large dataset (> 1M) OR pretrained weights
+CNN: Small dataset (< 1M) OR edge deployment
+```
+
+### Implementation Essentials
+
+```
+✅ Pre-norm (more stable than post-norm)
+✅ Residual connections (essential!)
+✅ Causal mask for decoder
+✅ Attention dropout (on weights, not Q/K/V)
+✅ d_ff = 4 × d_model (feed-forward dimension)
+✅ Check: d_model % num_heads == 0
+```
+
+
+## Next Steps
+
+After mastering this skill:
+- `attention-mechanisms-catalog`: Explore attention variants (sparse, linear, Flash)
+- `llm-specialist/llm-finetuning-strategies`: Apply to language models
+- `architecture-design-principles`: Understand design trade-offs
+
+**Remember:** Transformers are NOT magic. Understanding the mechanism (information retrieval via Q/K/V) beats cargo-culting implementations.