# Transformer Architecture Deep Dive ## When to Use This Skill Use this skill when you need to: - ✅ Implement a Transformer from scratch - ✅ Understand HOW and WHY self-attention works - ✅ Choose between encoder, decoder, or encoder-decoder architectures - ✅ Decide if Vision Transformer (ViT) is appropriate for your vision task - ✅ Understand modern variants (RoPE, ALiBi, GQA, MQA) - ✅ Debug Transformer implementation issues - ✅ Optimize Transformer performance **Do NOT use this skill for:** - ❌ High-level architecture selection (use `using-neural-architectures`) - ❌ Attention mechanism comparison (use `attention-mechanisms-catalog`) - ❌ LLM-specific topics like prompt engineering (use `llm-specialist` pack) ## Core Principle **Transformers are NOT magic.** They are: 1. Self-attention mechanism (information retrieval) 2. + Position encoding (break permutation invariance) 3. + Residual connections + Layer norm (training stability) 4. + Feed-forward networks (non-linearity) Understanding the mechanism beats cargo-culting implementations. ## Part 1: Self-Attention Mechanism Explained ### The Information Retrieval Analogy **Self-attention = Querying a database:** - **Query (Q)**: "What am I looking for?" - **Key (K)**: "What do I contain?" - **Value (V)**: "What information do I have?" **Process:** 1. Compare your query with all keys (compute similarity) 2. Weight values by similarity 3. Return weighted sum of values **Example:** Sentence: "The cat sat on the mat" Token "sat" (verb): - High attention to "cat" (subject) → Learns verb-subject relationship - High attention to "mat" (object) → Learns verb-object relationship - Low attention to "the", "on" (function words) ### Mathematical Breakdown **Given input X:** (batch, seq_len, d_model) **Step 1: Project to Q, K, V** ```python Q = X @ W_Q # (batch, seq_len, d_k) K = X @ W_K # (batch, seq_len, d_k) V = X @ W_V # (batch, seq_len, d_v) # Typically: d_k = d_v = d_model / num_heads ``` **Step 2: Compute attention scores** (similarity) ```python scores = Q @ K.transpose(-2, -1) # (batch, seq_len, seq_len) # scores[i, j] = similarity between query_i and key_j ``` **Geometric interpretation:** - Dot product measures vector alignment - q · k = ||q|| ||k|| cos(θ) - Similar vectors → Large dot product → High attention - Orthogonal vectors → Zero dot product → No attention **Step 3: Scale by √d_k** (CRITICAL!) ```python scores = scores / math.sqrt(d_k) ``` **WHY scaling?** - Dot products grow with dimension: Var(q · k) = d_k - Example: d_k=64 → Random dot products ~ ±64 - Large scores → Softmax saturates → Gradients vanish - Scaling: Keep scores ~ O(1) regardless of dimension **Without scaling:** Softmax([30, 25, 20]) ≈ [0.99, 0.01, 0.00] (saturated!) **With scaling:** Softmax([3, 2.5, 2]) ≈ [0.50, 0.30, 0.20] (healthy gradients) **Step 4: Softmax to get attention weights** ```python attn_weights = F.softmax(scores, dim=-1) # (batch, seq_len, seq_len) # Each row sums to 1 (probability distribution) # attn_weights[i, j] = "how much token i attends to token j" ``` **Step 5: Weight values** ```python output = attn_weights @ V # (batch, seq_len, d_v) # Each token's output = weighted average of all values ``` **Complete formula:** ```python Attention(Q, K, V) = softmax(Q K^T / √d_k) V ``` ### Why Three Matrices (Q, K, V)? **Could we use just one?** Attention(X, X, X) **Yes, but Q/K/V separation enables:** 1. **Asymmetry**: Query can differ from key (search ≠ database) 2. **Decoupling**: What you search for (Q@K) ≠ what you retrieve (V) 3. **Cross-attention**: Q from one source, K/V from another - Example: Decoder queries encoder (translation) **Modern optimization:** Multi-Query Attention (MQA), Grouped-Query Attention (GQA) - Share K/V across heads (fewer parameters, faster inference) ### Implementation Example ```python import torch import torch.nn as nn import torch.nn.functional as F import math class SelfAttention(nn.Module): def __init__(self, d_model, d_k=None): super().__init__() self.d_k = d_k or d_model self.W_q = nn.Linear(d_model, self.d_k) self.W_k = nn.Linear(d_model, self.d_k) self.W_v = nn.Linear(d_model, self.d_k) def forward(self, x, mask=None): # x: (batch, seq_len, d_model) Q = self.W_q(x) # (batch, seq_len, d_k) K = self.W_k(x) # (batch, seq_len, d_k) V = self.W_v(x) # (batch, seq_len, d_k) # Attention scores scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k) # scores: (batch, seq_len, seq_len) # Apply mask if provided (for causal attention) if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) # Attention weights attn_weights = F.softmax(scores, dim=-1) # (batch, seq_len, seq_len) # Weighted sum of values output = torch.matmul(attn_weights, V) # (batch, seq_len, d_k) return output, attn_weights ``` **Complexity:** O(n² · d) where n = seq_len, d = d_model - **Quadratic in sequence length** (bottleneck for long sequences) - For n=1000, d=512: 1000² × 512 = 512M operations ## Part 2: Multi-Head Attention ### Why Multiple Heads? **Single-head attention** learns one attention pattern. **Multi-head attention** learns multiple parallel patterns: - Head 1: Syntactic relationships (subject-verb) - Head 2: Semantic similarity - Head 3: Positional proximity - Head 4: Long-range dependencies **Analogy:** Ensemble of attention functions, each specializing in different patterns. ### Head Dimension Calculation **CRITICAL CONSTRAINT:** num_heads must divide d_model evenly! ```python d_model = 512 num_heads = 8 d_k = d_model // num_heads # 512 / 8 = 64 # Each head operates in d_k dimensions # Concatenate all heads → back to d_model dimensions ``` **Common configurations:** - BERT-base: d_model=768, heads=12, d_k=64 - GPT-2: d_model=768, heads=12, d_k=64 - GPT-3 175B: d_model=12288, heads=96, d_k=128 - LLaMA-2 70B: d_model=8192, heads=64, d_k=128 **Rule of thumb:** d_k (head dimension) should be 64-128 - Too small (d_k < 32): Limited representational capacity - Too large (d_k > 256): Redundant, wasteful ### Multi-Head Implementation ```python class MultiHeadAttention(nn.Module): def __init__(self, d_model, num_heads): super().__init__() assert d_model % num_heads == 0, "d_model must be divisible by num_heads" self.d_model = d_model self.num_heads = num_heads self.d_k = d_model // num_heads # Single linear layers for all heads (more efficient) self.W_q = nn.Linear(d_model, d_model) self.W_k = nn.Linear(d_model, d_model) self.W_v = nn.Linear(d_model, d_model) self.W_o = nn.Linear(d_model, d_model) # Output projection def split_heads(self, x): # x: (batch, seq_len, d_model) batch_size, seq_len, d_model = x.size() # Reshape to (batch, seq_len, num_heads, d_k) x = x.view(batch_size, seq_len, self.num_heads, self.d_k) # Transpose to (batch, num_heads, seq_len, d_k) return x.transpose(1, 2) def forward(self, x, mask=None): batch_size = x.size(0) # Linear projections Q = self.W_q(x) # (batch, seq_len, d_model) K = self.W_k(x) V = self.W_v(x) # Split into multiple heads Q = self.split_heads(Q) # (batch, num_heads, seq_len, d_k) K = self.split_heads(K) V = self.split_heads(V) # Attention scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k) if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) attn_weights = F.softmax(scores, dim=-1) # Weighted sum attn_output = torch.matmul(attn_weights, V) # attn_output: (batch, num_heads, seq_len, d_k) # Concatenate heads attn_output = attn_output.transpose(1, 2).contiguous() # (batch, seq_len, num_heads, d_k) attn_output = attn_output.view(batch_size, -1, self.d_model) # (batch, seq_len, d_model) # Final linear projection output = self.W_o(attn_output) return output, attn_weights ``` ### Modern Variants: GQA and MQA **Problem:** K/V caching during inference is memory-intensive - LLaMA-2 70B: 8192 × 64 heads × 2 (K + V) = 1M parameters per token cached! **Solution 1: Multi-Query Attention (MQA)** - **One** K/V head shared across **all** Q heads - Benefit: Dramatically faster inference (smaller KV cache) - Trade-off: ~1-2% accuracy loss ```python # MQA: Single K/V projection self.W_k = nn.Linear(d_model, d_k) # Not d_model! self.W_v = nn.Linear(d_model, d_k) self.W_q = nn.Linear(d_model, d_model) # Multiple Q heads ``` **Solution 2: Grouped-Query Attention (GQA)** - Middle ground: Group multiple Q heads per K/V head - Example: 32 Q heads → 8 K/V heads (4 Q per K/V) - Benefit: 4x smaller KV cache, minimal accuracy loss **Used in:** LLaMA-2, Mistral, Mixtral ## Part 3: Position Encoding ### Why Position Encoding? **Problem:** Self-attention is **permutation-invariant** - Attention("cat sat mat") = Attention("mat cat sat") - No inherent notion of position or order! **Solution:** Add position information to embeddings ### Strategy 1: Sinusoidal Position Encoding (Original) **Formula:** ```python PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)) ``` **Implementation:** ```python def sinusoidal_position_encoding(seq_len, d_model): pe = torch.zeros(seq_len, d_model) position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) return pe # Usage: Add to input embeddings x = token_embeddings + positional_encoding ``` **Properties:** - Deterministic (no learned parameters) - Extrapolates to unseen lengths (geometric properties) - Relative positions: PE(pos+k) is linear function of PE(pos) **When to use:** Variable-length sequences in NLP ### Strategy 2: Learned Position Embeddings ```python self.pos_embedding = nn.Embedding(max_seq_len, d_model) # Usage positions = torch.arange(seq_len, device=x.device) x = token_embeddings + self.pos_embedding(positions) ``` **Properties:** - Learnable (adapts to data) - Cannot extrapolate beyond max_seq_len **When to use:** - Fixed-length sequences - Vision Transformers (image patches) - When training data covers all positions ### Strategy 3: Rotary Position Embeddings (RoPE) ⭐ **Modern approach (2021+):** Rotate Q and K in complex plane **Key advantages:** - Encodes **relative** positions naturally - Better long-range decay properties - No addition to embeddings (applied in attention) **Used in:** GPT-NeoX, PaLM, LLaMA, LLaMA-2, Mistral ```python def apply_rotary_pos_emb(x, cos, sin): # x: (batch, num_heads, seq_len, d_k) # Split into even/odd x1, x2 = x[..., ::2], x[..., 1::2] # Rotate return torch.cat([x1 * cos - x2 * sin, x1 * sin + x2 * cos], dim=-1) ``` ### Strategy 4: ALiBi (Attention with Linear Biases) ⭐ **Simplest modern approach:** Add bias to attention scores (no embeddings!) ```python # Bias matrix: -1 * distance # [[0, -1, -2, -3], # [0, 0, -1, -2], # [0, 0, 0, -1], # [0, 0, 0, 0]] scores = Q @ K^T / √d_k + alibi_bias ``` **Key advantages:** - **Best extrapolation** to longer sequences - No positional embeddings (simpler) - Per-head slopes (different decay rates) **Used in:** BLOOM ### Position Encoding Selection Guide | Use Case | Recommended | Why | |----------|-------------|-----| | NLP (variable length) | RoPE or ALiBi | Better extrapolation | | NLP (fixed length) | Learned embeddings | Adapts to data | | Vision (ViT) | 2D learned embeddings | Spatial structure | | Long sequences (>2k) | ALiBi | Best extrapolation | | Legacy/compatibility | Sinusoidal | Original Transformer | **Modern trend (2023+):** RoPE and ALiBi dominate over sinusoidal ## Part 4: Architecture Variants ### Variant 1: Encoder-Only (Bidirectional) **Architecture:** - Self-attention: Each token attends to **ALL** tokens (past + future) - No masking (bidirectional context) **Examples:** BERT, RoBERTa, ELECTRA, DeBERTa **Use cases:** - Text classification - Named entity recognition - Question answering (extract span from context) - Sentence embeddings **Key property:** Sees full context → Good for **understanding** **Implementation:** ```python class TransformerEncoder(nn.Module): def __init__(self, d_model, num_heads, d_ff, num_layers): super().__init__() self.layers = nn.ModuleList([ EncoderLayer(d_model, num_heads, d_ff) for _ in range(num_layers) ]) def forward(self, x, mask=None): for layer in self.layers: x = layer(x, mask) # No causal mask! return x class EncoderLayer(nn.Module): def __init__(self, d_model, num_heads, d_ff): super().__init__() self.self_attn = MultiHeadAttention(d_model, num_heads) self.feed_forward = nn.Sequential( nn.Linear(d_model, d_ff), nn.ReLU(), nn.Linear(d_ff, d_model) ) self.norm1 = nn.LayerNorm(d_model) self.norm2 = nn.LayerNorm(d_model) def forward(self, x, mask=None): # Self-attention + residual + norm attn_output, _ = self.self_attn(x, mask) x = self.norm1(x + attn_output) # Feed-forward + residual + norm ff_output = self.feed_forward(x) x = self.norm2(x + ff_output) return x ``` ### Variant 2: Decoder-Only (Autoregressive) **Architecture:** - Self-attention with **causal masking** - Each token attends ONLY to past tokens (not future) **Causal mask (lower triangular):** ```python # mask[i, j] = 1 if j <= i else 0 [[1, 0, 0, 0], # Token 0 sees only itself [1, 1, 0, 0], # Token 1 sees tokens 0-1 [1, 1, 1, 0], # Token 2 sees tokens 0-2 [1, 1, 1, 1]] # Token 3 sees all ``` **Examples:** GPT, GPT-2, GPT-3, GPT-4, LLaMA, Mistral **Use cases:** - Text generation - Language modeling - Code generation - Autoregressive prediction **Key property:** Generates sequentially → Good for **generation** **Implementation:** ```python def create_causal_mask(seq_len, device): # Lower triangular matrix mask = torch.tril(torch.ones(seq_len, seq_len, device=device)) return mask class TransformerDecoder(nn.Module): def __init__(self, d_model, num_heads, d_ff, num_layers): super().__init__() self.layers = nn.ModuleList([ DecoderLayer(d_model, num_heads, d_ff) for _ in range(num_layers) ]) def forward(self, x): seq_len = x.size(1) causal_mask = create_causal_mask(seq_len, x.device) for layer in self.layers: x = layer(x, causal_mask) # Apply causal mask! return x ``` **Modern trend (2023+):** Decoder-only architectures dominate - Can do both generation AND understanding (via prompting) - Simpler than encoder-decoder (no cross-attention) - Scales better to massive sizes ### Variant 3: Encoder-Decoder (Seq2Seq) **Architecture:** - **Encoder**: Bidirectional self-attention (understands input) - **Decoder**: Causal self-attention (generates output) - **Cross-attention**: Decoder queries encoder outputs **Cross-attention mechanism:** ```python # Q from decoder, K and V from encoder Q = decoder_hidden @ W_q K = encoder_output @ W_k V = encoder_output @ W_v cross_attn = softmax(Q K^T / √d_k) V ``` **Examples:** T5, BART, mT5, original Transformer (2017) **Use cases:** - Translation (input ≠ output language) - Summarization (long input → short output) - Question answering (generate answer, not extract) **When to use:** Input and output are fundamentally different **Implementation:** ```python class EncoderDecoderLayer(nn.Module): def __init__(self, d_model, num_heads, d_ff): super().__init__() self.self_attn = MultiHeadAttention(d_model, num_heads) self.cross_attn = MultiHeadAttention(d_model, num_heads) # NEW! self.feed_forward = FeedForward(d_model, d_ff) self.norm1 = nn.LayerNorm(d_model) self.norm2 = nn.LayerNorm(d_model) self.norm3 = nn.LayerNorm(d_model) def forward(self, decoder_input, encoder_output, causal_mask=None): # 1. Self-attention (causal) self_attn_out, _ = self.self_attn(decoder_input, causal_mask) x = self.norm1(decoder_input + self_attn_out) # 2. Cross-attention (Q from decoder, K/V from encoder) cross_attn_out, _ = self.cross_attn.forward_cross( query=x, key=encoder_output, value=encoder_output ) x = self.norm2(x + cross_attn_out) # 3. Feed-forward ff_out = self.feed_forward(x) x = self.norm3(x + ff_out) return x ``` ### Architecture Selection Guide | Task | Architecture | Why | |------|--------------|-----| | Classification | Encoder-only | Need full bidirectional context | | Text generation | Decoder-only | Autoregressive generation | | Translation | Encoder-decoder or Decoder-only | Different languages, or use prompting | | Summarization | Encoder-decoder or Decoder-only | Length mismatch, or use prompting | | Q&A (extract) | Encoder-only | Find span in context | | Q&A (generate) | Decoder-only | Generate freeform answer | **2023+ trend:** Decoder-only can do everything via prompting (but less parameter-efficient for some tasks) ## Part 5: Vision Transformers (ViT) ### From Images to Sequences **Key insight:** Treat image as sequence of patches **Process:** 1. Split image into patches (e.g., 16×16 pixels) 2. Flatten each patch → 1D vector 3. Linear projection → token embeddings 4. Add 2D positional embeddings 5. Prepend [CLS] token (for classification) 6. Feed to Transformer encoder **Example:** 224×224 image, 16×16 patches - Number of patches: (224/16)² = 196 - Each patch: 16 × 16 × 3 = 768 dimensions - Transformer input: 197 tokens (196 patches + 1 [CLS]) ### ViT Implementation ```python class VisionTransformer(nn.Module): def __init__(self, img_size=224, patch_size=16, in_channels=3, d_model=768, num_heads=12, num_layers=12, num_classes=1000): super().__init__() self.patch_size = patch_size num_patches = (img_size // patch_size) ** 2 patch_dim = in_channels * patch_size ** 2 # Patch embedding (linear projection of flattened patches) self.patch_embed = nn.Linear(patch_dim, d_model) # [CLS] token (learnable) self.cls_token = nn.Parameter(torch.randn(1, 1, d_model)) # Position embeddings (learnable) self.pos_embed = nn.Parameter(torch.randn(1, num_patches + 1, d_model)) # Transformer encoder self.encoder = TransformerEncoder(d_model, num_heads, d_ff=4*d_model, num_layers=num_layers) # Classification head self.head = nn.Linear(d_model, num_classes) def forward(self, x): # x: (batch, channels, height, width) batch_size = x.size(0) # Divide into patches and flatten x = x.unfold(2, self.patch_size, self.patch_size) x = x.unfold(3, self.patch_size, self.patch_size) # (batch, channels, num_patches_h, num_patches_w, patch_size, patch_size) x = x.contiguous().view(batch_size, -1, self.patch_size ** 2 * 3) # (batch, num_patches, patch_dim) # Linear projection x = self.patch_embed(x) # (batch, num_patches, d_model) # Prepend [CLS] token cls_tokens = self.cls_token.expand(batch_size, -1, -1) x = torch.cat([cls_tokens, x], dim=1) # (batch, num_patches+1, d_model) # Add positional embeddings x = x + self.pos_embed # Transformer encoder x = self.encoder(x) # Classification: Use [CLS] token cls_output = x[:, 0] # (batch, d_model) logits = self.head(cls_output) return logits ``` ### ViT vs CNN: Critical Differences **1. Inductive Bias** | Property | CNN | ViT | |----------|-----|-----| | Locality | Strong (conv kernel) | Weak (global attention) | | Translation invariance | Strong (weight sharing) | Weak (position embeddings) | | Hierarchy | Strong (pooling layers) | None (flat patches) | **Implication:** CNN has strong priors, ViT learns from data **2. Data Requirements** | Dataset Size | CNN | ViT (from scratch) | ViT (pretrained) | |--------------|-----|-------------------|------------------| | Small (< 100k) | ✅ Good | ❌ Fails | ✅ Good | | Medium (100k-1M) | ✅ Excellent | ⚠️ Poor | ✅ Good | | Large (> 1M) | ✅ Excellent | ⚠️ OK | ✅ Excellent | | Huge (> 100M) | ✅ Excellent | ✅ SOTA | N/A | **Key finding:** ViT needs 100M+ images to train from scratch - Original ViT: Trained on JFT-300M (300 million images) - Without massive data, ViT underperforms CNNs significantly **3. Computational Cost** **Example: 224×224 images** | Model | Parameters | GFLOPs | Inference (GPU) | |-------|-----------|--------|-----------------| | ResNet-50 | 25M | 4.1 | ~30ms | | EfficientNet-B0 | 5M | 0.4 | ~10ms | | ViT-B/16 | 86M | 17.6 | ~100ms | **Implication:** ViT is 40x more expensive than EfficientNet! ### When to Use ViT ✅ **Use ViT when:** - Large dataset (> 1M images) OR using pretrained weights - Computational cost acceptable (cloud, large GPU) - Best possible accuracy needed - Can fine-tune from ImageNet-21k checkpoint ❌ **Use CNN when:** - Small/medium dataset (< 1M images) and training from scratch - Limited compute/memory - Edge deployment (mobile, embedded) - Need architectural inductive biases ### Hybrid Approaches (2022-2023) **ConvNeXt:** CNN with ViT design choices - Matches ViT accuracy with CNN efficiency - Works better on small datasets **Swin Transformer:** Hierarchical ViT with local windows - Shifted windows for efficiency - O(n) complexity instead of O(n²) - Better for dense prediction (segmentation) **CoAtNet:** Mix conv layers (early) + Transformer layers (late) - Gets both inductive bias and global attention ## Part 6: Implementation Checklist ### Critical Details **1. Layer Norm Placement** **Post-norm (original):** ```python x = x + self_attn(x) x = layer_norm(x) ``` **Pre-norm (modern, recommended):** ```python x = x + self_attn(layer_norm(x)) ``` **Why pre-norm?** More stable training, less sensitive to learning rate **2. Attention Dropout** Apply dropout to **attention weights**, not Q/K/V! ```python attn_weights = F.softmax(scores, dim=-1) attn_weights = F.dropout(attn_weights, p=0.1, training=self.training) # HERE! output = torch.matmul(attn_weights, V) ``` **3. Feed-Forward Dimension** Typically: d_ff = 4 × d_model - BERT: d_model=768, d_ff=3072 - GPT-2: d_model=768, d_ff=3072 **4. Residual Connections** ALWAYS use residual connections (essential for training)! ```python x = x + self_attn(x) # Residual x = x + feed_forward(x) # Residual ``` **5. Initialization** Use Xavier/Glorot initialization for attention weights: ```python nn.init.xavier_uniform_(self.W_q.weight) nn.init.xavier_uniform_(self.W_k.weight) nn.init.xavier_uniform_(self.W_v.weight) ``` ## Part 7: When NOT to Use Transformers ### Limitation 1: Small Datasets **Problem:** Transformers have weak inductive bias (learn from data) **Impact:** - ViT: Fails on < 100k images without pretraining - NLP: BERT needs 100M+ tokens for pretraining **Solution:** Use models with stronger priors (CNN for vision, smaller models for text) ### Limitation 2: Long Sequences **Problem:** O(n²) memory complexity **Impact:** - Standard Transformer: n=10k → 100M attention scores - GPU memory: 10k² × 4 bytes = 400MB per sample! **Solution:** - Sparse attention (Longformer, BigBird) - Linear attention (Linformer, Performer) - Flash Attention (memory-efficient kernel) - State space models (S4, Mamba) ### Limitation 3: Edge Deployment **Problem:** Large model size, high latency **Impact:** - ViT-B: 86M parameters, ~100ms inference - Mobile/embedded: Need < 10M parameters, < 50ms **Solution:** Efficient CNNs (MobileNet, EfficientNet) or distilled models ### Limitation 4: Real-Time Processing **Problem:** Sequential generation in decoder (cannot parallelize at inference) **Impact:** GPT-style models generate one token at a time **Solution:** Non-autoregressive models, speculative decoding, or smaller models ## Part 8: Common Mistakes ### Mistake 1: Forgetting Causal Mask **Symptom:** Decoder "cheats" by seeing future tokens **Fix:** Always apply causal mask to decoder self-attention! ```python causal_mask = torch.tril(torch.ones(seq_len, seq_len)) ``` ### Mistake 2: Wrong Dimension for Multi-Head **Symptom:** Runtime error or dimension mismatch **Fix:** Ensure d_model % num_heads == 0 ```python assert d_model % num_heads == 0, "d_model must be divisible by num_heads" ``` ### Mistake 3: Forgetting Position Encoding **Symptom:** Model ignores word order **Fix:** Always add position information! ```python x = token_embeddings + positional_encoding ``` ### Mistake 4: Wrong Softmax Dimension **Symptom:** Attention weights don't sum to 1 per query **Fix:** Softmax over last dimension (keys) ```python attn_weights = F.softmax(scores, dim=-1) # Sum over keys for each query ``` ### Mistake 5: No Residual Connections **Symptom:** Training diverges or converges very slowly **Fix:** Always add residual connections! ```python x = x + self_attn(x) x = x + feed_forward(x) ``` ## Summary: Quick Reference ### Architecture Selection ``` Classification/Understanding → Encoder-only (BERT-style) Generation/Autoregressive → Decoder-only (GPT-style) Seq2Seq (input ≠ output) → Encoder-decoder (T5-style) or Decoder-only with prompting ``` ### Position Encoding Selection ``` NLP (variable length) → RoPE or ALiBi NLP (fixed length) → Learned embeddings Vision (ViT) → 2D learned embeddings Long sequences (> 2k) → ALiBi (best extrapolation) ``` ### Multi-Head Configuration ``` Small models (d_model < 512): 4-8 heads Medium models (d_model 512-1024): 8-12 heads Large models (d_model > 1024): 12-32 heads Rule: d_k (head dimension) should be 64-128 ``` ### ViT vs CNN ``` ViT: Large dataset (> 1M) OR pretrained weights CNN: Small dataset (< 1M) OR edge deployment ``` ### Implementation Essentials ``` ✅ Pre-norm (more stable than post-norm) ✅ Residual connections (essential!) ✅ Causal mask for decoder ✅ Attention dropout (on weights, not Q/K/V) ✅ d_ff = 4 × d_model (feed-forward dimension) ✅ Check: d_model % num_heads == 0 ``` ## Next Steps After mastering this skill: - `attention-mechanisms-catalog`: Explore attention variants (sparse, linear, Flash) - `llm-specialist/llm-finetuning-strategies`: Apply to language models - `architecture-design-principles`: Understand design trade-offs **Remember:** Transformers are NOT magic. Understanding the mechanism (information retrieval via Q/K/V) beats cargo-culting implementations.