Initial commit

2025-11-30 09:00:00 +08:00
commit 955d5c6743
12 changed files with 6996 additions and 0 deletions
--- a/skills/using-neural-architectures/generative-model-families.md
+++ b/skills/using-neural-architectures/generative-model-families.md
@@ -0,0 +1,811 @@
+
+# Generative Model Families
+
+## When to Use This Skill
+
+Use this skill when you need to:
+- ✅ Select generative model for image/audio/video generation
+- ✅ Understand VAE vs GAN vs Diffusion trade-offs
+- ✅ Decide between training from scratch vs fine-tuning
+- ✅ Address mode collapse in GANs
+- ✅ Choose between quality, speed, and training stability
+- ✅ Understand modern landscape (Stable Diffusion, StyleGAN, etc.)
+
+**Do NOT use this skill for:**
+- ❌ Text generation (use `llm-specialist` pack)
+- ❌ Architecture implementation details (use model-specific docs)
+- ❌ High-level architecture selection (use `using-neural-architectures`)
+
+
+## Core Principle
+
+**Generative models have fundamental trade-offs:**
+- **Quality vs Stability**: GANs sharp but unstable, VAEs blurry but stable
+- **Quality vs Speed**: Diffusion high-quality but slow, GANs fast
+- **Explicitness vs Flexibility**: Autoregressive/Flow have likelihood, GANs don't
+
+**Modern default (2025):** Diffusion models (best quality + stability)
+
+
+## Part 1: Model Family Overview
+
+### The Five Families
+
+**1. VAE (Variational Autoencoder)**
+- **Approach**: Learn latent space with encoder-decoder
+- **Quality**: Blurry (6/10)
+- **Training**: Very stable
+- **Use**: Latent space exploration, NOT high-quality generation
+
+**2. GAN (Generative Adversarial Network)**
+- **Approach**: Adversarial game (generator vs discriminator)
+- **Quality**: Sharp (9/10)
+- **Training**: Unstable (adversarial dynamics)
+- **Use**: High-quality generation, fast inference
+
+**3. Diffusion Models**
+- **Approach**: Iterative denoising
+- **Quality**: Very sharp (9.5/10)
+- **Training**: Stable
+- **Use**: Modern default for high-quality generation
+
+**4. Autoregressive Models**
+- **Approach**: Sequential generation (pixel-by-pixel, token-by-token)
+- **Quality**: Good (7-8/10)
+- **Training**: Stable
+- **Use**: Explicit likelihood, sequential data
+
+**5. Flow Models**
+- **Approach**: Invertible transformations
+- **Quality**: Good (7-8/10)
+- **Training**: Stable
+- **Use**: Exact likelihood, invertibility needed
+
+### Quick Comparison
+
+| Model | Quality | Training Stability | Inference Speed | Mode Collapse | Likelihood |
+|-------|---------|-------------------|----------------|---------------|------------|
+| VAE | 6/10 (blurry) | 10/10 | Fast | No | Approximate |
+| GAN | 9/10 | 3/10 | Fast | Yes | No |
+| Diffusion | 9.5/10 | 9/10 | Slow | No | Approximate |
+| Autoregressive | 7-8/10 | 9/10 | Very slow | No | Exact |
+| Flow | 7-8/10 | 8/10 | Fast (both ways) | No | Exact |
+
+
+## Part 2: VAE (Variational Autoencoder)
+
+### Architecture
+
+**Components:**
+1. **Encoder**: x → z (image to latent)
+2. **Latent space**: z ~ N(μ, σ²)
+3. **Decoder**: z → x' (latent to reconstruction)
+
+**Loss function:**
+```python
+# ELBO (Evidence Lower Bound)
+loss = reconstruction_loss + KL_divergence
+
+# Reconstruction: How well decoder reconstructs input
+reconstruction_loss = MSE(x, x_reconstructed)
+
+# KL: How close latent is to standard normal
+KL_divergence = KL(q(z|x) || p(z))
+```
+
+### Why VAE is Blurry
+
+**Problem**: MSE loss encourages pixel-wise averaging
+
+**Example:**
+- Dataset: Faces with both smiles and no smiles
+- VAE learns: "Average face has half-smile blur"
+- Result: Blurry, hedges between modes
+
+**Mathematical reason:**
+- MSE minimization = mean prediction
+- Mean of sharp images = blurry image
+
+### When to Use VAE
+
+✅ **Use VAE for:**
+- Latent space exploration (interpolation, arithmetic)
+- Anomaly detection (reconstruction error)
+- Disentangled representations (β-VAE)
+- Compression (lossy, with latent codes)
+
+❌ **DON'T use VAE for:**
+- High-quality image generation (use GAN or Diffusion!)
+- Sharp, realistic outputs
+
+### Implementation
+
+```python
+import torch
+import torch.nn as nn
+
+class VAE(nn.Module):
+    def __init__(self, latent_dim=128):
+        super().__init__()
+        # Encoder
+        self.encoder = nn.Sequential(
+            nn.Conv2d(3, 32, 4, 2, 1),  # 64x64 -> 32x32
+            nn.ReLU(),
+            nn.Conv2d(32, 64, 4, 2, 1),  # 32x32 -> 16x16
+            nn.ReLU(),
+            nn.Conv2d(64, 128, 4, 2, 1),  # 16x16 -> 8x8
+            nn.ReLU(),
+            nn.Flatten()
+        )
+
+        self.fc_mu = nn.Linear(128 * 8 * 8, latent_dim)
+        self.fc_logvar = nn.Linear(128 * 8 * 8, latent_dim)
+
+        # Decoder
+        self.fc_decode = nn.Linear(latent_dim, 128 * 8 * 8)
+        self.decoder = nn.Sequential(
+            nn.ConvTranspose2d(128, 64, 4, 2, 1),
+            nn.ReLU(),
+            nn.ConvTranspose2d(64, 32, 4, 2, 1),
+            nn.ReLU(),
+            nn.ConvTranspose2d(32, 3, 4, 2, 1),
+            nn.Sigmoid()
+        )
+
+    def reparameterize(self, mu, logvar):
+        # Reparameterization trick: z = μ + σ * ε
+        std = torch.exp(0.5 * logvar)
+        eps = torch.randn_like(std)
+        return mu + eps * std
+
+    def forward(self, x):
+        # Encode
+        h = self.encoder(x)
+        mu = self.fc_mu(h)
+        logvar = self.fc_logvar(h)
+
+        # Sample latent
+        z = self.reparameterize(mu, logvar)
+
+        # Decode
+        h = self.fc_decode(z)
+        h = h.view(-1, 128, 8, 8)
+        x_recon = self.decoder(h)
+
+        return x_recon, mu, logvar
+
+    def loss_function(self, x, x_recon, mu, logvar):
+        # Reconstruction loss
+        recon_loss = F.mse_loss(x_recon, x, reduction='sum')
+
+        # KL divergence
+        kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
+
+        return recon_loss + kl_loss
+```
+
+
+## Part 3: GAN (Generative Adversarial Network)
+
+### Architecture
+
+**Components:**
+1. **Generator**: z → x (noise to image)
+2. **Discriminator**: x → [0, 1] (image to real/fake probability)
+
+**Adversarial Training:**
+```python
+# Discriminator loss: Classify real as real, fake as fake
+D_loss = -log(D(x_real)) - log(1 - D(G(z)))
+
+# Generator loss: Fool discriminator
+G_loss = -log(D(G(z)))
+
+# Minimax game:
+min_G max_D V(D, G) = E[log D(x)] + E[log(1 - D(G(z)))]
+```
+
+### Training Instability
+
+**Problem**: Adversarial dynamics are unstable
+
+**Common issues:**
+1. **Mode collapse**: Generator produces limited variety
+2. **Non-convergence**: Oscillation, never settles
+3. **Vanishing gradients**: Discriminator too strong, generator can't learn
+4. **Hyperparameter sensitivity**: Learning rates critical
+
+**Solutions:**
+- Spectral normalization (StyleGAN2)
+- Progressive growing (start low-res, increase)
+- Minibatch discrimination (penalize lack of diversity)
+- Wasserstein loss (WGAN, more stable)
+
+### Mode Collapse
+
+**What is it?**
+- Generator produces subset of distribution
+- Example: Face GAN only generates 10 face types
+
+**Why it happens:**
+- Generator exploits discriminator weaknesses
+- Finds "easy" samples that fool discriminator
+- Forgets other modes
+
+**Detection:**
+```python
+# Check diversity: Generate many samples
+samples = generator.generate(n=1000)
+diversity = compute_pairwise_distance(samples)
+if diversity < threshold:
+    print("Mode collapse detected!")
+```
+
+**Solutions:**
+- Minibatch discrimination
+- Unrolled GANs (slow but helps)
+- Switch to diffusion (no mode collapse by design!)
+
+### Modern GANs
+
+**StyleGAN2 (2020):**
+- State-of-the-art for faces
+- Style-based generator
+- Spectral normalization for stability
+- Resolution: 1024×1024
+
+**StyleGAN3 (2021):**
+- Alias-free architecture
+- Better animation/video
+
+**When to use GAN:**
+✅ Fast inference needed (50ms per image)
+✅ Pretrained model available (StyleGAN2)
+✅ Can tolerate training difficulty
+
+❌ Training instability unacceptable
+❌ Mode collapse problematic
+❌ Starting from scratch (use diffusion instead)
+
+### Implementation (Basic GAN)
+
+```python
+class Generator(nn.Module):
+    def __init__(self, latent_dim=100, img_channels=3):
+        super().__init__()
+        self.model = nn.Sequential(
+            nn.Linear(latent_dim, 128 * 8 * 8),
+            nn.ReLU(),
+            nn.Unflatten(1, (128, 8, 8)),
+            nn.ConvTranspose2d(128, 64, 4, 2, 1),  # 8x8 -> 16x16
+            nn.ReLU(),
+            nn.ConvTranspose2d(64, 32, 4, 2, 1),   # 16x16 -> 32x32
+            nn.ReLU(),
+            nn.ConvTranspose2d(32, img_channels, 4, 2, 1),  # 32x32 -> 64x64
+            nn.Tanh()
+        )
+
+    def forward(self, z):
+        return self.model(z)
+
+class Discriminator(nn.Module):
+    def __init__(self, img_channels=3):
+        super().__init__()
+        self.model = nn.Sequential(
+            nn.Conv2d(img_channels, 32, 4, 2, 1),  # 64x64 -> 32x32
+            nn.LeakyReLU(0.2),
+            nn.Conv2d(32, 64, 4, 2, 1),            # 32x32 -> 16x16
+            nn.LeakyReLU(0.2),
+            nn.Conv2d(64, 128, 4, 2, 1),           # 16x16 -> 8x8
+            nn.LeakyReLU(0.2),
+            nn.Flatten(),
+            nn.Linear(128 * 8 * 8, 1),
+            nn.Sigmoid()
+        )
+
+    def forward(self, x):
+        return self.model(x)
+
+# Training loop
+for real_images in dataloader:
+    # Train discriminator
+    fake_images = generator(noise)
+    D_real = discriminator(real_images)
+    D_fake = discriminator(fake_images.detach())
+
+    D_loss = -torch.mean(torch.log(D_real) + torch.log(1 - D_fake))
+    D_loss.backward()
+    optimizer_D.step()
+
+    # Train generator
+    D_fake = discriminator(fake_images)
+    G_loss = -torch.mean(torch.log(D_fake))
+    G_loss.backward()
+    optimizer_G.step()
+```
+
+
+## Part 4: Diffusion Models (Modern Default)
+
+### Architecture
+
+**Concept**: Learn to reverse a diffusion (noising) process
+
+**Forward process** (fixed):
+```python
+# Gradually add noise to image
+x_0 (original) → x_1 → x_2 → ... → x_T (pure noise)
+
+# At each step:
+x_t = √(1 - β_t) * x_{t-1} + √β_t * ε
+where ε ~ N(0, I), β_t = noise schedule
+```
+
+**Reverse process** (learned):
+```python
+# Model learns to denoise
+x_T (noise) → x_{T-1} → ... → x_1 → x_0 (image)
+
+# Model predicts: ε_θ(x_t, t)
+# Then: x_{t-1} = (x_t - √β_t * ε_θ(x_t, t)) / √(1 - β_t)
+```
+
+**Training:**
+```python
+# Simple loss: Predict the noise
+loss = MSE(ε, ε_θ(x_t, t))
+
+# x_t = noisy image at step t
+# ε = actual noise added
+# ε_θ(x_t, t) = model's noise prediction
+```
+
+### Why Diffusion is Excellent
+
+**Advantages:**
+1. **High quality**: State-of-the-art (better than GAN)
+2. **Stable training**: Standard MSE loss (no adversarial dynamics)
+3. **No mode collapse**: By design, covers full distribution
+4. **Controllable**: Easy to add conditioning (text, class, etc.)
+
+**Disadvantages:**
+1. **Slow inference**: 50-1000 denoising steps (vs GAN's 1 step)
+2. **Compute intensive**: T forward passes (T = 50-1000)
+
+**Speed comparison:**
+```
+GAN: 1 forward pass = 50ms
+Diffusion (T=50): 50 forward passes = 2.5 seconds
+Diffusion (T=1000): 1000 forward passes = 50 seconds
+```
+
+**Speedup techniques:**
+- DDIM (fewer steps, 10-50 instead of 1000)
+- DPM-Solver (fast sampler)
+- Latent diffusion (Stable Diffusion, denoise in latent space)
+
+### Modern Diffusion Models
+
+**Stable Diffusion (2022+):**
+- Latent diffusion (denoise in VAE latent space)
+- Text conditioning (CLIP text encoder)
+- Pretrained on billions of images
+- Fine-tunable
+
+**DALL-E 2 (2022):**
+- Prior network (text → image embedding)
+- Diffusion decoder (embedding → image)
+
+**Imagen (2022, Google):**
+- Text conditioning with T5 encoder
+- Cascaded diffusion (64×64 → 256×256 → 1024×1024)
+
+**When to use Diffusion:**
+✅ High-quality generation (best quality)
+✅ Stable training (standard loss)
+✅ Diversity needed (no mode collapse)
+✅ Conditioning (text-to-image, class-conditional)
+
+❌ Need fast inference (< 1 second)
+❌ Real-time generation
+
+### Implementation (DDPM)
+
+```python
+class DiffusionModel(nn.Module):
+    def __init__(self, img_channels=3):
+        super().__init__()
+        # U-Net architecture
+        self.model = UNet(
+            in_channels=img_channels,
+            out_channels=img_channels,
+            time_embedding_dim=256
+        )
+
+    def forward(self, x_t, t):
+        # Predict noise ε at timestep t
+        return self.model(x_t, t)
+
+# Training
+def train_step(model, x_0):
+    # Sample random timestep
+    t = torch.randint(0, T, (batch_size,))
+
+    # Sample noise
+    ε = torch.randn_like(x_0)
+
+    # Create noisy image x_t
+    α_t = alpha_schedule[t]
+    x_t = torch.sqrt(α_t) * x_0 + torch.sqrt(1 - α_t) * ε
+
+    # Predict noise
+    ε_pred = model(x_t, t)
+
+    # Loss: MSE between actual and predicted noise
+    loss = F.mse_loss(ε_pred, ε)
+    return loss
+
+# Sampling (generation)
+@torch.no_grad()
+def sample(model, shape):
+    # Start from pure noise
+    x_t = torch.randn(shape)
+
+    # Iteratively denoise
+    for t in reversed(range(T)):
+        # Predict noise
+        ε_pred = model(x_t, t)
+
+        # Denoise one step
+        α_t = alpha_schedule[t]
+        x_t = (x_t - (1 - α_t) / torch.sqrt(1 - ᾱ_t) * ε_pred) / torch.sqrt(α_t)
+
+        # Add noise (except last step)
+        if t > 0:
+            x_t += torch.sqrt(β_t) * torch.randn_like(x_t)
+
+    return x_t  # x_0 (generated image)
+```
+
+
+## Part 5: Autoregressive Models
+
+### Concept
+
+**Idea**: Model probability as product of conditionals
+```
+p(x) = p(x_1) * p(x_2|x_1) * p(x_3|x_1,x_2) * ... * p(x_n|x_1,...,x_{n-1})
+```
+
+**For images**: Generate pixel-by-pixel (or patch-by-patch)
+
+**Architectures:**
+- **PixelCNN**: Convolutional with masked kernels
+- **PixelCNN++**: Improved with mixture of logistics
+- **VQ-VAE + PixelCNN**: Two-stage (learn discrete codes, model codes)
+- **ImageGPT**: GPT-style Transformer for images
+
+### Advantages
+
+✅ **Explicit likelihood**: Can compute p(x) exactly
+✅ **Stable training**: Standard cross-entropy loss
+✅ **Theoretical guarantees**: Proper probability model
+
+### Disadvantages
+
+❌ **Very slow generation**: Sequential (can't parallelize)
+❌ **Limited quality**: Worse than GAN/Diffusion for high-res
+❌ **Resolution scaling**: Impractical for 1024×1024 (1M pixels!)
+
+**Speed comparison:**
+```
+GAN: Generate 1024×1024 in 50ms (parallel)
+PixelCNN: Generate 32×32 in 5 seconds (sequential!)
+ImageGPT: Generate 256×256 in 30 seconds
+
+For 1024×1024: 1M pixels × 5ms/pixel = 83 minutes!
+```
+
+### When to Use
+
+✅ **Use autoregressive for:**
+- Explicit likelihood needed (compression, evaluation)
+- Small images (32×32, 64×64)
+- Two-stage models (VQ-VAE + Transformer)
+
+❌ **Don't use for:**
+- High-resolution images (too slow)
+- Real-time generation
+- Quality-critical applications (use diffusion)
+
+### Modern Usage
+
+**Two-stage approach (DALL-E, VQ-GAN):**
+1. **Stage 1**: VQ-VAE learns discrete codes
+   - Image → 32×32 grid of codes (instead of 1M pixels)
+2. **Stage 2**: Autoregressive model (Transformer) on codes
+   - Much faster (32×32 = 1024 codes, not 1M pixels)
+
+
+## Part 6: Flow Models
+
+### Concept
+
+**Idea**: Invertible transformations
+```
+z ~ N(0, I)  ←→  x ~ p_data
+
+f: z → x (forward)
+f⁻¹: x → z (inverse)
+```
+
+**Requirement**: f must be invertible and differentiable
+
+**Advantage**: Exact likelihood via change-of-variables
+```
+log p(x) = log p(z) + log |det(∂f⁻¹/∂x)|
+```
+
+### Architectures
+
+**RealNVP (2017):**
+- Coupling layers (affine transformations)
+- Invertible by design
+
+**Glow (2018, OpenAI):**
+- Actnorm, invertible 1×1 convolutions
+- Multi-scale architecture
+
+**When to use Flow:**
+✅ Exact likelihood needed (better than VAE)
+✅ Invertibility needed (both z→x and x→z)
+✅ Stable training (standard loss)
+
+❌ Architecture constraints (must be invertible)
+❌ Quality not as good as GAN/Diffusion
+
+### Modern Status
+
+**Mostly superseded by Diffusion:**
+- Diffusion has better quality
+- Diffusion more flexible (no invertibility constraint)
+- Flow models still used in specialized applications
+
+
+## Part 7: Decision Framework
+
+### By Primary Goal
+
+```
+Goal: High-quality images
+→ Diffusion (modern default)
+→ OR GAN if pretrained available
+
+Goal: Fast inference
+→ GAN (50ms per image)
+→ Avoid Diffusion (too slow for real-time)
+
+Goal: Training stability
+→ Diffusion or VAE (standard loss)
+→ Avoid GAN (adversarial training hard)
+
+Goal: Latent space exploration
+→ VAE (smooth interpolation)
+→ Avoid GAN (no encoder)
+
+Goal: Explicit likelihood
+→ Autoregressive or Flow
+→ For evaluation, compression
+
+Goal: Diversity (no mode collapse)
+→ Diffusion (by design)
+→ OR VAE (stable)
+→ Avoid GAN (mode collapse common)
+```
+
+### By Data Type
+
+```
+Images (high-quality):
+→ Diffusion (Stable Diffusion)
+→ OR GAN (StyleGAN2)
+
+Images (small, 32×32):
+→ Any model works
+→ Try VAE first (simplest)
+
+Audio waveforms:
+→ WaveGAN
+→ OR Diffusion (WaveGrad)
+
+Video:
+→ Video Diffusion (limited)
+→ OR GAN (StyleGAN-V)
+
+Text:
+→ Autoregressive (GPT)
+→ NOT VAE/GAN/Diffusion (discrete tokens)
+```
+
+### By Training Budget
+
+```
+Large budget (millions $, pretrain from scratch):
+→ Diffusion (Stable Diffusion scale)
+→ Billions of images, weeks on cluster
+
+Medium budget (thousands $, train from scratch):
+→ GAN or Diffusion
+→ 10k-1M images, days on GPU
+
+Small budget (hundreds $, fine-tune):
+→ Fine-tune Stable Diffusion (LoRA)
+→ 1k-10k images, hours on consumer GPU
+
+Tiny budget (research, small scale):
+→ VAE (simplest, most stable)
+→ Few thousand images, CPU possible
+```
+
+### Modern Recommendations (2025)
+
+**For new projects:**
+1. **Default: Diffusion**
+   - Fine-tune Stable Diffusion or train from scratch
+   - Best quality + stability
+
+2. **If need speed: GAN**
+   - Use pretrained StyleGAN2 if available
+   - Or train GAN (if can tolerate instability)
+
+3. **If need latent space: VAE**
+   - For interpolation, not generation quality
+
+**AVOID:**
+- Training GAN from scratch (unless necessary)
+- Using VAE for high-quality generation
+- Autoregressive for high-res images
+
+
+## Part 8: Training from Scratch vs Fine-Tuning
+
+### Stable Diffusion Example
+
+**Pretraining (what Stability AI did):**
+- Dataset: LAION-5B (5 billion images)
+- Compute: 150,000 A100 GPU hours
+- Cost: ~$600,000
+- Time: Weeks on massive cluster
+- **DON'T DO THIS!**
+
+**Fine-tuning (what users do):**
+- Dataset: 10k-100k domain images
+- Compute: 100-1000 GPU hours
+- Cost: $100-1,000
+- Time: Days on single A100
+- **DO THIS!**
+
+**LoRA (Low-Rank Adaptation):**
+- Efficient fine-tuning (fewer parameters)
+- Dataset: 1k-5k images
+- Compute: 10-100 GPU hours
+- Cost: $10-100
+- Time: Hours on consumer GPU (RTX 3090)
+- **Best for small budgets!**
+
+### Decision
+
+```
+Have pretrained model in your domain:
+→ Fine-tune (don't retrain!)
+
+No pretrained model:
+→ Train from scratch (small model)
+→ OR find closest pretrained and fine-tune
+
+Budget < $1000:
+→ LoRA fine-tuning
+→ OR train small model (64×64)
+
+Budget < $100:
+→ LoRA with free Colab
+→ OR VAE from scratch (cheap)
+```
+
+
+## Part 9: Common Mistakes
+
+### Mistake 1: VAE for High-Quality Generation
+
+**Symptom:** Blurry outputs
+**Fix:** Use GAN or Diffusion for quality
+**VAE is for:** Latent space, not generation
+
+### Mistake 2: Ignoring Mode Collapse
+
+**Symptom:** GAN generates same images
+**Fix:** Spectral norm, minibatch discrimination
+**Better:** Switch to Diffusion (no mode collapse)
+
+### Mistake 3: Training Stable Diffusion from Scratch
+
+**Symptom:** Burning money, poor results
+**Fix:** Fine-tune pretrained model
+**Reality:** Pretraining costs $600k+
+
+### Mistake 4: Slow Inference with Diffusion
+
+**Symptom:** 50 seconds per image
+**Fix:** Use DDIM (fewer steps, 10-50)
+**OR:** Use GAN if speed critical
+
+### Mistake 5: Wrong Loss for GAN
+
+**Symptom:** Training diverges
+**Fix:** Use Wasserstein loss (WGAN)
+**OR:** Spectral normalization
+**Better:** Switch to Diffusion (standard loss)
+
+
+## Summary: Quick Reference
+
+### Model Selection
+
+```
+High quality + stable training:
+→ Diffusion (modern default)
+
+Fast inference required:
+→ GAN (if pretrained) or trained GAN
+
+Latent space exploration:
+→ VAE
+
+Explicit likelihood:
+→ Autoregressive or Flow
+
+Small images (< 64×64):
+→ Any model (start with VAE)
+
+Large images (> 256×256):
+→ Diffusion or GAN (avoid autoregressive)
+```
+
+### Quality Ranking
+
+```
+1. Diffusion (9.5/10)
+2. GAN (9/10)
+3. Autoregressive (7-8/10)
+4. Flow (7-8/10)
+5. VAE (6/10 - blurry)
+```
+
+### Training Stability Ranking
+
+```
+1. VAE (10/10)
+2. Diffusion (9/10)
+3. Autoregressive (9/10)
+4. Flow (8/10)
+5. GAN (3/10 - very unstable)
+```
+
+### Modern Stack (2025)
+
+```
+Image generation: Stable Diffusion (fine-tuned)
+Fast inference: StyleGAN2 (if available)
+Latent space: VAE
+Research: Diffusion (easiest to train)
+```
+
+
+## Next Steps
+
+After mastering this skill:
+- `llm-specialist/llm-finetuning-strategies`: Apply to text generation
+- `architecture-design-principles`: Understand design trade-offs
+- `training-optimization`: Optimize training for your chosen model
+
+**Remember:** Diffusion models dominate in 2025. Use them unless you have specific reason not to (speed, latent space, likelihood).