# Generative Model Families

## When to Use This Skill

Use this skill when you need to:
- ✅ Select generative model for image/audio/video generation
- ✅ Understand VAE vs GAN vs Diffusion trade-offs
- ✅ Decide between training from scratch vs fine-tuning
- ✅ Address mode collapse in GANs
- ✅ Choose between quality, speed, and training stability
- ✅ Understand modern landscape (Stable Diffusion, StyleGAN, etc.)

**Do NOT use this skill for:**
- ❌ Text generation (use `llm-specialist` pack)
- ❌ Architecture implementation details (use model-specific docs)
- ❌ High-level architecture selection (use `using-neural-architectures`)


## Core Principle

**Generative models have fundamental trade-offs:**
- **Quality vs Stability**: GANs sharp but unstable, VAEs blurry but stable
- **Quality vs Speed**: Diffusion high-quality but slow, GANs fast
- **Explicitness vs Flexibility**: Autoregressive/Flow have likelihood, GANs don't

**Modern default (2025):** Diffusion models (best quality + stability)


## Part 1: Model Family Overview

### The Five Families

**1. VAE (Variational Autoencoder)**
- **Approach**: Learn latent space with encoder-decoder
- **Quality**: Blurry (6/10)
- **Training**: Very stable
- **Use**: Latent space exploration, NOT high-quality generation

**2. GAN (Generative Adversarial Network)**
- **Approach**: Adversarial game (generator vs discriminator)
- **Quality**: Sharp (9/10)
- **Training**: Unstable (adversarial dynamics)
- **Use**: High-quality generation, fast inference

**3. Diffusion Models**
- **Approach**: Iterative denoising
- **Quality**: Very sharp (9.5/10)
- **Training**: Stable
- **Use**: Modern default for high-quality generation

**4. Autoregressive Models**
- **Approach**: Sequential generation (pixel-by-pixel, token-by-token)
- **Quality**: Good (7-8/10)
- **Training**: Stable
- **Use**: Explicit likelihood, sequential data

**5. Flow Models**
- **Approach**: Invertible transformations
- **Quality**: Good (7-8/10)
- **Training**: Stable
- **Use**: Exact likelihood, invertibility needed

### Quick Comparison

| Model | Quality | Training Stability | Inference Speed | Mode Collapse | Likelihood |
|-------|---------|-------------------|----------------|---------------|------------|
| VAE | 6/10 (blurry) | 10/10 | Fast | No | Approximate |
| GAN | 9/10 | 3/10 | Fast | Yes | No |
| Diffusion | 9.5/10 | 9/10 | Slow | No | Approximate |
| Autoregressive | 7-8/10 | 9/10 | Very slow | No | Exact |
| Flow | 7-8/10 | 8/10 | Fast (both ways) | No | Exact |


## Part 2: VAE (Variational Autoencoder)

### Architecture

**Components:**
1. **Encoder**: x → z (image to latent)
2. **Latent space**: z ~ N(μ, σ²)
3. **Decoder**: z → x' (latent to reconstruction)

**Loss function:**
```python
# ELBO (Evidence Lower Bound)
loss = reconstruction_loss + KL_divergence

# Reconstruction: How well decoder reconstructs input
reconstruction_loss = MSE(x, x_reconstructed)

# KL: How close latent is to standard normal
KL_divergence = KL(q(z|x) || p(z))
```

### Why VAE is Blurry

**Problem**: MSE loss encourages pixel-wise averaging

**Example:**
- Dataset: Faces with both smiles and no smiles
- VAE learns: "Average face has half-smile blur"
- Result: Blurry, hedges between modes

**Mathematical reason:**
- MSE minimization = mean prediction
- Mean of sharp images = blurry image

### When to Use VAE

✅ **Use VAE for:**
- Latent space exploration (interpolation, arithmetic)
- Anomaly detection (reconstruction error)
- Disentangled representations (β-VAE)
- Compression (lossy, with latent codes)

❌ **DON'T use VAE for:**
- High-quality image generation (use GAN or Diffusion!)
- Sharp, realistic outputs

### Implementation

```python
import torch
import torch.nn as nn

class VAE(nn.Module):
    def __init__(self, latent_dim=128):
        super().__init__()
        # Encoder
        self.encoder = nn.Sequential(
            nn.Conv2d(3, 32, 4, 2, 1),  # 64x64 -> 32x32
            nn.ReLU(),
            nn.Conv2d(32, 64, 4, 2, 1),  # 32x32 -> 16x16
            nn.ReLU(),
            nn.Conv2d(64, 128, 4, 2, 1),  # 16x16 -> 8x8
            nn.ReLU(),
            nn.Flatten()
        )

        self.fc_mu = nn.Linear(128 * 8 * 8, latent_dim)
        self.fc_logvar = nn.Linear(128 * 8 * 8, latent_dim)

        # Decoder
        self.fc_decode = nn.Linear(latent_dim, 128 * 8 * 8)
        self.decoder = nn.Sequential(
            nn.ConvTranspose2d(128, 64, 4, 2, 1),
            nn.ReLU(),
            nn.ConvTranspose2d(64, 32, 4, 2, 1),
            nn.ReLU(),
            nn.ConvTranspose2d(32, 3, 4, 2, 1),
            nn.Sigmoid()
        )

    def reparameterize(self, mu, logvar):
        # Reparameterization trick: z = μ + σ * ε
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def forward(self, x):
        # Encode
        h = self.encoder(x)
        mu = self.fc_mu(h)
        logvar = self.fc_logvar(h)

        # Sample latent
        z = self.reparameterize(mu, logvar)

        # Decode
        h = self.fc_decode(z)
        h = h.view(-1, 128, 8, 8)
        x_recon = self.decoder(h)

        return x_recon, mu, logvar

    def loss_function(self, x, x_recon, mu, logvar):
        # Reconstruction loss
        recon_loss = F.mse_loss(x_recon, x, reduction='sum')

        # KL divergence
        kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

        return recon_loss + kl_loss
```


## Part 3: GAN (Generative Adversarial Network)

### Architecture

**Components:**
1. **Generator**: z → x (noise to image)
2. **Discriminator**: x → [0, 1] (image to real/fake probability)

**Adversarial Training:**
```python
# Discriminator loss: Classify real as real, fake as fake
D_loss = -log(D(x_real)) - log(1 - D(G(z)))

# Generator loss: Fool discriminator
G_loss = -log(D(G(z)))

# Minimax game:
min_G max_D V(D, G) = E[log D(x)] + E[log(1 - D(G(z)))]
```

### Training Instability

**Problem**: Adversarial dynamics are unstable

**Common issues:**
1. **Mode collapse**: Generator produces limited variety
2. **Non-convergence**: Oscillation, never settles
3. **Vanishing gradients**: Discriminator too strong, generator can't learn
4. **Hyperparameter sensitivity**: Learning rates critical

**Solutions:**
- Spectral normalization (StyleGAN2)
- Progressive growing (start low-res, increase)
- Minibatch discrimination (penalize lack of diversity)
- Wasserstein loss (WGAN, more stable)

### Mode Collapse

**What is it?**
- Generator produces subset of distribution
- Example: Face GAN only generates 10 face types

**Why it happens:**
- Generator exploits discriminator weaknesses
- Finds "easy" samples that fool discriminator
- Forgets other modes

**Detection:**
```python
# Check diversity: Generate many samples
samples = generator.generate(n=1000)
diversity = compute_pairwise_distance(samples)
if diversity < threshold:
    print("Mode collapse detected!")
```

**Solutions:**
- Minibatch discrimination
- Unrolled GANs (slow but helps)
- Switch to diffusion (no mode collapse by design!)

### Modern GANs

**StyleGAN2 (2020):**
- State-of-the-art for faces
- Style-based generator
- Spectral normalization for stability
- Resolution: 1024×1024

**StyleGAN3 (2021):**
- Alias-free architecture
- Better animation/video

**When to use GAN:**
✅ Fast inference needed (50ms per image)
✅ Pretrained model available (StyleGAN2)
✅ Can tolerate training difficulty

❌ Training instability unacceptable
❌ Mode collapse problematic
❌ Starting from scratch (use diffusion instead)

### Implementation (Basic GAN)

```python
class Generator(nn.Module):
    def __init__(self, latent_dim=100, img_channels=3):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(latent_dim, 128 * 8 * 8),
            nn.ReLU(),
            nn.Unflatten(1, (128, 8, 8)),
            nn.ConvTranspose2d(128, 64, 4, 2, 1),  # 8x8 -> 16x16
            nn.ReLU(),
            nn.ConvTranspose2d(64, 32, 4, 2, 1),   # 16x16 -> 32x32
            nn.ReLU(),
            nn.ConvTranspose2d(32, img_channels, 4, 2, 1),  # 32x32 -> 64x64
            nn.Tanh()
        )

    def forward(self, z):
        return self.model(z)

class Discriminator(nn.Module):
    def __init__(self, img_channels=3):
        super().__init__()
        self.model = nn.Sequential(
            nn.Conv2d(img_channels, 32, 4, 2, 1),  # 64x64 -> 32x32
            nn.LeakyReLU(0.2),
            nn.Conv2d(32, 64, 4, 2, 1),            # 32x32 -> 16x16
            nn.LeakyReLU(0.2),
            nn.Conv2d(64, 128, 4, 2, 1),           # 16x16 -> 8x8
            nn.LeakyReLU(0.2),
            nn.Flatten(),
            nn.Linear(128 * 8 * 8, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.model(x)

# Training loop
for real_images in dataloader:
    # Train discriminator
    fake_images = generator(noise)
    D_real = discriminator(real_images)
    D_fake = discriminator(fake_images.detach())

    D_loss = -torch.mean(torch.log(D_real) + torch.log(1 - D_fake))
    D_loss.backward()
    optimizer_D.step()

    # Train generator
    D_fake = discriminator(fake_images)
    G_loss = -torch.mean(torch.log(D_fake))
    G_loss.backward()
    optimizer_G.step()
```


## Part 4: Diffusion Models (Modern Default)

### Architecture

**Concept**: Learn to reverse a diffusion (noising) process

**Forward process** (fixed):
```python
# Gradually add noise to image
x_0 (original) → x_1 → x_2 → ... → x_T (pure noise)

# At each step:
x_t = √(1 - β_t) * x_{t-1} + √β_t * ε
where ε ~ N(0, I), β_t = noise schedule
```

**Reverse process** (learned):
```python
# Model learns to denoise
x_T (noise) → x_{T-1} → ... → x_1 → x_0 (image)

# Model predicts: ε_θ(x_t, t)
# Then: x_{t-1} = (x_t - √β_t * ε_θ(x_t, t)) / √(1 - β_t)
```

**Training:**
```python
# Simple loss: Predict the noise
loss = MSE(ε, ε_θ(x_t, t))

# x_t = noisy image at step t
# ε = actual noise added
# ε_θ(x_t, t) = model's noise prediction
```

### Why Diffusion is Excellent

**Advantages:**
1. **High quality**: State-of-the-art (better than GAN)
2. **Stable training**: Standard MSE loss (no adversarial dynamics)
3. **No mode collapse**: By design, covers full distribution
4. **Controllable**: Easy to add conditioning (text, class, etc.)

**Disadvantages:**
1. **Slow inference**: 50-1000 denoising steps (vs GAN's 1 step)
2. **Compute intensive**: T forward passes (T = 50-1000)

**Speed comparison:**
```
GAN: 1 forward pass = 50ms
Diffusion (T=50): 50 forward passes = 2.5 seconds
Diffusion (T=1000): 1000 forward passes = 50 seconds
```

**Speedup techniques:**
- DDIM (fewer steps, 10-50 instead of 1000)
- DPM-Solver (fast sampler)
- Latent diffusion (Stable Diffusion, denoise in latent space)

### Modern Diffusion Models

**Stable Diffusion (2022+):**
- Latent diffusion (denoise in VAE latent space)
- Text conditioning (CLIP text encoder)
- Pretrained on billions of images
- Fine-tunable

**DALL-E 2 (2022):**
- Prior network (text → image embedding)
- Diffusion decoder (embedding → image)

**Imagen (2022, Google):**
- Text conditioning with T5 encoder
- Cascaded diffusion (64×64 → 256×256 → 1024×1024)

**When to use Diffusion:**
✅ High-quality generation (best quality)
✅ Stable training (standard loss)
✅ Diversity needed (no mode collapse)
✅ Conditioning (text-to-image, class-conditional)

❌ Need fast inference (< 1 second)
❌ Real-time generation

### Implementation (DDPM)

```python
class DiffusionModel(nn.Module):
    def __init__(self, img_channels=3):
        super().__init__()
        # U-Net architecture
        self.model = UNet(
            in_channels=img_channels,
            out_channels=img_channels,
            time_embedding_dim=256
        )

    def forward(self, x_t, t):
        # Predict noise ε at timestep t
        return self.model(x_t, t)

# Training
def train_step(model, x_0):
    # Sample random timestep
    t = torch.randint(0, T, (batch_size,))

    # Sample noise
    ε = torch.randn_like(x_0)

    # Create noisy image x_t
    α_t = alpha_schedule[t]
    x_t = torch.sqrt(α_t) * x_0 + torch.sqrt(1 - α_t) * ε

    # Predict noise
    ε_pred = model(x_t, t)

    # Loss: MSE between actual and predicted noise
    loss = F.mse_loss(ε_pred, ε)
    return loss

# Sampling (generation)
@torch.no_grad()
def sample(model, shape):
    # Start from pure noise
    x_t = torch.randn(shape)

    # Iteratively denoise
    for t in reversed(range(T)):
        # Predict noise
        ε_pred = model(x_t, t)

        # Denoise one step
        α_t = alpha_schedule[t]
        x_t = (x_t - (1 - α_t) / torch.sqrt(1 - ᾱ_t) * ε_pred) / torch.sqrt(α_t)

        # Add noise (except last step)
        if t > 0:
            x_t += torch.sqrt(β_t) * torch.randn_like(x_t)

    return x_t  # x_0 (generated image)
```


## Part 5: Autoregressive Models

### Concept

**Idea**: Model probability as product of conditionals
```
p(x) = p(x_1) * p(x_2|x_1) * p(x_3|x_1,x_2) * ... * p(x_n|x_1,...,x_{n-1})
```

**For images**: Generate pixel-by-pixel (or patch-by-patch)

**Architectures:**
- **PixelCNN**: Convolutional with masked kernels
- **PixelCNN++**: Improved with mixture of logistics
- **VQ-VAE + PixelCNN**: Two-stage (learn discrete codes, model codes)
- **ImageGPT**: GPT-style Transformer for images

### Advantages

✅ **Explicit likelihood**: Can compute p(x) exactly
✅ **Stable training**: Standard cross-entropy loss
✅ **Theoretical guarantees**: Proper probability model

### Disadvantages

❌ **Very slow generation**: Sequential (can't parallelize)
❌ **Limited quality**: Worse than GAN/Diffusion for high-res
❌ **Resolution scaling**: Impractical for 1024×1024 (1M pixels!)

**Speed comparison:**
```
GAN: Generate 1024×1024 in 50ms (parallel)
PixelCNN: Generate 32×32 in 5 seconds (sequential!)
ImageGPT: Generate 256×256 in 30 seconds

For 1024×1024: 1M pixels × 5ms/pixel = 83 minutes!
```

### When to Use

✅ **Use autoregressive for:**
- Explicit likelihood needed (compression, evaluation)
- Small images (32×32, 64×64)
- Two-stage models (VQ-VAE + Transformer)

❌ **Don't use for:**
- High-resolution images (too slow)
- Real-time generation
- Quality-critical applications (use diffusion)

### Modern Usage

**Two-stage approach (DALL-E, VQ-GAN):**
1. **Stage 1**: VQ-VAE learns discrete codes
   - Image → 32×32 grid of codes (instead of 1M pixels)
2. **Stage 2**: Autoregressive model (Transformer) on codes
   - Much faster (32×32 = 1024 codes, not 1M pixels)


## Part 6: Flow Models

### Concept

**Idea**: Invertible transformations
```
z ~ N(0, I)  ←→  x ~ p_data

f: z → x (forward)
f⁻¹: x → z (inverse)
```

**Requirement**: f must be invertible and differentiable

**Advantage**: Exact likelihood via change-of-variables
```
log p(x) = log p(z) + log |det(∂f⁻¹/∂x)|
```

### Architectures

**RealNVP (2017):**
- Coupling layers (affine transformations)
- Invertible by design

**Glow (2018, OpenAI):**
- Actnorm, invertible 1×1 convolutions
- Multi-scale architecture

**When to use Flow:**
✅ Exact likelihood needed (better than VAE)
✅ Invertibility needed (both z→x and x→z)
✅ Stable training (standard loss)

❌ Architecture constraints (must be invertible)
❌ Quality not as good as GAN/Diffusion

### Modern Status

**Mostly superseded by Diffusion:**
- Diffusion has better quality
- Diffusion more flexible (no invertibility constraint)
- Flow models still used in specialized applications


## Part 7: Decision Framework

### By Primary Goal

```
Goal: High-quality images
→ Diffusion (modern default)
→ OR GAN if pretrained available

Goal: Fast inference
→ GAN (50ms per image)
→ Avoid Diffusion (too slow for real-time)

Goal: Training stability
→ Diffusion or VAE (standard loss)
→ Avoid GAN (adversarial training hard)

Goal: Latent space exploration
→ VAE (smooth interpolation)
→ Avoid GAN (no encoder)

Goal: Explicit likelihood
→ Autoregressive or Flow
→ For evaluation, compression

Goal: Diversity (no mode collapse)
→ Diffusion (by design)
→ OR VAE (stable)
→ Avoid GAN (mode collapse common)
```

### By Data Type

```
Images (high-quality):
→ Diffusion (Stable Diffusion)
→ OR GAN (StyleGAN2)

Images (small, 32×32):
→ Any model works
→ Try VAE first (simplest)

Audio waveforms:
→ WaveGAN
→ OR Diffusion (WaveGrad)

Video:
→ Video Diffusion (limited)
→ OR GAN (StyleGAN-V)

Text:
→ Autoregressive (GPT)
→ NOT VAE/GAN/Diffusion (discrete tokens)
```

### By Training Budget

```
Large budget (millions $, pretrain from scratch):
→ Diffusion (Stable Diffusion scale)
→ Billions of images, weeks on cluster

Medium budget (thousands $, train from scratch):
→ GAN or Diffusion
→ 10k-1M images, days on GPU

Small budget (hundreds $, fine-tune):
→ Fine-tune Stable Diffusion (LoRA)
→ 1k-10k images, hours on consumer GPU

Tiny budget (research, small scale):
→ VAE (simplest, most stable)
→ Few thousand images, CPU possible
```

### Modern Recommendations (2025)

**For new projects:**
1. **Default: Diffusion**
   - Fine-tune Stable Diffusion or train from scratch
   - Best quality + stability

2. **If need speed: GAN**
   - Use pretrained StyleGAN2 if available
   - Or train GAN (if can tolerate instability)

3. **If need latent space: VAE**
   - For interpolation, not generation quality

**AVOID:**
- Training GAN from scratch (unless necessary)
- Using VAE for high-quality generation
- Autoregressive for high-res images


## Part 8: Training from Scratch vs Fine-Tuning

### Stable Diffusion Example

**Pretraining (what Stability AI did):**
- Dataset: LAION-5B (5 billion images)
- Compute: 150,000 A100 GPU hours
- Cost: ~$600,000
- Time: Weeks on massive cluster
- **DON'T DO THIS!**

**Fine-tuning (what users do):**
- Dataset: 10k-100k domain images
- Compute: 100-1000 GPU hours
- Cost: $100-1,000
- Time: Days on single A100
- **DO THIS!**

**LoRA (Low-Rank Adaptation):**
- Efficient fine-tuning (fewer parameters)
- Dataset: 1k-5k images
- Compute: 10-100 GPU hours
- Cost: $10-100
- Time: Hours on consumer GPU (RTX 3090)
- **Best for small budgets!**

### Decision

```
Have pretrained model in your domain:
→ Fine-tune (don't retrain!)

No pretrained model:
→ Train from scratch (small model)
→ OR find closest pretrained and fine-tune

Budget < $1000:
→ LoRA fine-tuning
→ OR train small model (64×64)

Budget < $100:
→ LoRA with free Colab
→ OR VAE from scratch (cheap)
```


## Part 9: Common Mistakes

### Mistake 1: VAE for High-Quality Generation

**Symptom:** Blurry outputs
**Fix:** Use GAN or Diffusion for quality
**VAE is for:** Latent space, not generation

### Mistake 2: Ignoring Mode Collapse

**Symptom:** GAN generates same images
**Fix:** Spectral norm, minibatch discrimination
**Better:** Switch to Diffusion (no mode collapse)

### Mistake 3: Training Stable Diffusion from Scratch

**Symptom:** Burning money, poor results
**Fix:** Fine-tune pretrained model
**Reality:** Pretraining costs $600k+

### Mistake 4: Slow Inference with Diffusion

**Symptom:** 50 seconds per image
**Fix:** Use DDIM (fewer steps, 10-50)
**OR:** Use GAN if speed critical

### Mistake 5: Wrong Loss for GAN

**Symptom:** Training diverges
**Fix:** Use Wasserstein loss (WGAN)
**OR:** Spectral normalization
**Better:** Switch to Diffusion (standard loss)


## Summary: Quick Reference

### Model Selection

```
High quality + stable training:
→ Diffusion (modern default)

Fast inference required:
→ GAN (if pretrained) or trained GAN

Latent space exploration:
→ VAE

Explicit likelihood:
→ Autoregressive or Flow

Small images (< 64×64):
→ Any model (start with VAE)

Large images (> 256×256):
→ Diffusion or GAN (avoid autoregressive)
```

### Quality Ranking

```
1. Diffusion (9.5/10)
2. GAN (9/10)
3. Autoregressive (7-8/10)
4. Flow (7-8/10)
5. VAE (6/10 - blurry)
```

### Training Stability Ranking

```
1. VAE (10/10)
2. Diffusion (9/10)
3. Autoregressive (9/10)
4. Flow (8/10)
5. GAN (3/10 - very unstable)
```

### Modern Stack (2025)

```
Image generation: Stable Diffusion (fine-tuned)
Fast inference: StyleGAN2 (if available)
Latent space: VAE
Research: Diffusion (easiest to train)
```


## Next Steps

After mastering this skill:
- `llm-specialist/llm-finetuning-strategies`: Apply to text generation
- `architecture-design-principles`: Understand design trade-offs
- `training-optimization`: Optimize training for your chosen model

**Remember:** Diffusion models dominate in 2025. Use them unless you have specific reason not to (speed, latent space, likelihood).