20 KiB
Generative Model Families
When to Use This Skill
Use this skill when you need to:
- ✅ Select generative model for image/audio/video generation
- ✅ Understand VAE vs GAN vs Diffusion trade-offs
- ✅ Decide between training from scratch vs fine-tuning
- ✅ Address mode collapse in GANs
- ✅ Choose between quality, speed, and training stability
- ✅ Understand modern landscape (Stable Diffusion, StyleGAN, etc.)
Do NOT use this skill for:
- ❌ Text generation (use
llm-specialistpack) - ❌ Architecture implementation details (use model-specific docs)
- ❌ High-level architecture selection (use
using-neural-architectures)
Core Principle
Generative models have fundamental trade-offs:
- Quality vs Stability: GANs sharp but unstable, VAEs blurry but stable
- Quality vs Speed: Diffusion high-quality but slow, GANs fast
- Explicitness vs Flexibility: Autoregressive/Flow have likelihood, GANs don't
Modern default (2025): Diffusion models (best quality + stability)
Part 1: Model Family Overview
The Five Families
1. VAE (Variational Autoencoder)
- Approach: Learn latent space with encoder-decoder
- Quality: Blurry (6/10)
- Training: Very stable
- Use: Latent space exploration, NOT high-quality generation
2. GAN (Generative Adversarial Network)
- Approach: Adversarial game (generator vs discriminator)
- Quality: Sharp (9/10)
- Training: Unstable (adversarial dynamics)
- Use: High-quality generation, fast inference
3. Diffusion Models
- Approach: Iterative denoising
- Quality: Very sharp (9.5/10)
- Training: Stable
- Use: Modern default for high-quality generation
4. Autoregressive Models
- Approach: Sequential generation (pixel-by-pixel, token-by-token)
- Quality: Good (7-8/10)
- Training: Stable
- Use: Explicit likelihood, sequential data
5. Flow Models
- Approach: Invertible transformations
- Quality: Good (7-8/10)
- Training: Stable
- Use: Exact likelihood, invertibility needed
Quick Comparison
| Model | Quality | Training Stability | Inference Speed | Mode Collapse | Likelihood |
|---|---|---|---|---|---|
| VAE | 6/10 (blurry) | 10/10 | Fast | No | Approximate |
| GAN | 9/10 | 3/10 | Fast | Yes | No |
| Diffusion | 9.5/10 | 9/10 | Slow | No | Approximate |
| Autoregressive | 7-8/10 | 9/10 | Very slow | No | Exact |
| Flow | 7-8/10 | 8/10 | Fast (both ways) | No | Exact |
Part 2: VAE (Variational Autoencoder)
Architecture
Components:
- Encoder: x → z (image to latent)
- Latent space: z ~ N(μ, σ²)
- Decoder: z → x' (latent to reconstruction)
Loss function:
# ELBO (Evidence Lower Bound)
loss = reconstruction_loss + KL_divergence
# Reconstruction: How well decoder reconstructs input
reconstruction_loss = MSE(x, x_reconstructed)
# KL: How close latent is to standard normal
KL_divergence = KL(q(z|x) || p(z))
Why VAE is Blurry
Problem: MSE loss encourages pixel-wise averaging
Example:
- Dataset: Faces with both smiles and no smiles
- VAE learns: "Average face has half-smile blur"
- Result: Blurry, hedges between modes
Mathematical reason:
- MSE minimization = mean prediction
- Mean of sharp images = blurry image
When to Use VAE
✅ Use VAE for:
- Latent space exploration (interpolation, arithmetic)
- Anomaly detection (reconstruction error)
- Disentangled representations (β-VAE)
- Compression (lossy, with latent codes)
❌ DON'T use VAE for:
- High-quality image generation (use GAN or Diffusion!)
- Sharp, realistic outputs
Implementation
import torch
import torch.nn as nn
class VAE(nn.Module):
def __init__(self, latent_dim=128):
super().__init__()
# Encoder
self.encoder = nn.Sequential(
nn.Conv2d(3, 32, 4, 2, 1), # 64x64 -> 32x32
nn.ReLU(),
nn.Conv2d(32, 64, 4, 2, 1), # 32x32 -> 16x16
nn.ReLU(),
nn.Conv2d(64, 128, 4, 2, 1), # 16x16 -> 8x8
nn.ReLU(),
nn.Flatten()
)
self.fc_mu = nn.Linear(128 * 8 * 8, latent_dim)
self.fc_logvar = nn.Linear(128 * 8 * 8, latent_dim)
# Decoder
self.fc_decode = nn.Linear(latent_dim, 128 * 8 * 8)
self.decoder = nn.Sequential(
nn.ConvTranspose2d(128, 64, 4, 2, 1),
nn.ReLU(),
nn.ConvTranspose2d(64, 32, 4, 2, 1),
nn.ReLU(),
nn.ConvTranspose2d(32, 3, 4, 2, 1),
nn.Sigmoid()
)
def reparameterize(self, mu, logvar):
# Reparameterization trick: z = μ + σ * ε
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
return mu + eps * std
def forward(self, x):
# Encode
h = self.encoder(x)
mu = self.fc_mu(h)
logvar = self.fc_logvar(h)
# Sample latent
z = self.reparameterize(mu, logvar)
# Decode
h = self.fc_decode(z)
h = h.view(-1, 128, 8, 8)
x_recon = self.decoder(h)
return x_recon, mu, logvar
def loss_function(self, x, x_recon, mu, logvar):
# Reconstruction loss
recon_loss = F.mse_loss(x_recon, x, reduction='sum')
# KL divergence
kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return recon_loss + kl_loss
Part 3: GAN (Generative Adversarial Network)
Architecture
Components:
- Generator: z → x (noise to image)
- Discriminator: x → [0, 1] (image to real/fake probability)
Adversarial Training:
# Discriminator loss: Classify real as real, fake as fake
D_loss = -log(D(x_real)) - log(1 - D(G(z)))
# Generator loss: Fool discriminator
G_loss = -log(D(G(z)))
# Minimax game:
min_G max_D V(D, G) = E[log D(x)] + E[log(1 - D(G(z)))]
Training Instability
Problem: Adversarial dynamics are unstable
Common issues:
- Mode collapse: Generator produces limited variety
- Non-convergence: Oscillation, never settles
- Vanishing gradients: Discriminator too strong, generator can't learn
- Hyperparameter sensitivity: Learning rates critical
Solutions:
- Spectral normalization (StyleGAN2)
- Progressive growing (start low-res, increase)
- Minibatch discrimination (penalize lack of diversity)
- Wasserstein loss (WGAN, more stable)
Mode Collapse
What is it?
- Generator produces subset of distribution
- Example: Face GAN only generates 10 face types
Why it happens:
- Generator exploits discriminator weaknesses
- Finds "easy" samples that fool discriminator
- Forgets other modes
Detection:
# Check diversity: Generate many samples
samples = generator.generate(n=1000)
diversity = compute_pairwise_distance(samples)
if diversity < threshold:
print("Mode collapse detected!")
Solutions:
- Minibatch discrimination
- Unrolled GANs (slow but helps)
- Switch to diffusion (no mode collapse by design!)
Modern GANs
StyleGAN2 (2020):
- State-of-the-art for faces
- Style-based generator
- Spectral normalization for stability
- Resolution: 1024×1024
StyleGAN3 (2021):
- Alias-free architecture
- Better animation/video
When to use GAN: ✅ Fast inference needed (50ms per image) ✅ Pretrained model available (StyleGAN2) ✅ Can tolerate training difficulty
❌ Training instability unacceptable ❌ Mode collapse problematic ❌ Starting from scratch (use diffusion instead)
Implementation (Basic GAN)
class Generator(nn.Module):
def __init__(self, latent_dim=100, img_channels=3):
super().__init__()
self.model = nn.Sequential(
nn.Linear(latent_dim, 128 * 8 * 8),
nn.ReLU(),
nn.Unflatten(1, (128, 8, 8)),
nn.ConvTranspose2d(128, 64, 4, 2, 1), # 8x8 -> 16x16
nn.ReLU(),
nn.ConvTranspose2d(64, 32, 4, 2, 1), # 16x16 -> 32x32
nn.ReLU(),
nn.ConvTranspose2d(32, img_channels, 4, 2, 1), # 32x32 -> 64x64
nn.Tanh()
)
def forward(self, z):
return self.model(z)
class Discriminator(nn.Module):
def __init__(self, img_channels=3):
super().__init__()
self.model = nn.Sequential(
nn.Conv2d(img_channels, 32, 4, 2, 1), # 64x64 -> 32x32
nn.LeakyReLU(0.2),
nn.Conv2d(32, 64, 4, 2, 1), # 32x32 -> 16x16
nn.LeakyReLU(0.2),
nn.Conv2d(64, 128, 4, 2, 1), # 16x16 -> 8x8
nn.LeakyReLU(0.2),
nn.Flatten(),
nn.Linear(128 * 8 * 8, 1),
nn.Sigmoid()
)
def forward(self, x):
return self.model(x)
# Training loop
for real_images in dataloader:
# Train discriminator
fake_images = generator(noise)
D_real = discriminator(real_images)
D_fake = discriminator(fake_images.detach())
D_loss = -torch.mean(torch.log(D_real) + torch.log(1 - D_fake))
D_loss.backward()
optimizer_D.step()
# Train generator
D_fake = discriminator(fake_images)
G_loss = -torch.mean(torch.log(D_fake))
G_loss.backward()
optimizer_G.step()
Part 4: Diffusion Models (Modern Default)
Architecture
Concept: Learn to reverse a diffusion (noising) process
Forward process (fixed):
# Gradually add noise to image
x_0 (original) → x_1 → x_2 → ... → x_T (pure noise)
# At each step:
x_t = √(1 - β_t) * x_{t-1} + √β_t * ε
where ε ~ N(0, I), β_t = noise schedule
Reverse process (learned):
# Model learns to denoise
x_T (noise) → x_{T-1} → ... → x_1 → x_0 (image)
# Model predicts: ε_θ(x_t, t)
# Then: x_{t-1} = (x_t - √β_t * ε_θ(x_t, t)) / √(1 - β_t)
Training:
# Simple loss: Predict the noise
loss = MSE(ε, ε_θ(x_t, t))
# x_t = noisy image at step t
# ε = actual noise added
# ε_θ(x_t, t) = model's noise prediction
Why Diffusion is Excellent
Advantages:
- High quality: State-of-the-art (better than GAN)
- Stable training: Standard MSE loss (no adversarial dynamics)
- No mode collapse: By design, covers full distribution
- Controllable: Easy to add conditioning (text, class, etc.)
Disadvantages:
- Slow inference: 50-1000 denoising steps (vs GAN's 1 step)
- Compute intensive: T forward passes (T = 50-1000)
Speed comparison:
GAN: 1 forward pass = 50ms
Diffusion (T=50): 50 forward passes = 2.5 seconds
Diffusion (T=1000): 1000 forward passes = 50 seconds
Speedup techniques:
- DDIM (fewer steps, 10-50 instead of 1000)
- DPM-Solver (fast sampler)
- Latent diffusion (Stable Diffusion, denoise in latent space)
Modern Diffusion Models
Stable Diffusion (2022+):
- Latent diffusion (denoise in VAE latent space)
- Text conditioning (CLIP text encoder)
- Pretrained on billions of images
- Fine-tunable
DALL-E 2 (2022):
- Prior network (text → image embedding)
- Diffusion decoder (embedding → image)
Imagen (2022, Google):
- Text conditioning with T5 encoder
- Cascaded diffusion (64×64 → 256×256 → 1024×1024)
When to use Diffusion: ✅ High-quality generation (best quality) ✅ Stable training (standard loss) ✅ Diversity needed (no mode collapse) ✅ Conditioning (text-to-image, class-conditional)
❌ Need fast inference (< 1 second) ❌ Real-time generation
Implementation (DDPM)
class DiffusionModel(nn.Module):
def __init__(self, img_channels=3):
super().__init__()
# U-Net architecture
self.model = UNet(
in_channels=img_channels,
out_channels=img_channels,
time_embedding_dim=256
)
def forward(self, x_t, t):
# Predict noise ε at timestep t
return self.model(x_t, t)
# Training
def train_step(model, x_0):
# Sample random timestep
t = torch.randint(0, T, (batch_size,))
# Sample noise
ε = torch.randn_like(x_0)
# Create noisy image x_t
α_t = alpha_schedule[t]
x_t = torch.sqrt(α_t) * x_0 + torch.sqrt(1 - α_t) * ε
# Predict noise
ε_pred = model(x_t, t)
# Loss: MSE between actual and predicted noise
loss = F.mse_loss(ε_pred, ε)
return loss
# Sampling (generation)
@torch.no_grad()
def sample(model, shape):
# Start from pure noise
x_t = torch.randn(shape)
# Iteratively denoise
for t in reversed(range(T)):
# Predict noise
ε_pred = model(x_t, t)
# Denoise one step
α_t = alpha_schedule[t]
x_t = (x_t - (1 - α_t) / torch.sqrt(1 - ᾱ_t) * ε_pred) / torch.sqrt(α_t)
# Add noise (except last step)
if t > 0:
x_t += torch.sqrt(β_t) * torch.randn_like(x_t)
return x_t # x_0 (generated image)
Part 5: Autoregressive Models
Concept
Idea: Model probability as product of conditionals
p(x) = p(x_1) * p(x_2|x_1) * p(x_3|x_1,x_2) * ... * p(x_n|x_1,...,x_{n-1})
For images: Generate pixel-by-pixel (or patch-by-patch)
Architectures:
- PixelCNN: Convolutional with masked kernels
- PixelCNN++: Improved with mixture of logistics
- VQ-VAE + PixelCNN: Two-stage (learn discrete codes, model codes)
- ImageGPT: GPT-style Transformer for images
Advantages
✅ Explicit likelihood: Can compute p(x) exactly ✅ Stable training: Standard cross-entropy loss ✅ Theoretical guarantees: Proper probability model
Disadvantages
❌ Very slow generation: Sequential (can't parallelize) ❌ Limited quality: Worse than GAN/Diffusion for high-res ❌ Resolution scaling: Impractical for 1024×1024 (1M pixels!)
Speed comparison:
GAN: Generate 1024×1024 in 50ms (parallel)
PixelCNN: Generate 32×32 in 5 seconds (sequential!)
ImageGPT: Generate 256×256 in 30 seconds
For 1024×1024: 1M pixels × 5ms/pixel = 83 minutes!
When to Use
✅ Use autoregressive for:
- Explicit likelihood needed (compression, evaluation)
- Small images (32×32, 64×64)
- Two-stage models (VQ-VAE + Transformer)
❌ Don't use for:
- High-resolution images (too slow)
- Real-time generation
- Quality-critical applications (use diffusion)
Modern Usage
Two-stage approach (DALL-E, VQ-GAN):
- Stage 1: VQ-VAE learns discrete codes
- Image → 32×32 grid of codes (instead of 1M pixels)
- Stage 2: Autoregressive model (Transformer) on codes
- Much faster (32×32 = 1024 codes, not 1M pixels)
Part 6: Flow Models
Concept
Idea: Invertible transformations
z ~ N(0, I) ←→ x ~ p_data
f: z → x (forward)
f⁻¹: x → z (inverse)
Requirement: f must be invertible and differentiable
Advantage: Exact likelihood via change-of-variables
log p(x) = log p(z) + log |det(∂f⁻¹/∂x)|
Architectures
RealNVP (2017):
- Coupling layers (affine transformations)
- Invertible by design
Glow (2018, OpenAI):
- Actnorm, invertible 1×1 convolutions
- Multi-scale architecture
When to use Flow: ✅ Exact likelihood needed (better than VAE) ✅ Invertibility needed (both z→x and x→z) ✅ Stable training (standard loss)
❌ Architecture constraints (must be invertible) ❌ Quality not as good as GAN/Diffusion
Modern Status
Mostly superseded by Diffusion:
- Diffusion has better quality
- Diffusion more flexible (no invertibility constraint)
- Flow models still used in specialized applications
Part 7: Decision Framework
By Primary Goal
Goal: High-quality images
→ Diffusion (modern default)
→ OR GAN if pretrained available
Goal: Fast inference
→ GAN (50ms per image)
→ Avoid Diffusion (too slow for real-time)
Goal: Training stability
→ Diffusion or VAE (standard loss)
→ Avoid GAN (adversarial training hard)
Goal: Latent space exploration
→ VAE (smooth interpolation)
→ Avoid GAN (no encoder)
Goal: Explicit likelihood
→ Autoregressive or Flow
→ For evaluation, compression
Goal: Diversity (no mode collapse)
→ Diffusion (by design)
→ OR VAE (stable)
→ Avoid GAN (mode collapse common)
By Data Type
Images (high-quality):
→ Diffusion (Stable Diffusion)
→ OR GAN (StyleGAN2)
Images (small, 32×32):
→ Any model works
→ Try VAE first (simplest)
Audio waveforms:
→ WaveGAN
→ OR Diffusion (WaveGrad)
Video:
→ Video Diffusion (limited)
→ OR GAN (StyleGAN-V)
Text:
→ Autoregressive (GPT)
→ NOT VAE/GAN/Diffusion (discrete tokens)
By Training Budget
Large budget (millions $, pretrain from scratch):
→ Diffusion (Stable Diffusion scale)
→ Billions of images, weeks on cluster
Medium budget (thousands $, train from scratch):
→ GAN or Diffusion
→ 10k-1M images, days on GPU
Small budget (hundreds $, fine-tune):
→ Fine-tune Stable Diffusion (LoRA)
→ 1k-10k images, hours on consumer GPU
Tiny budget (research, small scale):
→ VAE (simplest, most stable)
→ Few thousand images, CPU possible
Modern Recommendations (2025)
For new projects:
-
Default: Diffusion
- Fine-tune Stable Diffusion or train from scratch
- Best quality + stability
-
If need speed: GAN
- Use pretrained StyleGAN2 if available
- Or train GAN (if can tolerate instability)
-
If need latent space: VAE
- For interpolation, not generation quality
AVOID:
- Training GAN from scratch (unless necessary)
- Using VAE for high-quality generation
- Autoregressive for high-res images
Part 8: Training from Scratch vs Fine-Tuning
Stable Diffusion Example
Pretraining (what Stability AI did):
- Dataset: LAION-5B (5 billion images)
- Compute: 150,000 A100 GPU hours
- Cost: ~$600,000
- Time: Weeks on massive cluster
- DON'T DO THIS!
Fine-tuning (what users do):
- Dataset: 10k-100k domain images
- Compute: 100-1000 GPU hours
- Cost: $100-1,000
- Time: Days on single A100
- DO THIS!
LoRA (Low-Rank Adaptation):
- Efficient fine-tuning (fewer parameters)
- Dataset: 1k-5k images
- Compute: 10-100 GPU hours
- Cost: $10-100
- Time: Hours on consumer GPU (RTX 3090)
- Best for small budgets!
Decision
Have pretrained model in your domain:
→ Fine-tune (don't retrain!)
No pretrained model:
→ Train from scratch (small model)
→ OR find closest pretrained and fine-tune
Budget < $1000:
→ LoRA fine-tuning
→ OR train small model (64×64)
Budget < $100:
→ LoRA with free Colab
→ OR VAE from scratch (cheap)
Part 9: Common Mistakes
Mistake 1: VAE for High-Quality Generation
Symptom: Blurry outputs Fix: Use GAN or Diffusion for quality VAE is for: Latent space, not generation
Mistake 2: Ignoring Mode Collapse
Symptom: GAN generates same images Fix: Spectral norm, minibatch discrimination Better: Switch to Diffusion (no mode collapse)
Mistake 3: Training Stable Diffusion from Scratch
Symptom: Burning money, poor results Fix: Fine-tune pretrained model Reality: Pretraining costs $600k+
Mistake 4: Slow Inference with Diffusion
Symptom: 50 seconds per image Fix: Use DDIM (fewer steps, 10-50) OR: Use GAN if speed critical
Mistake 5: Wrong Loss for GAN
Symptom: Training diverges Fix: Use Wasserstein loss (WGAN) OR: Spectral normalization Better: Switch to Diffusion (standard loss)
Summary: Quick Reference
Model Selection
High quality + stable training:
→ Diffusion (modern default)
Fast inference required:
→ GAN (if pretrained) or trained GAN
Latent space exploration:
→ VAE
Explicit likelihood:
→ Autoregressive or Flow
Small images (< 64×64):
→ Any model (start with VAE)
Large images (> 256×256):
→ Diffusion or GAN (avoid autoregressive)
Quality Ranking
1. Diffusion (9.5/10)
2. GAN (9/10)
3. Autoregressive (7-8/10)
4. Flow (7-8/10)
5. VAE (6/10 - blurry)
Training Stability Ranking
1. VAE (10/10)
2. Diffusion (9/10)
3. Autoregressive (9/10)
4. Flow (8/10)
5. GAN (3/10 - very unstable)
Modern Stack (2025)
Image generation: Stable Diffusion (fine-tuned)
Fast inference: StyleGAN2 (if available)
Latent space: VAE
Research: Diffusion (easiest to train)
Next Steps
After mastering this skill:
llm-specialist/llm-finetuning-strategies: Apply to text generationarchitecture-design-principles: Understand design trade-offstraining-optimization: Optimize training for your chosen model
Remember: Diffusion models dominate in 2025. Use them unless you have specific reason not to (speed, latent space, likelihood).