Files
gh-tachyon-beep-skillpacks-…/skills/using-neural-architectures/generative-model-families.md
2025-11-30 09:00:00 +08:00

20 KiB
Raw Blame History

Generative Model Families

When to Use This Skill

Use this skill when you need to:

  • Select generative model for image/audio/video generation
  • Understand VAE vs GAN vs Diffusion trade-offs
  • Decide between training from scratch vs fine-tuning
  • Address mode collapse in GANs
  • Choose between quality, speed, and training stability
  • Understand modern landscape (Stable Diffusion, StyleGAN, etc.)

Do NOT use this skill for:

  • Text generation (use llm-specialist pack)
  • Architecture implementation details (use model-specific docs)
  • High-level architecture selection (use using-neural-architectures)

Core Principle

Generative models have fundamental trade-offs:

  • Quality vs Stability: GANs sharp but unstable, VAEs blurry but stable
  • Quality vs Speed: Diffusion high-quality but slow, GANs fast
  • Explicitness vs Flexibility: Autoregressive/Flow have likelihood, GANs don't

Modern default (2025): Diffusion models (best quality + stability)

Part 1: Model Family Overview

The Five Families

1. VAE (Variational Autoencoder)

  • Approach: Learn latent space with encoder-decoder
  • Quality: Blurry (6/10)
  • Training: Very stable
  • Use: Latent space exploration, NOT high-quality generation

2. GAN (Generative Adversarial Network)

  • Approach: Adversarial game (generator vs discriminator)
  • Quality: Sharp (9/10)
  • Training: Unstable (adversarial dynamics)
  • Use: High-quality generation, fast inference

3. Diffusion Models

  • Approach: Iterative denoising
  • Quality: Very sharp (9.5/10)
  • Training: Stable
  • Use: Modern default for high-quality generation

4. Autoregressive Models

  • Approach: Sequential generation (pixel-by-pixel, token-by-token)
  • Quality: Good (7-8/10)
  • Training: Stable
  • Use: Explicit likelihood, sequential data

5. Flow Models

  • Approach: Invertible transformations
  • Quality: Good (7-8/10)
  • Training: Stable
  • Use: Exact likelihood, invertibility needed

Quick Comparison

Model Quality Training Stability Inference Speed Mode Collapse Likelihood
VAE 6/10 (blurry) 10/10 Fast No Approximate
GAN 9/10 3/10 Fast Yes No
Diffusion 9.5/10 9/10 Slow No Approximate
Autoregressive 7-8/10 9/10 Very slow No Exact
Flow 7-8/10 8/10 Fast (both ways) No Exact

Part 2: VAE (Variational Autoencoder)

Architecture

Components:

  1. Encoder: x → z (image to latent)
  2. Latent space: z ~ N(μ, σ²)
  3. Decoder: z → x' (latent to reconstruction)

Loss function:

# ELBO (Evidence Lower Bound)
loss = reconstruction_loss + KL_divergence

# Reconstruction: How well decoder reconstructs input
reconstruction_loss = MSE(x, x_reconstructed)

# KL: How close latent is to standard normal
KL_divergence = KL(q(z|x) || p(z))

Why VAE is Blurry

Problem: MSE loss encourages pixel-wise averaging

Example:

  • Dataset: Faces with both smiles and no smiles
  • VAE learns: "Average face has half-smile blur"
  • Result: Blurry, hedges between modes

Mathematical reason:

  • MSE minimization = mean prediction
  • Mean of sharp images = blurry image

When to Use VAE

Use VAE for:

  • Latent space exploration (interpolation, arithmetic)
  • Anomaly detection (reconstruction error)
  • Disentangled representations (β-VAE)
  • Compression (lossy, with latent codes)

DON'T use VAE for:

  • High-quality image generation (use GAN or Diffusion!)
  • Sharp, realistic outputs

Implementation

import torch
import torch.nn as nn

class VAE(nn.Module):
    def __init__(self, latent_dim=128):
        super().__init__()
        # Encoder
        self.encoder = nn.Sequential(
            nn.Conv2d(3, 32, 4, 2, 1),  # 64x64 -> 32x32
            nn.ReLU(),
            nn.Conv2d(32, 64, 4, 2, 1),  # 32x32 -> 16x16
            nn.ReLU(),
            nn.Conv2d(64, 128, 4, 2, 1),  # 16x16 -> 8x8
            nn.ReLU(),
            nn.Flatten()
        )

        self.fc_mu = nn.Linear(128 * 8 * 8, latent_dim)
        self.fc_logvar = nn.Linear(128 * 8 * 8, latent_dim)

        # Decoder
        self.fc_decode = nn.Linear(latent_dim, 128 * 8 * 8)
        self.decoder = nn.Sequential(
            nn.ConvTranspose2d(128, 64, 4, 2, 1),
            nn.ReLU(),
            nn.ConvTranspose2d(64, 32, 4, 2, 1),
            nn.ReLU(),
            nn.ConvTranspose2d(32, 3, 4, 2, 1),
            nn.Sigmoid()
        )

    def reparameterize(self, mu, logvar):
        # Reparameterization trick: z = μ + σ * ε
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def forward(self, x):
        # Encode
        h = self.encoder(x)
        mu = self.fc_mu(h)
        logvar = self.fc_logvar(h)

        # Sample latent
        z = self.reparameterize(mu, logvar)

        # Decode
        h = self.fc_decode(z)
        h = h.view(-1, 128, 8, 8)
        x_recon = self.decoder(h)

        return x_recon, mu, logvar

    def loss_function(self, x, x_recon, mu, logvar):
        # Reconstruction loss
        recon_loss = F.mse_loss(x_recon, x, reduction='sum')

        # KL divergence
        kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

        return recon_loss + kl_loss

Part 3: GAN (Generative Adversarial Network)

Architecture

Components:

  1. Generator: z → x (noise to image)
  2. Discriminator: x → [0, 1] (image to real/fake probability)

Adversarial Training:

# Discriminator loss: Classify real as real, fake as fake
D_loss = -log(D(x_real)) - log(1 - D(G(z)))

# Generator loss: Fool discriminator
G_loss = -log(D(G(z)))

# Minimax game:
min_G max_D V(D, G) = E[log D(x)] + E[log(1 - D(G(z)))]

Training Instability

Problem: Adversarial dynamics are unstable

Common issues:

  1. Mode collapse: Generator produces limited variety
  2. Non-convergence: Oscillation, never settles
  3. Vanishing gradients: Discriminator too strong, generator can't learn
  4. Hyperparameter sensitivity: Learning rates critical

Solutions:

  • Spectral normalization (StyleGAN2)
  • Progressive growing (start low-res, increase)
  • Minibatch discrimination (penalize lack of diversity)
  • Wasserstein loss (WGAN, more stable)

Mode Collapse

What is it?

  • Generator produces subset of distribution
  • Example: Face GAN only generates 10 face types

Why it happens:

  • Generator exploits discriminator weaknesses
  • Finds "easy" samples that fool discriminator
  • Forgets other modes

Detection:

# Check diversity: Generate many samples
samples = generator.generate(n=1000)
diversity = compute_pairwise_distance(samples)
if diversity < threshold:
    print("Mode collapse detected!")

Solutions:

  • Minibatch discrimination
  • Unrolled GANs (slow but helps)
  • Switch to diffusion (no mode collapse by design!)

Modern GANs

StyleGAN2 (2020):

  • State-of-the-art for faces
  • Style-based generator
  • Spectral normalization for stability
  • Resolution: 1024×1024

StyleGAN3 (2021):

  • Alias-free architecture
  • Better animation/video

When to use GAN: Fast inference needed (50ms per image) Pretrained model available (StyleGAN2) Can tolerate training difficulty

Training instability unacceptable Mode collapse problematic Starting from scratch (use diffusion instead)

Implementation (Basic GAN)

class Generator(nn.Module):
    def __init__(self, latent_dim=100, img_channels=3):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(latent_dim, 128 * 8 * 8),
            nn.ReLU(),
            nn.Unflatten(1, (128, 8, 8)),
            nn.ConvTranspose2d(128, 64, 4, 2, 1),  # 8x8 -> 16x16
            nn.ReLU(),
            nn.ConvTranspose2d(64, 32, 4, 2, 1),   # 16x16 -> 32x32
            nn.ReLU(),
            nn.ConvTranspose2d(32, img_channels, 4, 2, 1),  # 32x32 -> 64x64
            nn.Tanh()
        )

    def forward(self, z):
        return self.model(z)

class Discriminator(nn.Module):
    def __init__(self, img_channels=3):
        super().__init__()
        self.model = nn.Sequential(
            nn.Conv2d(img_channels, 32, 4, 2, 1),  # 64x64 -> 32x32
            nn.LeakyReLU(0.2),
            nn.Conv2d(32, 64, 4, 2, 1),            # 32x32 -> 16x16
            nn.LeakyReLU(0.2),
            nn.Conv2d(64, 128, 4, 2, 1),           # 16x16 -> 8x8
            nn.LeakyReLU(0.2),
            nn.Flatten(),
            nn.Linear(128 * 8 * 8, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.model(x)

# Training loop
for real_images in dataloader:
    # Train discriminator
    fake_images = generator(noise)
    D_real = discriminator(real_images)
    D_fake = discriminator(fake_images.detach())

    D_loss = -torch.mean(torch.log(D_real) + torch.log(1 - D_fake))
    D_loss.backward()
    optimizer_D.step()

    # Train generator
    D_fake = discriminator(fake_images)
    G_loss = -torch.mean(torch.log(D_fake))
    G_loss.backward()
    optimizer_G.step()

Part 4: Diffusion Models (Modern Default)

Architecture

Concept: Learn to reverse a diffusion (noising) process

Forward process (fixed):

# Gradually add noise to image
x_0 (original)  x_1  x_2  ...  x_T (pure noise)

# At each step:
x_t = (1 - β_t) * x_{t-1} + β_t * ε
where ε ~ N(0, I), β_t = noise schedule

Reverse process (learned):

# Model learns to denoise
x_T (noise)  x_{T-1}  ...  x_1  x_0 (image)

# Model predicts: ε_θ(x_t, t)
# Then: x_{t-1} = (x_t - √β_t * ε_θ(x_t, t)) / √(1 - β_t)

Training:

# Simple loss: Predict the noise
loss = MSE(ε, ε_θ(x_t, t))

# x_t = noisy image at step t
# ε = actual noise added
# ε_θ(x_t, t) = model's noise prediction

Why Diffusion is Excellent

Advantages:

  1. High quality: State-of-the-art (better than GAN)
  2. Stable training: Standard MSE loss (no adversarial dynamics)
  3. No mode collapse: By design, covers full distribution
  4. Controllable: Easy to add conditioning (text, class, etc.)

Disadvantages:

  1. Slow inference: 50-1000 denoising steps (vs GAN's 1 step)
  2. Compute intensive: T forward passes (T = 50-1000)

Speed comparison:

GAN: 1 forward pass = 50ms
Diffusion (T=50): 50 forward passes = 2.5 seconds
Diffusion (T=1000): 1000 forward passes = 50 seconds

Speedup techniques:

  • DDIM (fewer steps, 10-50 instead of 1000)
  • DPM-Solver (fast sampler)
  • Latent diffusion (Stable Diffusion, denoise in latent space)

Modern Diffusion Models

Stable Diffusion (2022+):

  • Latent diffusion (denoise in VAE latent space)
  • Text conditioning (CLIP text encoder)
  • Pretrained on billions of images
  • Fine-tunable

DALL-E 2 (2022):

  • Prior network (text → image embedding)
  • Diffusion decoder (embedding → image)

Imagen (2022, Google):

  • Text conditioning with T5 encoder
  • Cascaded diffusion (64×64 → 256×256 → 1024×1024)

When to use Diffusion: High-quality generation (best quality) Stable training (standard loss) Diversity needed (no mode collapse) Conditioning (text-to-image, class-conditional)

Need fast inference (< 1 second) Real-time generation

Implementation (DDPM)

class DiffusionModel(nn.Module):
    def __init__(self, img_channels=3):
        super().__init__()
        # U-Net architecture
        self.model = UNet(
            in_channels=img_channels,
            out_channels=img_channels,
            time_embedding_dim=256
        )

    def forward(self, x_t, t):
        # Predict noise ε at timestep t
        return self.model(x_t, t)

# Training
def train_step(model, x_0):
    # Sample random timestep
    t = torch.randint(0, T, (batch_size,))

    # Sample noise
    ε = torch.randn_like(x_0)

    # Create noisy image x_t
    α_t = alpha_schedule[t]
    x_t = torch.sqrt(α_t) * x_0 + torch.sqrt(1 - α_t) * ε

    # Predict noise
    ε_pred = model(x_t, t)

    # Loss: MSE between actual and predicted noise
    loss = F.mse_loss(ε_pred, ε)
    return loss

# Sampling (generation)
@torch.no_grad()
def sample(model, shape):
    # Start from pure noise
    x_t = torch.randn(shape)

    # Iteratively denoise
    for t in reversed(range(T)):
        # Predict noise
        ε_pred = model(x_t, t)

        # Denoise one step
        α_t = alpha_schedule[t]
        x_t = (x_t - (1 - α_t) / torch.sqrt(1 - ᾱ_t) * ε_pred) / torch.sqrt(α_t)

        # Add noise (except last step)
        if t > 0:
            x_t += torch.sqrt(β_t) * torch.randn_like(x_t)

    return x_t  # x_0 (generated image)

Part 5: Autoregressive Models

Concept

Idea: Model probability as product of conditionals

p(x) = p(x_1) * p(x_2|x_1) * p(x_3|x_1,x_2) * ... * p(x_n|x_1,...,x_{n-1})

For images: Generate pixel-by-pixel (or patch-by-patch)

Architectures:

  • PixelCNN: Convolutional with masked kernels
  • PixelCNN++: Improved with mixture of logistics
  • VQ-VAE + PixelCNN: Two-stage (learn discrete codes, model codes)
  • ImageGPT: GPT-style Transformer for images

Advantages

Explicit likelihood: Can compute p(x) exactly Stable training: Standard cross-entropy loss Theoretical guarantees: Proper probability model

Disadvantages

Very slow generation: Sequential (can't parallelize) Limited quality: Worse than GAN/Diffusion for high-res Resolution scaling: Impractical for 1024×1024 (1M pixels!)

Speed comparison:

GAN: Generate 1024×1024 in 50ms (parallel)
PixelCNN: Generate 32×32 in 5 seconds (sequential!)
ImageGPT: Generate 256×256 in 30 seconds

For 1024×1024: 1M pixels × 5ms/pixel = 83 minutes!

When to Use

Use autoregressive for:

  • Explicit likelihood needed (compression, evaluation)
  • Small images (32×32, 64×64)
  • Two-stage models (VQ-VAE + Transformer)

Don't use for:

  • High-resolution images (too slow)
  • Real-time generation
  • Quality-critical applications (use diffusion)

Modern Usage

Two-stage approach (DALL-E, VQ-GAN):

  1. Stage 1: VQ-VAE learns discrete codes
    • Image → 32×32 grid of codes (instead of 1M pixels)
  2. Stage 2: Autoregressive model (Transformer) on codes
    • Much faster (32×32 = 1024 codes, not 1M pixels)

Part 6: Flow Models

Concept

Idea: Invertible transformations

z ~ N(0, I)  ←→  x ~ p_data

f: z → x (forward)
f⁻¹: x → z (inverse)

Requirement: f must be invertible and differentiable

Advantage: Exact likelihood via change-of-variables

log p(x) = log p(z) + log |det(∂f⁻¹/∂x)|

Architectures

RealNVP (2017):

  • Coupling layers (affine transformations)
  • Invertible by design

Glow (2018, OpenAI):

  • Actnorm, invertible 1×1 convolutions
  • Multi-scale architecture

When to use Flow: Exact likelihood needed (better than VAE) Invertibility needed (both z→x and x→z) Stable training (standard loss)

Architecture constraints (must be invertible) Quality not as good as GAN/Diffusion

Modern Status

Mostly superseded by Diffusion:

  • Diffusion has better quality
  • Diffusion more flexible (no invertibility constraint)
  • Flow models still used in specialized applications

Part 7: Decision Framework

By Primary Goal

Goal: High-quality images
→ Diffusion (modern default)
→ OR GAN if pretrained available

Goal: Fast inference
→ GAN (50ms per image)
→ Avoid Diffusion (too slow for real-time)

Goal: Training stability
→ Diffusion or VAE (standard loss)
→ Avoid GAN (adversarial training hard)

Goal: Latent space exploration
→ VAE (smooth interpolation)
→ Avoid GAN (no encoder)

Goal: Explicit likelihood
→ Autoregressive or Flow
→ For evaluation, compression

Goal: Diversity (no mode collapse)
→ Diffusion (by design)
→ OR VAE (stable)
→ Avoid GAN (mode collapse common)

By Data Type

Images (high-quality):
→ Diffusion (Stable Diffusion)
→ OR GAN (StyleGAN2)

Images (small, 32×32):
→ Any model works
→ Try VAE first (simplest)

Audio waveforms:
→ WaveGAN
→ OR Diffusion (WaveGrad)

Video:
→ Video Diffusion (limited)
→ OR GAN (StyleGAN-V)

Text:
→ Autoregressive (GPT)
→ NOT VAE/GAN/Diffusion (discrete tokens)

By Training Budget

Large budget (millions $, pretrain from scratch):
→ Diffusion (Stable Diffusion scale)
→ Billions of images, weeks on cluster

Medium budget (thousands $, train from scratch):
→ GAN or Diffusion
→ 10k-1M images, days on GPU

Small budget (hundreds $, fine-tune):
→ Fine-tune Stable Diffusion (LoRA)
→ 1k-10k images, hours on consumer GPU

Tiny budget (research, small scale):
→ VAE (simplest, most stable)
→ Few thousand images, CPU possible

Modern Recommendations (2025)

For new projects:

  1. Default: Diffusion

    • Fine-tune Stable Diffusion or train from scratch
    • Best quality + stability
  2. If need speed: GAN

    • Use pretrained StyleGAN2 if available
    • Or train GAN (if can tolerate instability)
  3. If need latent space: VAE

    • For interpolation, not generation quality

AVOID:

  • Training GAN from scratch (unless necessary)
  • Using VAE for high-quality generation
  • Autoregressive for high-res images

Part 8: Training from Scratch vs Fine-Tuning

Stable Diffusion Example

Pretraining (what Stability AI did):

  • Dataset: LAION-5B (5 billion images)
  • Compute: 150,000 A100 GPU hours
  • Cost: ~$600,000
  • Time: Weeks on massive cluster
  • DON'T DO THIS!

Fine-tuning (what users do):

  • Dataset: 10k-100k domain images
  • Compute: 100-1000 GPU hours
  • Cost: $100-1,000
  • Time: Days on single A100
  • DO THIS!

LoRA (Low-Rank Adaptation):

  • Efficient fine-tuning (fewer parameters)
  • Dataset: 1k-5k images
  • Compute: 10-100 GPU hours
  • Cost: $10-100
  • Time: Hours on consumer GPU (RTX 3090)
  • Best for small budgets!

Decision

Have pretrained model in your domain:
→ Fine-tune (don't retrain!)

No pretrained model:
→ Train from scratch (small model)
→ OR find closest pretrained and fine-tune

Budget < $1000:
→ LoRA fine-tuning
→ OR train small model (64×64)

Budget < $100:
→ LoRA with free Colab
→ OR VAE from scratch (cheap)

Part 9: Common Mistakes

Mistake 1: VAE for High-Quality Generation

Symptom: Blurry outputs Fix: Use GAN or Diffusion for quality VAE is for: Latent space, not generation

Mistake 2: Ignoring Mode Collapse

Symptom: GAN generates same images Fix: Spectral norm, minibatch discrimination Better: Switch to Diffusion (no mode collapse)

Mistake 3: Training Stable Diffusion from Scratch

Symptom: Burning money, poor results Fix: Fine-tune pretrained model Reality: Pretraining costs $600k+

Mistake 4: Slow Inference with Diffusion

Symptom: 50 seconds per image Fix: Use DDIM (fewer steps, 10-50) OR: Use GAN if speed critical

Mistake 5: Wrong Loss for GAN

Symptom: Training diverges Fix: Use Wasserstein loss (WGAN) OR: Spectral normalization Better: Switch to Diffusion (standard loss)

Summary: Quick Reference

Model Selection

High quality + stable training:
→ Diffusion (modern default)

Fast inference required:
→ GAN (if pretrained) or trained GAN

Latent space exploration:
→ VAE

Explicit likelihood:
→ Autoregressive or Flow

Small images (< 64×64):
→ Any model (start with VAE)

Large images (> 256×256):
→ Diffusion or GAN (avoid autoregressive)

Quality Ranking

1. Diffusion (9.5/10)
2. GAN (9/10)
3. Autoregressive (7-8/10)
4. Flow (7-8/10)
5. VAE (6/10 - blurry)

Training Stability Ranking

1. VAE (10/10)
2. Diffusion (9/10)
3. Autoregressive (9/10)
4. Flow (8/10)
5. GAN (3/10 - very unstable)

Modern Stack (2025)

Image generation: Stable Diffusion (fine-tuned)
Fast inference: StyleGAN2 (if available)
Latent space: VAE
Research: Diffusion (easiest to train)

Next Steps

After mastering this skill:

  • llm-specialist/llm-finetuning-strategies: Apply to text generation
  • architecture-design-principles: Understand design trade-offs
  • training-optimization: Optimize training for your chosen model

Remember: Diffusion models dominate in 2025. Use them unless you have specific reason not to (speed, latent space, likelihood).