zhongwei/gh-tachyon-beep-skillpacks-plugins-yzmir-neural-architectures

Fork 0

Files

Zhongwei Li 955d5c6743 Initial commit

2025-11-30 09:00:00 +08:00

20 KiB

Raw Blame History

Generative Model Families

When to Use This Skill

Use this skill when you need to:

✅ Select generative model for image/audio/video generation
✅ Understand VAE vs GAN vs Diffusion trade-offs
✅ Decide between training from scratch vs fine-tuning
✅ Address mode collapse in GANs
✅ Choose between quality, speed, and training stability
✅ Understand modern landscape (Stable Diffusion, StyleGAN, etc.)

Do NOT use this skill for:

❌ Text generation (use llm-specialist pack)
❌ Architecture implementation details (use model-specific docs)
❌ High-level architecture selection (use using-neural-architectures)

Core Principle

Generative models have fundamental trade-offs:

Quality vs Stability: GANs sharp but unstable, VAEs blurry but stable
Quality vs Speed: Diffusion high-quality but slow, GANs fast
Explicitness vs Flexibility: Autoregressive/Flow have likelihood, GANs don't

Modern default (2025): Diffusion models (best quality + stability)

Part 1: Model Family Overview

The Five Families

1. VAE (Variational Autoencoder)

Approach: Learn latent space with encoder-decoder
Quality: Blurry (6/10)
Training: Very stable
Use: Latent space exploration, NOT high-quality generation

2. GAN (Generative Adversarial Network)

Approach: Adversarial game (generator vs discriminator)
Quality: Sharp (9/10)
Training: Unstable (adversarial dynamics)
Use: High-quality generation, fast inference

3. Diffusion Models

Approach: Iterative denoising
Quality: Very sharp (9.5/10)
Training: Stable
Use: Modern default for high-quality generation

4. Autoregressive Models

Approach: Sequential generation (pixel-by-pixel, token-by-token)
Quality: Good (7-8/10)
Training: Stable
Use: Explicit likelihood, sequential data

5. Flow Models

Approach: Invertible transformations
Quality: Good (7-8/10)
Training: Stable
Use: Exact likelihood, invertibility needed

Quick Comparison

Model	Quality	Training Stability	Inference Speed	Mode Collapse	Likelihood
VAE	6/10 (blurry)	10/10	Fast	No	Approximate
GAN	9/10	3/10	Fast	Yes	No
Diffusion	9.5/10	9/10	Slow	No	Approximate
Autoregressive	7-8/10	9/10	Very slow	No	Exact
Flow	7-8/10	8/10	Fast (both ways)	No	Exact

Part 2: VAE (Variational Autoencoder)

Architecture

Components:

Encoder: x → z (image to latent)
Latent space: z ~ N(μ, σ²)
Decoder: z → x' (latent to reconstruction)

Loss function:

# ELBO (Evidence Lower Bound)
loss = reconstruction_loss + KL_divergence

# Reconstruction: How well decoder reconstructs input
reconstruction_loss = MSE(x, x_reconstructed)

# KL: How close latent is to standard normal
KL_divergence = KL(q(z|x) || p(z))

Why VAE is Blurry

Problem: MSE loss encourages pixel-wise averaging

Example:

Dataset: Faces with both smiles and no smiles
VAE learns: "Average face has half-smile blur"
Result: Blurry, hedges between modes

Mathematical reason:

MSE minimization = mean prediction
Mean of sharp images = blurry image

When to Use VAE

✅ Use VAE for:

Latent space exploration (interpolation, arithmetic)
Anomaly detection (reconstruction error)
Disentangled representations (β-VAE)
Compression (lossy, with latent codes)

❌ DON'T use VAE for:

High-quality image generation (use GAN or Diffusion!)
Sharp, realistic outputs

Implementation

import torch
import torch.nn as nn

class VAE(nn.Module):
    def __init__(self, latent_dim=128):
        super().__init__()
        # Encoder
        self.encoder = nn.Sequential(
            nn.Conv2d(3, 32, 4, 2, 1),  # 64x64 -> 32x32
            nn.ReLU(),
            nn.Conv2d(32, 64, 4, 2, 1),  # 32x32 -> 16x16
            nn.ReLU(),
            nn.Conv2d(64, 128, 4, 2, 1),  # 16x16 -> 8x8
            nn.ReLU(),
            nn.Flatten()
        )

        self.fc_mu = nn.Linear(128 * 8 * 8, latent_dim)
        self.fc_logvar = nn.Linear(128 * 8 * 8, latent_dim)

        # Decoder
        self.fc_decode = nn.Linear(latent_dim, 128 * 8 * 8)
        self.decoder = nn.Sequential(
            nn.ConvTranspose2d(128, 64, 4, 2, 1),
            nn.ReLU(),
            nn.ConvTranspose2d(64, 32, 4, 2, 1),
            nn.ReLU(),
            nn.ConvTranspose2d(32, 3, 4, 2, 1),
            nn.Sigmoid()
        )

    def reparameterize(self, mu, logvar):
        # Reparameterization trick: z = μ + σ * ε
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def forward(self, x):
        # Encode
        h = self.encoder(x)
        mu = self.fc_mu(h)
        logvar = self.fc_logvar(h)

        # Sample latent
        z = self.reparameterize(mu, logvar)

        # Decode
        h = self.fc_decode(z)
        h = h.view(-1, 128, 8, 8)
        x_recon = self.decoder(h)

        return x_recon, mu, logvar

    def loss_function(self, x, x_recon, mu, logvar):
        # Reconstruction loss
        recon_loss = F.mse_loss(x_recon, x, reduction='sum')

        # KL divergence
        kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

        return recon_loss + kl_loss

Part 3: GAN (Generative Adversarial Network)

Architecture

Components:

Generator: z → x (noise to image)
Discriminator: x → [0, 1] (image to real/fake probability)

Adversarial Training:

# Discriminator loss: Classify real as real, fake as fake
D_loss = -log(D(x_real)) - log(1 - D(G(z)))

# Generator loss: Fool discriminator
G_loss = -log(D(G(z)))

# Minimax game:
min_G max_D V(D, G) = E[log D(x)] + E[log(1 - D(G(z)))]

Training Instability

Problem: Adversarial dynamics are unstable

Common issues:

Mode collapse: Generator produces limited variety
Non-convergence: Oscillation, never settles
Vanishing gradients: Discriminator too strong, generator can't learn
Hyperparameter sensitivity: Learning rates critical

Solutions:

Spectral normalization (StyleGAN2)
Progressive growing (start low-res, increase)
Minibatch discrimination (penalize lack of diversity)
Wasserstein loss (WGAN, more stable)

Mode Collapse

What is it?

Generator produces subset of distribution
Example: Face GAN only generates 10 face types

Why it happens:

Generator exploits discriminator weaknesses
Finds "easy" samples that fool discriminator
Forgets other modes

Detection:

# Check diversity: Generate many samples
samples = generator.generate(n=1000)
diversity = compute_pairwise_distance(samples)
if diversity < threshold:
    print("Mode collapse detected!")

Solutions:

Minibatch discrimination
Unrolled GANs (slow but helps)
Switch to diffusion (no mode collapse by design!)

Modern GANs

StyleGAN2 (2020):

State-of-the-art for faces
Style-based generator
Spectral normalization for stability
Resolution: 1024×1024

StyleGAN3 (2021):

Alias-free architecture
Better animation/video

When to use GAN: ✅ Fast inference needed (50ms per image) ✅ Pretrained model available (StyleGAN2) ✅ Can tolerate training difficulty

❌ Training instability unacceptable ❌ Mode collapse problematic ❌ Starting from scratch (use diffusion instead)

Implementation (Basic GAN)

class Generator(nn.Module):
    def __init__(self, latent_dim=100, img_channels=3):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(latent_dim, 128 * 8 * 8),
            nn.ReLU(),
            nn.Unflatten(1, (128, 8, 8)),
            nn.ConvTranspose2d(128, 64, 4, 2, 1),  # 8x8 -> 16x16
            nn.ReLU(),
            nn.ConvTranspose2d(64, 32, 4, 2, 1),   # 16x16 -> 32x32
            nn.ReLU(),
            nn.ConvTranspose2d(32, img_channels, 4, 2, 1),  # 32x32 -> 64x64
            nn.Tanh()
        )

    def forward(self, z):
        return self.model(z)

class Discriminator(nn.Module):
    def __init__(self, img_channels=3):
        super().__init__()
        self.model = nn.Sequential(
            nn.Conv2d(img_channels, 32, 4, 2, 1),  # 64x64 -> 32x32
            nn.LeakyReLU(0.2),
            nn.Conv2d(32, 64, 4, 2, 1),            # 32x32 -> 16x16
            nn.LeakyReLU(0.2),
            nn.Conv2d(64, 128, 4, 2, 1),           # 16x16 -> 8x8
            nn.LeakyReLU(0.2),
            nn.Flatten(),
            nn.Linear(128 * 8 * 8, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.model(x)

# Training loop
for real_images in dataloader:
    # Train discriminator
    fake_images = generator(noise)
    D_real = discriminator(real_images)
    D_fake = discriminator(fake_images.detach())

    D_loss = -torch.mean(torch.log(D_real) + torch.log(1 - D_fake))
    D_loss.backward()
    optimizer_D.step()

    # Train generator
    D_fake = discriminator(fake_images)
    G_loss = -torch.mean(torch.log(D_fake))
    G_loss.backward()
    optimizer_G.step()

Part 4: Diffusion Models (Modern Default)

Architecture

Concept: Learn to reverse a diffusion (noising) process

Forward process (fixed):

# Gradually add noise to image
x_0 (original) → x_1 → x_2 → ... → x_T (pure noise)

# At each step:
x_t = √(1 - β_t) * x_{t-1} + √β_t * ε
where ε ~ N(0, I), β_t = noise schedule

Reverse process (learned):

# Model learns to denoise
x_T (noise) → x_{T-1} → ... → x_1 → x_0 (image)

# Model predicts: ε_θ(x_t, t)
# Then: x_{t-1} = (x_t - √β_t * ε_θ(x_t, t)) / √(1 - β_t)

Training:

# Simple loss: Predict the noise
loss = MSE(ε, ε_θ(x_t, t))

# x_t = noisy image at step t
# ε = actual noise added
# ε_θ(x_t, t) = model's noise prediction

Why Diffusion is Excellent

Advantages:

High quality: State-of-the-art (better than GAN)
Stable training: Standard MSE loss (no adversarial dynamics)
No mode collapse: By design, covers full distribution
Controllable: Easy to add conditioning (text, class, etc.)

Disadvantages:

Slow inference: 50-1000 denoising steps (vs GAN's 1 step)
Compute intensive: T forward passes (T = 50-1000)

Speed comparison:

GAN: 1 forward pass = 50ms
Diffusion (T=50): 50 forward passes = 2.5 seconds
Diffusion (T=1000): 1000 forward passes = 50 seconds

Speedup techniques:

DDIM (fewer steps, 10-50 instead of 1000)
DPM-Solver (fast sampler)
Latent diffusion (Stable Diffusion, denoise in latent space)

Modern Diffusion Models

Stable Diffusion (2022+):

Latent diffusion (denoise in VAE latent space)
Text conditioning (CLIP text encoder)
Pretrained on billions of images
Fine-tunable

DALL-E 2 (2022):

Prior network (text → image embedding)
Diffusion decoder (embedding → image)

Imagen (2022, Google):

Text conditioning with T5 encoder
Cascaded diffusion (64×64 → 256×256 → 1024×1024)

When to use Diffusion: ✅ High-quality generation (best quality) ✅ Stable training (standard loss) ✅ Diversity needed (no mode collapse) ✅ Conditioning (text-to-image, class-conditional)

❌ Need fast inference (< 1 second) ❌ Real-time generation

Implementation (DDPM)

class DiffusionModel(nn.Module):
    def __init__(self, img_channels=3):
        super().__init__()
        # U-Net architecture
        self.model = UNet(
            in_channels=img_channels,
            out_channels=img_channels,
            time_embedding_dim=256
        )

    def forward(self, x_t, t):
        # Predict noise ε at timestep t
        return self.model(x_t, t)

# Training
def train_step(model, x_0):
    # Sample random timestep
    t = torch.randint(0, T, (batch_size,))

    # Sample noise
    ε = torch.randn_like(x_0)

    # Create noisy image x_t
    α_t = alpha_schedule[t]
    x_t = torch.sqrt(α_t) * x_0 + torch.sqrt(1 - α_t) * ε

    # Predict noise
    ε_pred = model(x_t, t)

    # Loss: MSE between actual and predicted noise
    loss = F.mse_loss(ε_pred, ε)
    return loss

# Sampling (generation)
@torch.no_grad()
def sample(model, shape):
    # Start from pure noise
    x_t = torch.randn(shape)

    # Iteratively denoise
    for t in reversed(range(T)):
        # Predict noise
        ε_pred = model(x_t, t)

        # Denoise one step
        α_t = alpha_schedule[t]
        x_t = (x_t - (1 - α_t) / torch.sqrt(1 - ᾱ_t) * ε_pred) / torch.sqrt(α_t)

        # Add noise (except last step)
        if t > 0:
            x_t += torch.sqrt(β_t) * torch.randn_like(x_t)

    return x_t  # x_0 (generated image)

Part 5: Autoregressive Models

Concept

Idea: Model probability as product of conditionals

p(x) = p(x_1) * p(x_2|x_1) * p(x_3|x_1,x_2) * ... * p(x_n|x_1,...,x_{n-1})

For images: Generate pixel-by-pixel (or patch-by-patch)

Architectures:

PixelCNN: Convolutional with masked kernels
PixelCNN++: Improved with mixture of logistics
VQ-VAE + PixelCNN: Two-stage (learn discrete codes, model codes)
ImageGPT: GPT-style Transformer for images

Advantages

✅ Explicit likelihood: Can compute p(x) exactly ✅ Stable training: Standard cross-entropy loss ✅ Theoretical guarantees: Proper probability model

Disadvantages

❌ Very slow generation: Sequential (can't parallelize) ❌ Limited quality: Worse than GAN/Diffusion for high-res ❌ Resolution scaling: Impractical for 1024×1024 (1M pixels!)

Speed comparison:

GAN: Generate 1024×1024 in 50ms (parallel)
PixelCNN: Generate 32×32 in 5 seconds (sequential!)
ImageGPT: Generate 256×256 in 30 seconds

For 1024×1024: 1M pixels × 5ms/pixel = 83 minutes!

When to Use

✅ Use autoregressive for:

Explicit likelihood needed (compression, evaluation)
Small images (32×32, 64×64)
Two-stage models (VQ-VAE + Transformer)

❌ Don't use for:

High-resolution images (too slow)
Real-time generation
Quality-critical applications (use diffusion)

Modern Usage

Two-stage approach (DALL-E, VQ-GAN):

Stage 1: VQ-VAE learns discrete codes
- Image → 32×32 grid of codes (instead of 1M pixels)
Stage 2: Autoregressive model (Transformer) on codes
- Much faster (32×32 = 1024 codes, not 1M pixels)

Part 6: Flow Models

Concept

Idea: Invertible transformations

z ~ N(0, I)  ←→  x ~ p_data

f: z → x (forward)
f⁻¹: x → z (inverse)

Requirement: f must be invertible and differentiable

Advantage: Exact likelihood via change-of-variables

log p(x) = log p(z) + log |det(∂f⁻¹/∂x)|

Architectures

RealNVP (2017):

Coupling layers (affine transformations)
Invertible by design

Glow (2018, OpenAI):

Actnorm, invertible 1×1 convolutions
Multi-scale architecture

When to use Flow: ✅ Exact likelihood needed (better than VAE) ✅ Invertibility needed (both z→x and x→z) ✅ Stable training (standard loss)

❌ Architecture constraints (must be invertible) ❌ Quality not as good as GAN/Diffusion

Modern Status

Mostly superseded by Diffusion:

Diffusion has better quality
Diffusion more flexible (no invertibility constraint)
Flow models still used in specialized applications

Part 7: Decision Framework

By Primary Goal

Goal: High-quality images
→ Diffusion (modern default)
→ OR GAN if pretrained available

Goal: Fast inference
→ GAN (50ms per image)
→ Avoid Diffusion (too slow for real-time)

Goal: Training stability
→ Diffusion or VAE (standard loss)
→ Avoid GAN (adversarial training hard)

Goal: Latent space exploration
→ VAE (smooth interpolation)
→ Avoid GAN (no encoder)

Goal: Explicit likelihood
→ Autoregressive or Flow
→ For evaluation, compression

Goal: Diversity (no mode collapse)
→ Diffusion (by design)
→ OR VAE (stable)
→ Avoid GAN (mode collapse common)

By Data Type

Images (high-quality):
→ Diffusion (Stable Diffusion)
→ OR GAN (StyleGAN2)

Images (small, 32×32):
→ Any model works
→ Try VAE first (simplest)

Audio waveforms:
→ WaveGAN
→ OR Diffusion (WaveGrad)

Video:
→ Video Diffusion (limited)
→ OR GAN (StyleGAN-V)

Text:
→ Autoregressive (GPT)
→ NOT VAE/GAN/Diffusion (discrete tokens)

By Training Budget

Large budget (millions $, pretrain from scratch):
→ Diffusion (Stable Diffusion scale)
→ Billions of images, weeks on cluster

Medium budget (thousands $, train from scratch):
→ GAN or Diffusion
→ 10k-1M images, days on GPU

Small budget (hundreds $, fine-tune):
→ Fine-tune Stable Diffusion (LoRA)
→ 1k-10k images, hours on consumer GPU

Tiny budget (research, small scale):
→ VAE (simplest, most stable)
→ Few thousand images, CPU possible

Modern Recommendations (2025)

For new projects:

Default: Diffusion
- Fine-tune Stable Diffusion or train from scratch
- Best quality + stability
If need speed: GAN
- Use pretrained StyleGAN2 if available
- Or train GAN (if can tolerate instability)
If need latent space: VAE
- For interpolation, not generation quality

AVOID:

Training GAN from scratch (unless necessary)
Using VAE for high-quality generation
Autoregressive for high-res images

Part 8: Training from Scratch vs Fine-Tuning

Stable Diffusion Example

Pretraining (what Stability AI did):

Dataset: LAION-5B (5 billion images)
Compute: 150,000 A100 GPU hours
Cost: ~$600,000
Time: Weeks on massive cluster
DON'T DO THIS!

Fine-tuning (what users do):

Dataset: 10k-100k domain images
Compute: 100-1000 GPU hours
Cost: $100-1,000
Time: Days on single A100
DO THIS!

LoRA (Low-Rank Adaptation):

Efficient fine-tuning (fewer parameters)
Dataset: 1k-5k images
Compute: 10-100 GPU hours
Cost: $10-100
Time: Hours on consumer GPU (RTX 3090)
Best for small budgets!

Decision

Have pretrained model in your domain:
→ Fine-tune (don't retrain!)

No pretrained model:
→ Train from scratch (small model)
→ OR find closest pretrained and fine-tune

Budget < $1000:
→ LoRA fine-tuning
→ OR train small model (64×64)

Budget < $100:
→ LoRA with free Colab
→ OR VAE from scratch (cheap)

Part 9: Common Mistakes

Mistake 1: VAE for High-Quality Generation

Symptom: Blurry outputs Fix: Use GAN or Diffusion for quality VAE is for: Latent space, not generation

Mistake 2: Ignoring Mode Collapse

Symptom: GAN generates same images Fix: Spectral norm, minibatch discrimination Better: Switch to Diffusion (no mode collapse)

Mistake 3: Training Stable Diffusion from Scratch

Symptom: Burning money, poor results Fix: Fine-tune pretrained model Reality: Pretraining costs $600k+

Mistake 4: Slow Inference with Diffusion

Symptom: 50 seconds per image Fix: Use DDIM (fewer steps, 10-50) OR: Use GAN if speed critical

Mistake 5: Wrong Loss for GAN

Symptom: Training diverges Fix: Use Wasserstein loss (WGAN) OR: Spectral normalization Better: Switch to Diffusion (standard loss)

Summary: Quick Reference

Model Selection

High quality + stable training:
→ Diffusion (modern default)

Fast inference required:
→ GAN (if pretrained) or trained GAN

Latent space exploration:
→ VAE

Explicit likelihood:
→ Autoregressive or Flow

Small images (< 64×64):
→ Any model (start with VAE)

Large images (> 256×256):
→ Diffusion or GAN (avoid autoregressive)

Quality Ranking

1. Diffusion (9.5/10)
2. GAN (9/10)
3. Autoregressive (7-8/10)
4. Flow (7-8/10)
5. VAE (6/10 - blurry)

Training Stability Ranking

1. VAE (10/10)
2. Diffusion (9/10)
3. Autoregressive (9/10)
4. Flow (8/10)
5. GAN (3/10 - very unstable)

Modern Stack (2025)

Image generation: Stable Diffusion (fine-tuned)
Fast inference: StyleGAN2 (if available)
Latent space: VAE
Research: Diffusion (easiest to train)

Next Steps

After mastering this skill:

llm-specialist/llm-finetuning-strategies: Apply to text generation
architecture-design-principles: Understand design trade-offs
training-optimization: Optimize training for your chosen model

Remember: Diffusion models dominate in 2025. Use them unless you have specific reason not to (speed, latent space, likelihood).

20 KiB Raw Blame History Unescape Escape

Generative Model Families

When to Use This Skill

Core Principle

Part 1: Model Family Overview

The Five Families

Quick Comparison

Part 2: VAE (Variational Autoencoder)

Architecture

Why VAE is Blurry

When to Use VAE

Implementation

Part 3: GAN (Generative Adversarial Network)

Architecture

Training Instability

Mode Collapse

Modern GANs

Implementation (Basic GAN)

Part 4: Diffusion Models (Modern Default)

Architecture

Why Diffusion is Excellent

Modern Diffusion Models

Implementation (DDPM)

Part 5: Autoregressive Models

Concept

Advantages

Disadvantages

When to Use

Modern Usage

Part 6: Flow Models

Concept

Architectures

Modern Status

Part 7: Decision Framework

By Primary Goal

By Data Type

By Training Budget

Modern Recommendations (2025)

Part 8: Training from Scratch vs Fine-Tuning

Stable Diffusion Example

Decision

Part 9: Common Mistakes

Mistake 1: VAE for High-Quality Generation

Mistake 2: Ignoring Mode Collapse

Mistake 3: Training Stable Diffusion from Scratch

Mistake 4: Slow Inference with Diffusion

Mistake 5: Wrong Loss for GAN

Summary: Quick Reference

Model Selection

Quality Ranking

Training Stability Ranking

Modern Stack (2025)

Next Steps

20 KiB

Raw Blame History