Files
gh-tachyon-beep-skillpacks-…/skills/using-neural-architectures/generative-model-families.md
2025-11-30 09:00:00 +08:00

812 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Generative Model Families
## When to Use This Skill
Use this skill when you need to:
- ✅ Select generative model for image/audio/video generation
- ✅ Understand VAE vs GAN vs Diffusion trade-offs
- ✅ Decide between training from scratch vs fine-tuning
- ✅ Address mode collapse in GANs
- ✅ Choose between quality, speed, and training stability
- ✅ Understand modern landscape (Stable Diffusion, StyleGAN, etc.)
**Do NOT use this skill for:**
- ❌ Text generation (use `llm-specialist` pack)
- ❌ Architecture implementation details (use model-specific docs)
- ❌ High-level architecture selection (use `using-neural-architectures`)
## Core Principle
**Generative models have fundamental trade-offs:**
- **Quality vs Stability**: GANs sharp but unstable, VAEs blurry but stable
- **Quality vs Speed**: Diffusion high-quality but slow, GANs fast
- **Explicitness vs Flexibility**: Autoregressive/Flow have likelihood, GANs don't
**Modern default (2025):** Diffusion models (best quality + stability)
## Part 1: Model Family Overview
### The Five Families
**1. VAE (Variational Autoencoder)**
- **Approach**: Learn latent space with encoder-decoder
- **Quality**: Blurry (6/10)
- **Training**: Very stable
- **Use**: Latent space exploration, NOT high-quality generation
**2. GAN (Generative Adversarial Network)**
- **Approach**: Adversarial game (generator vs discriminator)
- **Quality**: Sharp (9/10)
- **Training**: Unstable (adversarial dynamics)
- **Use**: High-quality generation, fast inference
**3. Diffusion Models**
- **Approach**: Iterative denoising
- **Quality**: Very sharp (9.5/10)
- **Training**: Stable
- **Use**: Modern default for high-quality generation
**4. Autoregressive Models**
- **Approach**: Sequential generation (pixel-by-pixel, token-by-token)
- **Quality**: Good (7-8/10)
- **Training**: Stable
- **Use**: Explicit likelihood, sequential data
**5. Flow Models**
- **Approach**: Invertible transformations
- **Quality**: Good (7-8/10)
- **Training**: Stable
- **Use**: Exact likelihood, invertibility needed
### Quick Comparison
| Model | Quality | Training Stability | Inference Speed | Mode Collapse | Likelihood |
|-------|---------|-------------------|----------------|---------------|------------|
| VAE | 6/10 (blurry) | 10/10 | Fast | No | Approximate |
| GAN | 9/10 | 3/10 | Fast | Yes | No |
| Diffusion | 9.5/10 | 9/10 | Slow | No | Approximate |
| Autoregressive | 7-8/10 | 9/10 | Very slow | No | Exact |
| Flow | 7-8/10 | 8/10 | Fast (both ways) | No | Exact |
## Part 2: VAE (Variational Autoencoder)
### Architecture
**Components:**
1. **Encoder**: x → z (image to latent)
2. **Latent space**: z ~ N(μ, σ²)
3. **Decoder**: z → x' (latent to reconstruction)
**Loss function:**
```python
# ELBO (Evidence Lower Bound)
loss = reconstruction_loss + KL_divergence
# Reconstruction: How well decoder reconstructs input
reconstruction_loss = MSE(x, x_reconstructed)
# KL: How close latent is to standard normal
KL_divergence = KL(q(z|x) || p(z))
```
### Why VAE is Blurry
**Problem**: MSE loss encourages pixel-wise averaging
**Example:**
- Dataset: Faces with both smiles and no smiles
- VAE learns: "Average face has half-smile blur"
- Result: Blurry, hedges between modes
**Mathematical reason:**
- MSE minimization = mean prediction
- Mean of sharp images = blurry image
### When to Use VAE
**Use VAE for:**
- Latent space exploration (interpolation, arithmetic)
- Anomaly detection (reconstruction error)
- Disentangled representations (β-VAE)
- Compression (lossy, with latent codes)
**DON'T use VAE for:**
- High-quality image generation (use GAN or Diffusion!)
- Sharp, realistic outputs
### Implementation
```python
import torch
import torch.nn as nn
class VAE(nn.Module):
def __init__(self, latent_dim=128):
super().__init__()
# Encoder
self.encoder = nn.Sequential(
nn.Conv2d(3, 32, 4, 2, 1), # 64x64 -> 32x32
nn.ReLU(),
nn.Conv2d(32, 64, 4, 2, 1), # 32x32 -> 16x16
nn.ReLU(),
nn.Conv2d(64, 128, 4, 2, 1), # 16x16 -> 8x8
nn.ReLU(),
nn.Flatten()
)
self.fc_mu = nn.Linear(128 * 8 * 8, latent_dim)
self.fc_logvar = nn.Linear(128 * 8 * 8, latent_dim)
# Decoder
self.fc_decode = nn.Linear(latent_dim, 128 * 8 * 8)
self.decoder = nn.Sequential(
nn.ConvTranspose2d(128, 64, 4, 2, 1),
nn.ReLU(),
nn.ConvTranspose2d(64, 32, 4, 2, 1),
nn.ReLU(),
nn.ConvTranspose2d(32, 3, 4, 2, 1),
nn.Sigmoid()
)
def reparameterize(self, mu, logvar):
# Reparameterization trick: z = μ + σ * ε
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
return mu + eps * std
def forward(self, x):
# Encode
h = self.encoder(x)
mu = self.fc_mu(h)
logvar = self.fc_logvar(h)
# Sample latent
z = self.reparameterize(mu, logvar)
# Decode
h = self.fc_decode(z)
h = h.view(-1, 128, 8, 8)
x_recon = self.decoder(h)
return x_recon, mu, logvar
def loss_function(self, x, x_recon, mu, logvar):
# Reconstruction loss
recon_loss = F.mse_loss(x_recon, x, reduction='sum')
# KL divergence
kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return recon_loss + kl_loss
```
## Part 3: GAN (Generative Adversarial Network)
### Architecture
**Components:**
1. **Generator**: z → x (noise to image)
2. **Discriminator**: x → [0, 1] (image to real/fake probability)
**Adversarial Training:**
```python
# Discriminator loss: Classify real as real, fake as fake
D_loss = -log(D(x_real)) - log(1 - D(G(z)))
# Generator loss: Fool discriminator
G_loss = -log(D(G(z)))
# Minimax game:
min_G max_D V(D, G) = E[log D(x)] + E[log(1 - D(G(z)))]
```
### Training Instability
**Problem**: Adversarial dynamics are unstable
**Common issues:**
1. **Mode collapse**: Generator produces limited variety
2. **Non-convergence**: Oscillation, never settles
3. **Vanishing gradients**: Discriminator too strong, generator can't learn
4. **Hyperparameter sensitivity**: Learning rates critical
**Solutions:**
- Spectral normalization (StyleGAN2)
- Progressive growing (start low-res, increase)
- Minibatch discrimination (penalize lack of diversity)
- Wasserstein loss (WGAN, more stable)
### Mode Collapse
**What is it?**
- Generator produces subset of distribution
- Example: Face GAN only generates 10 face types
**Why it happens:**
- Generator exploits discriminator weaknesses
- Finds "easy" samples that fool discriminator
- Forgets other modes
**Detection:**
```python
# Check diversity: Generate many samples
samples = generator.generate(n=1000)
diversity = compute_pairwise_distance(samples)
if diversity < threshold:
print("Mode collapse detected!")
```
**Solutions:**
- Minibatch discrimination
- Unrolled GANs (slow but helps)
- Switch to diffusion (no mode collapse by design!)
### Modern GANs
**StyleGAN2 (2020):**
- State-of-the-art for faces
- Style-based generator
- Spectral normalization for stability
- Resolution: 1024×1024
**StyleGAN3 (2021):**
- Alias-free architecture
- Better animation/video
**When to use GAN:**
✅ Fast inference needed (50ms per image)
✅ Pretrained model available (StyleGAN2)
✅ Can tolerate training difficulty
❌ Training instability unacceptable
❌ Mode collapse problematic
❌ Starting from scratch (use diffusion instead)
### Implementation (Basic GAN)
```python
class Generator(nn.Module):
def __init__(self, latent_dim=100, img_channels=3):
super().__init__()
self.model = nn.Sequential(
nn.Linear(latent_dim, 128 * 8 * 8),
nn.ReLU(),
nn.Unflatten(1, (128, 8, 8)),
nn.ConvTranspose2d(128, 64, 4, 2, 1), # 8x8 -> 16x16
nn.ReLU(),
nn.ConvTranspose2d(64, 32, 4, 2, 1), # 16x16 -> 32x32
nn.ReLU(),
nn.ConvTranspose2d(32, img_channels, 4, 2, 1), # 32x32 -> 64x64
nn.Tanh()
)
def forward(self, z):
return self.model(z)
class Discriminator(nn.Module):
def __init__(self, img_channels=3):
super().__init__()
self.model = nn.Sequential(
nn.Conv2d(img_channels, 32, 4, 2, 1), # 64x64 -> 32x32
nn.LeakyReLU(0.2),
nn.Conv2d(32, 64, 4, 2, 1), # 32x32 -> 16x16
nn.LeakyReLU(0.2),
nn.Conv2d(64, 128, 4, 2, 1), # 16x16 -> 8x8
nn.LeakyReLU(0.2),
nn.Flatten(),
nn.Linear(128 * 8 * 8, 1),
nn.Sigmoid()
)
def forward(self, x):
return self.model(x)
# Training loop
for real_images in dataloader:
# Train discriminator
fake_images = generator(noise)
D_real = discriminator(real_images)
D_fake = discriminator(fake_images.detach())
D_loss = -torch.mean(torch.log(D_real) + torch.log(1 - D_fake))
D_loss.backward()
optimizer_D.step()
# Train generator
D_fake = discriminator(fake_images)
G_loss = -torch.mean(torch.log(D_fake))
G_loss.backward()
optimizer_G.step()
```
## Part 4: Diffusion Models (Modern Default)
### Architecture
**Concept**: Learn to reverse a diffusion (noising) process
**Forward process** (fixed):
```python
# Gradually add noise to image
x_0 (original) x_1 x_2 ... x_T (pure noise)
# At each step:
x_t = (1 - β_t) * x_{t-1} + β_t * ε
where ε ~ N(0, I), β_t = noise schedule
```
**Reverse process** (learned):
```python
# Model learns to denoise
x_T (noise) x_{T-1} ... x_1 x_0 (image)
# Model predicts: ε_θ(x_t, t)
# Then: x_{t-1} = (x_t - √β_t * ε_θ(x_t, t)) / √(1 - β_t)
```
**Training:**
```python
# Simple loss: Predict the noise
loss = MSE(ε, ε_θ(x_t, t))
# x_t = noisy image at step t
# ε = actual noise added
# ε_θ(x_t, t) = model's noise prediction
```
### Why Diffusion is Excellent
**Advantages:**
1. **High quality**: State-of-the-art (better than GAN)
2. **Stable training**: Standard MSE loss (no adversarial dynamics)
3. **No mode collapse**: By design, covers full distribution
4. **Controllable**: Easy to add conditioning (text, class, etc.)
**Disadvantages:**
1. **Slow inference**: 50-1000 denoising steps (vs GAN's 1 step)
2. **Compute intensive**: T forward passes (T = 50-1000)
**Speed comparison:**
```
GAN: 1 forward pass = 50ms
Diffusion (T=50): 50 forward passes = 2.5 seconds
Diffusion (T=1000): 1000 forward passes = 50 seconds
```
**Speedup techniques:**
- DDIM (fewer steps, 10-50 instead of 1000)
- DPM-Solver (fast sampler)
- Latent diffusion (Stable Diffusion, denoise in latent space)
### Modern Diffusion Models
**Stable Diffusion (2022+):**
- Latent diffusion (denoise in VAE latent space)
- Text conditioning (CLIP text encoder)
- Pretrained on billions of images
- Fine-tunable
**DALL-E 2 (2022):**
- Prior network (text → image embedding)
- Diffusion decoder (embedding → image)
**Imagen (2022, Google):**
- Text conditioning with T5 encoder
- Cascaded diffusion (64×64 → 256×256 → 1024×1024)
**When to use Diffusion:**
✅ High-quality generation (best quality)
✅ Stable training (standard loss)
✅ Diversity needed (no mode collapse)
✅ Conditioning (text-to-image, class-conditional)
❌ Need fast inference (< 1 second)
❌ Real-time generation
### Implementation (DDPM)
```python
class DiffusionModel(nn.Module):
def __init__(self, img_channels=3):
super().__init__()
# U-Net architecture
self.model = UNet(
in_channels=img_channels,
out_channels=img_channels,
time_embedding_dim=256
)
def forward(self, x_t, t):
# Predict noise ε at timestep t
return self.model(x_t, t)
# Training
def train_step(model, x_0):
# Sample random timestep
t = torch.randint(0, T, (batch_size,))
# Sample noise
ε = torch.randn_like(x_0)
# Create noisy image x_t
α_t = alpha_schedule[t]
x_t = torch.sqrt(α_t) * x_0 + torch.sqrt(1 - α_t) * ε
# Predict noise
ε_pred = model(x_t, t)
# Loss: MSE between actual and predicted noise
loss = F.mse_loss(ε_pred, ε)
return loss
# Sampling (generation)
@torch.no_grad()
def sample(model, shape):
# Start from pure noise
x_t = torch.randn(shape)
# Iteratively denoise
for t in reversed(range(T)):
# Predict noise
ε_pred = model(x_t, t)
# Denoise one step
α_t = alpha_schedule[t]
x_t = (x_t - (1 - α_t) / torch.sqrt(1 - ᾱ_t) * ε_pred) / torch.sqrt(α_t)
# Add noise (except last step)
if t > 0:
x_t += torch.sqrt(β_t) * torch.randn_like(x_t)
return x_t # x_0 (generated image)
```
## Part 5: Autoregressive Models
### Concept
**Idea**: Model probability as product of conditionals
```
p(x) = p(x_1) * p(x_2|x_1) * p(x_3|x_1,x_2) * ... * p(x_n|x_1,...,x_{n-1})
```
**For images**: Generate pixel-by-pixel (or patch-by-patch)
**Architectures:**
- **PixelCNN**: Convolutional with masked kernels
- **PixelCNN++**: Improved with mixture of logistics
- **VQ-VAE + PixelCNN**: Two-stage (learn discrete codes, model codes)
- **ImageGPT**: GPT-style Transformer for images
### Advantages
**Explicit likelihood**: Can compute p(x) exactly
**Stable training**: Standard cross-entropy loss
**Theoretical guarantees**: Proper probability model
### Disadvantages
**Very slow generation**: Sequential (can't parallelize)
**Limited quality**: Worse than GAN/Diffusion for high-res
**Resolution scaling**: Impractical for 1024×1024 (1M pixels!)
**Speed comparison:**
```
GAN: Generate 1024×1024 in 50ms (parallel)
PixelCNN: Generate 32×32 in 5 seconds (sequential!)
ImageGPT: Generate 256×256 in 30 seconds
For 1024×1024: 1M pixels × 5ms/pixel = 83 minutes!
```
### When to Use
**Use autoregressive for:**
- Explicit likelihood needed (compression, evaluation)
- Small images (32×32, 64×64)
- Two-stage models (VQ-VAE + Transformer)
**Don't use for:**
- High-resolution images (too slow)
- Real-time generation
- Quality-critical applications (use diffusion)
### Modern Usage
**Two-stage approach (DALL-E, VQ-GAN):**
1. **Stage 1**: VQ-VAE learns discrete codes
- Image → 32×32 grid of codes (instead of 1M pixels)
2. **Stage 2**: Autoregressive model (Transformer) on codes
- Much faster (32×32 = 1024 codes, not 1M pixels)
## Part 6: Flow Models
### Concept
**Idea**: Invertible transformations
```
z ~ N(0, I) ←→ x ~ p_data
f: z → x (forward)
f⁻¹: x → z (inverse)
```
**Requirement**: f must be invertible and differentiable
**Advantage**: Exact likelihood via change-of-variables
```
log p(x) = log p(z) + log |det(∂f⁻¹/∂x)|
```
### Architectures
**RealNVP (2017):**
- Coupling layers (affine transformations)
- Invertible by design
**Glow (2018, OpenAI):**
- Actnorm, invertible 1×1 convolutions
- Multi-scale architecture
**When to use Flow:**
✅ Exact likelihood needed (better than VAE)
✅ Invertibility needed (both z→x and x→z)
✅ Stable training (standard loss)
❌ Architecture constraints (must be invertible)
❌ Quality not as good as GAN/Diffusion
### Modern Status
**Mostly superseded by Diffusion:**
- Diffusion has better quality
- Diffusion more flexible (no invertibility constraint)
- Flow models still used in specialized applications
## Part 7: Decision Framework
### By Primary Goal
```
Goal: High-quality images
→ Diffusion (modern default)
→ OR GAN if pretrained available
Goal: Fast inference
→ GAN (50ms per image)
→ Avoid Diffusion (too slow for real-time)
Goal: Training stability
→ Diffusion or VAE (standard loss)
→ Avoid GAN (adversarial training hard)
Goal: Latent space exploration
→ VAE (smooth interpolation)
→ Avoid GAN (no encoder)
Goal: Explicit likelihood
→ Autoregressive or Flow
→ For evaluation, compression
Goal: Diversity (no mode collapse)
→ Diffusion (by design)
→ OR VAE (stable)
→ Avoid GAN (mode collapse common)
```
### By Data Type
```
Images (high-quality):
→ Diffusion (Stable Diffusion)
→ OR GAN (StyleGAN2)
Images (small, 32×32):
→ Any model works
→ Try VAE first (simplest)
Audio waveforms:
→ WaveGAN
→ OR Diffusion (WaveGrad)
Video:
→ Video Diffusion (limited)
→ OR GAN (StyleGAN-V)
Text:
→ Autoregressive (GPT)
→ NOT VAE/GAN/Diffusion (discrete tokens)
```
### By Training Budget
```
Large budget (millions $, pretrain from scratch):
→ Diffusion (Stable Diffusion scale)
→ Billions of images, weeks on cluster
Medium budget (thousands $, train from scratch):
→ GAN or Diffusion
→ 10k-1M images, days on GPU
Small budget (hundreds $, fine-tune):
→ Fine-tune Stable Diffusion (LoRA)
→ 1k-10k images, hours on consumer GPU
Tiny budget (research, small scale):
→ VAE (simplest, most stable)
→ Few thousand images, CPU possible
```
### Modern Recommendations (2025)
**For new projects:**
1. **Default: Diffusion**
- Fine-tune Stable Diffusion or train from scratch
- Best quality + stability
2. **If need speed: GAN**
- Use pretrained StyleGAN2 if available
- Or train GAN (if can tolerate instability)
3. **If need latent space: VAE**
- For interpolation, not generation quality
**AVOID:**
- Training GAN from scratch (unless necessary)
- Using VAE for high-quality generation
- Autoregressive for high-res images
## Part 8: Training from Scratch vs Fine-Tuning
### Stable Diffusion Example
**Pretraining (what Stability AI did):**
- Dataset: LAION-5B (5 billion images)
- Compute: 150,000 A100 GPU hours
- Cost: ~$600,000
- Time: Weeks on massive cluster
- **DON'T DO THIS!**
**Fine-tuning (what users do):**
- Dataset: 10k-100k domain images
- Compute: 100-1000 GPU hours
- Cost: $100-1,000
- Time: Days on single A100
- **DO THIS!**
**LoRA (Low-Rank Adaptation):**
- Efficient fine-tuning (fewer parameters)
- Dataset: 1k-5k images
- Compute: 10-100 GPU hours
- Cost: $10-100
- Time: Hours on consumer GPU (RTX 3090)
- **Best for small budgets!**
### Decision
```
Have pretrained model in your domain:
→ Fine-tune (don't retrain!)
No pretrained model:
→ Train from scratch (small model)
→ OR find closest pretrained and fine-tune
Budget < $1000:
→ LoRA fine-tuning
→ OR train small model (64×64)
Budget < $100:
→ LoRA with free Colab
→ OR VAE from scratch (cheap)
```
## Part 9: Common Mistakes
### Mistake 1: VAE for High-Quality Generation
**Symptom:** Blurry outputs
**Fix:** Use GAN or Diffusion for quality
**VAE is for:** Latent space, not generation
### Mistake 2: Ignoring Mode Collapse
**Symptom:** GAN generates same images
**Fix:** Spectral norm, minibatch discrimination
**Better:** Switch to Diffusion (no mode collapse)
### Mistake 3: Training Stable Diffusion from Scratch
**Symptom:** Burning money, poor results
**Fix:** Fine-tune pretrained model
**Reality:** Pretraining costs $600k+
### Mistake 4: Slow Inference with Diffusion
**Symptom:** 50 seconds per image
**Fix:** Use DDIM (fewer steps, 10-50)
**OR:** Use GAN if speed critical
### Mistake 5: Wrong Loss for GAN
**Symptom:** Training diverges
**Fix:** Use Wasserstein loss (WGAN)
**OR:** Spectral normalization
**Better:** Switch to Diffusion (standard loss)
## Summary: Quick Reference
### Model Selection
```
High quality + stable training:
→ Diffusion (modern default)
Fast inference required:
→ GAN (if pretrained) or trained GAN
Latent space exploration:
→ VAE
Explicit likelihood:
→ Autoregressive or Flow
Small images (< 64×64):
→ Any model (start with VAE)
Large images (> 256×256):
→ Diffusion or GAN (avoid autoregressive)
```
### Quality Ranking
```
1. Diffusion (9.5/10)
2. GAN (9/10)
3. Autoregressive (7-8/10)
4. Flow (7-8/10)
5. VAE (6/10 - blurry)
```
### Training Stability Ranking
```
1. VAE (10/10)
2. Diffusion (9/10)
3. Autoregressive (9/10)
4. Flow (8/10)
5. GAN (3/10 - very unstable)
```
### Modern Stack (2025)
```
Image generation: Stable Diffusion (fine-tuned)
Fast inference: StyleGAN2 (if available)
Latent space: VAE
Research: Diffusion (easiest to train)
```
## Next Steps
After mastering this skill:
- `llm-specialist/llm-finetuning-strategies`: Apply to text generation
- `architecture-design-principles`: Understand design trade-offs
- `training-optimization`: Optimize training for your chosen model
**Remember:** Diffusion models dominate in 2025. Use them unless you have specific reason not to (speed, latent space, likelihood).