812 lines
20 KiB
Markdown
812 lines
20 KiB
Markdown
|
||
# Generative Model Families
|
||
|
||
## When to Use This Skill
|
||
|
||
Use this skill when you need to:
|
||
- ✅ Select generative model for image/audio/video generation
|
||
- ✅ Understand VAE vs GAN vs Diffusion trade-offs
|
||
- ✅ Decide between training from scratch vs fine-tuning
|
||
- ✅ Address mode collapse in GANs
|
||
- ✅ Choose between quality, speed, and training stability
|
||
- ✅ Understand modern landscape (Stable Diffusion, StyleGAN, etc.)
|
||
|
||
**Do NOT use this skill for:**
|
||
- ❌ Text generation (use `llm-specialist` pack)
|
||
- ❌ Architecture implementation details (use model-specific docs)
|
||
- ❌ High-level architecture selection (use `using-neural-architectures`)
|
||
|
||
|
||
## Core Principle
|
||
|
||
**Generative models have fundamental trade-offs:**
|
||
- **Quality vs Stability**: GANs sharp but unstable, VAEs blurry but stable
|
||
- **Quality vs Speed**: Diffusion high-quality but slow, GANs fast
|
||
- **Explicitness vs Flexibility**: Autoregressive/Flow have likelihood, GANs don't
|
||
|
||
**Modern default (2025):** Diffusion models (best quality + stability)
|
||
|
||
|
||
## Part 1: Model Family Overview
|
||
|
||
### The Five Families
|
||
|
||
**1. VAE (Variational Autoencoder)**
|
||
- **Approach**: Learn latent space with encoder-decoder
|
||
- **Quality**: Blurry (6/10)
|
||
- **Training**: Very stable
|
||
- **Use**: Latent space exploration, NOT high-quality generation
|
||
|
||
**2. GAN (Generative Adversarial Network)**
|
||
- **Approach**: Adversarial game (generator vs discriminator)
|
||
- **Quality**: Sharp (9/10)
|
||
- **Training**: Unstable (adversarial dynamics)
|
||
- **Use**: High-quality generation, fast inference
|
||
|
||
**3. Diffusion Models**
|
||
- **Approach**: Iterative denoising
|
||
- **Quality**: Very sharp (9.5/10)
|
||
- **Training**: Stable
|
||
- **Use**: Modern default for high-quality generation
|
||
|
||
**4. Autoregressive Models**
|
||
- **Approach**: Sequential generation (pixel-by-pixel, token-by-token)
|
||
- **Quality**: Good (7-8/10)
|
||
- **Training**: Stable
|
||
- **Use**: Explicit likelihood, sequential data
|
||
|
||
**5. Flow Models**
|
||
- **Approach**: Invertible transformations
|
||
- **Quality**: Good (7-8/10)
|
||
- **Training**: Stable
|
||
- **Use**: Exact likelihood, invertibility needed
|
||
|
||
### Quick Comparison
|
||
|
||
| Model | Quality | Training Stability | Inference Speed | Mode Collapse | Likelihood |
|
||
|-------|---------|-------------------|----------------|---------------|------------|
|
||
| VAE | 6/10 (blurry) | 10/10 | Fast | No | Approximate |
|
||
| GAN | 9/10 | 3/10 | Fast | Yes | No |
|
||
| Diffusion | 9.5/10 | 9/10 | Slow | No | Approximate |
|
||
| Autoregressive | 7-8/10 | 9/10 | Very slow | No | Exact |
|
||
| Flow | 7-8/10 | 8/10 | Fast (both ways) | No | Exact |
|
||
|
||
|
||
## Part 2: VAE (Variational Autoencoder)
|
||
|
||
### Architecture
|
||
|
||
**Components:**
|
||
1. **Encoder**: x → z (image to latent)
|
||
2. **Latent space**: z ~ N(μ, σ²)
|
||
3. **Decoder**: z → x' (latent to reconstruction)
|
||
|
||
**Loss function:**
|
||
```python
|
||
# ELBO (Evidence Lower Bound)
|
||
loss = reconstruction_loss + KL_divergence
|
||
|
||
# Reconstruction: How well decoder reconstructs input
|
||
reconstruction_loss = MSE(x, x_reconstructed)
|
||
|
||
# KL: How close latent is to standard normal
|
||
KL_divergence = KL(q(z|x) || p(z))
|
||
```
|
||
|
||
### Why VAE is Blurry
|
||
|
||
**Problem**: MSE loss encourages pixel-wise averaging
|
||
|
||
**Example:**
|
||
- Dataset: Faces with both smiles and no smiles
|
||
- VAE learns: "Average face has half-smile blur"
|
||
- Result: Blurry, hedges between modes
|
||
|
||
**Mathematical reason:**
|
||
- MSE minimization = mean prediction
|
||
- Mean of sharp images = blurry image
|
||
|
||
### When to Use VAE
|
||
|
||
✅ **Use VAE for:**
|
||
- Latent space exploration (interpolation, arithmetic)
|
||
- Anomaly detection (reconstruction error)
|
||
- Disentangled representations (β-VAE)
|
||
- Compression (lossy, with latent codes)
|
||
|
||
❌ **DON'T use VAE for:**
|
||
- High-quality image generation (use GAN or Diffusion!)
|
||
- Sharp, realistic outputs
|
||
|
||
### Implementation
|
||
|
||
```python
|
||
import torch
|
||
import torch.nn as nn
|
||
|
||
class VAE(nn.Module):
|
||
def __init__(self, latent_dim=128):
|
||
super().__init__()
|
||
# Encoder
|
||
self.encoder = nn.Sequential(
|
||
nn.Conv2d(3, 32, 4, 2, 1), # 64x64 -> 32x32
|
||
nn.ReLU(),
|
||
nn.Conv2d(32, 64, 4, 2, 1), # 32x32 -> 16x16
|
||
nn.ReLU(),
|
||
nn.Conv2d(64, 128, 4, 2, 1), # 16x16 -> 8x8
|
||
nn.ReLU(),
|
||
nn.Flatten()
|
||
)
|
||
|
||
self.fc_mu = nn.Linear(128 * 8 * 8, latent_dim)
|
||
self.fc_logvar = nn.Linear(128 * 8 * 8, latent_dim)
|
||
|
||
# Decoder
|
||
self.fc_decode = nn.Linear(latent_dim, 128 * 8 * 8)
|
||
self.decoder = nn.Sequential(
|
||
nn.ConvTranspose2d(128, 64, 4, 2, 1),
|
||
nn.ReLU(),
|
||
nn.ConvTranspose2d(64, 32, 4, 2, 1),
|
||
nn.ReLU(),
|
||
nn.ConvTranspose2d(32, 3, 4, 2, 1),
|
||
nn.Sigmoid()
|
||
)
|
||
|
||
def reparameterize(self, mu, logvar):
|
||
# Reparameterization trick: z = μ + σ * ε
|
||
std = torch.exp(0.5 * logvar)
|
||
eps = torch.randn_like(std)
|
||
return mu + eps * std
|
||
|
||
def forward(self, x):
|
||
# Encode
|
||
h = self.encoder(x)
|
||
mu = self.fc_mu(h)
|
||
logvar = self.fc_logvar(h)
|
||
|
||
# Sample latent
|
||
z = self.reparameterize(mu, logvar)
|
||
|
||
# Decode
|
||
h = self.fc_decode(z)
|
||
h = h.view(-1, 128, 8, 8)
|
||
x_recon = self.decoder(h)
|
||
|
||
return x_recon, mu, logvar
|
||
|
||
def loss_function(self, x, x_recon, mu, logvar):
|
||
# Reconstruction loss
|
||
recon_loss = F.mse_loss(x_recon, x, reduction='sum')
|
||
|
||
# KL divergence
|
||
kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
|
||
|
||
return recon_loss + kl_loss
|
||
```
|
||
|
||
|
||
## Part 3: GAN (Generative Adversarial Network)
|
||
|
||
### Architecture
|
||
|
||
**Components:**
|
||
1. **Generator**: z → x (noise to image)
|
||
2. **Discriminator**: x → [0, 1] (image to real/fake probability)
|
||
|
||
**Adversarial Training:**
|
||
```python
|
||
# Discriminator loss: Classify real as real, fake as fake
|
||
D_loss = -log(D(x_real)) - log(1 - D(G(z)))
|
||
|
||
# Generator loss: Fool discriminator
|
||
G_loss = -log(D(G(z)))
|
||
|
||
# Minimax game:
|
||
min_G max_D V(D, G) = E[log D(x)] + E[log(1 - D(G(z)))]
|
||
```
|
||
|
||
### Training Instability
|
||
|
||
**Problem**: Adversarial dynamics are unstable
|
||
|
||
**Common issues:**
|
||
1. **Mode collapse**: Generator produces limited variety
|
||
2. **Non-convergence**: Oscillation, never settles
|
||
3. **Vanishing gradients**: Discriminator too strong, generator can't learn
|
||
4. **Hyperparameter sensitivity**: Learning rates critical
|
||
|
||
**Solutions:**
|
||
- Spectral normalization (StyleGAN2)
|
||
- Progressive growing (start low-res, increase)
|
||
- Minibatch discrimination (penalize lack of diversity)
|
||
- Wasserstein loss (WGAN, more stable)
|
||
|
||
### Mode Collapse
|
||
|
||
**What is it?**
|
||
- Generator produces subset of distribution
|
||
- Example: Face GAN only generates 10 face types
|
||
|
||
**Why it happens:**
|
||
- Generator exploits discriminator weaknesses
|
||
- Finds "easy" samples that fool discriminator
|
||
- Forgets other modes
|
||
|
||
**Detection:**
|
||
```python
|
||
# Check diversity: Generate many samples
|
||
samples = generator.generate(n=1000)
|
||
diversity = compute_pairwise_distance(samples)
|
||
if diversity < threshold:
|
||
print("Mode collapse detected!")
|
||
```
|
||
|
||
**Solutions:**
|
||
- Minibatch discrimination
|
||
- Unrolled GANs (slow but helps)
|
||
- Switch to diffusion (no mode collapse by design!)
|
||
|
||
### Modern GANs
|
||
|
||
**StyleGAN2 (2020):**
|
||
- State-of-the-art for faces
|
||
- Style-based generator
|
||
- Spectral normalization for stability
|
||
- Resolution: 1024×1024
|
||
|
||
**StyleGAN3 (2021):**
|
||
- Alias-free architecture
|
||
- Better animation/video
|
||
|
||
**When to use GAN:**
|
||
✅ Fast inference needed (50ms per image)
|
||
✅ Pretrained model available (StyleGAN2)
|
||
✅ Can tolerate training difficulty
|
||
|
||
❌ Training instability unacceptable
|
||
❌ Mode collapse problematic
|
||
❌ Starting from scratch (use diffusion instead)
|
||
|
||
### Implementation (Basic GAN)
|
||
|
||
```python
|
||
class Generator(nn.Module):
|
||
def __init__(self, latent_dim=100, img_channels=3):
|
||
super().__init__()
|
||
self.model = nn.Sequential(
|
||
nn.Linear(latent_dim, 128 * 8 * 8),
|
||
nn.ReLU(),
|
||
nn.Unflatten(1, (128, 8, 8)),
|
||
nn.ConvTranspose2d(128, 64, 4, 2, 1), # 8x8 -> 16x16
|
||
nn.ReLU(),
|
||
nn.ConvTranspose2d(64, 32, 4, 2, 1), # 16x16 -> 32x32
|
||
nn.ReLU(),
|
||
nn.ConvTranspose2d(32, img_channels, 4, 2, 1), # 32x32 -> 64x64
|
||
nn.Tanh()
|
||
)
|
||
|
||
def forward(self, z):
|
||
return self.model(z)
|
||
|
||
class Discriminator(nn.Module):
|
||
def __init__(self, img_channels=3):
|
||
super().__init__()
|
||
self.model = nn.Sequential(
|
||
nn.Conv2d(img_channels, 32, 4, 2, 1), # 64x64 -> 32x32
|
||
nn.LeakyReLU(0.2),
|
||
nn.Conv2d(32, 64, 4, 2, 1), # 32x32 -> 16x16
|
||
nn.LeakyReLU(0.2),
|
||
nn.Conv2d(64, 128, 4, 2, 1), # 16x16 -> 8x8
|
||
nn.LeakyReLU(0.2),
|
||
nn.Flatten(),
|
||
nn.Linear(128 * 8 * 8, 1),
|
||
nn.Sigmoid()
|
||
)
|
||
|
||
def forward(self, x):
|
||
return self.model(x)
|
||
|
||
# Training loop
|
||
for real_images in dataloader:
|
||
# Train discriminator
|
||
fake_images = generator(noise)
|
||
D_real = discriminator(real_images)
|
||
D_fake = discriminator(fake_images.detach())
|
||
|
||
D_loss = -torch.mean(torch.log(D_real) + torch.log(1 - D_fake))
|
||
D_loss.backward()
|
||
optimizer_D.step()
|
||
|
||
# Train generator
|
||
D_fake = discriminator(fake_images)
|
||
G_loss = -torch.mean(torch.log(D_fake))
|
||
G_loss.backward()
|
||
optimizer_G.step()
|
||
```
|
||
|
||
|
||
## Part 4: Diffusion Models (Modern Default)
|
||
|
||
### Architecture
|
||
|
||
**Concept**: Learn to reverse a diffusion (noising) process
|
||
|
||
**Forward process** (fixed):
|
||
```python
|
||
# Gradually add noise to image
|
||
x_0 (original) → x_1 → x_2 → ... → x_T (pure noise)
|
||
|
||
# At each step:
|
||
x_t = √(1 - β_t) * x_{t-1} + √β_t * ε
|
||
where ε ~ N(0, I), β_t = noise schedule
|
||
```
|
||
|
||
**Reverse process** (learned):
|
||
```python
|
||
# Model learns to denoise
|
||
x_T (noise) → x_{T-1} → ... → x_1 → x_0 (image)
|
||
|
||
# Model predicts: ε_θ(x_t, t)
|
||
# Then: x_{t-1} = (x_t - √β_t * ε_θ(x_t, t)) / √(1 - β_t)
|
||
```
|
||
|
||
**Training:**
|
||
```python
|
||
# Simple loss: Predict the noise
|
||
loss = MSE(ε, ε_θ(x_t, t))
|
||
|
||
# x_t = noisy image at step t
|
||
# ε = actual noise added
|
||
# ε_θ(x_t, t) = model's noise prediction
|
||
```
|
||
|
||
### Why Diffusion is Excellent
|
||
|
||
**Advantages:**
|
||
1. **High quality**: State-of-the-art (better than GAN)
|
||
2. **Stable training**: Standard MSE loss (no adversarial dynamics)
|
||
3. **No mode collapse**: By design, covers full distribution
|
||
4. **Controllable**: Easy to add conditioning (text, class, etc.)
|
||
|
||
**Disadvantages:**
|
||
1. **Slow inference**: 50-1000 denoising steps (vs GAN's 1 step)
|
||
2. **Compute intensive**: T forward passes (T = 50-1000)
|
||
|
||
**Speed comparison:**
|
||
```
|
||
GAN: 1 forward pass = 50ms
|
||
Diffusion (T=50): 50 forward passes = 2.5 seconds
|
||
Diffusion (T=1000): 1000 forward passes = 50 seconds
|
||
```
|
||
|
||
**Speedup techniques:**
|
||
- DDIM (fewer steps, 10-50 instead of 1000)
|
||
- DPM-Solver (fast sampler)
|
||
- Latent diffusion (Stable Diffusion, denoise in latent space)
|
||
|
||
### Modern Diffusion Models
|
||
|
||
**Stable Diffusion (2022+):**
|
||
- Latent diffusion (denoise in VAE latent space)
|
||
- Text conditioning (CLIP text encoder)
|
||
- Pretrained on billions of images
|
||
- Fine-tunable
|
||
|
||
**DALL-E 2 (2022):**
|
||
- Prior network (text → image embedding)
|
||
- Diffusion decoder (embedding → image)
|
||
|
||
**Imagen (2022, Google):**
|
||
- Text conditioning with T5 encoder
|
||
- Cascaded diffusion (64×64 → 256×256 → 1024×1024)
|
||
|
||
**When to use Diffusion:**
|
||
✅ High-quality generation (best quality)
|
||
✅ Stable training (standard loss)
|
||
✅ Diversity needed (no mode collapse)
|
||
✅ Conditioning (text-to-image, class-conditional)
|
||
|
||
❌ Need fast inference (< 1 second)
|
||
❌ Real-time generation
|
||
|
||
### Implementation (DDPM)
|
||
|
||
```python
|
||
class DiffusionModel(nn.Module):
|
||
def __init__(self, img_channels=3):
|
||
super().__init__()
|
||
# U-Net architecture
|
||
self.model = UNet(
|
||
in_channels=img_channels,
|
||
out_channels=img_channels,
|
||
time_embedding_dim=256
|
||
)
|
||
|
||
def forward(self, x_t, t):
|
||
# Predict noise ε at timestep t
|
||
return self.model(x_t, t)
|
||
|
||
# Training
|
||
def train_step(model, x_0):
|
||
# Sample random timestep
|
||
t = torch.randint(0, T, (batch_size,))
|
||
|
||
# Sample noise
|
||
ε = torch.randn_like(x_0)
|
||
|
||
# Create noisy image x_t
|
||
α_t = alpha_schedule[t]
|
||
x_t = torch.sqrt(α_t) * x_0 + torch.sqrt(1 - α_t) * ε
|
||
|
||
# Predict noise
|
||
ε_pred = model(x_t, t)
|
||
|
||
# Loss: MSE between actual and predicted noise
|
||
loss = F.mse_loss(ε_pred, ε)
|
||
return loss
|
||
|
||
# Sampling (generation)
|
||
@torch.no_grad()
|
||
def sample(model, shape):
|
||
# Start from pure noise
|
||
x_t = torch.randn(shape)
|
||
|
||
# Iteratively denoise
|
||
for t in reversed(range(T)):
|
||
# Predict noise
|
||
ε_pred = model(x_t, t)
|
||
|
||
# Denoise one step
|
||
α_t = alpha_schedule[t]
|
||
x_t = (x_t - (1 - α_t) / torch.sqrt(1 - ᾱ_t) * ε_pred) / torch.sqrt(α_t)
|
||
|
||
# Add noise (except last step)
|
||
if t > 0:
|
||
x_t += torch.sqrt(β_t) * torch.randn_like(x_t)
|
||
|
||
return x_t # x_0 (generated image)
|
||
```
|
||
|
||
|
||
## Part 5: Autoregressive Models
|
||
|
||
### Concept
|
||
|
||
**Idea**: Model probability as product of conditionals
|
||
```
|
||
p(x) = p(x_1) * p(x_2|x_1) * p(x_3|x_1,x_2) * ... * p(x_n|x_1,...,x_{n-1})
|
||
```
|
||
|
||
**For images**: Generate pixel-by-pixel (or patch-by-patch)
|
||
|
||
**Architectures:**
|
||
- **PixelCNN**: Convolutional with masked kernels
|
||
- **PixelCNN++**: Improved with mixture of logistics
|
||
- **VQ-VAE + PixelCNN**: Two-stage (learn discrete codes, model codes)
|
||
- **ImageGPT**: GPT-style Transformer for images
|
||
|
||
### Advantages
|
||
|
||
✅ **Explicit likelihood**: Can compute p(x) exactly
|
||
✅ **Stable training**: Standard cross-entropy loss
|
||
✅ **Theoretical guarantees**: Proper probability model
|
||
|
||
### Disadvantages
|
||
|
||
❌ **Very slow generation**: Sequential (can't parallelize)
|
||
❌ **Limited quality**: Worse than GAN/Diffusion for high-res
|
||
❌ **Resolution scaling**: Impractical for 1024×1024 (1M pixels!)
|
||
|
||
**Speed comparison:**
|
||
```
|
||
GAN: Generate 1024×1024 in 50ms (parallel)
|
||
PixelCNN: Generate 32×32 in 5 seconds (sequential!)
|
||
ImageGPT: Generate 256×256 in 30 seconds
|
||
|
||
For 1024×1024: 1M pixels × 5ms/pixel = 83 minutes!
|
||
```
|
||
|
||
### When to Use
|
||
|
||
✅ **Use autoregressive for:**
|
||
- Explicit likelihood needed (compression, evaluation)
|
||
- Small images (32×32, 64×64)
|
||
- Two-stage models (VQ-VAE + Transformer)
|
||
|
||
❌ **Don't use for:**
|
||
- High-resolution images (too slow)
|
||
- Real-time generation
|
||
- Quality-critical applications (use diffusion)
|
||
|
||
### Modern Usage
|
||
|
||
**Two-stage approach (DALL-E, VQ-GAN):**
|
||
1. **Stage 1**: VQ-VAE learns discrete codes
|
||
- Image → 32×32 grid of codes (instead of 1M pixels)
|
||
2. **Stage 2**: Autoregressive model (Transformer) on codes
|
||
- Much faster (32×32 = 1024 codes, not 1M pixels)
|
||
|
||
|
||
## Part 6: Flow Models
|
||
|
||
### Concept
|
||
|
||
**Idea**: Invertible transformations
|
||
```
|
||
z ~ N(0, I) ←→ x ~ p_data
|
||
|
||
f: z → x (forward)
|
||
f⁻¹: x → z (inverse)
|
||
```
|
||
|
||
**Requirement**: f must be invertible and differentiable
|
||
|
||
**Advantage**: Exact likelihood via change-of-variables
|
||
```
|
||
log p(x) = log p(z) + log |det(∂f⁻¹/∂x)|
|
||
```
|
||
|
||
### Architectures
|
||
|
||
**RealNVP (2017):**
|
||
- Coupling layers (affine transformations)
|
||
- Invertible by design
|
||
|
||
**Glow (2018, OpenAI):**
|
||
- Actnorm, invertible 1×1 convolutions
|
||
- Multi-scale architecture
|
||
|
||
**When to use Flow:**
|
||
✅ Exact likelihood needed (better than VAE)
|
||
✅ Invertibility needed (both z→x and x→z)
|
||
✅ Stable training (standard loss)
|
||
|
||
❌ Architecture constraints (must be invertible)
|
||
❌ Quality not as good as GAN/Diffusion
|
||
|
||
### Modern Status
|
||
|
||
**Mostly superseded by Diffusion:**
|
||
- Diffusion has better quality
|
||
- Diffusion more flexible (no invertibility constraint)
|
||
- Flow models still used in specialized applications
|
||
|
||
|
||
## Part 7: Decision Framework
|
||
|
||
### By Primary Goal
|
||
|
||
```
|
||
Goal: High-quality images
|
||
→ Diffusion (modern default)
|
||
→ OR GAN if pretrained available
|
||
|
||
Goal: Fast inference
|
||
→ GAN (50ms per image)
|
||
→ Avoid Diffusion (too slow for real-time)
|
||
|
||
Goal: Training stability
|
||
→ Diffusion or VAE (standard loss)
|
||
→ Avoid GAN (adversarial training hard)
|
||
|
||
Goal: Latent space exploration
|
||
→ VAE (smooth interpolation)
|
||
→ Avoid GAN (no encoder)
|
||
|
||
Goal: Explicit likelihood
|
||
→ Autoregressive or Flow
|
||
→ For evaluation, compression
|
||
|
||
Goal: Diversity (no mode collapse)
|
||
→ Diffusion (by design)
|
||
→ OR VAE (stable)
|
||
→ Avoid GAN (mode collapse common)
|
||
```
|
||
|
||
### By Data Type
|
||
|
||
```
|
||
Images (high-quality):
|
||
→ Diffusion (Stable Diffusion)
|
||
→ OR GAN (StyleGAN2)
|
||
|
||
Images (small, 32×32):
|
||
→ Any model works
|
||
→ Try VAE first (simplest)
|
||
|
||
Audio waveforms:
|
||
→ WaveGAN
|
||
→ OR Diffusion (WaveGrad)
|
||
|
||
Video:
|
||
→ Video Diffusion (limited)
|
||
→ OR GAN (StyleGAN-V)
|
||
|
||
Text:
|
||
→ Autoregressive (GPT)
|
||
→ NOT VAE/GAN/Diffusion (discrete tokens)
|
||
```
|
||
|
||
### By Training Budget
|
||
|
||
```
|
||
Large budget (millions $, pretrain from scratch):
|
||
→ Diffusion (Stable Diffusion scale)
|
||
→ Billions of images, weeks on cluster
|
||
|
||
Medium budget (thousands $, train from scratch):
|
||
→ GAN or Diffusion
|
||
→ 10k-1M images, days on GPU
|
||
|
||
Small budget (hundreds $, fine-tune):
|
||
→ Fine-tune Stable Diffusion (LoRA)
|
||
→ 1k-10k images, hours on consumer GPU
|
||
|
||
Tiny budget (research, small scale):
|
||
→ VAE (simplest, most stable)
|
||
→ Few thousand images, CPU possible
|
||
```
|
||
|
||
### Modern Recommendations (2025)
|
||
|
||
**For new projects:**
|
||
1. **Default: Diffusion**
|
||
- Fine-tune Stable Diffusion or train from scratch
|
||
- Best quality + stability
|
||
|
||
2. **If need speed: GAN**
|
||
- Use pretrained StyleGAN2 if available
|
||
- Or train GAN (if can tolerate instability)
|
||
|
||
3. **If need latent space: VAE**
|
||
- For interpolation, not generation quality
|
||
|
||
**AVOID:**
|
||
- Training GAN from scratch (unless necessary)
|
||
- Using VAE for high-quality generation
|
||
- Autoregressive for high-res images
|
||
|
||
|
||
## Part 8: Training from Scratch vs Fine-Tuning
|
||
|
||
### Stable Diffusion Example
|
||
|
||
**Pretraining (what Stability AI did):**
|
||
- Dataset: LAION-5B (5 billion images)
|
||
- Compute: 150,000 A100 GPU hours
|
||
- Cost: ~$600,000
|
||
- Time: Weeks on massive cluster
|
||
- **DON'T DO THIS!**
|
||
|
||
**Fine-tuning (what users do):**
|
||
- Dataset: 10k-100k domain images
|
||
- Compute: 100-1000 GPU hours
|
||
- Cost: $100-1,000
|
||
- Time: Days on single A100
|
||
- **DO THIS!**
|
||
|
||
**LoRA (Low-Rank Adaptation):**
|
||
- Efficient fine-tuning (fewer parameters)
|
||
- Dataset: 1k-5k images
|
||
- Compute: 10-100 GPU hours
|
||
- Cost: $10-100
|
||
- Time: Hours on consumer GPU (RTX 3090)
|
||
- **Best for small budgets!**
|
||
|
||
### Decision
|
||
|
||
```
|
||
Have pretrained model in your domain:
|
||
→ Fine-tune (don't retrain!)
|
||
|
||
No pretrained model:
|
||
→ Train from scratch (small model)
|
||
→ OR find closest pretrained and fine-tune
|
||
|
||
Budget < $1000:
|
||
→ LoRA fine-tuning
|
||
→ OR train small model (64×64)
|
||
|
||
Budget < $100:
|
||
→ LoRA with free Colab
|
||
→ OR VAE from scratch (cheap)
|
||
```
|
||
|
||
|
||
## Part 9: Common Mistakes
|
||
|
||
### Mistake 1: VAE for High-Quality Generation
|
||
|
||
**Symptom:** Blurry outputs
|
||
**Fix:** Use GAN or Diffusion for quality
|
||
**VAE is for:** Latent space, not generation
|
||
|
||
### Mistake 2: Ignoring Mode Collapse
|
||
|
||
**Symptom:** GAN generates same images
|
||
**Fix:** Spectral norm, minibatch discrimination
|
||
**Better:** Switch to Diffusion (no mode collapse)
|
||
|
||
### Mistake 3: Training Stable Diffusion from Scratch
|
||
|
||
**Symptom:** Burning money, poor results
|
||
**Fix:** Fine-tune pretrained model
|
||
**Reality:** Pretraining costs $600k+
|
||
|
||
### Mistake 4: Slow Inference with Diffusion
|
||
|
||
**Symptom:** 50 seconds per image
|
||
**Fix:** Use DDIM (fewer steps, 10-50)
|
||
**OR:** Use GAN if speed critical
|
||
|
||
### Mistake 5: Wrong Loss for GAN
|
||
|
||
**Symptom:** Training diverges
|
||
**Fix:** Use Wasserstein loss (WGAN)
|
||
**OR:** Spectral normalization
|
||
**Better:** Switch to Diffusion (standard loss)
|
||
|
||
|
||
## Summary: Quick Reference
|
||
|
||
### Model Selection
|
||
|
||
```
|
||
High quality + stable training:
|
||
→ Diffusion (modern default)
|
||
|
||
Fast inference required:
|
||
→ GAN (if pretrained) or trained GAN
|
||
|
||
Latent space exploration:
|
||
→ VAE
|
||
|
||
Explicit likelihood:
|
||
→ Autoregressive or Flow
|
||
|
||
Small images (< 64×64):
|
||
→ Any model (start with VAE)
|
||
|
||
Large images (> 256×256):
|
||
→ Diffusion or GAN (avoid autoregressive)
|
||
```
|
||
|
||
### Quality Ranking
|
||
|
||
```
|
||
1. Diffusion (9.5/10)
|
||
2. GAN (9/10)
|
||
3. Autoregressive (7-8/10)
|
||
4. Flow (7-8/10)
|
||
5. VAE (6/10 - blurry)
|
||
```
|
||
|
||
### Training Stability Ranking
|
||
|
||
```
|
||
1. VAE (10/10)
|
||
2. Diffusion (9/10)
|
||
3. Autoregressive (9/10)
|
||
4. Flow (8/10)
|
||
5. GAN (3/10 - very unstable)
|
||
```
|
||
|
||
### Modern Stack (2025)
|
||
|
||
```
|
||
Image generation: Stable Diffusion (fine-tuned)
|
||
Fast inference: StyleGAN2 (if available)
|
||
Latent space: VAE
|
||
Research: Diffusion (easiest to train)
|
||
```
|
||
|
||
|
||
## Next Steps
|
||
|
||
After mastering this skill:
|
||
- `llm-specialist/llm-finetuning-strategies`: Apply to text generation
|
||
- `architecture-design-principles`: Understand design trade-offs
|
||
- `training-optimization`: Optimize training for your chosen model
|
||
|
||
**Remember:** Diffusion models dominate in 2025. Use them unless you have specific reason not to (speed, latent space, likelihood).
|