11 KiB
Theoretical Foundations of scvi-tools
This document explains the mathematical and statistical principles underlying scvi-tools.
Core Concepts
Variational Inference
What is it? Variational inference is a technique for approximating complex probability distributions. In single-cell analysis, we want to understand the posterior distribution p(z|x) - the probability of latent variables z given observed data x.
Why use it?
- Exact inference is computationally intractable for complex models
- Scales to large datasets (millions of cells)
- Provides uncertainty quantification
- Enables Bayesian reasoning about cell states
How does it work?
- Define a simpler approximate distribution q(z|x) with learnable parameters
- Minimize the KL divergence between q(z|x) and true posterior p(z|x)
- Equivalent to maximizing the Evidence Lower Bound (ELBO)
ELBO Objective:
ELBO = E_q[log p(x|z)] - KL(q(z|x) || p(z))
↑ ↑
Reconstruction Regularization
- Reconstruction term: Model should generate data similar to observed
- Regularization term: Latent representation should match prior
Variational Autoencoders (VAEs)
Architecture:
x (observed data)
↓
[Encoder Neural Network]
↓
z (latent representation)
↓
[Decoder Neural Network]
↓
x̂ (reconstructed data)
Encoder: Maps cells (x) to latent space (z)
- Learns q(z|x), the approximate posterior
- Parameterized by neural network with learnable weights
- Outputs mean and variance of latent distribution
Decoder: Maps latent space (z) back to gene space
- Learns p(x|z), the likelihood
- Generates gene expression from latent representation
- Models count distributions (Negative Binomial, Zero-Inflated NB)
Reparameterization Trick:
- Allows backpropagation through stochastic sampling
- Sample z = μ + σ ⊙ ε, where ε ~ N(0,1)
- Enables end-to-end training with gradient descent
Amortized Inference
Concept: Share encoder parameters across all cells.
Traditional inference: Learn separate latent variables for each cell
- n_cells × n_latent parameters
- Doesn't scale to large datasets
Amortized inference: Learn single encoder for all cells
- Fixed number of parameters regardless of cell count
- Enables fast inference on new cells
- Transfers learned patterns across dataset
Benefits:
- Scalable to millions of cells
- Fast inference on query data
- Leverages shared structure across cells
- Enables few-shot learning
Statistical Modeling
Count Data Distributions
Single-cell data are counts (integer-valued), requiring appropriate distributions.
Negative Binomial (NB)
x ~ NB(μ, θ)
- μ (mean): Expected expression level
- θ (dispersion): Controls variance
- Variance: Var(x) = μ + μ²/θ
When to use: Gene expression without zero-inflation
- More flexible than Poisson (allows overdispersion)
- Models technical and biological variation
Zero-Inflated Negative Binomial (ZINB)
x ~ π·δ₀ + (1-π)·NB(μ, θ)
- π (dropout rate): Probability of technical zero
- δ₀: Point mass at zero
- NB(μ, θ): Expression when not dropped out
When to use: Sparse scRNA-seq data
- Models technical dropout separately from biological zeros
- Better fit for highly sparse data (e.g., 10x data)
Poisson
x ~ Poisson(μ)
- Simplest count distribution
- Mean equals variance: Var(x) = μ
When to use: Less common; ATAC-seq fragment counts
- More restrictive than NB
- Faster computation
Batch Correction Framework
Problem: Technical variation confounds biological signal
- Different sequencing runs, protocols, labs
- Must remove technical effects while preserving biology
scvi-tools approach:
- Encode batch as categorical variable s
- Include s in generative model
- Latent space z is batch-invariant
- Decoder conditions on s for batch-specific effects
Mathematical formulation:
Encoder: q(z|x, s) - batch-aware encoding
Latent: z - batch-corrected representation
Decoder: p(x|z, s) - batch-specific decoding
Key insight: Batch info flows through decoder, not latent space
- z captures biological variation
- s explains technical variation
- Separable biology and batch effects
Deep Generative Modeling
Generative model: Learns p(x), the data distribution
Process:
- Sample latent variable: z ~ p(z) = N(0, I)
- Generate expression: x ~ p(x|z)
- Joint distribution: p(x, z) = p(x|z)p(z)
Benefits:
- Generate synthetic cells
- Impute missing values
- Quantify uncertainty
- Perform counterfactual predictions
Inference network: Inverts generative process
- Given x, infer z
- q(z|x) approximates true posterior p(z|x)
Model Architecture Details
scVI Architecture
Input: Gene expression counts x ∈ ℕ^G (G genes)
Encoder:
h = ReLU(W₁·x + b₁)
μ_z = W₂·h + b₂
log σ²_z = W₃·h + b₃
z ~ N(μ_z, σ²_z)
Latent space: z ∈ ℝ^d (typically d=10-30)
Decoder:
h = ReLU(W₄·z + b₄)
μ = softmax(W₅·h + b₅) · library_size
θ = exp(W₆·h + b₆)
π = sigmoid(W₇·h + b₇) # for ZINB
x ~ ZINB(μ, θ, π)
Loss function (ELBO):
L = E_q[log p(x|z)] - KL(q(z|x) || N(0,I))
Handling Covariates
Categorical covariates (batch, donor, etc.):
- One-hot encoded: s ∈ {0,1}^K
- Concatenate with latent: [z, s]
- Or use conditional layers
Continuous covariates (library size, percent_mito):
- Standardize to zero mean, unit variance
- Include in encoder and/or decoder
Covariate injection strategies:
- Concatenation: [z, s] fed to decoder
- Deep injection: s added at multiple layers
- Conditional batch norm: Batch-specific normalization
Advanced Theoretical Concepts
Transfer Learning (scArches)
Concept: Use pretrained model as initialization for new data
Process:
- Train reference model on large dataset
- Freeze encoder parameters
- Fine-tune decoder on query data
- Or fine-tune all with lower learning rate
Why it works:
- Encoder learns general cellular representations
- Decoder adapts to query-specific characteristics
- Prevents catastrophic forgetting
Applications:
- Query-to-reference mapping
- Few-shot learning for rare cell types
- Rapid analysis of new datasets
Multi-Resolution Modeling (MrVI)
Idea: Separate shared and sample-specific variation
Latent space decomposition:
z = z_shared + z_sample
- z_shared: Common across samples
- z_sample: Sample-specific effects
Hierarchical structure:
Sample level: ρ_s ~ N(0, I)
Cell level: z_i ~ N(ρ_{s(i)}, σ²)
Benefits:
- Disentangle biological sources of variation
- Compare samples at different resolutions
- Identify sample-specific cell states
Counterfactual Prediction
Goal: Predict outcome under different conditions
Example: "What would this cell look like if from different batch?"
Method:
- Encode cell to latent: z = Encoder(x, s_original)
- Decode with new condition: x_new = Decoder(z, s_new)
- x_new is counterfactual prediction
Applications:
- Batch effect assessment
- Predicting treatment response
- In silico perturbation studies
Posterior Predictive Distribution
Definition: Distribution of new data given observed data
p(x_new | x_observed) = ∫ p(x_new|z) q(z|x_observed) dz
Estimation: Sample z from q(z|x), generate x_new from p(x_new|z)
Uses:
- Uncertainty quantification
- Robust predictions
- Outlier detection
Differential Expression Framework
Bayesian Approach
Traditional methods: Compare point estimates
- Wilcoxon, t-test, etc.
- Ignore uncertainty
- Require pseudocounts
scvi-tools approach: Compare distributions
- Sample from posterior: μ_A ~ p(μ|x_A), μ_B ~ p(μ|x_B)
- Compute log fold-change: LFC = log(μ_B) - log(μ_A)
- Posterior distribution of LFC quantifies uncertainty
Bayes Factor
Definition: Ratio of posterior odds to prior odds
BF = P(H₁|data) / P(H₀|data)
─────────────────────────
P(H₁) / P(H₀)
Interpretation:
- BF > 3: Moderate evidence for H₁
- BF > 10: Strong evidence
- BF > 100: Decisive evidence
In scvi-tools: Used to rank genes by evidence for DE
False Discovery Proportion (FDP)
Goal: Control expected false discovery rate
Procedure:
- For each gene, compute posterior probability of DE
- Rank genes by evidence (Bayes factor)
- Select top k genes such that E[FDP] ≤ α
Advantage over p-values:
- Fully Bayesian
- Natural for posterior inference
- No arbitrary thresholds
Implementation Details
Optimization
Optimizer: Adam (adaptive learning rates)
- Default lr = 0.001
- Momentum parameters: β₁=0.9, β₂=0.999
Training loop:
- Sample mini-batch of cells
- Compute ELBO loss
- Backpropagate gradients
- Update parameters with Adam
- Repeat until convergence
Convergence criteria:
- ELBO plateaus on validation set
- Early stopping prevents overfitting
- Typically 200-500 epochs
Regularization
KL annealing: Gradually increase KL weight
- Prevents posterior collapse
- Starts at 0, increases to 1 over epochs
Dropout: Random neuron dropping during training
- Default: 0.1 dropout rate
- Prevents overfitting
- Improves generalization
Weight decay: L2 regularization on weights
- Prevents large weights
- Improves stability
Scalability
Mini-batch training:
- Process subset of cells per iteration
- Batch size: 64-256 cells
- Enables scaling to millions of cells
Stochastic optimization:
- Estimates ELBO on mini-batches
- Unbiased gradient estimates
- Converges to optimal solution
GPU acceleration:
- Neural networks naturally parallelize
- Order of magnitude speedup
- Essential for large datasets
Connections to Other Methods
vs. PCA
- PCA: Linear, deterministic
- scVI: Nonlinear, probabilistic
- Advantage: scVI captures complex structure, handles counts
vs. t-SNE/UMAP
- t-SNE/UMAP: Visualization-focused
- scVI: Full generative model
- Advantage: scVI enables downstream tasks (DE, imputation)
vs. Seurat Integration
- Seurat: Anchor-based alignment
- scVI: Probabilistic modeling
- Advantage: scVI provides uncertainty, works for multiple batches
vs. Harmony
- Harmony: PCA + batch correction
- scVI: VAE-based
- Advantage: scVI handles counts natively, more flexible
Mathematical Notation
Common symbols:
- x: Observed gene expression (counts)
- z: Latent representation
- θ: Model parameters
- q(z|x): Approximate posterior (encoder)
- p(x|z): Likelihood (decoder)
- p(z): Prior on latent variables
- μ, σ²: Mean and variance
- π: Dropout probability (ZINB)
- θ (in NB): Dispersion parameter
- s: Batch/covariate indicator
Further Reading
Key Papers:
- Lopez et al. (2018): "Deep generative modeling for single-cell transcriptomics"
- Xu et al. (2021): "Probabilistic harmonization and annotation of single-cell transcriptomics"
- Boyeau et al. (2019): "Deep generative models for detecting differential expression in single cells"
Concepts to explore:
- Variational inference in machine learning
- Bayesian deep learning
- Information theory (KL divergence, mutual information)
- Generative models (GANs, normalizing flows, diffusion models)
- Probabilistic programming (Pyro, PyTorch)
Mathematical background:
- Probability theory and statistics
- Linear algebra and calculus
- Optimization theory
- Information theory