Initial commit
This commit is contained in:
438
skills/scvi-tools/references/theoretical-foundations.md
Normal file
438
skills/scvi-tools/references/theoretical-foundations.md
Normal file
@@ -0,0 +1,438 @@
|
||||
# Theoretical Foundations of scvi-tools
|
||||
|
||||
This document explains the mathematical and statistical principles underlying scvi-tools.
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### Variational Inference
|
||||
|
||||
**What is it?**
|
||||
Variational inference is a technique for approximating complex probability distributions. In single-cell analysis, we want to understand the posterior distribution p(z|x) - the probability of latent variables z given observed data x.
|
||||
|
||||
**Why use it?**
|
||||
- Exact inference is computationally intractable for complex models
|
||||
- Scales to large datasets (millions of cells)
|
||||
- Provides uncertainty quantification
|
||||
- Enables Bayesian reasoning about cell states
|
||||
|
||||
**How does it work?**
|
||||
1. Define a simpler approximate distribution q(z|x) with learnable parameters
|
||||
2. Minimize the KL divergence between q(z|x) and true posterior p(z|x)
|
||||
3. Equivalent to maximizing the Evidence Lower Bound (ELBO)
|
||||
|
||||
**ELBO Objective**:
|
||||
```
|
||||
ELBO = E_q[log p(x|z)] - KL(q(z|x) || p(z))
|
||||
↑ ↑
|
||||
Reconstruction Regularization
|
||||
```
|
||||
|
||||
- **Reconstruction term**: Model should generate data similar to observed
|
||||
- **Regularization term**: Latent representation should match prior
|
||||
|
||||
### Variational Autoencoders (VAEs)
|
||||
|
||||
**Architecture**:
|
||||
```
|
||||
x (observed data)
|
||||
↓
|
||||
[Encoder Neural Network]
|
||||
↓
|
||||
z (latent representation)
|
||||
↓
|
||||
[Decoder Neural Network]
|
||||
↓
|
||||
x̂ (reconstructed data)
|
||||
```
|
||||
|
||||
**Encoder**: Maps cells (x) to latent space (z)
|
||||
- Learns q(z|x), the approximate posterior
|
||||
- Parameterized by neural network with learnable weights
|
||||
- Outputs mean and variance of latent distribution
|
||||
|
||||
**Decoder**: Maps latent space (z) back to gene space
|
||||
- Learns p(x|z), the likelihood
|
||||
- Generates gene expression from latent representation
|
||||
- Models count distributions (Negative Binomial, Zero-Inflated NB)
|
||||
|
||||
**Reparameterization Trick**:
|
||||
- Allows backpropagation through stochastic sampling
|
||||
- Sample z = μ + σ ⊙ ε, where ε ~ N(0,1)
|
||||
- Enables end-to-end training with gradient descent
|
||||
|
||||
### Amortized Inference
|
||||
|
||||
**Concept**: Share encoder parameters across all cells.
|
||||
|
||||
**Traditional inference**: Learn separate latent variables for each cell
|
||||
- n_cells × n_latent parameters
|
||||
- Doesn't scale to large datasets
|
||||
|
||||
**Amortized inference**: Learn single encoder for all cells
|
||||
- Fixed number of parameters regardless of cell count
|
||||
- Enables fast inference on new cells
|
||||
- Transfers learned patterns across dataset
|
||||
|
||||
**Benefits**:
|
||||
- Scalable to millions of cells
|
||||
- Fast inference on query data
|
||||
- Leverages shared structure across cells
|
||||
- Enables few-shot learning
|
||||
|
||||
## Statistical Modeling
|
||||
|
||||
### Count Data Distributions
|
||||
|
||||
Single-cell data are counts (integer-valued), requiring appropriate distributions.
|
||||
|
||||
#### Negative Binomial (NB)
|
||||
```
|
||||
x ~ NB(μ, θ)
|
||||
```
|
||||
- **μ (mean)**: Expected expression level
|
||||
- **θ (dispersion)**: Controls variance
|
||||
- **Variance**: Var(x) = μ + μ²/θ
|
||||
|
||||
**When to use**: Gene expression without zero-inflation
|
||||
- More flexible than Poisson (allows overdispersion)
|
||||
- Models technical and biological variation
|
||||
|
||||
#### Zero-Inflated Negative Binomial (ZINB)
|
||||
```
|
||||
x ~ π·δ₀ + (1-π)·NB(μ, θ)
|
||||
```
|
||||
- **π (dropout rate)**: Probability of technical zero
|
||||
- **δ₀**: Point mass at zero
|
||||
- **NB(μ, θ)**: Expression when not dropped out
|
||||
|
||||
**When to use**: Sparse scRNA-seq data
|
||||
- Models technical dropout separately from biological zeros
|
||||
- Better fit for highly sparse data (e.g., 10x data)
|
||||
|
||||
#### Poisson
|
||||
```
|
||||
x ~ Poisson(μ)
|
||||
```
|
||||
- Simplest count distribution
|
||||
- Mean equals variance: Var(x) = μ
|
||||
|
||||
**When to use**: Less common; ATAC-seq fragment counts
|
||||
- More restrictive than NB
|
||||
- Faster computation
|
||||
|
||||
### Batch Correction Framework
|
||||
|
||||
**Problem**: Technical variation confounds biological signal
|
||||
- Different sequencing runs, protocols, labs
|
||||
- Must remove technical effects while preserving biology
|
||||
|
||||
**scvi-tools approach**:
|
||||
1. Encode batch as categorical variable s
|
||||
2. Include s in generative model
|
||||
3. Latent space z is batch-invariant
|
||||
4. Decoder conditions on s for batch-specific effects
|
||||
|
||||
**Mathematical formulation**:
|
||||
```
|
||||
Encoder: q(z|x, s) - batch-aware encoding
|
||||
Latent: z - batch-corrected representation
|
||||
Decoder: p(x|z, s) - batch-specific decoding
|
||||
```
|
||||
|
||||
**Key insight**: Batch info flows through decoder, not latent space
|
||||
- z captures biological variation
|
||||
- s explains technical variation
|
||||
- Separable biology and batch effects
|
||||
|
||||
### Deep Generative Modeling
|
||||
|
||||
**Generative model**: Learns p(x), the data distribution
|
||||
|
||||
**Process**:
|
||||
1. Sample latent variable: z ~ p(z) = N(0, I)
|
||||
2. Generate expression: x ~ p(x|z)
|
||||
3. Joint distribution: p(x, z) = p(x|z)p(z)
|
||||
|
||||
**Benefits**:
|
||||
- Generate synthetic cells
|
||||
- Impute missing values
|
||||
- Quantify uncertainty
|
||||
- Perform counterfactual predictions
|
||||
|
||||
**Inference network**: Inverts generative process
|
||||
- Given x, infer z
|
||||
- q(z|x) approximates true posterior p(z|x)
|
||||
|
||||
## Model Architecture Details
|
||||
|
||||
### scVI Architecture
|
||||
|
||||
**Input**: Gene expression counts x ∈ ℕ^G (G genes)
|
||||
|
||||
**Encoder**:
|
||||
```
|
||||
h = ReLU(W₁·x + b₁)
|
||||
μ_z = W₂·h + b₂
|
||||
log σ²_z = W₃·h + b₃
|
||||
z ~ N(μ_z, σ²_z)
|
||||
```
|
||||
|
||||
**Latent space**: z ∈ ℝ^d (typically d=10-30)
|
||||
|
||||
**Decoder**:
|
||||
```
|
||||
h = ReLU(W₄·z + b₄)
|
||||
μ = softmax(W₅·h + b₅) · library_size
|
||||
θ = exp(W₆·h + b₆)
|
||||
π = sigmoid(W₇·h + b₇) # for ZINB
|
||||
x ~ ZINB(μ, θ, π)
|
||||
```
|
||||
|
||||
**Loss function (ELBO)**:
|
||||
```
|
||||
L = E_q[log p(x|z)] - KL(q(z|x) || N(0,I))
|
||||
```
|
||||
|
||||
### Handling Covariates
|
||||
|
||||
**Categorical covariates** (batch, donor, etc.):
|
||||
- One-hot encoded: s ∈ {0,1}^K
|
||||
- Concatenate with latent: [z, s]
|
||||
- Or use conditional layers
|
||||
|
||||
**Continuous covariates** (library size, percent_mito):
|
||||
- Standardize to zero mean, unit variance
|
||||
- Include in encoder and/or decoder
|
||||
|
||||
**Covariate injection strategies**:
|
||||
- **Concatenation**: [z, s] fed to decoder
|
||||
- **Deep injection**: s added at multiple layers
|
||||
- **Conditional batch norm**: Batch-specific normalization
|
||||
|
||||
## Advanced Theoretical Concepts
|
||||
|
||||
### Transfer Learning (scArches)
|
||||
|
||||
**Concept**: Use pretrained model as initialization for new data
|
||||
|
||||
**Process**:
|
||||
1. Train reference model on large dataset
|
||||
2. Freeze encoder parameters
|
||||
3. Fine-tune decoder on query data
|
||||
4. Or fine-tune all with lower learning rate
|
||||
|
||||
**Why it works**:
|
||||
- Encoder learns general cellular representations
|
||||
- Decoder adapts to query-specific characteristics
|
||||
- Prevents catastrophic forgetting
|
||||
|
||||
**Applications**:
|
||||
- Query-to-reference mapping
|
||||
- Few-shot learning for rare cell types
|
||||
- Rapid analysis of new datasets
|
||||
|
||||
### Multi-Resolution Modeling (MrVI)
|
||||
|
||||
**Idea**: Separate shared and sample-specific variation
|
||||
|
||||
**Latent space decomposition**:
|
||||
```
|
||||
z = z_shared + z_sample
|
||||
```
|
||||
- **z_shared**: Common across samples
|
||||
- **z_sample**: Sample-specific effects
|
||||
|
||||
**Hierarchical structure**:
|
||||
```
|
||||
Sample level: ρ_s ~ N(0, I)
|
||||
Cell level: z_i ~ N(ρ_{s(i)}, σ²)
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Disentangle biological sources of variation
|
||||
- Compare samples at different resolutions
|
||||
- Identify sample-specific cell states
|
||||
|
||||
### Counterfactual Prediction
|
||||
|
||||
**Goal**: Predict outcome under different conditions
|
||||
|
||||
**Example**: "What would this cell look like if from different batch?"
|
||||
|
||||
**Method**:
|
||||
1. Encode cell to latent: z = Encoder(x, s_original)
|
||||
2. Decode with new condition: x_new = Decoder(z, s_new)
|
||||
3. x_new is counterfactual prediction
|
||||
|
||||
**Applications**:
|
||||
- Batch effect assessment
|
||||
- Predicting treatment response
|
||||
- In silico perturbation studies
|
||||
|
||||
### Posterior Predictive Distribution
|
||||
|
||||
**Definition**: Distribution of new data given observed data
|
||||
|
||||
```
|
||||
p(x_new | x_observed) = ∫ p(x_new|z) q(z|x_observed) dz
|
||||
```
|
||||
|
||||
**Estimation**: Sample z from q(z|x), generate x_new from p(x_new|z)
|
||||
|
||||
**Uses**:
|
||||
- Uncertainty quantification
|
||||
- Robust predictions
|
||||
- Outlier detection
|
||||
|
||||
## Differential Expression Framework
|
||||
|
||||
### Bayesian Approach
|
||||
|
||||
**Traditional methods**: Compare point estimates
|
||||
- Wilcoxon, t-test, etc.
|
||||
- Ignore uncertainty
|
||||
- Require pseudocounts
|
||||
|
||||
**scvi-tools approach**: Compare distributions
|
||||
- Sample from posterior: μ_A ~ p(μ|x_A), μ_B ~ p(μ|x_B)
|
||||
- Compute log fold-change: LFC = log(μ_B) - log(μ_A)
|
||||
- Posterior distribution of LFC quantifies uncertainty
|
||||
|
||||
### Bayes Factor
|
||||
|
||||
**Definition**: Ratio of posterior odds to prior odds
|
||||
|
||||
```
|
||||
BF = P(H₁|data) / P(H₀|data)
|
||||
─────────────────────────
|
||||
P(H₁) / P(H₀)
|
||||
```
|
||||
|
||||
**Interpretation**:
|
||||
- BF > 3: Moderate evidence for H₁
|
||||
- BF > 10: Strong evidence
|
||||
- BF > 100: Decisive evidence
|
||||
|
||||
**In scvi-tools**: Used to rank genes by evidence for DE
|
||||
|
||||
### False Discovery Proportion (FDP)
|
||||
|
||||
**Goal**: Control expected false discovery rate
|
||||
|
||||
**Procedure**:
|
||||
1. For each gene, compute posterior probability of DE
|
||||
2. Rank genes by evidence (Bayes factor)
|
||||
3. Select top k genes such that E[FDP] ≤ α
|
||||
|
||||
**Advantage over p-values**:
|
||||
- Fully Bayesian
|
||||
- Natural for posterior inference
|
||||
- No arbitrary thresholds
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Optimization
|
||||
|
||||
**Optimizer**: Adam (adaptive learning rates)
|
||||
- Default lr = 0.001
|
||||
- Momentum parameters: β₁=0.9, β₂=0.999
|
||||
|
||||
**Training loop**:
|
||||
1. Sample mini-batch of cells
|
||||
2. Compute ELBO loss
|
||||
3. Backpropagate gradients
|
||||
4. Update parameters with Adam
|
||||
5. Repeat until convergence
|
||||
|
||||
**Convergence criteria**:
|
||||
- ELBO plateaus on validation set
|
||||
- Early stopping prevents overfitting
|
||||
- Typically 200-500 epochs
|
||||
|
||||
### Regularization
|
||||
|
||||
**KL annealing**: Gradually increase KL weight
|
||||
- Prevents posterior collapse
|
||||
- Starts at 0, increases to 1 over epochs
|
||||
|
||||
**Dropout**: Random neuron dropping during training
|
||||
- Default: 0.1 dropout rate
|
||||
- Prevents overfitting
|
||||
- Improves generalization
|
||||
|
||||
**Weight decay**: L2 regularization on weights
|
||||
- Prevents large weights
|
||||
- Improves stability
|
||||
|
||||
### Scalability
|
||||
|
||||
**Mini-batch training**:
|
||||
- Process subset of cells per iteration
|
||||
- Batch size: 64-256 cells
|
||||
- Enables scaling to millions of cells
|
||||
|
||||
**Stochastic optimization**:
|
||||
- Estimates ELBO on mini-batches
|
||||
- Unbiased gradient estimates
|
||||
- Converges to optimal solution
|
||||
|
||||
**GPU acceleration**:
|
||||
- Neural networks naturally parallelize
|
||||
- Order of magnitude speedup
|
||||
- Essential for large datasets
|
||||
|
||||
## Connections to Other Methods
|
||||
|
||||
### vs. PCA
|
||||
- **PCA**: Linear, deterministic
|
||||
- **scVI**: Nonlinear, probabilistic
|
||||
- **Advantage**: scVI captures complex structure, handles counts
|
||||
|
||||
### vs. t-SNE/UMAP
|
||||
- **t-SNE/UMAP**: Visualization-focused
|
||||
- **scVI**: Full generative model
|
||||
- **Advantage**: scVI enables downstream tasks (DE, imputation)
|
||||
|
||||
### vs. Seurat Integration
|
||||
- **Seurat**: Anchor-based alignment
|
||||
- **scVI**: Probabilistic modeling
|
||||
- **Advantage**: scVI provides uncertainty, works for multiple batches
|
||||
|
||||
### vs. Harmony
|
||||
- **Harmony**: PCA + batch correction
|
||||
- **scVI**: VAE-based
|
||||
- **Advantage**: scVI handles counts natively, more flexible
|
||||
|
||||
## Mathematical Notation
|
||||
|
||||
**Common symbols**:
|
||||
- x: Observed gene expression (counts)
|
||||
- z: Latent representation
|
||||
- θ: Model parameters
|
||||
- q(z|x): Approximate posterior (encoder)
|
||||
- p(x|z): Likelihood (decoder)
|
||||
- p(z): Prior on latent variables
|
||||
- μ, σ²: Mean and variance
|
||||
- π: Dropout probability (ZINB)
|
||||
- θ (in NB): Dispersion parameter
|
||||
- s: Batch/covariate indicator
|
||||
|
||||
## Further Reading
|
||||
|
||||
**Key Papers**:
|
||||
1. Lopez et al. (2018): "Deep generative modeling for single-cell transcriptomics"
|
||||
2. Xu et al. (2021): "Probabilistic harmonization and annotation of single-cell transcriptomics"
|
||||
3. Boyeau et al. (2019): "Deep generative models for detecting differential expression in single cells"
|
||||
|
||||
**Concepts to explore**:
|
||||
- Variational inference in machine learning
|
||||
- Bayesian deep learning
|
||||
- Information theory (KL divergence, mutual information)
|
||||
- Generative models (GANs, normalizing flows, diffusion models)
|
||||
- Probabilistic programming (Pyro, PyTorch)
|
||||
|
||||
**Mathematical background**:
|
||||
- Probability theory and statistics
|
||||
- Linear algebra and calculus
|
||||
- Optimization theory
|
||||
- Information theory
|
||||
Reference in New Issue
Block a user